US8645090B2 - Automated baseline removal of signal - Google Patents
Automated baseline removal of signal Download PDFInfo
- Publication number
- US8645090B2 US8645090B2 US12/466,031 US46603109A US8645090B2 US 8645090 B2 US8645090 B2 US 8645090B2 US 46603109 A US46603109 A US 46603109A US 8645090 B2 US8645090 B2 US 8645090B2
- Authority
- US
- United States
- Prior art keywords
- signal
- peak
- baseline
- determining
- region
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G01—MEASURING; TESTING
- G01N—INVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
- G01N30/00—Investigating or analysing materials by separation into components using adsorption, absorption or similar phenomena or using ion-exchange, e.g. chromatography or field flow fractionation
- G01N30/02—Column chromatography
- G01N30/86—Signal analysis
- G01N30/8624—Detection of slopes or peaks; baseline correction
- G01N30/8641—Baseline
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/02—Preprocessing
Definitions
- Baseline removal is important for applications that involve quantitation, like estimating the amount of each of the compounds generating peaks in the signal. Baseline removal is also important for numerical processing applications performed prior to quantitation, such as estimating the number of peaks present in a signal. Background removal is not only necessary in chromatography, but it is also required to interpret signals in mass spectrometry, for example.
- FIG. 1A is a graph showing a signal curve 110 along with a baseline curve 120
- FIG. 1B shows the signal curve 115 with the baseline removed.
- the baseline curve 120 changes over a longer time frame than the peaks of the signal curve 110 .
- baselines generally are smoother than peaks, as can be seen in FIG. 1A . Therefore, in order to estimate the baseline of a signal, as is necessary, e.g., to determine the amplitudes of the peaks, the smoothness of the baseline curve 120 must be reconciled with the accuracy (or fidelity) with which the base line curve 120 matches the signal curve 110 . In other words, the lower bound of the signal curve 110 should be followed as closely as possible without overestimating the baseline curve 120 , especially when peaks of the signal curve 110 have a high degree of overlap.
- FIG. 2 is a graph showing a signal curve 210 along with an overestimated baseline curve 220 , for example, according to a conventional method for baseline estimation.
- the estimated baseline curve 220 of FIG. 2 is less smooth and follows the signal curve 210 more closely than the baseline curve 120 of FIG. 1A .
- the asterisks on the baseline curve 220 show instances in which overfitting leads to inaccurate quantitation of the peak areas of the signal curve 210 .
- FIGS. 1A and 1B are graphs showing a signal with and without baseline, according to a representative embodiment.
- FIG. 2 is a graph showing a conventional signal with baseline that overfits.
- FIG. 3 is a functional block diagram illustrating a system for removing baseline of a signal, according to a representative embodiment.
- FIG. 4 is a flow diagram of a method for removing baseline of a signal, according to a representative embodiment.
- FIG. 5 is a flow diagram of a method for determining peak-free area estimator of FIG. 4 , according to a representative embodiment.
- FIG. 6 is a graph showing variance of a signal, according to a representative embodiment.
- FIGS. 7A-7D are graphs showing determination of weight functions, according to a representative embodiment.
- FIG. 8 is a flow diagram of a method for determining a cutoff value of peak region values, according to a representative embodiment.
- FIG. 9 is a flow diagram of a method for estimating a baseline of FIG. 4 , according to a representative embodiment.
- FIGS. 10A-10D are graphs showing estimation of a baseline for a signal having highly overlapped peaks, according to a representative embodiment.
- FIGS. 11A-11B are graphs showing estimation of a baseline for a signal having noise, according to a representative embodiment.
- FIGS. 12A-12D are graphs showing iterative estimation of a baseline, according to a representative embodiment.
- FIGS. 13A-13D are graphs showing estimation of baselines of experimental signals, according to representative embodiments.
- FIG. 14 is a functional block diagram illustrating a system for removing baseline of a signal, according to a representative embodiment.
- an automated process provides for estimation and removal of baselines of signals without the need of user intervention.
- the process is intended for baseline removal a posteriori, that is, after data acquisition has taken place.
- the process does not require adjustment of parameters and operations having high computational loads can be efficiently parallelized, making it suitable for processing signals consisting of large numbers of points. Also, the process is resistant to overfitting.
- the approach to modeling or estimating a baseline includes rejecting domains of a signal having peaks (peak-containing regions) and then modeling the baseline using the remaining regions (peak-free regions) of the signal. This is achieved, for example, through a binary classification scheme (peak-containing versus peak-free regions) or through a continuous normalized weight function with values that are close to zero in peak-containing regions and values close to one in peak-free regions.
- a binary classification scheme peak-containing versus peak-free regions
- a continuous normalized weight function with values that are close to zero in peak-containing regions and values close to one in peak-free regions.
- an estimator based on slope of the signal for example, is not sufficiently reliable since even a small amount of noise, e.g., noise remaining after de-noising the signal, can confuse the estimator.
- Representative embodiments therefore include an estimator that naturally takes into account the noise in the signal in determining the peak-free regions of the signal. Then, the peak-free regions are fitted with smoothing splines.
- two parameters e.g., smoothness factor and weight cutoff, have intuitive meanings and may be automatically estimated based on properties of the signal, as discussed below. However, should the user have prior knowledge about the properties of the baseline, these values can be overridden.
- FIG. 3 is a functional block diagram illustrating a system 300 , according to a representative embodiment.
- the system 300 may be any system for receiving and processing one-dimensional or two-dimensional signals, such as signals produced in accordance with chromatography, mass spectrometry, spectroscopy, electrophoresis, imaging, electronic measurements and the like.
- the system 300 may represent a liquid chromatography/mass spectrometry (LC/MS) system, for example, which generally combines separation functionality of liquid chromatography with mass analysis ability of mass spectrometry. While the representative embodiments relate to LC/MS applications, the present teachings are more broadly applicable to one-dimensional signals with background features that include an unwanted slowly changing component.
- LC/MS liquid chromatography/mass spectrometry
- FIG. 3 may be physically implemented using a software-controlled microprocessor, hard-wired logic circuits, or a combination thereof. Also, while the parts are functionally segregated in FIG. 3 for explanation purposes, they may be combined variously in any physical implementation.
- the system 300 includes a signal generator 310 and a baseline removal system 330 .
- the signal generator 310 is configured to generate a signal that exhibits peaks and has a corresponding baseline, as shown for example in FIG. 1A .
- the signal generator 310 may be embodied as sample separator, for example, which separates samples to provide signals indicating chemical entities included in the samples.
- Another example of a signal generator 310 is a mass spectrometer, which generates signals having peaks indicating masses corresponding to molecular contents of samples.
- the signal generator 310 provides (one-dimensional or two-dimensional) signals to the baseline removal system 330 , in various embodiments.
- the baseline removal system 330 performs processing operations on the received signals, e.g., from sample separations, including the baseline removal process, in accordance with various embodiments discussed below.
- the baseline removal system 330 may also execute software that controls the basic functionality of the system 300 .
- the baseline removal system 330 may be implemented as a microprocessor, a digital signal processor (DSP), or the like, or at least in part by hard-wired logic circuits or customizable hardware. As stated above, although depicted separately, the baseline removal system 330 may be included within the sample separator 310 , in various embodiments.
- FIG. 4 is a flow diagram illustrating a method for baseline removal, according to a representative embodiment.
- the various operations of the method depicted in FIG. 4 may correspond to modules, realized by hard-wired logic circuits or customizable hardware, a program running on a processor, or any combination thereof, indicated by the baseline removal system 330 .
- a signal is received from a signal source, such as the sample separator 310 , which has generated the signal in accordance with a number of possible procedures, including chromatography, mass spectrometry, spectroscopy, electrophoresis, imaging, electronic measurements and the like.
- a weight function representing peak-free regions of the received signal is determined at block 420 .
- the determined peak-free region weights of the received signal and the received signal are used in a smoothing spline fitting procedure at block 430 , and the spline resulting from block 430 is denoted as an estimated baseline.
- the estimated baseline from block 430 is subtracted from the received signal at block 440 , removing the estimated baseline.
- the processes indicated in blocks 420 and 430 are described in detail below with respect to FIGS. 5 , 8 and 9 , respectively. Further, in an embodiment, the processes indicated in blocks 420 and 430 are performed in parallel.
- FIG. 5 is a flow diagram illustrating a method for determining the weight function for peak-free regions of a signal, indicated by the process of block 420 of FIG. 4 , according to a representative embodiment.
- the various operations of the method depicted in FIG. 5 may correspond to modules, realized by hard-wired logic circuits or customizable hardware, a program running on a processor, or any combination thereof, indicated by the baseline removal system 330 .
- the determination of peak-free regions initially includes estimating peak-containing regions using variance of the signal, and more specifically, the rate of growth of the signal variance (e.g., blocks 510 - 530 of FIG. 5 ), further discussed with reference to FIGS. 6A and 6B , below.
- the rate of growth of the signal variance e.g., blocks 510 - 530 of FIG. 5
- an estimator using the rate of growth of signal variance naturally takes into account noise present in the signal.
- a weight function is built for baseline estimation using a previously calculated cutoff value (e.g., blocks 540 -S 550 of FIG. 5 ), further discussed with reference to FIG. 8 , below.
- the peak regions have regression lines with higher slopes and higher intercepts than the baseline region.
- FIGS. 6A and 6B show graphs depicting variances of a signal as a function of window size, according to a representative embodiment, showing rate of growth of signal variance.
- two synthetic peaks 610 and 612 of signal 605 shown in FIG. 6A , are fused and superimposed to a changing baseline.
- FIG. 6A also shows corresponding signal variances 610 R/ 612 R using windows that extend to the right of peaks 610 and 612 , and corresponding signal variances 610 L/ 612 L using windows that extend to the left of peaks 610 and 612 .
- 6B is a magnification of the signal 605 in the region in which the peaks 610 and 612 appear, showing variances 610 R/ 612 R and 610 L/ 612 L.
- the window sizes shown start at three, for example, and increase in unit increments. Although both contributions are actually added to determine the signal variances, 610 R/ 612 R and 610 L/ 612 L are shown separately for purposes of illustration. The location of the peak-containing region is thus apparent from the variance in the signal, as discussed below.
- estimators for identifying peak containing regions may be incorporated, such as determining maximum range of values of the signal in the window, although using signal variance appears to be the most robust. For example, as the slope of the baseline increases, signal variance becomes non-negligible even in peak-free areas. When dealing with steep baselines and/or increasing amounts of noise, the distinction between peak-containing regions and peak-free regions may blur. However, the peak-containing regions and peak-free regions may still be distinguished based on the increase of the variance with increasing window sizes, which remains linear in baseline regions, but not in peak-containing regions.
- estimating peak-containing regions begins with recording variance values of the signal as window size increases at block 510 .
- the mathematical form of the estimator defined in terms of the rate of growth of the signal variance with increasing window size, may differ in various embodiments. The appropriate range of the window size is discussed below.
- a linear regression (e.g., a simple linear regression with one independent variable) of the variance values as a function of the window size is performed.
- the value of the maximum window size m should be large enough to be statistically significant, but small enough to capture only local variations of the variance, the determination of which would be apparent.
- m may be set in the range of 6-10, in a representative embodiment, to analyze both simulated and real signals. For peaks having fewer than six data points, for example, the window size may be reduced.
- equation (2) shows a representative embodiment in which the contributions ⁇ (i) 1 and ⁇ (i) 2 are simply added, any linear combination or monotonic function of the two may provide similar results.
- block 530 yields a set of peak region values X peaks , which are estimator values corresponding to each point i of the signal.
- FIGS. 7A-7D are graphs showing peak-free and peak-containing regions, according to a representative embodiment.
- curve 720 depicts a sample signal having multiple peaks.
- the signal may be synthetic or may result from various processes, such as chromatography, mass spectrometry, spectroscopy, electrophoresis, imaging, electronic measurements or the like.
- the depicted signal 720 is a one-dimensional signal, although it is understood that the process may be generalized to accommodate two-dimensional signals, as well.
- Curve 730 depicts the peak region values X peaks , for example, derived according to equation (2) in block 530 , above.
- curve 730 depicting peak region values X peaks tends to follow the baseline (having a value of approximately 0.525) in non-peak regions and to spike in peak regions of the signal curve 720 .
- Curve 740 of FIG. 7A depicts a baseline weight function, which is determined in accordance with blocks 540 - 550 of FIG. 5 , discussed below.
- peak-containing regions in which the variance of the signal, ⁇ 2 , grows non-linearly and at a lower rate than a linear function, tend to have values of the first contribution
- the rate of growth of ⁇ 2 in peak-free regions is approximately linear, and ⁇ (i) 1 tends to have negligible values.
- the second contribution ⁇ (i) 2 provides the slope of the regression line previous mentioned, which also tends to be higher in peak-containing regions compared to peak-free regions (the slope is always positive).
- peak region values X peaks are transformed into quantities having values of (0, 1), which can be used as weights to evaluate the presence (or absence) of peaks.
- a cutoff value of the peak region values X peaks is determined at block 540 of FIG. 5 , indicated by X cutoff .
- the cutoff value X cutoff is where statistically significant departures from zero begin to occur, shown for example at location 760 of curve 780 in FIG. 7B .
- Curve 780 of FIG. 7B is an example of peak region values X peaks sorted in ascending order.
- the cutoff value X cutoff of the peak region values X peaks may be reliably determined in various manners, since baseline estimation is robust with respect to the precise value of cutoff values.
- FIG. 8 is a flow diagram illustrating an example of a process for determining a cutoff value X cutoff , according to a representative embodiment, indicated by the process of block 540 of FIG. 5 .
- the various operations of the method depicted in FIG. 8 may correspond to modules, realized by hard-wired logic circuits or customizable hardware, a program running on a processor, such as processor 330 , or any combination thereof.
- the cutoff value X cutoff may be determined by first sorting the peak region values X peaks in ascending order and normalizing the sorted the peak region values X peaks to the length of the data at block 810 .
- curve 780 of FIG. 7B is an example of peak region values X peaks sorted in ascending order.
- Values of the variance var of values obtained in block 810 are then recorded over moving windows of a given size s at block 820 . Recording the variance var may be accomplished by sliding the window over each point along the sorted list of peak region values X peaks .
- values of point z are calculated for all possible values of threshold t and window sizes s.
- a threshold t is specified as a number that can take values between one and a fraction of the total length of the data (e.g., any relatively large number).
- a point z* is calculated at block 840 based on the functional form from block 830 .
- the cutoff value X cutoff may then be determined at block 850 based on the calculated value of z* and the curve of the sorted and normalized values of the peak region values X peaks .
- Point z* must satisfy the following relationship, where var(s z ) is the variance of a window of size s starting at point z in the sorted series: var( s z )> t var( s z+1 ) (3)
- FIG. 7C indicates temporary cutoffs calculated with intervals of increasing stringency and decreasing window size within a given stringency. More particularly, FIG. 7C is a plot of values of points z (along the vertical axis) for all possible values of threshold t and window sizes s. In this figure, points z are calculated for increasing values of the threshold t, indicated by linearly increasing value 771 , where for a given value of t, decreasing values of s are explored. Since each point z is a function of two variables, t and s, the curve in FIG. 7C shows all values of s corresponding to each value of t for each point z. When the value of point z is equal to the length of the dataset (the first occurrence being highlighted by rectangle 770 in FIG.
- FIG. 7D is a zoom or magnification of the region of interest 770 , showing more detail, where a first spike 776 large enough to be substantially equal to the length of the data (e.g., 500 in the depicted example) appears for the first time.
- a first spike 776 large enough to be substantially equal to the length of the data (e.g., 500 in the depicted example) appears for the first time.
- the values of s go from highest to lowest for each value of t.
- point z* the value preceding the breakdown of formula (3), designated point z*
- the values of z preceding the breakdown of formula (3) are indicated by dotted oval 775 in FIG. 7D , for example.
- the final cutoff value X cutoff is calculated at block 850 for the peak region values X peaks in the sorted list of FIG. 7B evaluated at a point determined in FIGS. 7C and 7D .
- the first quartile of the distribution of points preceding the breakdown of formula 3, e.g., formed by points within the dotted oval 775 , is used, which works well for diverse baselines and peak configurations (e.g., involving different peak numbers and different degrees of overlap). This quartile may then be used as the index in the sorted list (block 810 ) to calculate the final value of X cutoff , as shown in block 850 .
- a peak-free region or baseline weight function w(i) may be calculated at block 550 as follows, where i represents points along the signal:
- w ⁇ ( i ) e - X ⁇ ( i ) peaks X cutoff ( 4 )
- baseline estimation is robust against changes in this function, and variants (e.g., involving the use a hyperbolic tangent in conjunction with a cutoff) could be used.
- the values of the baseline weight function w(i) corresponding to the example are indicated by curve 740 of FIG. 7A . As shown, the baseline weight function w(i) has values that are arbitrarily close to 1 in each peak-free region and arbitrarily close to 0 in each peak-containing region.
- the weight function w(i) is applied to the signal in the same process as the spline fitting, such that determination of the peak-free regions and the spline fitting are sub-operations of a parallelized fitting procedure.
- the weight function e.g., estimator
- the signal weighted in accordance with the weight function is fitted by a smoothing spline procedure at block 430 to provide an estimated baseline. That is, the next process is to interpolate a smooth curve that effectively includes the peak-free regions of the signal obtained by multiplying the signal by the weight function, which may be achieved using piecewise polynomials, for example.
- the estimated baseline is subtracted from the received signal, in accordance with various apparent techniques, to obtain the actual signal.
- Piecewise polynomials interpolate data (e.g., from the signal) along a finite series of intervals. The points defining these intervals are called knots.
- splines are used as the piecewise polynomials, since splines are less erratic at boundaries and have smaller fluctuations than standard polynomials because they satisfy the constraint that all associated derivatives are continuous at the knots.
- cubic splines are used, which are cubic polynomials satisfying continuity of the function, the first slope and the curvature at the knots.
- Fidelity means that the interpolated smooth curve of the estimated baseline closely follows the peak-free regions of the signal data.
- Smoothness means that the estimated baseline should not overfit peak-containing regions, which have not been selected from the signal data, as depicted by the asterisks in FIG. 2 , for example. Fidelity and smoothness are incompatible to some degree, and therefore a compromise is needed in the form of an optimization.
- a function g when fitting scattered data using a spline, a function g may be defined as follows, where (y i , x i ) is a data set consisting of N points, S is the cubic spline, and ⁇ is a positive real number referred to as the “smoothing factor”:
- the first term in the right side of equation (5) (i.e., (y i ⁇ S(x i )) 2 ) is a standard term used in linear regression to penalize departures of the spline from the data set.
- the second term on the right side of equation (5) (i.e., ⁇ S′′(t) 2 dt) penalizes high curvature of the spline or lack of smoothness.
- the smoothing factor ⁇ defines how much weight smoothness has over data fidelity. When g( ⁇ ) is optimized for a given value of the smoothing factor ⁇ , a “smoothing spline” will be obtained, as described for example in Reinsch, Smoothing by Spline Functions , N UMBER .
- g( ⁇ ) from equation (5) may be minimized as follows, where w(i) are the weights determined according to equation (4) (e.g., as indicated by line 740 of FIG. 7A ):
- Equation (6) does not include a penalty for situations in which y(i) ⁇ S(x i ) ⁇ 0 and w(i) ⁇ 1.
- the baseline may end up having higher values than the real signal, and thus baseline subtraction may result in a negative net signal.
- w(i) may be provided a (predetermined) finite value in all points where the relationship S(i)>y i holds. The finite value provided w(i) need not be precise, although normally the finite value should be in the range of 1 to 100.
- Finite values larger than 1 appropriately penalize the situations in which y(i) ⁇ S(x i ) ⁇ 0 and w(i) ⁇ 1, while finite values less than 100 avoid slight underfitting of the baseline, e.g., when large amounts of white noise are present in the signal.
- FIG. 9 is a flow diagram illustrating fitting the weighted signal indicated by the process of block 430 in FIG. 4 using equation (6), according to a representative embodiment.
- the various operations of the method depicted in FIG. 9 may correspond to modules, realized by hard-wired logic circuits or customizable hardware, a program running on a processor, or any combination thereof, indicated by the baseline removal system 330 .
- the smoothing factor ⁇ is determined in block 910 .
- the value of the smoothing factor ⁇ is set in a self-consistent manner, such as in accordance with a generalized cross-validation procedure, as described, for example, in Craven et al., Smoothing Noisy Data with Spline Functions: Estimating the Correct Degree of Smoothing by the Method of Generalized Cross - Validation , N UMBER . M ATH. 31:377-403 (1979) (hereinafter Craven et al.), the contents of which are hereby incorporated by reference.
- the smoothing factor ⁇ may be estimated.
- an automated routine may estimate the value of the smoothing factor ⁇ from the curvature of the peak-free regions of the signal (i.e., the regions of the signal where the weight function w(i) is close enough to 1, as discussed above).
- the smoothing factor ⁇ may be estimated from the curvature of the peak-free regions of the signal (i.e., the regions of the signal where the weight function w(i) is close enough to 1, as discussed above).
- a curvature g′′ of each region through the largest possible five-point stencil (e.g., as defined by equation (7)), is numerically estimated in accordance with the following equation, where x is a point in the middle of the region, h is the segment length divided by five, and f:
- any other finite difference approximation of the curvature may be used, such as the following:
- ⁇ is determined.
- the smoothing factor ⁇ is then defined to be proportional to 1/max ⁇
- ⁇ among the segments may be arbitrarily low due to noise, a functional form of the smoothing factor ⁇ may be selected that has an upper bound as follows, where ⁇ 1/1000 (the chosen value of 1000 being arbitrary):
- the optimization of equation (6) is substantially insensitive to a precise value of the smoothing factor ⁇ . This may occur, for example, for variations of four to five orders of magnitude. Therefore, the estimation of the smoothing factor ⁇ here proposed is sufficient, and is a computationally less expensive alternative to the generalized cross-validation procedure, for example, as described in Craven et al.
- the number of knots is determined for defining the smoothing splines S of equation (6).
- the number of knots is the number of pieces of the piecewise polynomial that will form the baseline. A large number of knots is generally desirable, since it is unknown, a priori, how fast the baseline will change from the beginning to the end of the signal. However, because the number of variables to minimize is equal to the number of knots, the computational cost rapidly increases with the number of knots, which factors in favor of fewer knots. Also, equation (6) prevents the curvature of the spline from rising independently of the number of knots, so a large number of knots is not necessarily needed.
- knots equally spaced along the domain of the signal are used, and up to 15 knots provides essentially indistinguishable results, for example.
- the knots are not equally spaced, but rather may be distributed with a density proportional to an absolute value of curvature g′′ calculated in equation (7), for example.
- the signal data is interpolated along the knots using splines, e.g., in accordance with equation (6), above. That is, function g is defined for data set (y i , x i ) of the signal consisting of N points in each cubic spline S.
- the resulting series of cubic splines provides the smoothed, interpolated curve representing the estimated baseline of the signal. Referring again to FIG. 4 , this smoothed, interpolated curve is subtracted from the signal at block 440 to remove the estimated baseline.
- FIGS. 10A-10D are graphs showing an example of determining and removing baselines, according to a representative embodiment.
- the minimization of equation (6) was performed using Particle Swarm Optimization, although other optimization methods may be used, as discussed above.
- Particle Swarm Optimization is a stochastic, population-based evolutionary computer algorithm used for optimization.
- An ensemble of particles, called “swarm,” is modeled multidimensionally, such that each particle has a position and a velocity.
- the particles move through multidimensional space sampling the function that they are optimizing.
- Each particle remembers its best position and is aware of its neighboring particles' best positions (referred to as “social cognition”), respectively.
- Swarm members communicate with each other and adjust their trajectories based on the quality of their positions.
- the determination of peak-free regions and the fitting procedure may be parallelized, e.g., for efficient analysis of long signals.
- Testing indicates that normalizing the first term of the right hand side of equation (6) by the length of the signal and using absolute values of errors instead of squared differences of errors made the peak removal process slightly more consistent for a noisy signal which changes in length. For example, signal lengths tested ranged from 100 to 10,000 points.
- Gaussian peaks of different heights and degree of overlap were randomly generated.
- the overlap among two consecutive peaks was defined as the area of the smaller peak falling under the curve of the larger of the two peaks.
- the maximum height of each peak and the window of allowed overlap among peaks were specified by the user.
- the absolute amount of white noise added to the signal was also user specified in the testing.
- Different baseline types were added to the indicated number of peaks before baseline estimation. Automated calculations of the cutoff for the weights of the weight function and the value of the smoothing factor ⁇ are shown for each signal.
- FIGS. 10A-10D show that the baseline removal process does not overfit signals having highly overlapped peaks.
- FIGS. 11A and 11B show a signal curve 1110 a and corresponding baseline curve 1120 a , obtained through baseline estimation performed in accordance with an embodiment of the baseline removal process
- FIG. 11B shows a signal curve 110 b with a 100 fold increase in noise level and corresponding baseline curve 1120 b , obtained through the baseline estimation.
- the estimated baseline curve 1120 b indicates negligible distortion as compared to the estimated baseline 1120 a , regardless of the substantial increase in the noise level.
- the signal 1110 b having the excessive noise level may be de-noised in a manner preserving first and second derivatives, for example, by using a Savitzky-Golay filter, described by Savitzky et la., Smoothing and Differentiation of Data by Simplified Least Squares Procedure , A NAL . C HEM., 36 (1964), pp. 1627-1639, the contents of which are hereby incorporated by reference.
- the baseline removal process may be applied two or more times, e.g., by consecutively subjecting the baseline-subtracted signal obtained by a previously applied baseline remove process to the baseline removal process. For example, when baselines change too fast, a single application of the baseline removal process discussed above may not capture portions of the baseline that have high slope. This may result in underfitting the signal, as shown in FIG. 12A , for example. However, as shown in FIG. 12B , by applying the baseline removal process iteratively, the missing portions of the baseline are recovered without overfitting the portions of the signal that were accurate after the first pass.
- FIG. 12A is a graph showing a signal curve 1210 a having a fast changing baseline curve 1220 a , which can not be removed by the baseline removal process, e.g., indicated by the substantially flat weight factor curve 1230 a .
- FIG. 12B is a graph showing the signal curve 1210 b and corresponding estimated baseline curve 1220 b after subjecting the signal curve 1210 a remaining after baseline subtraction to a second baseline removal process. As shown, the resulting weight factor curve 1230 b indicates the peak-free regions for establishing the baseline curve 1220 b.
- FIGS. 12C and 12 D are no deleterious effects, as shown in FIGS. 12C and 12 D, for example. That is, FIG. 12C is a graph showing a signal curve 1210 c having an accurately determined baseline curve 1220 c .
- FIG. 12D is a graph showing the signal curve 1210 d and corresponding estimated baseline 1220 d after subjecting the signal 1210 c remaining after baseline subtraction to a second baseline removal process, resulting in no overfitting.
- applying the baseline removal process twice incurs double the computational cost, but effectively treats fast changing baselines without risk of negative effects.
- FIGS. 13A-13D show baseline estimates of the baseline removal process, according to representative embodiments.
- the baseline removal process was applied to curves 1310 a - 1310 d of experimental signals obtained from gas chromatography of kerosene samples, according to representative embodiments.
- the chromatograms of FIGS. 13A-13D are separations along the second (hydrophobic) dimension of a two-dimensional chromatography experiment.
- FIG. 13A includes insert 1345 , which includes a magnification of the lower portion of the signal curve 1310 a and baseline curve 1340 a.
- the baseline estimate curves 1340 a - 1340 d of FIGS. 13A-13D are based on different fractions 623 , 953 , 866 and 790 , respectively, corresponding to the second dimension in the two-dimensional gas chromatography separation of the kerosene.
- the kerosene sample undergoes a two-dimensional gas chromatography separation, and the elution corresponding to time points of the separation in the first dimension are called “fractions.”
- each fraction is further separated through a second separation, or dimension.
- the horizontal axis of each of FIGS. 13A-13D corresponds to the time course of the second separation.
- the fractions analyzed were selected by visual inspection to have the highest degree of peak overlap among more than 1500 fractions available.
- the resulting baseline estimate curves 1340 a - 1340 d demonstrate that no overfitting takes place in the experimental signals.
- FIG. 14 is a functional block diagram illustrating a system 1400 , according to a representative embodiment.
- the system 1400 may be any system for receiving and processing one-dimensional or two-dimensional signals, such as signals produced in accordance with chromatography, mass spectrometry, spectroscopy, electrophoresis, imaging, electronic measurements and the like.
- the system 1400 may represent an LC/MS system, which generally combines separation functionality of liquid chromatography with mass analysis ability of mass spectrometry.
- the system 1400 includes a sample separator 1410 and a baseline removal system 1430 .
- the sample separator 1410 receives samples, which may include various mixtures molecules (e.g., peptides, proteins, or the like) to be identified. As stated above, the samples may be separated by the sample separator 1410 via various types of separation processing to reduce the complexity of the mixture and to isolate to the extent possible individual compounds contained within sample. Isolation may occur spatially or temporally, for example.
- the sample separator 1410 may perform separation processing in accordance with any appropriate separation technique, including two-dimensional gel electrophoresis and LC/MS, and may be implemented in part using a microfluidic device, for example.
- the sample separator 1410 provides (one-dimensional or two-dimensional) signals to the baseline removal system 1430 , in various embodiments.
- the baseline removal system 1430 performs processing on the received signals, e.g., from sample separations, including the baseline removal process, in accordance with various embodiments discussed below.
- the baseline removal system 1430 may be a computer processor, for example, and includes central processing unit (CPU) 1431 , internal memory 1432 , bus 1439 and interfaces 1435 - 1438 , and is configured to interface with the sample separator 1410 through a respective interface 1412 , which may be a universal serial bus (USB) interface, an IEEE 1394 interface, or a parallel port interface, for example.
- USB universal serial bus
- IEEE 1394 IEEE 1394 interface
- parallel port interface for example.
- the internal memory 1432 includes at least nonvolatile read only memory (ROM) 1433 and volatile random access memory (RAM) 1434 , although the internal memory 1432 may be implemented as any number, type and combination of ROM and RAM, and may provide look-up tables and/or other relational functionality.
- the internal memory 1432 may include a disk drive or flash memory, for example.
- the internal memory 1432 may store program instructions and results of calculations or summaries performed by CPU 1431 , discussed below.
- the CPU 1431 is configured to execute one or more software algorithms to perform the baseline removal process of the embodiments described herein, in conjunction with the internal memory 1432 .
- the CPU 1431 may also execute software algorithms to control the basic functionality of the system 1400 .
- the CPU 1431 may include its own memory (e.g., nonvolatile memory) for storing executable software code that allows it to perform the various functions.
- the executable code may be stored in designated memory locations within internal memory 1432 .
- the executable code may be written in C, C ++ or other capable programming language.
- the CPU 1431 executes an operating system, such as a Windows® operating system available from Microsoft Corporation, a Linux operating system, a Unix operating system (e.g., SolarisTM available from Sun Microsystems, Inc.), or a NetWare® operating system available from Novell, Inc.
- the operating system may control execution of other programs, including programs that cause sample separator 1410 to perform such operations as collection and separation of samples and output of corresponding signals.
- a user and/or other computers may interact with the baseline removal system 1430 using input device(s) 1445 through I/O interface 1435 .
- the input device(s) 1445 may include any type of input device, for example, a keyboard, a track ball, a mouse, a touch pad or touch-sensitive display, and the like.
- information may be displayed by the baseline removal system 1430 on display 1446 through display interface 1436 , which may include any type of graphical user interface (GUI), for example.
- GUI graphical user interface
- the displayed information includes the processing results obtained by the CPU 1431 executing the method of peak detection, described herein.
- the processing results of the CPU 1431 may also be stored in the database 1448 through memory interface 1438 .
- the database 1448 may include any type and combination of volatile and/or nonvolatile storage medium and corresponding interface, including hard disk, compact disc (e.g., CD-R/CD/RW), USB, flash memory, or the like.
- the stored processing results may be viewed, e.g., on the display 1446 , and/or further processed at a later time. Also, the processing results may be provided to other computer systems connected to network 1447 through network interface 1437 .
- the network 1447 may be any network capable of transporting electronic data, such as the Internet, a local area network (LAN), a wireless LAN, and the like.
- the network interface 1247 may include, for example, a transceiver (not shown), including a receiver and a transmitter, that provides functionality for the system 1400 to communicate wirelessly over the data network through an antenna system (not shown), according to appropriate standard protocols.
- the network interface 1437 may include any type of interface (wired or wireless) with the communications network, including various types of digital modems, for example.
- the baseline determination process is intended to determine and remove the baseline of a signal generated using a number of possible signal sources, including chromatography, mass spectrometry, spectroscopy, electrophoresis, imaging, electronic measurements and the like.
- the baseline determination process may determine and remove background signals, such as background illumination on an image, e.g., when the method is adapted to two-dimensional signals.
- the various “parts” shown in FIG. 14 of the baseline removal system 1430 may be physically implemented using a software-controlled microprocessor, hard-wired logic circuits, or a combination thereof. Also, while the parts are functionally segregated for explanation purposes, they may be combined variously in any physical implementation.
- baseline determination of the baseline removal process produces baselines that do not overfit the corresponding signals, even in the presence of congested or overcrowded peaks.
- the baseline determination is automated and parameter-less, although a user may optionally control two underlying parameters, i.e., the smoothness factor and the weight cutoff, which may be useful when information about the nature of the sample/measurement provides meaningful constraints on the shape of the baseline.
- the main computational burden of the baseline determination (the calculation of peak free regions and a numerical optimization) is efficiently parallelized, making it well suited for long signals (e.g., a large number of points).
- the baseline determination may be easily generalized and applied to two-dimensional signals, broadening its scope to baseline removal in multidimensional separations, for example, in LS/MS systems, and removal of background from images.
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Optimization (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Pathology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biochemistry (AREA)
- Operations Research (AREA)
- Analytical Chemistry (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Computational Biology (AREA)
- Algebra (AREA)
- Immunology (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Chemical & Material Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Investigating Or Analysing Materials By Optical Means (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
Description
y j=β1+β2 j+ε j (1)
The value of the maximum window size m should be large enough to be statistically significant, but small enough to capture only local variations of the variance, the determination of which would be apparent. For example, m may be set in the range of 6-10, in a representative embodiment, to analyze both simulated and real signals. For peaks having fewer than six data points, for example, the window size may be reduced.
X(i)peaks=|β(i)1|+β(i)2 (2)
Although both peak-free regions and peak-containing regions will have a variance that tends to shrink to zero with decreasing window size, the variance will rarely be zero in peak-containing regions. Also, while equation (2) shows a representative embodiment in which the contributions β(i)1 and β(i)2 are simply added, any linear combination or monotonic function of the two may provide similar results. Thus, block 530 yields a set of peak region values Xpeaks, which are estimator values corresponding to each point i of the signal.
var(s z)>t var(s z+1) (3)
Again, baseline estimation is robust against changes in this function, and variants (e.g., involving the use a hyperbolic tangent in conjunction with a cutoff) could be used. The values of the baseline weight function w(i) corresponding to the example are indicated by
Other functions that are proportional to the inverse of the curvature may likewise be used.
Claims (16)
y j=β1+β2 j+ε j
y j=β1+β2 j+ε j
X(i)peaks=|β(i)1+β(i)2
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/466,031 US8645090B2 (en) | 2009-05-14 | 2009-05-14 | Automated baseline removal of signal |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/466,031 US8645090B2 (en) | 2009-05-14 | 2009-05-14 | Automated baseline removal of signal |
Publications (2)
Publication Number | Publication Date |
---|---|
US20100292957A1 US20100292957A1 (en) | 2010-11-18 |
US8645090B2 true US8645090B2 (en) | 2014-02-04 |
Family
ID=43069228
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/466,031 Active 2031-08-01 US8645090B2 (en) | 2009-05-14 | 2009-05-14 | Automated baseline removal of signal |
Country Status (1)
Country | Link |
---|---|
US (1) | US8645090B2 (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102009038112B4 (en) * | 2009-08-19 | 2012-08-09 | Siemens Aktiengesellschaft | Method for improving the chromatographic detection limit for an analyte |
US8716025B2 (en) * | 2011-07-08 | 2014-05-06 | Agilent Technologies, Inc. | Drifting two-dimensional separation with adaption of second dimension gradient to actual first dimension condition |
GB201509244D0 (en) * | 2015-05-29 | 2015-07-15 | Micromass Ltd | A method of mass analysis using ion filtering |
CN108603867B (en) * | 2015-12-03 | 2020-11-06 | 株式会社岛津制作所 | Peak detection method and data processing device |
ES2820228T3 (en) * | 2018-03-29 | 2021-04-20 | Leica Microsystems | Apparatus and method, particularly for microscopes and endoscopes, using baseline estimation and semi-quadratic minimization to eliminate image blurring |
JP7070014B2 (en) * | 2018-04-18 | 2022-05-18 | 東ソー株式会社 | Peak signal processing method in chromatogram |
GB201817028D0 (en) * | 2018-10-19 | 2018-12-05 | Renishaw Plc | Spectroscopic apparatus and methods |
CN113049725A (en) * | 2019-12-26 | 2021-06-29 | 深圳迈瑞生物医疗电子股份有限公司 | Method for calculating content of detected object, detection system and sample analyzer |
CN120253038B (en) * | 2025-06-03 | 2025-08-15 | 深圳市汇众智慧科技有限公司 | Quality evaluation method and system for monitoring signals |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6253162B1 (en) * | 1999-04-07 | 2001-06-26 | Battelle Memorial Institute | Method of identifying features in indexed data |
US7219038B2 (en) * | 2005-03-22 | 2007-05-15 | College Of William And Mary | Automatic peak identification method |
US20070110202A1 (en) * | 2005-11-03 | 2007-05-17 | Casler David C | Using statistics to locate signals in noise |
US7337066B2 (en) | 2005-12-08 | 2008-02-26 | Chemimage, Corporation | System and method for automated baseline correction for Raman spectra |
-
2009
- 2009-05-14 US US12/466,031 patent/US8645090B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6253162B1 (en) * | 1999-04-07 | 2001-06-26 | Battelle Memorial Institute | Method of identifying features in indexed data |
US7219038B2 (en) * | 2005-03-22 | 2007-05-15 | College Of William And Mary | Automatic peak identification method |
US20070110202A1 (en) * | 2005-11-03 | 2007-05-17 | Casler David C | Using statistics to locate signals in noise |
US7337066B2 (en) | 2005-12-08 | 2008-02-26 | Chemimage, Corporation | System and method for automated baseline correction for Raman spectra |
Non-Patent Citations (8)
Also Published As
Publication number | Publication date |
---|---|
US20100292957A1 (en) | 2010-11-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8645090B2 (en) | Automated baseline removal of signal | |
Ning et al. | Chromatogram baseline estimation and denoising using sparsity (BEADS) | |
Risum et al. | Using deep learning to evaluate peaks in chromatographic data | |
Stone et al. | AutoProf–I. An automated non-parametric light profile pipeline for modern galaxy surveys | |
Boera et al. | The thermal history of the intergalactic medium down to redshift z= 1.5: a new curvature measurement | |
O’Haver | A pragmatic introduction to signal processing | |
US7054792B2 (en) | Method, computer program, and system for intrinsic timescale decomposition, filtering, and automated analysis of signals of arbitrary origin or timescale | |
CN105518455B (en) | Peak detection method | |
Rowlands et al. | Automated algorithm for baseline subtraction in spectra | |
Perrin et al. | The use of wavelets for signal denoising in capillary electrophoresis | |
JP2009516172A (en) | Discover biological features using synthetic images | |
EP2438436B1 (en) | Methods of automated spectral peak detection and quantification without user input | |
WO2017197618A1 (en) | Method and system for removing stripe noise in infrared image | |
JP2009516172A5 (en) | ||
EP2344874A1 (en) | Methods of automated spectral peak detection and quantification without user input | |
CN104870955A (en) | Spectroscopic apparatus and methods | |
Christensen et al. | Chromatographic preprocessing of GC–MS data for analysis of complex chemical mixtures | |
Krishnan et al. | Instrument and process independent binning and baseline correction methods for liquid chromatography–high resolution-mass spectrometry deconvolution | |
Navarro‐Reig et al. | Chemometric strategies for peak detection and profiling from multidimensional chromatography | |
Niezen et al. | Critical comparison of background correction algorithms used in chromatography | |
JP6995361B2 (en) | Chromatograph data processing equipment, data processing methods, and chromatographs | |
Dong et al. | Baseline estimation using optimized asymmetric least squares (O-ALS) | |
Psychogyios et al. | Morphological classification of local luminous infrared galaxies | |
CN106199693A (en) | Geological data normal-moveout spectrum automatic pick method and device | |
Tan et al. | Green’s matching: an efficient approach to parameter estimation in complex dynamic systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: AGILENT TECHNOLOGIES, INC., COLORADO Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:SATULOVSKY, JAVIER E.;REEL/FRAME:022690/0511 Effective date: 20090513 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
CC | Certificate of correction | ||
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1553); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 12 |