MSBACKADJ provides background correction for a signal with peaks YOUT = MSBACKADJ(X,Y) adjusts the variable background (baseline) of a signal with peaks by following three steps: 1) estimates the background within multiple shifted windows of width 200 separation units along the x-axis, 2) regresses the varying baseline to the window points using a spline approximation, and 3) adjusts the background of the input signal Y. X and Y are column vectors where paired values represent points in the signal. Y can be a matrix with several signals, all sharing the same X scale. Units in the X scale (separation units or s.u.) may quantify wavelength, frequency, distance, time or m/z depending on the type of instrument that generates the signal. MSBACKADJ(...,'WINDOWSIZE',WS) sets the width for the shifting window. The default is 200 s.u., which means a background point is estimated for windows of 200 s.u. in width. WS may also be a function handle. The referenced function is evaluated at each X value to compute a variable width for the windows. This option is useful for cases where the resolution of the signal is dissimilar at different regions. MSBACKADJ(...,'STEPSIZE',SS) sets the steps for the shifting window. The default is 200 s.u., which means a background point is estimated for windows at every 200 s.u. SS may also be a function handle. The referenced function is evaluated at each X value to compute the distance between adjacent windows. MSBACKADJ(...,'REGRESSIONMETHOD',RM) sets the method used to regress the window estimated points to a soft curve. The default is 'pchip'; i.e., shape-preserving piecewise cubic interpolation. Other options are 'linear' and 'spline' interpolation. MSBACKADJ(...,'ESTIMATIONMETHOD',EM) sets the method used to find the likely background value at every window. Default is 'quantile', in which the quantile value is set to 10%. An alternative method is 'em', which assumes a doubly stochastic model; i.e., every sample is the i.i.d. draw of any of two normal distributed classes (background or peaks). Because the class label is hidden, the distributions are estimated with an expectation-maximization algorithm. The ultimate background value is the mean of the background class. MSBACKADJ(...,'SMOOTHMETHOD',SM) sets the method used to smooth the curve of estimated points, useful to eliminate the effect of possible outliers. Options are 'none' (default), 'lowess' (linear fit), 'loess' (quadratic fit), or 'rlowess' and 'rloess' (robust linear and quadratic fit). MSBACKADJ(...,'QUANTILEVALUE',QV) changes the default quantile value. The default is 0.10. MSBACKADJ(...,'PRESERVEHEIGHTS',true) sets the baseline subtraction mode to preserve the height of the tallest peak in the signal when subtracting the baseline. By default heights are not preserved. MSBACKADJ(...,'SHOWPLOT',SP) plots the background estimated points, the regressed baseline, and the original signal. When SP is TRUE, the first signal in Y is used. If MSBACKADJ is called without output arguments, a plot will be shown unless SP is FALSE. SP can also contain an index to one of the signals in Y. Examples: % Correct the baseline of SELDI-TOF mass-spectrograms: load sample_lo_res % Adjust the baseline of a group of spectrograms. YB = msbackadj(MZ_lo_res,Y_lo_res); % Plot the third spectrogram in Y_lo_res and its estimated baseline. msbackadj(MZ_lo_res,Y_lo_res,'SHOWPLOT',3); % Plot the estimated baseline for the fourth spectrogram in Y_lo_res % using an anonymous function to describe a MZ dependent parameter. wf = @(mz) 200 + .001 .* mz; msbackadj(MZ_lo_res,Y_lo_res(:,4),'STEPSIZE',wf); See also MSALIGN, MSHEATMAP, MSLOWESS, MSNORM, MSPREPRODEMO, MSRESAMPLE, MSSGOLAY, MSVIEWER.
0001 function Y = remove_background(X,Y,varargin) 0002 %MSBACKADJ provides background correction for a signal with peaks 0003 % 0004 % YOUT = MSBACKADJ(X,Y) adjusts the variable background (baseline) of a 0005 % signal with peaks by following three steps: 1) estimates the background 0006 % within multiple shifted windows of width 200 separation units along the 0007 % x-axis, 2) regresses the varying baseline to the window points using a 0008 % spline approximation, and 3) adjusts the background of the input signal Y. 0009 % 0010 % X and Y are column vectors where paired values represent points in the 0011 % signal. Y can be a matrix with several signals, all sharing the same X 0012 % scale. Units in the X scale (separation units or s.u.) may quantify 0013 % wavelength, frequency, distance, time or m/z depending on the type of 0014 % instrument that generates the signal. 0015 % 0016 % MSBACKADJ(...,'WINDOWSIZE',WS) sets the width for the shifting window. 0017 % The default is 200 s.u., which means a background point is estimated 0018 % for windows of 200 s.u. in width. WS may also be a function handle. 0019 % The referenced function is evaluated at each X value to compute 0020 % a variable width for the windows. This option is useful for cases where 0021 % the resolution of the signal is dissimilar at different regions. 0022 % 0023 % MSBACKADJ(...,'STEPSIZE',SS) sets the steps for the shifting window. 0024 % The default is 200 s.u., which means a background point is estimated 0025 % for windows at every 200 s.u. SS may also be a function handle. The 0026 % referenced function is evaluated at each X value to compute the 0027 % distance between adjacent windows. 0028 % 0029 % MSBACKADJ(...,'REGRESSIONMETHOD',RM) sets the method used to regress 0030 % the window estimated points to a soft curve. The default is 'pchip'; 0031 % i.e., shape-preserving piecewise cubic interpolation. Other options are 0032 % 'linear' and 'spline' interpolation. 0033 % 0034 % MSBACKADJ(...,'ESTIMATIONMETHOD',EM) sets the method used to find the 0035 % likely background value at every window. Default is 'quantile', in 0036 % which the quantile value is set to 10%. An alternative method is 'em', 0037 % which assumes a doubly stochastic model; i.e., every sample is the 0038 % i.i.d. draw of any of two normal distributed classes (background or 0039 % peaks). Because the class label is hidden, the distributions are 0040 % estimated with an expectation-maximization algorithm. The ultimate 0041 % background value is the mean of the background class. 0042 % 0043 % MSBACKADJ(...,'SMOOTHMETHOD',SM) sets the method used to smooth the 0044 % curve of estimated points, useful to eliminate the effect of possible 0045 % outliers. Options are 'none' (default), 'lowess' (linear fit), 'loess' 0046 % (quadratic fit), or 'rlowess' and 'rloess' (robust linear and quadratic 0047 % fit). 0048 % 0049 % MSBACKADJ(...,'QUANTILEVALUE',QV) changes the default quantile value. 0050 % The default is 0.10. 0051 % 0052 % MSBACKADJ(...,'PRESERVEHEIGHTS',true) sets the baseline subtraction 0053 % mode to preserve the height of the tallest peak in the signal when 0054 % subtracting the baseline. By default heights are not preserved. 0055 % 0056 % MSBACKADJ(...,'SHOWPLOT',SP) plots the background estimated points, the 0057 % regressed baseline, and the original signal. When SP is TRUE, the first 0058 % signal in Y is used. If MSBACKADJ is called without output arguments, a 0059 % plot will be shown unless SP is FALSE. SP can also contain an index to 0060 % one of the signals in Y. 0061 % 0062 % Examples: 0063 % 0064 % % Correct the baseline of SELDI-TOF mass-spectrograms: 0065 % load sample_lo_res 0066 % 0067 % % Adjust the baseline of a group of spectrograms. 0068 % YB = msbackadj(MZ_lo_res,Y_lo_res); 0069 % 0070 % % Plot the third spectrogram in Y_lo_res and its estimated baseline. 0071 % msbackadj(MZ_lo_res,Y_lo_res,'SHOWPLOT',3); 0072 % 0073 % % Plot the estimated baseline for the fourth spectrogram in Y_lo_res 0074 % % using an anonymous function to describe a MZ dependent parameter. 0075 % wf = @(mz) 200 + .001 .* mz; 0076 % msbackadj(MZ_lo_res,Y_lo_res(:,4),'STEPSIZE',wf); 0077 % 0078 % 0079 % See also MSALIGN, MSHEATMAP, MSLOWESS, MSNORM, MSPREPRODEMO, 0080 % MSRESAMPLE, MSSGOLAY, MSVIEWER. 0081 0082 % Copyright 2003-2008 The MathWorks, Inc. 0083 % $Revision: 1.2 $ $Date: 2014/05/21 21:29:51 $ 0084 0085 % References: 0086 % [1] Lucio Andrade and Elias Manolakos, "Signal Background Estimation and 0087 % Baseline Correction Algorithms for Accurate DNA Sequencing" Journal 0088 % of VLSI, special issue on Bioinformatics 35:3 pp 229-243 (2003) 0089 0090 % check inputs 0091 %bioinfochecknargin(nargin,2,mfilename); 0092 % set defaults 0093 stepSize = 200; 0094 windowSize = 200; 0095 regressionMethod = 'pchip'; 0096 estimationMethod = 'quantile'; 0097 smoothMethod = 'none'; 0098 quantileValue = 0.1; 0099 preserveHeights = false; 0100 maxNumWindows = 10000; 0101 if nargout == 1 0102 plotId = 0; 0103 else 0104 plotId = 1; 0105 end 0106 0107 if nargin > 2 0108 if rem(nargin,2) == 1 0109 error('Bioinfo:msbackadj:IncorrectNumberOfArguments',... 0110 'Incorrect number of arguments to %s.',mfilename); 0111 end 0112 okargs = {'stepsize','windowsize','regressionmethod',... 0113 'estimationmethod','quantilevalue','preserveheights',... 0114 'smoothmethod','showplot'}; 0115 for j=1:2:nargin-2 0116 pname = varargin{j}; 0117 pval = varargin{j+1}; 0118 k = find(strncmpi(pname, okargs,length(pname))); 0119 if isempty(k) 0120 error('Bioinfo:msbackadj:UnknownParameterName',... 0121 'Unknown parameter name: %s.',pname); 0122 elseif length(k)>1 0123 error('Bioinfo:msbackadj:AmbiguousParameterName',... 0124 'Ambiguous parameter name: %s.',pname); 0125 else 0126 switch(k) 0127 case 1 % step size 0128 stepSize = pval; 0129 if ~isscalar(stepSize) && ~isa(stepSize,'function_handle') 0130 error('Bioinfo:msbackadj:NotValidStepSize',... 0131 'STEP must be a scalar or a function handle.') 0132 end 0133 case 2 % window size 0134 windowSize = pval; 0135 if ~isscalar(windowSize) && class(windowSize,'function_handler') 0136 error('Bioinfo:msbackadj:NotValidWindowSize',... 0137 'WIDTH must be a scalar or a function handle.') 0138 end 0139 case 3 % regression method 0140 regressionMethods = {'cubic','pchip','spline','linear'}; 0141 regressionMethod = strmatch(lower(pval),regressionMethods); 0142 if isempty(regressionMethod) 0143 error('Bioinfo:msbackadj:NotValidRegressionMethod',... 0144 'Not a valid regression method.') 0145 end 0146 regressionMethod = regressionMethods{max(2,regressionMethod)}; 0147 case 4 % estimation method 0148 estimationMethods = {'quantile','em'}; 0149 estimationMethod = strmatch(lower(pval),estimationMethods); 0150 if isempty(estimationMethods) 0151 error('Bioinfo:msbackadj:NotValidEstimationMethod',... 0152 'Not a valid estimation method.') 0153 end 0154 estimationMethod = estimationMethods{estimationMethod}; 0155 case 5 % quantile value 0156 quantileValue = pval; 0157 case 6 % preserve heights 0158 preserveHeights = opttf(pval); 0159 case 7 % smoothing method 0160 smoothMethods = {'none','lowess','loess','rlowess','rloess'}; 0161 smoothMethod = strmatch(lower(pval),smoothMethods); 0162 if isempty(smoothMethod) 0163 error('Bioinfo:msbackadj:NotValidSmoothMethod',... 0164 'Not a valid smoothing method.') 0165 elseif length(smoothMethod)>1 0166 error('Bioinfo:msbackadj:AmbiguousSmoothMethod',... 0167 'Ambiguous smoothing method: %s.',pname); 0168 end 0169 smoothMethod = smoothMethods{smoothMethod}; 0170 case 8 % show 0171 if opttf(pval) 0172 if isnumeric(pval) 0173 if isscalar(pval) 0174 plotId = double(pval); 0175 else 0176 plotId = 1; 0177 warning('Bioinfo:msbackadj:SPNoScalar',... 0178 'SHOWPLOT must be a single index to one of the signals in Y or a logical. Plotting the first column in Y only.') 0179 end 0180 else 0181 plotId = 1; 0182 end 0183 else 0184 plotId = 0; 0185 end 0186 end 0187 end 0188 end 0189 end 0190 0191 % validate X and Y 0192 0193 if ~isnumeric(Y) || ~isreal(Y) 0194 error('Bioinfo:msbackadj:IntensityNotNumericAndReal',... 0195 'Intensities must be numeric and real.') 0196 end 0197 0198 if ~isnumeric(X) || ~isreal(X) 0199 error('Bioinfo:msbackadj:XNotNumericAndReal',... 0200 'X scale must be numeric and real.') 0201 end 0202 0203 if size(X,1) ~= size(Y,1) 0204 error('Bioinfo:msbackadj:NotEqualNumberOfSamples',... 0205 'The intensity vector(s) must have the same number of samples as the X scale.') 0206 end 0207 0208 numSignals = size(Y,2); 0209 0210 if (plotId~=0) && ~any((1:numSignals)==plotId) 0211 warning('Bioinfo:msbackadj:InvalidPlotIndex',... 0212 'SHOWPLOT value is not a valid signal index, no plot was generated.') 0213 end 0214 0215 multiple_X = false; 0216 if size(X,2)>1 0217 multiple_X = true; 0218 if size(X,2) ~= numSignals 0219 error('Bioinfo:msbackadj:NotEqualNumberOfXScales',... 0220 'A X scale must be supplied for every intensity vector when size(X,2)>1.') 0221 end 0222 end 0223 0224 % change scalars to function handlers 0225 if isnumeric(stepSize) 0226 stepSize = @(x) repmat(stepSize,size(x)); 0227 end 0228 if isnumeric(windowSize) 0229 windowSize = @(x) repmat(windowSize,size(x)); 0230 end 0231 0232 % allocate space for Xp and WE 0233 Xp = zeros(maxNumWindows,1); 0234 WE = nan(maxNumWindows,1); 0235 0236 % calculates the location of the windows (when it is the same for all the 0237 % signals) 0238 if ~multiple_X 0239 Xpid = max(0,X(1)); 0240 Xend = X(end); 0241 id = 1; 0242 while Xpid <= Xend 0243 Xp(id) = Xpid; 0244 Xpid = Xpid + stepSize(Xpid); 0245 id = id + 1; 0246 if id > maxNumWindows 0247 error('Bioinfo:msbackadj:maxNumWindowsExceeded',... 0248 'Maximum number of windows exceeded, verify the STEP value or function handle.') 0249 end 0250 end 0251 numWindows = id-1; 0252 end 0253 0254 % iterate for every signal 0255 for ns = 1:numSignals 0256 if nargout>0 || (ns == plotId) 0257 % find the location of the windows (when it is different for every 0258 % signal, otherwise this was done out of the loop) 0259 if multiple_X 0260 Xpid = max(0,X(1,ns)); 0261 Xend = X(end,ns); 0262 id = 1; Xp = zeros(1,maxNumWindows); 0263 while Xpid <= Xend 0264 Xp(id) = Xpid; 0265 Xpid = Xpid + stepSize(Xpid); 0266 id = id + 1; 0267 if id > maxNumWindows 0268 error('Bioinfo:msbackadj:maxNumWindowsExceeded',... 0269 'Maximum number of windows exceeded, verify the STEP value or function handle.') 0270 end 0271 end 0272 Xp(id:end)=[]; 0273 numWindows = id-1; 0274 nnss = ns; 0275 else 0276 nnss = 1; 0277 end 0278 Xpt = Xp(1:numWindows); 0279 Xw = windowSize(Xpt); 0280 0281 % find the estimated baseline for every window 0282 for nw = 1:numWindows 0283 subw = Y(X(:,nnss)>=Xpt(nw) & X(:,nnss)<= (Xpt(nw)+Xw(nw)),ns); 0284 switch estimationMethod 0285 case 'quantile' 0286 WE(nw) = quantile(subw,quantileValue); 0287 case 'em' 0288 WE(nw) = em2c1d(subw); 0289 end 0290 end % for nw = 1:numWindows 0291 0292 % smooth the estimated points 0293 if ~isequal('none',smoothMethod) 0294 WE(1:numWindows) = ... 0295 masmooth(Xpt+Xw/2,WE(1:numWindows),10,smoothMethod,2); 0296 end 0297 0298 % regress the estimated points 0299 b = interp1(Xpt+Xw/2,WE(1:numWindows),X(:,nnss),regressionMethod); 0300 0301 if (ns == plotId) 0302 figure 0303 plot(X(:,nnss),Y(:,ns)) 0304 hold on 0305 plot(X(:,nnss),b,'r','linewidth',2) 0306 plot(Xpt+Xw/2,WE(1:numWindows),'kx') 0307 title(sprintf('Signal ID: %d',ns)); 0308 xlabel('Separation Units') 0309 ylabel('Relative Intensity') 0310 legend('Original Signal','Regressed baseline','Estimated baseline points') 0311 axis([min(X(:,nnss)) max(X(:,nnss)) min(Y(:,ns)) max(Y(:,ns))]) 0312 grid on 0313 hold off 0314 setAllowAxesRotate(rotate3d(gcf),gca,false) 0315 end 0316 0317 % apply the correction 0318 if preserveHeights 0319 K = 1 - b/max(Y(:,ns)); 0320 Y(:,ns) = (Y(:,ns) - b) ./ K; 0321 %[YMax,locMax] = max(Y(:,ns)); 0322 %K = 1 - b(locMax)/YMax; 0323 %Y(:,ns) = (Y(:,ns) - b) / K; 0324 else 0325 Y(:,ns) = (Y(:,ns) - b); 0326 end 0327 0328 end % if nargout>0 || (ns == plotId) 0329 end % for ns = 1:numSignals 0330 0331 if nargout == 0 0332 clear Y 0333 end 0334 0335 0336 0337 0338