Organism is opposed to chaos… as message is to noise. – (Wiener, 1954)
1 Introduction
The aim of spectral pre-treatment is to improve signal quality before modelling as well as remove physical information from the spectra. Applying a pre-treatment can increase the repeatability/reproducibility of the method, model robustness and accuracy, although there are no guarantees this will actually work. The pre-processing functions currently available in the package are listed in Table 1.
[34mprospectr version 0.2.9 -- proxy[39m
[34mcheck the package repository at: https://l-ramirez-lopez.github.io/prospectr/,
https://github.com/l-ramirez-lopez/prospectr[39m
Table 1: Pre-processing functions available in the package.
Function
Description
movav
Simple moving (or running) average filter
savitzkyGolay
Savitzky–Golay smoothing and derivative
gapDer
Gap–segment derivative
baseline
Baseline removal (similar to continuumRemoval)
continuumRemoval
Computes continuum–removed values
detrend
Detrend normalisation
standardNormalVariate
Standard Normal Variate (SNV) transformation
msc
Multiplicative Scatter Correction
binning
Average a signal in column bins
resample
Resample a signal to new band positions
resample2
Resample a signal using new FWHM values
blockScale
Block scaling
blockNorm
Sum of squares block weighting
We show below how they can be used, using the Near-Infrared (NIR) dataset (NIRsoil) included in the package (Fernandez-Pierna and Dardenne, 2008). Observations should be arranged row-wise.
library(prospectr)data(NIRsoil)# NIRsoil is a data.frame with 825 observations and 5 variables:# Nt (Total Nitrogen), Ciso (Carbon), CEC (Cation Exchange Capacity),# train (0/1 indicator for training (1) and validation (0) samples),# spc (spectral matrix)str(NIRsoil)
'data.frame': 825 obs. of 5 variables:
$ Nt : num 0.3 0.69 0.71 0.85 NA ...
$ Ciso : num 0.22 NA NA NA 0.9 NA NA 0.6 NA 1.28 ...
$ CEC : num NA NA NA NA NA NA NA NA NA NA ...
$ train: num 1 1 1 1 1 1 1 1 1 1 ...
$ spc : num [1:825, 1:700] 0.339 0.308 0.328 0.364 0.237 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:825] "1" "2" "3" "4" ...
.. ..$ : chr [1:700] "1100" "1102" "1104" "1106" ...
2 Noise removal
Noise represents random fluctuations around the signal that can originate from the instrument or environmental laboratory conditions. The simplest solution is to perform \(n\) repeated measurements and average the individual spectra; noise decreases by a factor of \(\sqrt{n}\). When this is not possible, or if residual noise remains, it can be removed mathematically.
2.1 Moving average
A moving average filter is a column-wise operation that averages contiguous wavelengths within a given window size, as illustrated in Figure 1.
# Add some noisenoisy <- NIRsoil$spc +rnorm(length(NIRsoil$spc), 0, 0.001)# Plot the first spectrumplot(x =as.numeric(colnames(NIRsoil$spc)),y = noisy[1, ],type ="l",lwd =1.5,xlab ="Wavelength",ylab ="Absorbance")# Window size of 11 bands; note the 5 first and last bands are lostX <-movav(X = noisy, w =11)lines(x =as.numeric(colnames(X)), y = X[1, ], lwd =1.5, col ="red")grid()legend("topleft",legend =c("Raw", "Moving average"),lty =1,col =c("black", "red"))
Figure 1: Effect of a moving average with window size of 11 bands on a raw spectrum.
2.2 Savitzky-Golay filtering
Savitzky-Golay filtering (Savitzky and Golay, 1964) is a widely used pre-processing technique. It fits a local polynomial regression on the signal and requires equidistant bandwidth. Mathematically, it operates as a weighted sum of neighbouring values:
where \(x_j^*\) is the smoothed value, \(N\) is a normalising coefficient, \(k\) is the number of neighbouring values on each side of \(j\), and \(c_h\) are pre-computed coefficients that depend on the chosen polynomial order, window size, and derivative degree (0 = smoothing, 1 = first derivative, 2 = second derivative).
# p = polynomial order# w = window size (must be odd)# m = m-th derivative (0 = smoothing)# Accepts vectors, data frames, or matrices (observations row-wise)sgvec <-savitzkyGolay(X = NIRsoil$spc[1, ], p =3, w =11, m =0)sg <-savitzkyGolay(X = NIRsoil$spc, p =3, w =11, m =0)# Bands at the edges of the spectral matrix are lostdim(NIRsoil$spc)
[1] 825 700
dim(sg)
[1] 825 690
3 Derivatives
Taking numerical derivatives of the spectra can remove both additive and multiplicative effects and has further consequences summarised in Table 2.
Table 2: Advantages and drawbacks of derivative spectra.
Advantage
Drawback
Reduces baseline offset
Risk of overfitting the calibration model
Can resolve overlapping absorptions
Increases noise; smoothing required
Compensates for instrumental drift
Increases uncertainty in model coefficients
Enhances small spectral absorptions
Complicates spectral interpretation
Often increases predictive accuracy for complex datasets
Removes the baseline
First and second derivatives can be computed with the finite difference method (difference between two subsequent data points), provided that bandwidth is constant:
\[x_i' = x_i - x_{i-1}\]
\[x_i'' = x_{i-1} - 2 \cdot x_i + x_{i+1}\]
In R this can be achieved with the diff function from base, as shown in Figure 2:
d1 <-t(diff(t(NIRsoil$spc), differences =1)) # first derivatived2 <-t(diff(t(NIRsoil$spc), differences =2)) # second derivativeplot(as.numeric(colnames(d1)), d1[1, ],type ="l", lwd =1.5,xlab ="Wavelength", ylab ="")lines(as.numeric(colnames(d2)), d2[1, ], lwd =1.5, col ="red")grid()legend("topleft",legend =c("1st derivative", "2nd derivative"),lty =1, col =c("black", "red"))
Figure 2: First and second derivatives of a raw spectrum.
Derivatives tend to amplify noise. The gap derivative and the Savitzky-Golay algorithm both address this. The gap derivative is computed as:
\[x_i' = x_{i+k} - x_{i-k}\]
\[x_i'' = x_{i-k} - 2 \cdot x_i + x_{i+k}\]
where \(k\) is the gap size. This can be achieved using the lag argument of diff (see Figure 3):
# m = derivative order; w = gap size; s = segment sizegsd1 <-gapDer(X = NIRsoil$spc, m =1, w =11, s =5)plot(as.numeric(colnames(d1)), d1[1, ],type ="l", lwd =1.5,xlab ="Wavelength (nm)", ylab ="1st derivative")par(new =TRUE)plot(as.numeric(colnames(gsd1)), gsd1[1, ],type ="l", lwd =1.5, col ="red",xaxt ="n", yaxt ="n",xlab ="", ylab ="")axis(4, col ="red")mtext("Gap-segment derivative", side =4, line =3, col ="red")grid()legend("topleft",legend =c("1st derivative", "Gap-segment 1st derivative"),lty =1, col =c("black", "red"))par(new =FALSE)
Figure 3: Comparison of first derivative (finite difference) and gap derivative (lag = 10 bands) for the first spectrum.
For greater control over smoothing, the Savitzky-Golay (savitzkyGolay) and gap-segment derivative (gapDer) functions are preferable. The gap-segment algorithm first smooths the signal over a given segment size, then computes a derivative of a given order over a given gap size. See Luo et al. (2005) and Hopkins (2001) for details on these methods.
4 Scatter and baseline corrections
Undesired spectral variations due to light scatter effects and variations in effective path length can be removed using scatter corrections.
4.1 Standard Normal Variate (SNV)
The Standard Normal Variate (SNV) transformation (Barnes et al., 1989) is a row-wise standardisation that corrects for light scatter by centering and scaling each spectrum individually (Figure 4):
\[\text{SNV}_i = \frac{x_i - \bar{x}_i}{s_i}\]
where \(\bar{x}_i\) is the mean and \(s_i\) the standard deviation of the \(i\)-th spectrum (i.e. computed across all bands of that observation).
Figure 4: Raw absorbance spectra (left) and SNV-transformed spectra (right) for the first five observations.
According to Fearn (2008), SNV should be applied after filtering (e.g. Savitzky-Golay) rather than before.
4.2 Multiplicative Scatter Correction (MSC)
Along with SNV, MSC (Geladi et al., 1985) is one of the most widely used pre-processing techniques in NIR spectroscopy (Rinnan et al., 2009). It is a scatter correction method that accounts for additive and multiplicative effects by regressing each spectrum (\(x_i\)) against a reference spectrum (\(x_r\), typically the mean spectrum, Figure 5):
\[x_i = m_i x_r + a_i\]
\[\text{MSC}(x_i) = \frac{x_i - a_i}{m_i}\]
where \(a_i\) and \(m_i\) are the additive and multiplicative scatter coefficients, respectively, estimated by ordinary least squares for each spectrum.
# Reference spectrum is the column mean of the full matrixmsc_spc <-msc(X = NIRsoil$spc, ref_spectrum =colMeans(NIRsoil$spc))plot(as.numeric(colnames(NIRsoil$spc)), NIRsoil$spc[1, ],type ="l", lwd =1.5,xlab ="Wavelength (nm)", ylab ="Absorbance")lines(as.numeric(colnames(msc_spc)), msc_spc[1, ], lwd =1.5, col ="red")grid()legend("topleft",legend =c("Raw", "MSC"),lty =1, col =c("black", "red"))
Figure 5: Effect of MSC on a raw spectrum (reference = mean spectrum).
Because MSC correction depends on a reference spectrum, that reference must be stored and reused when processing new spectra. The following example shows how to apply the correction estimated on a training set to a separate validation set:
# Training and validation partitionsXr <- NIRsoil$spc[NIRsoil$train ==1, ]Xu <- NIRsoil$spc[NIRsoil$train ==0, ]# Fit MSC on the training set (reference = mean of Xr by default)Xr_msc <-msc(Xr)# Apply the same reference spectrum to the validation setXu_msc <-msc(Xu, ref_spectrum =attr(Xr_msc, "Reference spectrum"))
4.3 SNV-Detrend
The SNV-Detrend transformation (Barnes et al., 1989) extends SNV by additionally correcting for wavelength-dependent scattering effects, i.e. residual curvature remaining after SNV. After the SNV transformation, a second-order polynomial is fitted to each spectrum and subtracted from it (Figure 6).
wav <-as.numeric(colnames(NIRsoil$spc))dt <-detrend(X = NIRsoil$spc, wav = wav)# Dual y-axis: raw and detrended spectra are on different scalesplot( wav, NIRsoil$spc[1, ],type ="l", lwd =1.5,xlab ="Wavelength (nm)", ylab ="Absorbance")par(new =TRUE)plot( wav, dt[1, ],type ="l", lwd =1.5, col ="red",xaxt ="n", yaxt ="n",xlab ="", ylab ="")axis(4, col ="red")mtext("Detrended", side =4, line =3, col ="red")grid()legend("topleft",legend =c("Raw", "SNV-Detrend"),lty =1, col =c("black", "red"))par(new =FALSE)
Figure 6: Effect of SNV-Detrend on a raw spectrum. Note the different y-axis scales: raw absorbance (left) and detrended signal (right).
4.4 Baseline removal
The baseline function estimates and removes the baseline of each spectrum in three steps, with the result illustrated in Figure 7:
The convex hull points of each spectrum are identified using grDevices::chull.
The convex hull points are linearly interpolated to the wavelength positions of the original spectrum to produce the baseline.
The baseline is subtracted from the original spectrum.
Figure 7: Raw absorbance spectra (left) and baseline-corrected spectra (right) for the first three observations.
5 Centering and scaling
Centering and scaling transform a given matrix so that columns have zero mean (centering), unit variance (scaling), or both (auto-scaling):
\[Xc_{ij} = X_{ij} - \bar{X}_{j}\]
\[Xs_{ij} = \frac{X_{ij} - \bar{X}_{j}}{s_{j}}\]
where \(Xc\) and \(Xs\) are the mean-centred and auto-scaled matrices, \(\bar{X}_j\) and \(s_j\) are the mean and standard deviation of variable \(j\). In R these operations are obtained with the scale function.
Spectroscopic models can often be improved by incorporating ancillary data (e.g. temperature) alongside the spectra (Fearn, 2010). However, due to the high dimensionality of spectral data, ancillary variables risk being dominated by the spectral block and contributing negligibly to the model (Eriksson et al., 2006). Block scaling addresses this by applying different scaling weights to different blocks of variables. With soft block scaling, each block is divided by a constant such that the sum of its column variances equals \(\sqrt{p}\), where \(p\) is the number of variables in the block. With hard block scaling, the block is scaled so that the sum of its column variances equals 1.
# type = "soft" or "hard"# Returns a list with the scaled matrix (Xscaled) and the divisor (f)bsc <-blockScale(X = NIRsoil$spc, type ="hard")$Xscaledsum(apply(bsc, 2, var)) # should equal 1
[1] 1
A limitation of block scaling is that all variables within a block are brought to the same variance. When this is undesirable, sum-of-squares block weighting (blockNorm) provides an alternative: the spectral matrix is multiplied by a scalar to achieve a pre-specified sum of squares.
# targetnorm = desired Frobenius norm (sum of squares) for Xbn <-blockNorm(X = NIRsoil$spc, targetnorm =1)$Xscaledsum(bn^2) # should equal 1
[1] 1
6 Resampling
To match the spectral response of one instrument to another, a signal can be resampled to new band positions by interpolation (resample) or by convolution using full-width at half-maximum (FWHM) values (resample2). Figure 8 shows an example of resampling a NIR spectrum to a coarser resolution of 10 nm using resample.
Figure 8: Original NIR spectrum and resampled spectrum at a coarser resolution (every 10 nm) using interpolation.
7 Other transformations
7.1 Continuum removal
The continuum removal technique was introduced by Clark and Roush (1984) to highlight absorption features by removing the effect of the overall spectral shape (albedo). The method identifies the convex-hull envelope (continuum) of each spectrum and normalises the original values against it (Figure 9):
\[\phi_{i} = \frac{x_{i}}{c_{i}}, \quad i = \{1, \ldots, p\}\]
where \(x_i\) and \(c_i\) are the original and continuum values at the \(i\)-th wavelength of a set of \(p\) wavelengths. The result \(\phi_i\) is dimensionless and bounded in \([0, 1]\) for reflectance spectra, where values close to 1 indicate no absorption relative to the continuum.
The continuumRemoval function handles both reflectance (type = "R", default) and absorbance (type = "A") spectra. For absorbance data, spectra are first converted to reflectance (\(1/X\)) before computing the continuum, and the result is back-transformed afterwards. This means that for absorbance data, continuumRemoval and baseline are not equivalent: they compute the convex-hull envelope on different scales. For reflectance data the two functions are more directly comparable, differing only in the final step: baseline subtracts the continuum (\(x_i - c_i\)), whereas continuumRemoval divides by it (\(x_i / c_i\)).
Figure 9: Raw absorbance spectra (left) and continuum-removed spectra (right) for the first three observations.
References
Barnes, R., Dhanoa, M.S., Lister, S.J., 1989. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra, in: Applied Spectroscopy. SAGE Publications Sage UK: London, England, pp. 772–777.
Clark, R.N., Roush, T.L., 1984. Reflectance spectroscopy: Quantitative analysis techniques for remote sensing applications, in: Journal of Geophysical Research: Solid Earth. Wiley Online Library, pp. 6329–6340.
Eriksson, L., Johansson, E., Kettaneh-Wold, N., Trygg, J., Wikström, C., Wold, S., 2006. Multi-and megavariate data analysis. Umetrics Sweden.
Fearn, T., 2010. Combining other predictors with NIR spectra. NIR news 21, 13–16.
Fearn, T., 2008. The interaction between standard normal variate and derivatives. NIR news 19, 16–17.
Fernandez-Pierna, J.A., Dardenne, P., 2008. Soil parameter quantification by NIRS as a chemometric challenge at “chimiométrie 2006.” Chemometrics and intelligent laboratory systems 91, 94–98.
Geladi, P., MacDougall, D., Martens, H., 1985. Linearization and scatter-correction for near-infrared reflectance spectra of meat. Applied spectroscopy 39, 491–500.
Hopkins, D.W., 2001. What is a norris derivative? NIR news 12, 3–5.
Luo, J., Ying, K., He, P., Bai, J., 2005. Properties of savitzky–golay digital differentiators. Digital Signal Processing 15, 122–136.
Rinnan, Å., Van Den Berg, F., Engelsen, S.B., 2009. Review of the most common pre-processing techniques for near-infrared spectra. TrAC Trends in Analytical Chemistry 28, 1201–1222.
Savitzky, A., Golay, M.J., 1964. Smoothing and differentiation of data by simplified least squares procedures. Analytical chemistry 36, 1627–1639.
Wiener, N., 1954. The human use of human beings: Cybernetics and society, Revised edition. ed. Doubleday Anchor, Garden City, NY.