Practical comparison of sparse methods for classification of Arabica and Robusta coffee species using near infrared hyperspectral imaging

In the present work sparse-based methods are applied to the analysis of hyperspectral images with the aim at studying their capability of being adequate methods for variable selection in a classification framework. The key aspect of sparse methods is the possibility of performing variable selection by forcing the model coefficients related to irrelevant variables to zero. In particular, two different sparse classification approaches, i.e. sPCA+kNN and sPLS-DA, were compared with the corresponding classical methods (PCA+kNN and PLS-DA) to classify Arabica and Robusta coffee species. Green coffee samples were analyzed using near infrared hyperspectral imaging and the average spectra from each hyperspectral image were used to build training and test sets; furthermore a test image was used to evaluate the performances of the considered methods at pixel-level. In our case, sparse methods led to similar results as classical methods, with the advantage of obtaining more interpretable and parsimonious models. An important result to highlight is that variable selection performed with two different sparse classification approaches converged to the selection of same spectral regions, which implies the chemical relevance of those regions in the discrimination of Arabica and Robusta coffee species.


1.INTRODUCTION
Hyperspectral imaging (HSI) combines classical spectroscopic systems with imaging devices in order to obtain both spectral and spatial information from a sample. The resulting hyperspectral images are three dimensional data arrays, often referred to as datacubes, where each pixel contains one spectrum [1]. Each hyperspectral image can be formed by thousands to millions of spectra, and each spectrum can be composed by more than 100 wavelengths. Consequently, when dealing with such a big amount of data a multivariate approach is necessary in order to extract the relevant information contained in the datacube. For this purpose, classical multivariate data analysis techniques have been adapted to the elaboration of hyperspectral images showing great potential and benefit for extracting the desired information [2,3].
The large amount of data contained in the datacube represents at the same time both an advantage and a drawback of HSI. On one hand, the large amount of pixels contained in each image allows a detailed representation of the analyzed sample; on the other hand, it is necessary to face data handling issues such as storage problems and long computational times [4].
Therefore, data reduction is frequently needed in order to preserve only the useful information contained in high-dimensional data [5,6]. When dealing with many images, the most common way to perform data reduction consists of extracting the average spectra form each image or from userdefined Regions of Interest (ROI), to be used for further analysis on the whole dataset. In this manner, data reduction is performed in the {x, y} spatial dimensions, without affecting the spectral dimension, .
Alternatively, the spectral dimension can be reduced by selecting only the informative wavelengths, without loss of relevant features. As in the case of point-wise NIR spectroscopy, the identification of the spectral variables that provide useful information (and the resulting elimination of the signal regions containing noise and information not pertinent to the problem at hand) can lead to better results in classification or calibration issues, and simplifies the chemical interpretation of the results. Moreover, in the specific case of hyperspectral imaging, the selection of spectral variables is essential for the identification of key wavelengths in the development of multispectral imagining systems for on-line applications or for portable devices.
In some situations variable selection can be simply based on a deep knowledge of the spectroscopic properties of a sample, but the use of appropriate algorithms based on multivariate statistics generally leads to better results [7].
To this aim, many variable selection techniques have been proposed in the literature, such as interval-Partial Least Squares (iPLS) [8][9][10], Genetic Algorithms (GA) [11], and Wavelet Transform (WT) [12,13], which work in different manners and are suitable for different applications.
Though very effective, these methods often require long computational times and the optimization of many parameters or, as in the case of iPLS, the selected regions strongly depend upon the defined interval size. Sparse methods have been developed in order to face problems concerning calibration or classification of high-dimensional data, mostly in the field of bioinformatics where the variables usually consist of thousands of genes. Sparse methods have been widely applied also in statistic learning [14], analysis of biological data [15], metabolomics [16] and genomics [17]. Some applications of sparse methods to spectroscopic data in the field of food analysis and control are also available in the literature, for example concerning food-borne bacterial species [18], white wines discrimination [19] or virgin olive oil adulteration [20].
Generally, the term sparse refers to a matrix in which most of the elements are equal to zero. In the case of sparse methods the term sparse refers to the estimated parameter vector of a model, e.g. a regression vector, which is forced to contain many zeros. The main idea of sparse methods is to reduce the influence of the noise contained in irrelevant variables by forcing the model coefficients related to those variables to be equal to zero, consequently performing variable selection. Classical methods, like e.g. PCA or PLS, usually do not set the contribution of uninformative variables to zero, but only to a small absolute value. Therefore, if many variables have small contributions, their global influence could be considerable and lower the predictive ability of the model [21].
In general, sparsity is achieved by adding a penalty term to a given objective function in order to induce some model coefficients to be equal to zero. The level of sparsity to induce in the model is a user-defined parameter that needs to be optimized, for example by minimizing the cross validation error in a similar manner as for selection of number of components. Therefore, sparse methods require two parameters to be tuned, but once the optimal sparsity and dimensionality of the model is optimized , they allow performing classification (or regression) and variable selection at the same time.
Several sparse versions of classical methods have been developed for data exploration, regression and classification problems. Principal Component Analysis (PCA) is the most used chemometric tool for data exploration. This has originated several sparse variants of PCA (sPCA), where sparsity is induced both on the score and loading vectors [22][23][24], or only on the loading vectors [25][26][27]. For regression purposes, sparse versions of Partial Least Squares (sPLS) have been proposed by Chun and Keles, which make use of the elastic net approach [28], and by Lê Cao et al., whose method is instead based on the Lasso penalty [29]. Lê Cao et al. also proposed an extension of their sPLS for a sparse version of Partial Least Squares Discriminant Analysis (sPLS-DA) [30], while Clemmensen et al. proposed a sparse version of Linear Discriminant Analysis [31].
Moreover, in the literature some research studies are reported where sparse methods are compared with other variable selection methods [18,32] demonstrating the potential of sparse methods as a stable variable selection strategy.
In this context, the aim of the present work is to show a practical application of sparse methods such as sPCA-kNN [33] and sPLS-DA [30], which have been also compared with the corresponding non-sparse methods (PCA+kNN and PLS-DA) in terms of classification performances and model interpretability. In particular, sparse methods were applied in order to perform spectral variable selection on NIR hyperspectral images, with the aim of differentiating Arabica and Robusta green coffee beans. Arabica (Coffea arabica) and Robusta (Coffea canephora) coffee are the two main species used for the preparation of commercial coffee beverages, both alone or in blends. Due to its better taste and aroma, Arabica coffee is of higher quality than Robusta coffee, but it is more difficult to grow, even in function of its lower resistance to plant diseases, and therefore it is more expensive [34].
Classical point-wise NIR spectroscopy has been widely used to discriminate Arabica and Robusta species, both on green coffee [35,36] and roasted coffee [37,38]. Moreover some research works have been also reported where hyperspectral imaging is used to characterize these coffee species [39,40]. For these reasons, in the present work green coffee samples were analyzed with NIR-HSI and the hyperspectral images were elaborated in order to test the ability of two different sparse classification methods, i.e., sparse PCA coupled with k-Nearest-Neighbors (sPCA+kNN) and sparse PLS-DA (sPLS-DA), to discriminate Arabica and Robusta coffee. Both these sparse classification methods allowed to perform classification and variable selection at the same time, leading to the identification of the informative NIR regions involved in the discrimination. The performances of the two sparse methods and their corresponding classical (non-sparse) counterparts (PCA+kNN and PLS-DA) were compared.

Classical methods
As a first, very simple approach to discriminate between the average spectra obtained from the hyperspectral images acquired on Arabica and Robusta coffee samples, Principal Component Analysis (PCA) [41] was used in conjunction with k-Nearest-Neighbors (kNN) [42]. In PCA+kNN, firstly a PCA model is computed on the average spectra matrix and then the PCA scores are used for kNN classification. In this manner, a data compression technique is coupled with a simple and robust classification tool. The number k of nearest neighbors to consider in kNN classification and the number of principal components are user-defined parameters.
Moreover, also Partial Least Squares Discriminant Analysis (PLS-DA) [44] was considered as an alternative classification method, since it is a fast linear method often leading to optimal performances. In this case, the proper number of latent variables of the model has to be selected for example by maximizing the classification efficiency in cross validation.

Sparse methods
Sparse methods are extensions of classical methods in which the parameter vectors of a model are forced to contain many zeros by adding a penalty term to the objective function of the considered method [21]. The algorithms used in this work for sPCA and sPLS-DA apply the Least absolute shrinkage and selection operator (Lasso) approach [45] to induce sparsity on the model coefficients.
In the following sections only a brief description of sPCA and of sPLS-DA algorithms is given; for a more detailed explanation the reader is referred to [33] and [30].

sPCA + kNN
Sparse Principal Component Analysis (sPCA) is a PCA-based model in which sparsity is induced on the model parameters: scores, loadings or both of them. Several algorithms were proposed to calculate sPCA models where sparsity is induced on the loadings; the one used in this work is Alternating Shrunken Least Squares (ASLS) [33].
The ASLS algorithm, for a fixed number of components A, estimates a sparse PCA solution of the following objective function: subject to || || 1 1 ≤ and || || 2 where ||. || 2 is the squared Frobenius norm of the matrix (sum of squared elements), T is the scores matrix, p i are the columns of the normalized loadings matrix P for the i th component, and c is the L 1 norm constraint on each loading vector. Therefore, the L 1 norm constraint (|| || 1 1 ≤ ) applied on each normalized (|| || 2 2 = 1) loading vector gives a sPCA model with sparse loadings.
The value of the scalar c (from here onwards referred to as sparsity constraint) controls the sparsity level of the model: the lower is c the higher is the sparsity induced on the loadings. In particular, the sparsity constraint may range between 1 (which corresponds to only one active variable for each Since sPCA is an unsupervised technique, it is necessary to couple it with a classifier (e.g. kNN) in order to obtain an estimate of the classification performance which can be used to tune the model parameters. In general, the choice of the optimal sparse parameters should be addressed to the best compromise between sparsity (i.e. as less variables as possible) and model performance (i.e. high efficiency), in order to keep the lowest possible the number of useful spectral variables that lead to stable models with satisfactory classification results.

sPLS-DA
The algorithm used to perform Sparse Partial Least Squares Discriminant Analysis (sPLS-DA) [30] is an extension of Sparse Partial Least Squares regression (sPLS) [29] applied to classification problems. Similarly to PLS-DA, sPLS-DA is based on the use of PLS regression for discriminant purposes, but a Lasso penalty is added to the model parameters in order to constrain some coefficients to be equal to zero.
In particular, the sPLS algorithm used in this work is based on the PLS-SVD approach [47]. Given a descriptor matrix X with size {n, m} and a response matrix Y with size {n, q}, the PLS-SVD approach is based on SVD decomposition of the cross product = , as follows: where the column vectors of U and V correspond to the PLS loadings vectors of X and Y, respectively. In the same manner as PLS-DA, Y is a dummy matrix containing the binary class vectors. The cross product M is calculated as expressed in Equation 3 only for the first latent variable (h=1); for the subsequent components (h = 2, … ,H) the cross product is calculated on the previously deflated X h-1 and Y h-1 matrices. Indeed the algorithm, in an iterative way, minimizes the squared residuals between the current cross product and the estimated loading vectors and, moreover, adds the Lasso penalty to the X loading vector u h ; subsequently X h and Y h are calculated by a deflation step from matrices X h-1 and Y h-1 . Therefore, the first couple of singular vectors u h and v h (where either ‖ ‖ 2 2 = 1 or ‖ ‖ 2 2 = 1) are the initial estimate of the iterative algorithm which solves the following optimization problem: where ||. || 2 is the squared Frobenius norm of the matrix (sum of squared elements), and λ is a penalty parameter which applies the Lasso componentwise on the loading vectors. Sparsity is induced on the PLS loadings, and consequently on the regression coefficients used to predict unknown samples, therefore thanks to the sPLS-DA approach it is possible to perform both classification and variable selection in one step, by forcing to zero the coefficients of noisy or uninformative variables.
Conversely to sPCA, in sPLS-DA the sparse latent variables are orthogonal to each other, and their directions do not depend upon the number of components used to calculate the model, due to the deflation step performed before calculating each component.
As in sPCA, there are two parameters to tune in sPLS-DA: the number of sparse latent variables (sLVs) and the penalty term λ. Since the sparsity induced on the model is related to the value of λ, for practical reasons the algorithm has been implemented by the authors in a way to define directly the number of variables to select for each sLV, rather than λ [30].

Coffee samples
Samples of green coffee beans of Arabica and Robusta species were provided by a local coffee roasting company. Thirty three green coffee batches were considered in this study, coming from different geographical areas and subjected to different processing methods to separate the seed form the fruit. Despite the different sources of variability in the samples, we focused on the discrimination between Arabica and Robusta coffee species, regardless of processing method or geographical origin.
On the whole, 33 samples were collected in the industrial plant during a period of 6 months: 18 samples of Robusta and 15 samples of Arabica. Each sample consisted of about 500 g of beans that were sampled in order to be as representative as possible of the corresponding batch, and were stored in a sealed package until the day of analysis. From each sample, three aliquots of 70 g of randomly selected beans were kept, and two images were acquired, changing the arrangement of the beans between the two images. This procedure was repeated in a different day to check the day-today variability. All the samples were acquired in random order and the packages were sealed again and stored at room temperature between the different acquisition days. Therefore, for each sample 12 hyperspectral images (= 2 measurement sessions × 3 aliquots × 2 repeated acquisitions) were obtained, leading to a dataset composed by 396 hyperspectral images (33 samples × 12 images).

Image acquisition
The hyperspectral images were acquired using a desktop NIR Spectral Scanner (DV Optic), using a reflectance imaging based spectrometer Specim N17E, coupled to a Xenics XEVA 2608 camera (320 × 256 pixels) and working in the 955-1700 nm spectral range with a spectral resolution of 5 nm. All the images were acquired using as background a black silicon carbide sandpaper sheet, which is characterized by a very low and constant reflectance spectrum [47]. Moreover, a 99% reflectance standard and two ceramic tiles with two different grayscale tones and intermediate reflectance values were included in the images.
The raw data were converted into reflectance values using an instrumental calibration based on the high reflectance standard reference and on dark current [48]. Furthermore, in order to reduce the variability among images over time, an additional internal calibration was performed [49], based on the average reflectance values of the reflectance standard, of the two ceramic tiles and of the black silicon carbide sandpaper.
Then, before further analysis, from each image the pixels related to the black sandpaper background were removed using a thresholding procedure. To this aim, based on the preliminary evaluation of some sample images, the most discriminant wavelength was identified by maximizing the Fisher ratio between background spectra and sample spectra. In this manner, at 1050 nm, all the pixels below the threshold value of 0.1 reflectance units were identified as background and removed.

Data analysis
After background removal, from each image the average spectra were calculated obtaining a dataset Moreover, in order to visually evaluate the classification performances at the pixel level for both non-sparse and sparse models, one image of Arabica coffee and one of Robusta coffee taken from the test samples were merged together in order to create a test image. In this manner, since the test image is made of two images, one for each class, it was possible to know the class belonging of the single coffee beans and thus to obtain a quantitative evaluation of the predictive ability of the models.
Before calculating the classification models, the spectra were preprocessed using Standard Normal Variate followed by first derivative and mean center.
The classification performances were defined using efficiency (EFF), which is the geometric mean between sensitivity (SENS) and specificity (SPEC), i.e.: where sensitivity is the percentage of objects of each class accepted by the class model and specificity is the percentage of objects of the other classes correctly rejected by the class model [43]. Further details are provided in reference [50]). The data were analysed using a personal computer running with Windows 8.1-64 bit and equipped with an Intel Core® i7-3632QM CPU @ 2.20GHz processor and 6.00 GB RAM.

RESULTS AND DISCUSSION
In order to compare both the sparse methods with the corresponding classical methods and all the four classification methods altogether, the following sections are organized as follows:   Figure 1.b) the higher is the c value which is necessary to reach stable conditions. For example, when using 2 sPCs a stable situation is reached with a sparsity constraint value equal to 3.5, while with 4 sPCs the value of c must be at least equal to 5.5.
Therefore, comparing the classification efficiency trends reported in Figure 1.b, the best sPCA+kNN model was chosen as the one calculated using 2 sPCs and a sparsity constraint equal to 3.5, that corresponds to 21% of the variables set to zero. This model was then used to predict the samples of the external test set, obtaining a classification efficiency value equal to 100%.
The comparison between the performance of PCA+kNN and sPCA+kNN is reported in Table 1.
The results obtained for the calibration and for the monitoring sets are comparable for both methods, but sPCA led to a much higher classification efficiency for the test set, which confirms the importance of forcing to zero the coefficients related to uninformative variables. PCA and sPCA models have the same dimensionality in terms of number of PCs, but the number of spectral variables selected with sPCA is definitely lower (21% variables selected in sPCA). As it is shown in Figure 2  Furthermore, the performance of the selected sPCA+kNN model was also evaluated at pixel level using the test image, and the results were compared with those obtained with PCA+kNN (last row of Table 1). The prediction results are also reported under the form of images in Figure 3.a for PCA+kNN and in Figure 3.b for sPCA+kNN, where the pixels predicted as belonging to Arabica coffee are represented in red color, while those predicted as Robusta coffee are represented in green color. Figure 3 shows that, by a qualitative point of view, the results obtained with PCA+kNN and with sPCA+kNN are analogous, since the overall classification of the beans is correct in both cases. The different efficiency values reported in Table 1 are due to some pixel misclassifications ascribable to the round shape of the beans and the presence of the center-cut, whose effects are slightly more evident in sPCA+kNN than in PCA+kNN. Therefore the sparse model are more sensitive to the noise caused by the morphology of the beans and this fact is shown in the difference image in Figure 3.c, where the pixels correctly predicted with both methods are represented in blue color, the pixels misclassified in both methods are represented in purple color and those predicted in different classes are represented in yellow color. In particular, the percentage of pixels correctly predicted in both methods is equal to 84%, the percentage of pixels misclassified in both methods is equal to 7% and the percentage of pixels differently predicted is 9%; 6% of which is correctly predicted only by PCA+kNN while 3% is correctly predicted only by sPCA+kNN.    Also in this case a pixel-level classification of the test image was obtained for both PLS-DA and sPLS-DA models, and the results are reported in Figure 6.a and Figure 6.b as a false-color image,

PCA + kNN
where the pixels predicted as belonging to Arabica coffee are represented in red color and the pixels predicted as belonging to Robusta coffee in green color. Comparing the right parts of Figure 6.a and 6.b (Robusta coffee beans), it is possible to notice that for both PLS-DA and sPLS-DA the same Robusta coffee bean is misclassified; moreover, in sPLS-DA there is one more Robusta bean whose classification is uncertain. Analogous considerations can be drawn for the left parts of the two images, where ambiguous results have been obtained for two coffee beans. In general, the performance of the sPLS-DA model is slightly lower than the performance of the PLS-DA model; this is also confirmed by the efficiency values reported in Table 1 and by the comparison reported in Figure 6.c, where the pixels correctly predicted in in both methods are represented in blue color, the pixel misclassified in both methods are represented in purple color while, those predicted in different classes are represented in yellow color. In particular, the percentage of pixels correctly predicted with both methods is equal to 76%, the percentage of pixels misclassified in both methods is equal to10% and the percentage of pixels differently predicted is equal to 14%, 9% of which is correctly predicted only by PLS-DA and 5% is correctly predicted only by sPLS-DA. Similarly to the case of sPCA+kNN, Figure 6.c shows that the sparse model is more sensitive to the round shape and to the presence of the center-cut of the beans. Figure 6. Prediction on the test image of the best (a) PLS-DA and (b) sPLS-DA models and (c) difference image between the two images of PLS-DA and sPLS-DA models. Each image reports on the left the group of Arabica coffee beans, and on the right the group of Robusta coffee beans. In (a) and (b) the pixels predicted as belonging to Arabica coffee are reported in red color, while those predicted as belonging to Robusta coffee are represented in green color; in (c) the pixels correctly predicted in both methods are represented in blue color, the pixels misclassified in both methods are represented in purple color while those predicted in different classes are represented in yellow color.

Comparison between methods
The comparison between the results reported in Table 1 shows that the classification performances image could be explained by the fact that kNN is a distance-based classifier able to handle nonlinear boundaries between classes, while PLS-DA establishes a linear threshold. Since images are made of thousands of pixel spectra, these have a much greater variability than the average image spectra, and some overlapping between classes of pixels may occur. In these conditions, kNN proves to be a more robust method than PLS-DA.
Regarding variable selection performed by sparse methods, sPLS-DA led to a slightly more parsimonious model including 20 active variables, with respect to the 32 non-zero variables selected by sPCA+kNN. However, comparing the sparse loadings from the sPCA+kNN model (Figure 2.b) with those of the sPLS-DA model ( Figure 5.b), it is evident that there is a substantial convergence of the spectral regions selected by the two different approaches, which confirms the reliability of these sparse methods in highlighting the chemical differences between the two considered classes.
Finally, a comparison between the different classification methods was also done in terms of time

CONCLUSIONS
In the present work we explored the possibility to use sparse methods, such as sPCA+kNN or sPLS-DA, both in order to classify hyperspectral images of Arabica and Robusta green coffee beans, and to select spectral regions relevant for the discrimination between the two classes.
Compared to classical methods, the corresponding sparse methods led to the analogous or even better classification results, evaluated both at the image level and at the pixel-level.
However, sparse methods allowed performing variable selection at the same time as classification, giving much more parsimonious models and enhancing the interpretability in chemical terms of the results, within a reasonable computational time. In particular, the feature selection made with two different sparse classification approaches converged to the same spectral regions, which confirms the chemical relevance of the selected wavelengths.
Furthermore, the high classification efficiency values obtained with sparse methods highlighted the possibility to use the narrow selected spectral regions for the implementation of multispectral systems, to be used for on-line process control applications.