Exploring local spatial features in hyperspectral images

We propose a methodological framework to extract spatial features in hyperspectral imaging data and establish a link between these features and the spectral regions, capturing the observed structural patterns. The proposed approach consists of five main steps: (i) two‐dimensional stationary wavelet transform (2D‐SWT) is applied to a hyperspectral data cube, decomposing each single‐channel image with a selected wavelet filter up to the maximum decomposition level; (ii) a gray‐level co‐occurrence matrix is calculated for every 2D‐SWT image resulting from stage (i); (iii) distinctive spatial features are determined by computing morphological descriptors from each gray‐level co‐occurrence matrix; (iv) the morphological descriptors are rearranged in a two‐dimensional data array; and (v) this data matrix is subjected to principal component analysis (PCA) for exploring the variability of the aforementioned descriptors across spectral channels. As a result, groups of spectral wavelengths associated to specific spatial features can be pointed out yielding a better understanding and interpretation of the data. In principle, this information can also be further exploited, for example, to improve the separation of pure spectral profiles in a multivariate curve resolution context.


| INTRODUCTION
Hyperspectral imaging (HSI) has numerous possible applications that, depending on the instrumentation and the spectral domain covered, can range from environmental surveillance to cellular monitoring. [1][2][3][4] HSI data consist of three-dimensional arrays with two spatial dimensions and one spectral dimension, providing an image for each scanned spectral channel. When HSI is concerned, one is usually interested in retrieving both the pure spectra of the individual components constituting the image and their respective spatial distribution.
In chemistry, one of the most used approaches for achieving this aim is multivariate curve resolution-alternating least squares (MCR-ALS). With MCR-ALS, a hyperspectral image is first unfolded pixel-wise, and afterwards, a bilinear model is fitted to the unfolded data using an ALS approach under appropriate constraints. This permits to unravel the distribution maps and the pure spectral signatures of the physicochemically meaningful constituents of the image. 5 However, unfolding the data results in losing the information on the local spatial features within the dataset. A solution to this issue can be the implementation of spatial constraints as described in Hugelier et al. 6 Nonetheless, if different constituents exhibit distinct spatial features (i.e., textural patterns), and/or two (or more) of these different constituents show a significant overlap along both the spectral and the spatial domain, this approach may not be actionable. Multivariate image analysis (MIA) is a very useful alternative to investigate spatial features in grayscale, RGB, and, to a lesser extent, hyperspectral images. [7][8][9] The basic principle of MIA is to analyze the unfolded data by means of multivariate tools like principal component analysis (PCA) or partial least squares (PLS) regression. Spatial features are captured considering the relationships between each pixel and its neighbors in the unfolding step (see Bharati et al. 7 and Prats-Montalbán et al. 8 for details). MIA has also been coupled to wavelet transform (WT) multiresolution analysis, 10 and this combination has been proven effective in resolving spatial features in multispectral 11 and Raman hyperspectral images. 12 In addition, other strategies, like coclustering 13 and gray-level co-occurrence matrices (GLCMs), 14 have recently been applied to examine texture in HSI datasets. Textural features in HSI have also been explored by using three-dimensional discrete WT (DWT) 15 or by fusion of the two-dimensional DWT decomposition images obtained from each spectral channel. 16 All these approaches are capable of enhancing spatial features in HSI data and of establishing some link between these features and the spectral regions responsible for the observed textural patterns. However, their performance has not been satisfactory in situations where both textural/spatial and spectral information are highly mixed, that is, when pure chemical components and/or distinct physical contributions are overlapped both in the investigated spectral range and in their distribution across the image. 17 Aiming at facing this issue, we propose in this short communication a methodological framework that relies on the capability of two-dimensional stationary wavelet transform (2D-SWT) 18,19 to capture distinct spatial features in disjoint subspaces (different wavelet images can be extracted for every spectral channel) and on the versatility of multivariate data analysis tools to explore the information these spatial features carry. The approach consists of five consecutive steps: (i) 2D-SWT decomposition of the HSI data cube resulting in wavelet subimages; (ii) from these subimages, computation of the GLCM; (iii) calculation of morphological descriptors from each GLCM; (iv) rearrangement of the obtained descriptors into a matrix; and (v) PCA modeling of this matrix for the exploration of the variability of the morphological descriptors across spectral channels.
Such a workflow would allow, for example, spectral wavelengths mostly capturing specific spatial features to be pointed out by the simultaneous investigation of PCA loadings and scores. PCA can also be coupled to other statistical approaches for addressing particular tasks depending on the study at hand. Here, for example, we used the k-means algorithm to cluster the loadings obtained from the PCA of the descriptor matrix and determine the spectral regions that show similar variation patterns in terms of spatial descriptors. These regions can be selective for different pure components underlying HSI data and highlighting them can help users to, for example, improve the quality of MCR-ALS solutions.

| METHODS
The proposed methodology consists of five main steps which are illustrated in Figure 1 and detailed below.

| Step 1: 2D-SWT decomposition
A low-pass filter and a high-pass filter are applied to every spectral channel of the analyzed hyperspectral image (see Figure 1A) to obtain four distinct sets of wavelet coefficients, denoted H, V, D, and A. Each set corresponds to a decomposition block and will be referred to as a wavelet subimage. The horizontal set (H) corresponds to the application of both a low-pass filter, row-wise, and high-pass filter, column-wise; the vertical set (V) to the application of a high-pass filter, row-wise, and a low-pass filter, column-wise; the diagonal set (D) to the application of a high-pass filter, both row-wise and column-wise; and the approximation set (A) to the application of a low-pass filter, both row-wise and column-wise. This scheme is iterated on the approximation subimage until a certain decomposition level has been reached (see Figure 1B). In 2D-SWT, at each iteration of the decomposition, the wavelet filter is upsampled, contrary to standard DWT 20 where the wavelet coefficients are downsampled. In this way, congruent wavelet subimages are yielded, and the position of the spatial patterns in the image is preserved.
For the objective of this short communication, the decomposition level is set to the maximum compatible with the size of the original image. Furthermore, the Haar filter was utilized here even though different decomposition filters exist and can be exploited for the same purpose. 21

| Step 2: GLCM calculation
A GLCM is computed for each slab corresponding to a specific wavelet subimage, that is, for a given block of coefficients (H, V, D, and A) associated to a specific decomposition level and spectral wavelength ( Figure 1C). A GLCM maps the local textures of a given image by counting how often pairs of pixels with certain normalized integer intensity values occur at a particular distance. 22,23 The type of normalization, and the direction along which the distance is calculated, needs to be set a priori. Here, we used a 64-integer intensity range and different directions for each wavelet subimage, matching the spatial pattern every wavelet decomposition image highlights: vertical direction for the horizontal coefficients image (the information retained after the decomposition, in fact, reflects the local spatial changes in that direction); horizontal for the vertical coefficients image; and top-left to bottomright direction for the diagonal coefficients image and the summation of the previous directions for the approximation coefficients image.

| Step 3: Descriptors calculation
A set of eight descriptors (Energy, Contrast, Correlation, Variance, Inverse difference moment, Sum entropy, Information Measure of Correlation 1, and Maximal correlation coefficient 23 ) is computed for each GLCM (see Figure 1D). A brief description of each of these features and their respective formulas are included in Table S1. These descriptors were chosen because they summarize most of the local spatial features one can find in an image. However, depending on the case study, distinct descriptors can be selected based on prior knowledge or on the necessity of specific image features to be highlighted.

| Step 4: Descriptor matrix rearrangement
All the morphological descriptor values estimated for every spectral channel are organized into individual column vectors subsequently gathered in a single data matrix. Thus, the descriptor matrix (see Figure 1E) features a number of rows equal to the number of descriptors (eight) times the number of decomposition levels (which depends on the size of the hyperspectral image) times the number of wavelet subimages per decomposition level (four). The number of columns corresponds to the number of spectral variables (wavelengths).

| Step 5: Multivariate analysis
The descriptor matrix is subjected to PCA for the exploration of the information it carries and, more specifically, for establishing a link between the spatial and spectral information captured by the variation of the morphological descriptors within the investigated spectral range. In this work, a possible pathway to establishing this link in a more systematic way is also explored, that is, the application of k-means clustering to the resulting PCA loadings to get an idea about the spectral channels associated to similar spatial features.

| RESULTS AND DISCUSSION
The aim of this communication is to show how local spatial features extracted with the use of the procedure outlined in the previous section can provide valuable information for HSI data analysis and exploration. For this purpose, the results of two case studies are presented.

| Oil-in-water emulsion
The first case study 24,25 relates to a Raman HSI dataset of an oil-in-water emulsion, which illustrates a situation where the spatial and spectral information are both somehow selective in their respective domains, that is, no severe overlap of the spatial and spectral features is observed. More specifically, the different individual chemical components of the image (featured in the spectral domain) are associated to clearly distinguishable shapes/spatial structures. This dataset is relatively simple and will serve as a proof of concept for the proposed methodology. The Raman imaging system by which these data were collected has a spatial resolution of around 1 μm, and the image is 60 × 60 pixels. The spectral resolution is 3.6 cm −1 , and the investigated spectral range goes from 953.6 to 1861 cm −1 (253 wavelengths). In Figure 2A,B, the mean image, averaged over all the spectral channels, and the mean spectrum, averaged over all the pixels, are shown, respectively.
We applied our approach considering eight descriptors (Section 2) and up to five decomposition levels. The outcomes resulting from the PCA modeling of the descriptor matrix are shown in Figure 2C,D. They display the scores and loading plots of the two first principal components, respectively. The score plot provides a graphical representation of the variation of the morphological descriptors across the wavelet subimages whereas the loading plot accounts for their variation across the spectral channels. As it can be assumed that the most extreme score values are the most informative, as recently pointed out in MCR context, 26 we will focus on the 10 most extreme scores values along PC 1 and PC 2 in Figure 2C. The corresponding points are labeled according to their respective descriptor name and to the wavelet subimage from which such a descriptor has been calculated. In Figure 2D, the loadings lying in the same quadrant along the direction determined by each one of these points and the origin of the PCA subspace (±9 ) are highlighted accordingly (same colors/symbols). This way, 10 different groups of spectral channels were identified (loadings too close to the origin and, thus, contributing very little to the definition of PC 1 and PC 2 were excluded from this assessment). One wavelet subimage can be associated to each group which corresponds to one of the labeled descriptors in the score plot, as represented in Figure 3 (a maximum number of three wavelengths per group is considered here).
The comparative inspection of Figures 2 and 3 enables the simultaneous exploration of the spatial and spectral characteristics of the investigated HSI data. The loading plot gives insights into the spectral domain whereas the score plot together with the wavelet subimages does so for the spatial domain.
From Figure 3, overall, four main spatial contributions are discernible: • two small droplets (see Figure 3A,D): the descriptors capturing these spatial features were Energy-A2 (i.e., energy calculated on the wavelet subimage corresponding to the approximations coefficients at decomposition level two) and Variance-H3; • a larger droplet-like structure (see Figure 3C,H), associated to descriptors Variance-A3 and Sum-Entropy-A2; • a circular border around the larger droplet-like structure (see Figures 3B,E), associated to descriptors Energy-V2 and Variance-V3. It was expected that the vertical details would be able to capture the border shape; and • a background effect (see Figure 3G,J), associated to descriptors Variance-A5 and Information Mean Correlation-V5. Actually, the deepest decomposition level typically captures very smooth textural features. Overall, we may regard the fifth level of decomposition as the one that captures background and/or illumination effects.
On the other hand, Figure 3F,I seem to result from the overlap of some of these contributions. To summarize, the proposed approach enabled to identify and distinguish four different components within the oilin-water scene, spatially unraveled in the wavelet subimages ( Figure 3) and spectrally identified in the loading plot ( Figure 2D). These results are in good agreement with previous findings, 24,25 which discussed the interior droplet as due to oil and the border structure being the oil/water interface, while matching the smaller droplet to oil with a different composition. However, to complement the results of this preliminary visual inspection, k-means clustering was exploited. For this purpose, the descriptor matrix was compressed by PCA (17 PCs, explaining 90.12% variance), and clustering was applied on the estimated PCA loadings (four clusters of wavelengths were retrieved).
As highlighted in Figure 4C, averaging the hyperspectral image across the four extracted clusters of spectral channels led to the isolation of the different structures previously unraveled. It is then clear that for the data at hand, contributions showing a distinctive spatial distribution are also associated to rather selective spectral signatures within different wavelength ranges. These spectral signatures can be inspected in Figure 4A: • Cluster #1, which is associated to the small droplets, coincides with the minor peaks at 1300, 1684, and 1789 cm −1 .
• Cluster #2, which is associated to the border structure, corresponds to three major peaks at 1044, 1126, and 1317 cm −1 . • Cluster #3 mainly encompasses the peaks at 1483 and 1508 cm −1 . It corresponds to the interior droplet.
• Cluster #4 corresponds to the baseline regions observed in the mean spectrum.
A point to take note of is the overall correspondence of the spectral channels belonging to the groups identified by exploratory PCA of the descriptor matrix with the clusters found through k-means, as can be seen by comparing Figures 2D and 4B. In addition, the spatial features revealed by the wavelet subimages in Figure 3 mostly match those highlighted in the clustered mean images in Figure 4C.
F I G U R E 3 Oil in water dataset. The wavelet subimages, at decomposition block and level as reported in the points label in Figure 2C and at the spectral wavelengths corresponding to the ones showing the same symbols in Figure 2D, are shown in separate frames, labeled (A)-(J). The name of the descriptor-block level is reported on top of each frame; for example, the approximation images at the second decomposition level for the three wavenumbers 1695.2, 1760, and 1763.6 cm −1 are shown in Figure 3A, and so on for Figure 3B-J. The corresponding wavenumber is reported above each image Due to the straightforward nature of the results, the dataset has been used to assess the relevance of each individual step of the proposed workflow ( Figure 1). Analysis have been performed by removing some of these steps, that is, applying k-means on the PCA of unfolded HSI data or excluding the wavelet decomposition step and carrying out the GLCM and descriptors calculations directly on the raw images. The obtained results ( Figure S1) showed that the information obtained was less significant than when applying the full original workflow.

| Semen droplet on cotton tissue
The second case study regards a 222 × 220 pixels HSI-near infrared (NIR) image (acquired in the wavelength range 1268.8-2456.2 nm with a spectral resolution of 6.3 nm) of a semen droplet on a piece of white cotton. Further details on the data acquisition are given in Silva et al. 17 The mean image is represented in Figure 5A. A quite complex structure is observed which is characterized, at least at a first glance, by (i) a distinct horizontal pattern across the entire image that is due to the rough surface of the cotton fabric (texture); (ii) an almost indiscernible shadow of the oval-shaped border of the semen droplet; and (iii) a spurious fiber filament in the lower middle area of the image.
It is worth noting that this case study exhibits a much higher complexity than the previous one: the cotton contribution is present everywhere across the image; thus, there exists no spatial area selective for semen. In addition, F I G U R E 4 Oil in water dataset.
Results of the k-means clustering algorithm on the loadings matrix. (A) Mean spectrum, with the point at each wavelength colored according to the cluster's number they were assigned to; (B) PC 1 versus PC 2 loading plot, points colored according to clusters; and (C) mean images for each cluster, mean taken across the spectral channels belonging to the same cluster the spectral profiles of the different constituents of the captured scene are severely overlapped and semen might be not homogeneously distributed over the cotton sample.
In order to explore these data, we applied our approach as detailed in Section 2.
The three first PCs of the descriptor matrix were inspected here. Figure 5 displays the resulting outcomes. Different clusters of wavelength channels were identified within both the PC 1 /PC 2 and PC 2 /PC 3 subspaces. The 10 most extreme score values in Figure 5C,E were taken into consideration, following the same procedure outlined for the emulsion dataset. For the sake of simplicity, not all the corresponding points in the loading plot ( Figures 5D and 5E) were considered to extract the related wavelet subimages, but all were investigated. Among them, only those associated to the wavelet subimages showing easily interpretable spatial patterns were isolated. These wavelet subimages are shown in Figure 6. The images shown in Figure 6 can be categorized into three groups, each showing one of the main spatial contributions underlying this dataset: • The semen stain associated to descriptors Energy-A1 ( Figure 6A) and Variance-A4 ( Figure 6D,E). Two subgroups seem to be present, as two spatially distinct forms of the stain seem to exist. In Figure 6E, for the images taken at 1700 and 1500 nm, the two forms are apparent. This is in good agreement with previous findings showing how the spatial distribution observed might be generated by complementary semen compounds exhibiting a distinct behavior during the drying process of the biological fluid on the cotton fabric. 17 • The background ( Figure 6B) captured by the descriptor Contrast-H1. This textural pattern that is observed across the entire image is most likely caused by the reflection of light on the cotton fibers. The corresponding loadings (purple stars in Figure 5D) are associated to wavelengths 1287.5, 1306.2, and 1318.8 nm.
These are located in the range where cotton absorbs (1268.8-1362.5 nm) 27 and would most likely be related to the second overtones and combination of C-H stretching and C-H deformation. Notice that the fiber filament and the semen stain appear to be slightly overlapped in various wavelet subimages. Another important point to consider is that the best separation of the two semen stain forms resulted from the A4 wavelet subimage ( Figure 6E). This highlights the fact that wavelet decompositions can provide a much greater spatial resolution, as they have the ability to unravel distinct spatial structures. This is observed in Figure 6B,D, where the separation between semen and the horizontal spatial interference is evident (such a separation cannot be visualized in any of the single channel images of the original HSI data, not shown). Considering the isolation of the horizontal details  Figure 6A, and so on for Figure 6B-E ( Figure 6B), the different forms of the semen stain ( Figure 6E), the fiber ( Figure 6A,C), and the "mask" that excludes the semen stain ( Figure 6E, see at 1875 nm), the wavelet subimages feature a high potential for further investigation of the data. For example, considering methods such as MCR-ALS, on one hand, these isolated images can furnish the number of components to use and on the other hand can serve as initial estimates and could, for example, increase the spatial unmixing capability of the current methodology. 28 However, this could come with some ambiguity, as with higher decomposition levels, the wavelet subimages will only retain low frequency signals. This has to be considered carefully and will be explored in a future publication.
In order to corroborate the conclusion drawn after this visual inspection, k-means was applied, as previously explained (in this case, the descriptor matrix was compressed to 18 PC models, explaining 90.61% variance). Figure 7 represents the obtained results. A more ambiguous clustering was obtained here compared with the previous case study, most likely due to the increased complexity and spatial overlap of the different components underlying the HSI dataset. Yet the previously determined components are discernable in Figure 7D. However, they are considerably more mixed. For example, "Cluster #1" encompasses both the fiber and (to a lesser extent) the semen stain. Also, in "Cluster #3" and "Cluster #4," the semen stain, the fiber, and the background are not completely separated from one another. Nonetheless, despite being mixed with the textural background, the two different spatial forms of semen were isolated in Clusters #2 and #3.
The spectral channels corresponding to Clusters #2 and #3 ( Figure 7A) include the wavelengths regions around 1700 nm, and in the range 1500-1575 nm, which are selective for the two forms of semen when wavelet subimage A4 is considered ( Figure 6D,E). It must be emphasized that compared with the emulsion dataset, the results of the clustering are not satisfactory in terms of isolation of the different components ( Figure 7D). However, it has been shown F I G U R E 7 Semen dataset. Results of the k-means clustering algorithm on the loadings matrix. (A) Mean spectrum, with clustering highlighted on the spectral channels; (B) PC 1 versus PC 2 loading plot of points colored according to clusters; (C) PC 2 versus PC 3 loading plot; and (D) mean image (across the spectral dimension) for each cluster (in Figure 6) that the wavelet subimages estimated at specific spectral wavelengths and recovered by the PCA of the descriptor matrix do have the ability to isolate the different components.

| CONCLUDING REMARKS
In this short communication, a methodological framework based on combining both spatial features and spectral information for the analysis of HSI data is proposed. The method decomposes every single-wavelength image of a three-dimensional HSI array by 2D-SWT and computes individual GLCM for every resulting wavelet subimage. Morphological descriptors are afterwards estimated from all the GLCMs. In this way, spatial and spectral information is enhanced and conveyed in a single features matrix, which is finally processed by multivariate data analysis tools like PCA. Depending on the specific tasks the user must address, different multivariate statistical tools can be exploited at this point.
Although the proposed workflow combines different computational steps, which translates into higher complexity, every one of them is a necessary link in the chain. In fact, when some of these steps were skipped, the information obtained was insufficient to unravel all the distinct spatial features underlying the dataset under study and relate them to specific spectral regions.
According to the results obtained in two different case studies, it can be concluded that the proposed strategy is capable of consistently recovering the main spatial features of a HSI dataset and of highlighting the distinctive spectral regions accounting for them. In particular, the outcomes related to the investigation of the semen droplet dataset were found to be particularly promising considering the extreme physicochemical complexity of the examined image. In fact, even though the semen and cotton components show highly overlapped spectra, and cotton fabric is present everywhere in the sample, it was possible to highlight and localize the two different forms of the semen stain as well as the spectral wavelengths at which this spatial separation is effective.
Nonetheless, further developments can be foreseen. GLCM is just one of the possible ways to compress the spatial information encoded in wavelet subimages, and others will be explored in future research. The possibility of utilizing other multivariate data analysis tools in the developed framework will also be assessed. Finally, the improved and at least preliminarily disentangled spatial/spectral information returned by the described approach might constitute a valuable starting point for the design of new constraints to be applied in the context of MCR-ALS. Additional work is currently in progress towards this direction.