Fused adjacency matrices to enhance information extraction: The beer benchmark.

Multivariate exploratory data analysis allows revealing patterns and extracting information from complex multivariate data sets. However, highly complex data may not show evident groupings or trends in the principal component space, e.g. because the variation of the variables are not grouped but rather continuous. In these cases, classical exploratory methods may not provide satisfactory results when the aim is to find distinct groupings in the data. To enhance information extraction in such situations, we propose a novel approach inspired by the concept of combining weak classifiers, but in the unsupervised context. The approach is based on the fusion of several adjacency matrices obtained by different distance measures on data from different analytical platforms. This paper is intended to present and discuss the potential of the approach through a benchmark data set of beer samples. The beer data were acquired using three spectroscopic techniques: Visible, near-Infrared and Nuclear Magnetic Resonance. The results of fusing the three data sets via the proposed approach are compared with those from the single data blocks (Visible, NIR and NMR) and from a standard mid-level data fusion methodology. It is shown that, with the suggested approach, groupings related to beer style and other features are efficiently recovered, and generally more evident.


M A N U S C R I P T
A C C E P T E D ACCEPTED MANUSCRIPT

Introduction
Exploratory multivariate data analysis (EMDA, [1]) offers very powerful tools for looking into M A N U S C R I P T A C C E P T E D ACCEPTED MANUSCRIPT 7 placed right after processing inside a thermally insulated styrofoam box, equipped with ice chips 145 and a lid. This setup was made to keep the specimens in stable conditions while running the 146 experiments. 147 For each sample three replicates were acquired but the order of acquisition was randomized both 148 with respect to samples and replicates. A control sample for each batch was also prepared under the 149 same conditions as the other specimens. A pack of six canned beers was purchased from a local Lorentzian line broadening was applied. The spectra were in some cases automatically and in some 174 other manually baseline-and phase-corrected using the TOPSPIN processing tools, depending on 175 the results of the automatic correction assessed by a trained NMR user. For all spectra, the ppm 176 scale was referenced to the TSP peak at 0.00 ppm. The spectral window was 20.5 ppm. 177 After thawing and degassing, the specimens were kept at 5°C. Preparation of the NMR tubes was 178 executed in batches of twelve samples, which were collected from the fridge and placed within a 179 thermally insulated styrofoam box equipped with a ground of ice chips and closed with a lid. The 180 newly prepared tubes were placed into the autosampler rack, which was also stored within the 181 thermal box. 182 All the specimens were prepared to contain 10% D 2 O, 0,02% of sodium-3- NMR data carry different information in different spectral regions. As a consequence, NMR spectra 197 are usually roughly split into three regions [43,49]: the aliphatic/organic acids region (0-3 ppm), the 198 carbohydrates region (3-5 ppm) and the aromatic region (6-9 ppm). These regions mainly differ 199 because of involved metabolites/molecules, baseline noise, and signal's average intensity [49]. By  SOM mapping preserves the topology, and this means that distances and proximity relations 248 between samples are preserved [10]. As a result of this, all the nodes that are at the same topological 249 distance from a given node define a "neighbourhood": a representation of nearest, second-and 250 third-nearest neighbourhoods is given on the top-map in Figure 1.

251
In our work, a simple two-dimensional, 10-by-10 squared grid of nodes was used where m is the number of samples.

262
OPTICS is based on the concept of Reachability Distance (RD). RD is a similarity measure [52], 263 which is basically an Euclidean distance that describes how distant/similar is an object from the one 264 processed at the preceding step. The graphical output of OPTICS is called Reachability Plot (RP), 265 and it is obtained by plotting the RDs as vertical bars arranged along the x-axis according to the 266 processing sequence.

267
At each iteration, the OPTICS algorithm selects one object and compares it with all the objects that 268 have not been processed yet. This is done by computing all the pairwise Euclidean distances. Then, 269 the next object to be processed is selected among the k nearest neighbours: the distance at which 270 this next object is found becomes its RD, which is stored unchanged until the end of the procedure.

271
The final output is therefore a set of RD values, which can be plotted as bars in the RP. A cluster is 272 formed by objects that happen to be very close to each other, so it can be expected that these objects

275
When a cluster has been processed, then the next object would likely belong to another cluster: the 276 next RD value in the processing sequence is therefore going to be larger than the values preceding 277 it, which are related to previous cluster. This "jump" from one cluster to another is graphically 278 recognizable in the RP because it corresponds to a very high bar. Clusters therefore appear as 279 hollows created by groups of samples sharing similarly low RDs.

280
It is important to consider that the RP does not explicitly cluster the objects [52], but it rather allows 281 deducing the number of clusters in the data.  In the present study, a mid-level data fusion dataset was obtained by creating a matrix augmented in 290 the variables' direction. Seventy-seven features were merged: 7 PCA scores from the Vis dataset 291 and 6 PCA scores from the NIR dataset were merged with the 64 NMR features. To represent the 292 three different blocks evenly, autoscaling followed by block-scaling was performed. The Fused Adjacency Matrix approach is a two-step procedure: in the first step, information is The approach is based on the concept of combining different weak sources of information [15][16][17][18] 300 as it is done, for instance, in the classification context by the Random Forest algorithm (RF, [15]).

301
In RF the results of several weak classifiers are merged by counting how many times a sample was 302 assigned to one of the defined categories; then the sample is assigned to the category to which it 303 was more often assigned.

304
In our unsupervised case, we convert the distance information into several adjacency matrices, 305 which represent the weak sources of information. Adjacency matrices (AMs) are squared binary 306 symmetric matrices (m × m) in which a one is present when the adjacency condition is fulfilled by 307 the pair of samples under exam, and a zero is present when this condition is not fulfilled. In other 308 words, these matrices carry the information about whether two samples are close enough to each 309 other (they are "adjacent") as compared to, for instance, a distance threshold (the adjacency 310 condition). Merging these AMs using a sum rule [19] will result in a new squared symmetric matrix 311 in which, those pairs of samples that were consistently found adjacent will be characterized by high 312 values, while those pairs of samples which were consistently found far apart will have low values 313 or, even better, values close to zero. This is the overall idea of the proposed approach.

314
In our approach, for a given data block (X in Figure 1, on the left side), fourteen different AMs are 315 obtained. Ten are derived by using Euclidean and Mahalanobis distances (Equation 1), and four by 316 using SOM as a "clustering" method (Equation 2). Due to the number of implemented thresholds, 317 the contribution of each distance measure to form the AM X was comparable; however, the use of a 318 weighted sum can be advised in the more general case. two considered samples belong to the same g topological neighbourhood or to a closer one. We 326 defined four topological rectangular [54] neighbourhoods (g = 0, 1, 2, 3), including the "zeroth 327 level", which corresponds to a single node. Since different SOM runs generally produce slightly 328 different outputs, the average over ten runs was taken to make the resulting adjacency matrix 329 AM SOM more robust. group results denser than the Ales group, and this can be seen in both the RP (Fig.2a) and the score 379 plot (Fig.2b). The colour scale employed in Figure 2c describes the beer colour intensity, that is 380 defined as the absorption of the sample at 430 nm, taken as reference wavelength [58]. A colour 381 intensity gradient is recognizable along PC1 (Fig.2c). The sample distribution along PC2 is, on the

392
The information that could be extracted from the NIR dataset is rather limited, and this can be seen 393 by inspecting the RP (Fig.3a) and the PC1 score plot (Fig.3b), both obtained from the NIR 394 preprocessed spectra.  Two main clusters of samples were identified by inspecting the RP (Fig.3a), a small one which non-grouped samples (Fig.3b). The non-grouped set is much more scattered, as it has both higher 405 bars in the RP (Fig.3a) and a large variability range along PC1 (Fig.3b). group at the beginning of the RP, followed by a tail of slowly increasing RDs forming a non-434 grouped set (Fig.5a). However, the sample distribution obtained by PCA (score plot in Fig.5b) is 435 mainly determined by few variables, according to the loadings plot (Fig.5c). Features related to The results obtained by OPTICS and PCA on the Fused Adjacency Matrix preprocessed as 450 explained in Section 2.2.5 are discussed here and shown in Figure 6.

451
Two clusters of samples and a non-grouped set can be identified in the RP (Fig.6a). These three to the results found with the single techniques and the mid-level data fusion approach. It is also 456 interesting to notice the sample distribution within the Lagers group, where the "simple" lager 457 samples (in red in Figure 6b) are very grouped on the right side, which is in an opposite position 458 compared to the Ales group.

459
PC1 is related to the colour, and when combined with PC4 the samples adopt an arch-like 460 distribution (Fig.6c). The PC1-PC4 score plot not only shows the colour trend, but also suggests  In this section, more detailed comparisons among the results obtained by the different data blocks 476 and data fusion approaches are reported. Table 1 is organized as a summary of these comparisons.

20
The Lagers group was identifiable in all representations of the data, and it appears to be rather 481 stable. The Vis and AM Fus datasets showed the best results in terms of samples grouping, which is 482 probably reflected by their similarity, as highlighted by Procrustes Analysis (Section 3.7).

483
An interesting group of lager-style samples is the HI samples set, which includes beer products (group A in Figure 6a), while EU.2 one was found further in the OPTICS sequence, suggesting that, 509 only by this approach, a clearer difference based on the treatment was recovered.

511
These products are described as "summer beers", therefore their presence in the Lagers groups is 512 not unforeseen: this product type is intended to be refreshing and easy-to-drink, and it usually is 513 lighter in aromas and alcohol content. For these reasons it can be expected to find these summer 514 beers more similar to the lagers than the ales.  Figure 4).

526
The Light samples set was found rather grouped in the data fusion cases (Figures 5b and 6b)  No ABV trend was evident in the Vis case. This is naturally present in the NIR case (Fig.3b), since 533 PC1 describes the ethanol content. The trend is also present in the mid-level data fusion case, since 534 variable PC1 from NIR is highly influential (Fig.5c) Figure S2a.

537
The AM Fus case is rather different. The ABV trend is present in PC1-PC3 (score plot reported in 538 Figure S3, in the Supplementary Materials), but in a transformed way. The strongest and the lightest 539 beers all lie in the top part of the plot and they all belong to the non-grouped set (as in Figure 6b).

540
These samples represent the extremes in ABV, so their position is probably due to the fact that the 541 approach is just able to detect their dissimilarity from the bulk of "ABV-average" samples. brewed with lager yeasts, but more alcohol is obtained during the brewing process.

547
The Lagers Strong set was generally found split into two groups: four "low-ABV" and two "high- closer to the Lagers than the three highest ABV samples (Fig.3a). On the contrary, in the NMR 552 case, the Lager Strong samples are all in the Lagers group and do not follow any ABV order 553 (Fig.4). Both the data fusion approaches, in RP by OPTICS (Fig.5a and Fig.6a) is clearly 554 highlighted that the four low-ABV samples are more similar to the lagers (they belong to the Lagers 555 group) but are also located closer to each other within the RP sequence. However, the separation 556 between high-and low-ABV samples is much better appreciable in the PCA of the AM Fus (Fig.6b) 557 than in the mid-level data fusion score plot (Fig.5b). In AM Fus , moving along PC1 from the Lagers The colour trend naturally originates from the Vis dataset (Fig.2c). No trace of it was found neither 564 in the NIR nor the NMR cases. Both the data fusion methods were able to recover this piece of 565 information, even though the AM Fus (Fig.6c) provides a clearer trend than the mid-level data fusion 566 (Fig.5b). Light samples set were slightly better retrieved by the mid-level data fusion approach.

578
It is also very promising that the Fused Adjacency Matrix approach can highlight small sub-groups 579 (Fig.6c) which may be worth further investigation of their chemical/sensory characteristics. A 580 deeper characterization of these sub-groups may, for instance, provide new inspiration in beer 581 production, helping to define intersections between established and more general styles.
582 Table 1 to be inserted about here

585
In Sections from 3.1 to 3.6 we have graphically inspected and compared the information gathered 586 by the different data blocks as depicted in the principal components space, with the aim of 587 highlighting similarities and differences among them. This way of visually exploring the data easily 588 allows spotting trends and peculiarities, but subjectivity and limited availability of metadata (i.e.

589
additional information such as the beer style or the ABV content) can sometimes be a drawback.

597
In this work, the PCA spaces obtained from the different blocks (i.e. each single analytical platform, 598 the mid-level fused data set and the AM Fus data set, referred to as inter-block comparison) are 599 compared by PA analysis. Also, the data obtained from the different steps of the procedure, going 600 from the raw data to the AMs for each single data set (which will be named AM X , with the suffix X 601 being Vis, NIR and NMR, in turn) have been compared by PA. The latter case is referred to as 602 intra-block comparisons. An overview of the results is given hereinafter, while the visual 603 representation is reported in Figure S4, in the Supplementary Materials.

604
Inter-block comparisons were made, in pairs, using the PC scores of the Visible spectra (7 PCs), the 605 NIR spectra (6 PCs), the NMR features (6 PCs), the mid-level fused data (5 PCs) and the Fused  giving too much importance to that source, while a too loose similarity would have meant that the 614 information was either too reduced or not captured by the approach.

615
The effect of the different fusion steps was also assessed. These intra-block comparisons were made 616 for each data block individually (using the same number of PCs as specified above), and the results 617 are shown in Figure S4b. One interesting point is the transition from the distance information to its 618 correspondent AM X . The Euclidean distance D Euc resulted consistently similar to the Euclidean 619 AM Euc meaning that the "coded" AM version of the data is keeping a large part of the original 620 distance information. The same was observed with the Mahalanobis distance, albeit for the NMR 621 case the similarity between D Mah and AM Mah was found lower (Fig.S4b). By inspecting the 622 corresponding score plot it appears that this difference is due to a limited number of samples which  Table 1. Comparison summary (*ordered by increasing ABV)  Matrix (as in Figure 6). The dataset was normalized between zero and one to enhance its visual 924 representation and interpretability.

M A N U S C R I P T
A C C E P T E D ACCEPTED MANUSCRIPT Table 1 Comparison summary (*ordered by increasing ABV) Visible NIR NMR (Fig.4)
Medium to low variable values in general.
Some sub-groups; contains the Light samples set as a subgroup.
Included in the Lagers group.
Low values in general.
Not grouped in RP. (Fig.5a) Grouped in PCA. (Fig.5b) Not grouped in RP. (Fig.6a) Grouped in PCA. (Fig.6b Four low-ABV in the Lagers group, low-colour. (Fig.2a-b) Two high-ABV in the nongrouped set, mid-colour. (Fig.2a-b) Three in the mixed group. All in the Lagers group. Four low-ABV in the Lagers group. (Fig.5a) Two high-ABV quite far in the non-grouped set. (Fig.5a) Four low-ABV close to the Lagers group in PCA. (Fig.6b) Two high-ABV close to the Ales. (Fig.6b) ABV trend Not found. Very well described by PC1. (Fig.3b) Found in PCA (Fig.S2a); probably reflecting the sugar content.