APPLICATION OF DATA FUSION TECHNIQUES TO DIRECT GEOGRAPHICAL 1 TRACEABILITY INDICATORS 2

A hierarchical data fusion approach has been developed proposing Multivariate Curve 11 Resolution (MCR) as a variable reduction tool. 12 The case study presented concerns the characterization of soil samples of the Modena District. It 13 was performed in order to understand, at a pilot study stage, the geographical variability of the 14 zone prior to planning a representative soils sampling to derive geographical traceability models 15 for Lambrusco Wines. Soils samples were collected from four producers of Lambrusco Wines, 16 insisting in in-plane and hill areas. Depending on the extension of the sampled fields the number 17 of points collected varies from three to five and, for each point, five depth levels were 18 considered. 19 The different data blocks consisted of X-ray powder diffraction (XRDP) spectra, metals 20 concentrations relative to thirty-four elements and the 87 Sr/ 86 Sr isotopic abundance ratio, a very 21 promising geographical traceability marker. 22 A multi steps data fusion strategy has been adopted. Firstly, the metals concentrations dataset 23 was weighted and concatenated with the values of strontium isotopic ratio and compressed. The 24 resolved components describe common patterns of variation of metals content and strontium 25 isotopic ratio. The X-ray powder spectra profiles were resolved in three main components that can be referred to calcite, quartz and clays contributions. Then, an high-level data fusion 27 approach was applied by combining the components arising from the previous data sets. 28 The results show interesting links among the different components arising from XRDP, the 29 metals pattern and to which of these 87 Sr/ 86 Sr Isotopic Ratio variation is closer. The combined 30 information allowed capturing the variability of the analyzed soil samples.


40
High-throughput methodologies, megavariate database, fast fingerprinting and profiling [3, 7,9] or wavelet transform [4], i.e. 56 models are built separately on the different data blocks and the derived latent variables (or meta 57 variables in a broad sense) are fused to obtain a final high-multivariate-model.

58
This last approach could be particularly effective in the case of data blocks, which are difficult to 59 render comparable/commensurable, i.e. a suitable preprocessing procedure may not be available 60 or completely solve the issue.

61
To the best of our knowledge, multivariate curve resolution (MCR) methodology has not yet 62 been used as data reduction technique for extraction of data blocks information in high-level data 63 fusion. The possibility to obtain chemically meaningful components, e.g. that can be 64 4 characterized in terms of chemical concentration and spectra profiles, allows a better 65 understanding and highlighting, in the data fusion process, of the correlation between the 66 resolved profiles of the different analytical techniques.

67
Most often, in data fusion context, MCR has been used combining the information acquired by 68 the different analytical techniques in the multi-sets structure [10][11][12]. This is surely a sound 69 approach, however it may be non optimal when the data sets to be fused all share the samples 70 mode and each data block is constituted of different kind of variables, e.g. metal contents and 71 spectral fingerprint for the same set of samples, but there is not a varying condition for each 72 sample such as time of measurement, pH, or a second spectral dimension. In other word, when 73 data augmentation limits to variables concatenation and there is not real replicate information for 74 the same sample to assist the resolution of the underlying components.

75
Here, we present a case study where MCR was used as variable reduction tools for the 76 development of hierarchical data fusion model in a study aimed at achieving information about 77 the geochemical variability of soils samples.

78
In particular, this work is a part of a pilot study belonging to a project concerning assessment of 79 geographical traceability models for Lambrusco wines of protected denomination of origin 80 (PDO), a typical food product of the Province of Modena (Italy).

81
Food geographical traceability studies are targeted to establish the correlation between the soils 82 of origin and the final products, hence, one of the main aspect to face is the representativeness of The MCR data fusion approach was preferred instead of the multiset based one, for the reason 90 explained above taking into account the great difference among data blocks in terms of number 91 of variables and measurement scales.

92
Several examples are present in literature for the identification of patterns of variation of 93 pollutants or metal sources based on MCR [13,14] whereas it is the first time that an approach 94 based on multivariate curve resolution is proposed to attempt a partial resolution of XRDP 95 components. In particular, we were interested on one hand to fully exploit soil samples    The extension of the area (more than 90 km 2 ) and the amount of Lambrusco producers insisting 109 on it (more than four thousand) made it mandatory to develop a pilot sampling to evaluate, on a 110 reduced scale, variability of soils in the district, sampling conditions and operating procedures 111 [15]. Thus, four long chain producers were considered, three of these producers, named here on 112 as A, B and D are located in in-plain region, where the majority of the production of Lambrusco 6 wines insists, the fourth one, C producer, insists in the hill area. For each producer, depending on 114 the dimension of the field, from three to five coring where collected. In order to obtain 115 information about both horizontal and vertical variability, each core was split in five aliquots of 116 10 cm of length, starting from 10 cm of depth to 60 cm. All depths were analyzed for the hill 117 field and only lower and upper aliquots for the plain ones for a total of 47 samples.   Scientific (Bremen, Germany), was used for the determination of the following isotopes: 7

136
MCR is based on bilinear decomposition of the data matrix [18,19] according to the model: The model is calculated by alternating least squares algorithm (MCR-ALS).

139
Since MCR is not an orthogonal decomposition such as PCA, it needs constraints to resolve the 140 system in a way that the S (spectra) matrix corresponds to a real chemical behavior.

141
Constraints can be applied both to the spectra (S) and the concentration (C) matrices, in order to 142 reduce the rotational ambiguity of the model since MCR-ALS has not a unique solution.

143
Constraints can be considered as the translation in mathematical formulae of a characteristic of 144 the investigated system.

145
Two types of constraints can be implemented in an MCR model: i) soft constraints such as non-146 negativity, unimodality, selectivity and closure constraints that allow reducing rotational 147 ambiguities; ii) hard constraints, based on physicochemical models able to describe the system 148 under investigation such as kinetic or equilibrium model are able to reduce in the same time both 149 rotational and intensity ambiguities.

150
In the resolution of a chemical system, non-negativity constraints are very common. Furthermore 151 within the family of constraints [20] other usefully adopted for the reduction of the ambiguities 152 are: unimodality (i.e. the resolved profiles are imposed to have only a maximum), closure (i.e.

153
the total amount of the species within the system is constant) and selectivity (i.e. imposition of 154 the presence or absence of a species in a mixture or a region of the spectrum).

155
Here, we adopted soft constrains, such as non-negativity constraints both for concentrations and

169
The data fusion approach we adopted addresses two purposes: meaningful components are

173
The whole data fusion process is illustrated in figure 1 and described here after.

175
The last part of the signals was cut at 79.99 2θ and diffractograms were then preprocessed in 176 order to reduce noise and background effects and to minimize horizontal shift [15]. The signals 177 were finally arranged in a 47x4488 matrix called "XRDP dataset". By means of MCR-ALS      the terms "clays", "calcite" and "quartz" will be associated to the resolved spectra profiles.  evaluation of initial estimates, applying non-negativity in both concentration and "spectra"

284
The

357
The high-level data fusion approach adopted for the analysis of data of different nature proved to