SIMULATION OF AN EXPERIMENTAL DATABASE OF INFRARED SPECTRA OF COMPLEX GASEOUS MIXTURES FOR DETECTING SPECIFIC SUBSTANCES. THE CASE OF DRUG PRECURSORS

Abstract This work is motivated by the need to develop suitable databases in absence of real experimental data, for instance when spectra measured with a newly developed instrumentation on real samples are not available yet. This notwithstanding, in fact, the realization of the physical project should be addressed by a starting database, also invaluable in order to test its effectiveness. In this article we face the issue of simulating gas mixtures spectra for the development of a new sensor for external cavity-quantum cascade laser photoacoustic spectroscopy (EC-QCLPAS) starting from literature FT-IR spectra of pure components: a dataset is realized suitable to realistically represent the ensemble of spectra of the gas mixtures of interest. The informative data deriving from the literature spectra were combined with the stochastic component extracted from a sample spectrum recorded with a prototype instrument, allowing us to build a matrix containing thousands of simulated spectra of gaseous mixtures, accounting for the presence of different components at different concentrations. Signal processing and experimental design techniques were used along the whole path leading to the dataset of simulated spectra. In particular, the goal of the construction of the database lies in the development of a final system to detect drug precursors in the vapor phase. The comparison of some EC-QCLPAS spectra with the corresponding simulated signals confirms the validity of the proposed approach.


INTRODUCTION
The increasing need for portable sensors suitable to detect dangerous or illegal chemical substances and requiring fast in situ analysis, forces to develop sophisticated systems, exploiting at best the huge potentialities recently made available by precision mechanics, optics, and electronics. The realization of advanced detection systems, however, is often based on sensors whose development is still in progress and whose response could be, therefore, at least partly unknown. For example, the implementation of a new sensor for the detection of a particular target molecule would require that its response to the target and to possible interfering molecules is already known, but it often happens that experimental data are not available yet, the actual development of the physical device being still in progress. A preliminary database on which to work could be also necessary when the chemical substances of interest are hard to handle, due to their toxicity, to hazards, or to legal issues.
The challenging task of building a suitable starting database can be faced by following different strategies, depending on the specific situation. In particular, in this article we face the issue of simulating gas mixtures spectra for the development of a new sensor for External Cavity-Quantum Cascade Laser Photoacoustic Spectroscopy (EC-QCLPAS) working in the Mid Infrared (MIR) spectral range. From a literature survey, a first possible way to obtain the spectral responses consists in calculating them by physico-chemical simulations based on the chemical structures. Ab initio and semi-empirical quantum mechanical methods are widely used to this purpose, constituting invaluable tools in modern vibrational spectroscopy. However, ab initio calculations are very time consuming and require powerful computers. For this reason, similar methods cannot be adopted routinely, even for compounds of moderate complexity, e.g., of 100 -300 Da: a much more practical approach lies in the use of semi-empirical methods [1][2][3]. A further alternative procedure has been followed by Babkov et al. [4], which simulated vibrational spectra by the method of fragments [5]. Such a method is applicable also to the simulation of the vibrational spectra of large molecules [6,7]. The choice of computing the spectra using quantum mechanical methods presents in all cases important drawbacks, mainly due to the approximations affecting calculations and to the noise issue: a proper simulation should be suitable to represent not only the chemical information (pure spectrum), but also the contribution of the instrumental noise to the final signal shape.
Moreover, in case there is the need to simulate the spectrum of mixtures of different substances, the excessive use of approximations could lead to spectral responses that are far from the real ones, not to mention the inherent computational load.
A literature survey has shown that a few attempts are reported in which the simulation of single gas spectra or spectra of specific gas mixtures takes also into account the contribution of instrumental noise and of background signal variations [8,9]. Huang et al. [10] developed a simulation model based on the infrared transmission theory and on the knowledge of background and interference spectra. Sulub et al. [11] used previously measured molar absorptivities and solvent displacement factors, computing synthetic spectra using experimentally recorded background signals. Bak proposed an interesting procedure based on Principal Component Regression to simulate spectra of a single gas (CO) at varying concentrations, temperatures, and pathlengths [12,13]. Other approaches have been followed by Corsi et al. [14], that developed a simulation approach to reveal VOC in air, and by Gao et al. [15], who proposed a method to simulate spectra of polluting gases in a complex environment.
In this work we deal with the issue of simulating spectra in the MIR range, for the development of a new EC-QCLPAS sensor [8,9,16], devoted to the identification of vapours of drug precursors (target molecules) in air, considering a high variety of possible environmental conditions [17,18].
The MIR spectral range has been chosen since most chemicals, such as the target gases selected for our purposes, exhibit strong characteristic rotovibrational absorption bands in this wavenumber interval [19]. Due to the need to account for the possible presence of a large number of chemical species mixed in varying amounts, we started from spectra of pure substances taken from literature databases, that were considered as "building blocks" of mixtures, to simulate the complex spectral profiles of interest which, to a first approximation, will be obtained by the sensor under development. In addition to the target molecules, i.e., the drug precursors, other components of the gaseous mixtures were the most common interfering species and the main components of air.
Reasonable concentration ranges were considered for each species.
Noteworthy, the bulk of the developed procedure can be adapted to deal with a number of different situations. Moreover, it should be also noticed that, although the gas sensor considered here operates in the MIR spectral range, the proposed simulation methodology can be easily transferred to any other kind of spectroscopic data.
The approach proposed here requires some basic issues to be addressed, mainly involving signal processing and experimental design methodologies [20]. The gas spectra were first denoised by means of a Fast Wavelet Transform (FWT) [21][22][23] based algorithm. Then, a procedure was developed to multiply the spectra of the pure components by the corresponding concentration values, in order to operate only on the actual absorption bands. The noise structure of the EC-QCLPAS was analyzed using a spectrum recorded on methanol with a prototype instrument, and added to the "clean" spectra of the simulated mixtures. This approach allowed us to build a matrix containing the simulated EC-QCLPAS spectra of thousands of gas mixtures, which was further used [24] for the definition of the optimal spectral working range and even for identification of the single most informative wavenumbers within this range. In order to preliminarily test the proposed approach, the whole procedure has been applied to simulate the experimental spectra acquired with prototypes of the instrument on two sample mixtures.

Literature FT-IR Spectra
Two literature databases, namely PNNL [25] (Pacific Northwest National Laboratory) and HITRAN [26,27] (HIgh-resolution TRANsmission molecular absorption database, Atomic and Molecular Physics Division, Harvard-Smithsonian Center for Astrophysics) were exploited for extraction of spectra. The most part of the spectral data was taken from the PNNL database, and consisted of FT-IR spectra acquired at 298 K, at wavenumbers ranging from 6500 to 600 cm -1 , 0.0603 cm -1 spectral resolution, corresponding to 97902 data. Moreover, a few spectra have been extracted from the HITRAN database, referred to 296 K, within a wavenumber range from 3174 to 749 cm -1 at a resolution of 0.076 cm -1 .
As a whole, for our specific needs, the spectra of 33 chemical species were considered, divided into three categories:  4 target spectra, namely 1-phenyl-2-propanone, acetic anhydride, ephedrine and safrole (concentration range: 0.02 -1 ppm)  20 interfering species (pollutants)  9 air components. Table 1 reports the list of the considered pollutants, together with the relevant upper concentration limits (the lower limit was set to 1 ppb for all of them), and Table 2 reports the air components, along with their typical and maximum concentration values, as inferred from literature [8,28]. The list of pollutants was defined taking into account both the similarities of the spectral signatures with those of target gases and the most likely environmental conditions.

Spectra measured with EC-QCL-PAS prototype instrument
Real gas spectra have been recorded by a prototype of the EC-QCL-PAS instrument, in order to estimate the actual noise contribution and to collect preliminary experimental spectra suitable to test the validity of the developed procedure. The laser photoacoustic (LPAS) sensor ( Figure 1 Continuous wave QCL should be kept at a constant temperature; in particular, the considered device was thermostated at 23 °C by water cooling system. The laser power is modulated mechanically, by a rotating chopper wheel, at a frequency of 60 Hz. The collimated and modulated laser beam travels through the cylindrically shaped photoacoustic cell, which is 9.5 cm in length, 4 mm in diameter and is sealed by two windows at the ends. The sample gas absorbs the modulated infrared radiation and it heats and cools down periodically. The generated pressure wave is detected by the cantilever, whose position is measured by the laser interferometer. The amplitude of the cantilever oscillations at the modulation frequency corresponds to the intensity of the photoacoustic signal.
A first spectrum was measured using the continuous wave laser source, at 1 bar pressure, on 90 ppm CH 3 OH diluted in nitrogen, and was used to estimate the noise structure of the EC-QCL-PAS signals. A second spectrum was measured on the same sample using the pulsed laser source. These two spectra were then used to estimate the corresponding values of the experimental correction factor, given by the cell response and absolute laser power coefficients, which is the constant multiplication factor necessary to convert the photoacoustic intensity into the corresponding absorbance values. The two experimental correction factors were then used to convert two spectra used to perform the preliminary test of the proposed approach. They were measured on:  a binary mixture composed by 1.6 ppm CH 3 OH and 1 ppm NH 3 , diluted in nitrogen at 1 bar pressure, measured with a spectral resolution equal to 0.1 cm -1 with the continuous wave laser source;  a quaternary mixture composed by H 2 O (10000 ppm), CO 2 (380 ppm), CH 3 OH (11 ppm) and safrole (11 ppm), diluted in nitrogen at 950 mbar pressure, measured with a spectral resolution equal to 0.5 cm -1 with the pulsed laser source.
Due to the low intensity of both laser sources at the spectra extremes, only the 1038-1098 cm -1 range was considered for further elaborations.

ALGORITHMS
All the calculations, both for the elaboration of the data and for the creation of the algorithms, were performed in Matlab 7 ® language, running on a desktop PC with Windows 7 -64 bit ® , equipped with an Intel Core ® i7-2600 CPU @ 3.40 GHz processor and 4.00 GB RAM. Moreover, some of the subroutines available in the Wavelet Toolbox ver. 4.6 (The MathWorks, Inc.) and in the PLS-Toolbox ver. 7.0 (Eigenvector Research, Inc.) were employed.

Importation and pre-processing of the literature spectra
The step by step procedure adopted for processing the spectra extracted from the literature databases is described in details hereafter.

Importing spectra
Spectra stored in different databases and recorded with different instruments possess different characteristics. For this reason, an algorithm to import spectral data with varying wavenumber ranges, resolutions and input file formats has been developed. All the imported literature spectra have been converted into a common format and structure. In order to denoise the database spectra at best, these have been imported using all the original points, exploiting in this manner the highest possible resolution.

Denoising
Different noise structures affect the literature spectra, and are also quite reasonably diverse from the noise structure of the spectra measured with the device under development. For this reason, it is mandatory to separate the useful spectroscopic information, that will be present also in the signals of the new instrumental device, from the noise contribution of the specific instrument used to record each literature spectrum. This goal has been achieved by using Wavelet Transform (WT) -based signal processing techniques, thanks to the ability of WT to map the analysed signal both in the original and in the relevant frequency domains at the same time. Furthermore, the use of various wavelets to decompose the signal into the WT domain allows a wide number of representations among which to choose the most effective one.
An ad-hoc developed Fast Wavelet Transform (FWT [21,23])-based algorithm operates on an individual signal by convolving it with two filters (called the High-pass and Low-pass decomposition wavelet filters, i.e., Hi_D and Low_D, respectively), and splitting it into two orthogonal subspaces: the vector of approximations, retaining only the low frequency content of the signal, and the vector of details, which collects the high frequency content, respectively. Being the two wavelet filters orthogonal to each other, the frequencies retained by Lo_D are not brought by Hi_D, and vice versa: they are fully complementary to each other, since the original signal can be perfectly reconstructed from the approximations and details vectors, by applying the proper couple of wavelet reconstruction filters. The decomposition procedure can be repeated to further decomposition levels, applying the same two filters to the approximations vector. In this way, sharp and coarse properties of the signal are disjointed and stored into different sub spaces (approximation and detail vectors, at different levels of decomposition). In the present application, the reconstruction into the original domain employed the approximations vector at a user-defined level of decomposition, suitable to effectively remove the noise: in this way, the pure absorbance component of the spectra was obtained, i.e., the spectra cleaned from the instrumental noise. To this aim, a proper function was written that, through an interactive interface, allowed us to easily find the optimal values of the calculation parameters, i.e., wavelet type and decomposition level, thanks to the direct visualization of the resulting signals, i.e., the original spectrum, the low-frequency contribution, and the high-frequency contribution.
Different wavelet filters were considered, namely wavelet functions belonging to the Daubechies, symlets and coiflets families. The most part of the spectra were denoised using a Daubechies wavelet function (db3) at the 3 rd level of decomposition. As an example, in Figure 2 a portion of the spectrum of benzene is reported, showing the original spectrum together with the denoised one (upper plot), and their difference (lower plot). The irregular shape of the signal reported in the lower plot of Figure 2 clearly confirms that the high frequency content extracted by FWT actually corresponds to a random noise affecting the spectrum, not bearing any useful information.

Standardising spectra
The next step consisted in the transformation of the denoised signals in order to obtain uniform datasets of standardized spectra at constant concentration (1 ppb), within the 1000-2500 cm -1 spectral range. To this aim, the smoothed spectra are elaborated by a Matlab function that resamples the spectra at the desired resolution, into a spectral window defined by the user. First, in order to obtain an output signal resembling the hypothetical spectrum resulting from the instrument, the original spectrum is convolved with a Gaussian function with mean value of 1 and standard deviation derived from the Full Width at Half Maximum (FWHM) of the EC-QCLPAS laser line, according to the equation: In this case FWHM was set to 0.1 cm -1 .
After convolution, the resulting signal is resampled at the desired wavenumber values by shifting a second order polynomial interpolating three subsequent points at a time.

Extraction of the Noise Structure of EC-QCL-PAS spectra
As already mentioned, the final version of the instrument is not available yet; however, extraction of the specific noise from a signal measured with an available prototype was possible. In order not to arbitrarily assume an homoscedastic nature of the noise, the dependence upon signal intensity was also computed. In this way, a proper noise contribution can be added to the preprocessed spectra of the mixtures.
The spectrum of CH 3 OH at 90 ppm concentration was exploited to this aim, and analyzed using the same FWT-based procedure that was previously used to denoise the literature spectra.
The noise structure was defined by the following procedure: 1.
identification of the optimal conditions, in terms of type of wavelet and decomposition level, to separate the informative signal (I) from the corresponding noise (N). In particular, This led us to use the following equation to estimate the noise as a function of the signal intensity: where, for each point i of the signal, the estimate of the standard deviation of noise (i) N ŝ is obtained by multiplying the corresponding intensity of the denoised signal I(i) by 1 b , i.e. the mean value of the b 1 coefficients of the regression models obtained for each number of intervals, n.
Then, the noise structure can be applied to a simulated spectrum P to give the corresponding final spectrum S, using the following equation: where, for each point i of the signal, (i) N ŝ is the noise estimated, by equation 2, from the intensity of the pure spectrum P(i), and r(i) is a randomly generated number drawn from the standard normal distribution. As an example, the result of the application of this noise structure to the smoothed spectra of the target molecules (1 ppm concentration) is reported in Figure 4.

Definition of the concentration domains
In order to simulate a spectral dataset suitable to take into account the complex variability of the  Table 3, leading to 499 mixtures of target species on the whole.
In view of the high number of considered pollutants, however taking into account that the simultaneous presence of a high number of them is unrealistic, only mixtures containing from 1 to 3 interfering species were considered. In order to realize a balanced design, 3 FDs have been developed, considering 13, 4 and 3 concentration levels for each combination of 1, 2 and 3 interfering species, respectively, which leads to the number of mixtures reported in Table 4. The whole number of mixtures of pollutants results equal to 34080. The consideration of all these mixtures is actually impractical, so that a subsampling procedure, which will be described in the following section, was adopted.
In order to properly define experimental domains for both targets and pollutants, it was important to cover quite wide concentration ranges, but at the same time it was also appropriate to consider a relatively high number of mixtures containing low concentrations, since low values are most likely to be found in real scenarios. To this aim, the experimental designs with the lower number of concentration levels (from 1 to 5) were built using a non-uniform spacing criterion for the concentrations, by defining a dyadic sequence such that, for example, for a 5 level FD in the range from 0 to 1, the non-uniform spacing leads to the following levels: 0, 1/8, 1/4, 1/2, 1.
For the air components a different planning scheme has been adopted, since in this case all the 9 components must be always included. Therefore, considering that the relevant concentrations  in order to guarantee that the most part of the range between TYP LIT and MAX LIT is covered, the highest concentration value falling within this range is forced to result ≥ 0.9 × MAX LIT ;  concentration values higher than 10 × MAX LIT are discarded;  in order to simulate anomalous though realistic situations, such as very high CO 2 concentrations that could be found, for instance, within a sealed shipping container, 2 to 5 concentration values > MAX LIT , and one value > 5 × MAX LIT have been considered;  unrealistic H 2 O concentration values (> 10 8 ppb) are not considered.

Final mixture concentration matrix
The final mixture concentration matrix was generated according to the following scheme ( Figure 5 This procedure was iterated 5 times, in order to build a final data matrix with 5000 different mixtures, which is supposed to give efficiently account for a real scenario.

Sigmoidal Weighting Function
Once corresponds to the percentile of the signal intensities at which the multiplication factor is equal to (c+1)/2, while a defines the slope of the sigmoidal function. SWF, whose effects are shown in Figure 6 for the high (Figure 6.a) and low (Figure 6.b) intensity values, needs being suitably tuned by properly setting the values of m and a. In particular, in Figure 6.b the difference between the adoption of a linear multiplication of the whole spectrum by a constant concentration factor (dotted lines) and the use of the appropriate SWF (solid lines) is well evident.

Simulated spectral data matrix
Each mixture spectrum of the simulated spectral data matrix is obtained by:  using equation (4) to multiply each unit concentration spectrum by the corresponding value in the concentration mixture matrix;  summing up all these spectra, based on the additive property of absorbance;  adding the noise estimated by equation (3).
The 5000 mixture spectra of the final dataset include air, pollutants, and targets, at the chosen concentration levels, in the proper mixing proportions.

TEST OF THE PROPOSED APPROACH
A preliminary experimental test was performed in order to verify the validity of the proposed procedure to simulate the spectra expected from the instrument going to be realized. The signals recorded with the EC-QCL-PAS prototype instruments on a binary mixture composed by CH 3 OH and NH 3 , and on a quaternary mixture composed by H 2 O, CO 2 , CH 3 OH and safrole, were used to this purpose. The two different mixtures were tested with two different laser setups, as it was described in Section 2.2. Moreover, for each laser source a spectrum collected with the same EC-QCL-PAS prototype instrument on 90 ppm CH 3 OH was compared with the corresponding simulated spectrum, in order to estimate the value of the constant multiplication factor necessary to convert the photoacoustic intensities into the corresponding absorbances. This constant multiplication factor was calculated as the ratio between the average of the absorbance values of the CH 3 OH simulated spectrum in the 1050-1060 cm -1 range and the average of the EC-QCL-PAS intensity values in the same spectral range. This choice was justified by the highest intensity values of the CH 3 OH spectrum in this range. The constant multiplication factor was then used to convert the EC-QCLPAS mixture spectrum to an absorbance plot that, in turn, was compared to the corresponding simulated spectrum. Figure 7 reports the comparison of the EC-QCL-PAS spectra (in black) with the corresponding simulated spectra (in gray), both for the binary NH 3 + CH 3 OH gas mixture (Figure 7.a) and for the quaternary H 2 O + CO 2 + CH 3 OH + safrole gas mixture (Figure 7.b). Both the simulated spectra show a satisfactory agreement with the experimentally measured ones, with respect to both the signal intensity and shape. In particular, the simulated spectrum of the binary mixture (Figure 7.a, spectral resolution 0.1 cm -1 ) is almost coincident with the corresponding EC-QCL-PAS spectrum, except for some very limited discrepancies in the intensity and position of the peaks with highest intensity (at about 1084.5, 1065.6 and 1046.4 cm -1 ) due to NH 3 . While the differences in the intensity values can be ascribed to the accuracy of the actual gas concentration and to the uneven distribution of the laser power curve, the slight shifts in the wavenumbers (max 0.2 cm -1 ) are likely due to the repeatability of the positioning of the laser source, which might cause some random changes in the absolute accuracy of the wavenumber values. A lower, though well acceptable performance was obtained for the simulation of the quaternary mixture (Figure 7.b, spectral resolution 0.5 cm -1 ). In this case, although the overall shape of the signal is quite well modeled, higher discrepancies are observed, especially in the range between 1060 and 1038 cm -1 . While the absorbance values of the simulated and of the measured spectra are very close to each other, differences can be observed in terms of spectral resolution. This is likely due to the fact that the laser used for the measurement of this gas mixture operated in pulsed mode, which typically broadens the linewidth of the laser itself. While the resolution used both for the acquisition of the experimental data and for the simulation of the spectra was equal to 0.5 cm -1 , the linewidth of the pulsed laser source is below 1 cm -1 : convolution of the laser linewidth with the actual spectrum occurs, which results in broadening of some sharp spectral features.
However, in general, also in view of the fact that the final instrumentation will employ a continuous wave laser source, the comparison between the simulated spectra and the corresponding EC-QCL-PAS experimental data confirms the validity of the proposed approach. This indicates that the dataset of 5000 simulated gas mixtures spectra constitutes a valuable tool to address properly the realisation of the final instrumentation.

CONCLUSIONS
In this work we faced the issue to build up a spectral database, specific of a given MIR instrument, [28] NIOSH -National Institute for Occupational Safety and Health, http://wwwn.cdc.gov/pubs/niosh.aspx, and therein cited references. Sauli Sinisalo received his Master's degree from the Department of Physics and Astronomy, University of Turku in 2011 from the field of Optics and Spectroscopy. Before graduating, he studied also at University of Washington, Seattle, in 2008 and started working at Gasera Ltd. in 2010 with gas analyzer development, strongly specializing in the laser based photoacoustic spectroscopy (LPAS). While closely following the cutting-edge laser technology and by applying diode-, QC-lasers and OPO's to photoacoustic trace gas detection he has also been co-authoring in a few scientific publications. He is currently a Client Partner at Gasera, responsible for managing of several EU-project contributions, client projects as well as managing product development.  Table 1 List of pollutants (interfering species) together with the relevant upper concentration limits. Table 2 Air components, with corresponding typical and maximum concentration values. Table 3 Definition of the number of mixtures of target molecules. Table 4 Definition of the number of mixtures of pollutants.       Comparison between simulated spectra (gray) and EC-QCL-PAS spectra (black) relative to: a) binary mixture of NH 3 (1 ppm) and CH 3 OH (1.6 ppm); b) quaternary mixture of H 2 O (380 ppm), CO 2 (10000 ppm), CH 3 OH (11 ppm) and safrole (11 ppm).   Table 3 N°