Mixed Mode Data Clustering: An Approach Based on Tetrachoric Correlations

In this paper we face the problem of clustering mixed mode data by assuming that the observed binary variables are generated from latent continuous variables. We perform a principal components analysis on the matrix of tetrachoric correlations and we then estimate the scores of each latent variable and construct a data matrix with continuous variables to be used in fully Guassian mixture models or in the k-means cluster analysis. The calculation of the expected a posteriori (EAP) estimates may proceed by simply considering a limited number of quadrature points. Main results on a simulation study and on a real data set are reported.


Introduction
One possible approach to cluster analysis is the mixture maximum likelihood method, in which the data to be clustered are assumed to come from a finite mixture of populations.The method has been well developed and much used for the case of normal populations.A main advantage in using Gaussian distributions is that a number of possible restrictions on the covariance matrices has been proposed in literature (e.g., [1,3]) to deal with different local dependencies and, at the same time, to alleviate the problem of the rapidly growing of the parameters with the data dimension and with the number of clusters.A large range of Gaussian models are available, from the simple spherical one to the least parsimonious where all elements of the covariance matrix are allowed to vary across clusters.Practical applications, however, often involve mixture of categorical and continuous variables.Everitt [4] and Everitt and Merette [5] extended the normal model to deal with mixed mode data but the computation involved in their model is so extensive that is only feasible for data with very few categorical variables.Lawrence and Krzanowski [7] and Vermunt & Magidson [12] propose conditional Gaussian models with local Local dependencies are specified only between pairs of categorical variables and between pairs of continuous variables and are dealt via joint multinomial and multivariate normal distributions.In the "Latent Gold" package [11] the dependence between a categorical and a continuous variable may be dealt with a sort of "trick", by doubling the categorical variable and treating the variable also as a covariate.The estimated dependence, however, may not vary between groups.The mixture model for large data sets implemented in the package SPSS is also based on joint multinomial and gaussian distributions and postulate the hypothesis of local independence between a categorical and a continuous variable.
Here we face the problem of clustering data with different scales and allowing local dependencies also between a categorical and a continuous variable by assuming that each observed categorical variable is generated from a latent continuous variable and by estimating the scores of these latent variables.In economics, these variables are called utility functions and the assumption is that the response (which may be, for example, the presence or the absence of a public service or a public utility) are determined by the crossing of certain thresholds in these functions (see, among others, [8]).Heckman [6] models whether or not American states have introduced fair-employment legislation and describes the corresponding latent response as the "sentiment" favoring fair-employment legislation.In genetics, the latent response is interpreted as the "liability" to develop a qualitative trait or phenotype.There are also examples of continuous variables which are sampled as binary (among others, bit data which are originated by electric voltages).Skrondal and Rabe-Hesketh [10], pp.16-17, report various interpretations of these latent variables and also state that assuming a latent continuous variable may be useful regardless of whether the latent response can be given a real meaning.
This work represents the first step in the construction of fully Gaussian models for classification, in which correlations among variables may vary across groups and also variable selection may be faced differently in each group.Here we estimate the scores of each latent variable and reach a data matrix with all continuous variables to be used in these models.An application shows that some benefits of using a data matrix with all continuous variables instead of a mixed mode data matrix may be reached in the k-means cluster analysis.

From Binary Variables to Continuous Variables
The essential feature of the method to be described in this section is that the observed categorical variables are generated from underlying latent continuous variables according to the values of a set of thresholds.Here we formalize results regarding binary variables but the theory may be extended to multinomial variables by estimating the matrix of polychoric correlations.Given p vectors of binary variables observed for a sample of size n, a contingency table for each couple of variables X k and X j is constructed, with the following cell frequencies: The estimated value for the threshold generating the variable X k is the value h k satisfying Φ(h k ) = (e jk + c jk )/n.For variable X j it is the value h j satisfying Φ(h j ) = (e jk + b jk )/n, where Φ is the standard normal cumulative distribution function.We then estimate the tetrachoric correlation coefficient r jk conditional on these thresholds, via maximum likelihood.The solution may be found iteratively or by using the following approximate analytic solution: In tables with zero frequencies, zero values are set to 0.5.In a simulation study with 5000 different data sets of size (100 × 6) generated from 10 multivariate normal populations, the estimator (1) has been shown to give better results than the other ones based on approximate analytic solutions of the likelihood function.The (n × p) matrix of the scores of the p latent continuous variables is reached with expected a posteriori (EAP) estimates.In order to reach semi parametric estimates, we consider a model based on principal components rather than on factors (see, for example, [2] and [9], for EAP estimates reached by considering a fully parametric model where also thresholds, eigenvalues and eigenvectors associated with each factor are estimated by maximizing the likelihood function).We perform a principal component analysis on the matrix of tetrachoric correlations (which does not require previous smoothing if the matrix is not positive definite) and consider the following model: where t i j is the score of principal component j for case i, a jk are the loadings (eigenvectors) and y ik is the score for case i relative to the k latent variable associated with the observed categorical variable x k as follows: As assumed for the thresholds estimates, y ∼ N (0, I ) and t ∼ N (0, Λ) where Λ is a diagonal matrix with elements λ 2 j = p k=1 a 2 jk equal to the eigenvalues.The EAP estimator of the jth principal component score is the mean of the posterior distribution of t j , which is expressed by: where w is the vector of known parameters (the thresholds and the eigenvectors).
In the following equations, for economy of space, w will be omitted.Given σ 2 jk = λ 2 j − a 2 jk = h =k a 2 jh , then Introducing the change in the variable: Letting ) when a jk < 0, assuming the independence of the binary variables x k conditional on each component t j , it results We consider S quadrature points and estimate the scores as follows: where t s j are equally spaced points in [−z j , z j ] with Φ(−z j /λ j ) = 0.001, φ(t s j ) are the density functions of these points in the N (0, λ 2 j ) curve times the interval size.
Given the estimates ti j , the EAP estimates ỹik of the latent variables may be then reached through analogous steps.The EAP estimator of the kth variable scores is the mean of the posterior distribution of y k , which is expressed by: ik be the values y ik ≥ h k and y − ik be the values y ik < h k , then (a jk < 0) ( 12) Let and Considering S quadrature points we estimate the scores as follows: where y sk are equally spaced points in [−z j h k ] when x ik = 0, in [h k z j ] when x ik = 1, with Φ(−z j ) = 0.001, φ(y sk ) being the density functions of these points in the N (0, 1) curve times the interval size.

Main Results on a Simulation Study and on a Real Data Set
A simulation study is used to evaluate the accuracy of the tetrachoric correlations and the scores estimates.From 10 standard multivariate normal populations with correlation matrices P with equal elements ρ rc , r = c, out of the main diagonal, ranging form 0.0 to 0.95, we generate 5,000 data sets (500 from each population) of size (100 × 6).We then dichotomize the 6 variables by imposing random thresholds from a uniform distribution in the interval [−2 + 2].The mean absolute errors (MAEs) for the thresholds estimates for each variables (averaged over the 5,000 data sets and the 100 observations of each set) are always less than 0.06.Considering "difficult variables", originated by thresholds outside the interval [−1 + 1], the MAEs increase to 0.11.These less accurate estimates also lead to larger errors for the scores estimates.Using (1), the mean absolute errors (MAEs) obtained for the different ρ, averaged over the 500 data sets generated with each correlation matrix, the 100 observations of each set and the 15 correlation coefficients, are: Results seem particularly accurate for all values of ρ.Mean errors also decrease as long as the real correlations among variables increase.Boxplots of the MAEs for the eigenvalues of the principal components, calculated between the eigenvalues of the correlation matrix P used to generate the data and the correlation matrix R of the generated data, are reported in Fig. 1.For values of ρ rc not exceeding 0.8, estimates Fig. 1 Boxplots of the mean absolute errors of the eigenvalues plotted along the original correlations ρ rc .In the left-hand boxes, errors are calculated between eigenvalues of the tetrachoric correlation matrix and eigenvalues of the matrix R of the generated data.In the right-hand boxes, errors are calculated between eigenvalues of the tetrachoric correlation matrix and eigenvalues of the matrix P used to generate the data.In the upper boxes errors are averaged over the six eigenvalues.In the lower boxes, errors are calculated only for the first eigenvalues of all the eigenvalues better recover the computed correlation matrix, rather than the matrix used to generate the data.This is not true for the first eigenvalue: when this one is large (and the correlations are larger than 0.8) the estimates better recover the first eigenvalues of the matrix P. We then estimate the scores of each latent variable and of the principal components.The MAEs, averaged over the 500 data sets generated for each correlation matrix and over the six variables and the 100 observations of each set, are: As long as the correlation among variables increases, there is an improvement in the principal components estimates.On the contrary, results regarding the latent variables do not seem to depend on ρ.Estimates of the scores of the latent variables show improvements in average accuracy when the generated thresholds are close to zero, that is are close to the mean (and the median) of the latent variables.
When the thresholds are beyond the range [−1 + 1], average errors are significantly greater.Average errors, however, are always less than the variance of each variable and results seem enough accurate.Table 1 reports the MAEs of the latent variables scores obtained in a further study.Here the 6 binary variables are obtained by generating a (5000 × 6) data set from the same zero-mean multivariate normal populations as before, but with fixed thresholds: −2, −0.5, 0, 0.2, 0.5, 2. Average errors (in the last row) show that the accuracy of the EAP estimates increases as long as the threshold approaches zero.On the other hand, considering the errors computed for different values of the true scores, we note that minima average errors (reported in bold) are obtained for values near the thresholds.The worst fittings are obtained for large positive values when the threshold is −2 and for large negative values when the threshold is +2.For variables with thresholds −0.2, 0, 0.2 and 0.5, the correlations between real and estimated scores are 0.74, 0.78, 0.78 and 0.75, respectively.We then consider the internet advertisement data set from the UCI machine learning depository (http://archive.ics.uci.edu/ml/).The features encode the geometry of the image as well as phrases occurring in the URL, the image's URL and alt text, the anchor text, and words occurring near the anchor text.The cluster membership of each image is known (clusters are: advertisement or not advertisement).After removing instances with missing values and selecting binary variables with relative frequencies higher than 0.1, we reach a data set with 2,359 instances, 3 continuous variables and 10 binary variables.We perform a k-means cluster analysis and we run the mixture model implemented in SPSS first with mixed mode variables (normalizing continuous variables in the interval [0 1]) and then with all continuous variables (with the estimated scores of the binary ones).The classification error rate decrease from 33 to 30% with k-means and from 35 to 32% with the mixture model.

Concluding Remarks
Although it is clearly impossible to generalize from the results presented, it does appear that estimating the scores of the latent continuous variables generating the binary values may improve the clustering results and, above all, it allows fully Gaussian models with different correlations among the variables in each group to be used for classification.This paper describes an initial investigation into the feasibility of estimating the scores of each latent continuous variable.In literature, only EAP estimates of the most relevant factors have been presented, for the different aims of estimating composed items that are assumed to represent a particular set of constructs and for data reduction.Here the aim is to reach a continuous data matrix, of the same dimension of the original one.Possible variations and improvements to the method proposed are relevant topics for future research.Future simulations involve data generated from distributions rather than the normal, to explore whether the EAP estimates work well also in these cases.Indeed, although the threshold estimates are based on the normal distribution and the t i j and the y i j are supposed to be Gaussian, EAP estimates are little affected by the choice of this distribution since loadings and eigenvalues are not estimated by maximum likelihood.