For clustering objects, we often collect not only continuous variables, but binary attributes as well. This paperproposes a model-based clustering approach with mixed binary and continuous variables where each binaryattribute is generated by a latent continuous variable that is dichotomized with a suitable threshold value, andwhere the scores of the latent variables are estimated from the binary data. In economics, such variables arecalled utility functions and the assumption is that the binary attributes (the presence or the absence of a publicservice or utility) are determined by low and high values of these functions. In genetics, the latent responseis interpreted as the ‘liability’ to develop a qualitative trait or phenotype. The estimated scores of the latentvariables, together with the observed continuous ones, allow to use a multivariate Gaussian mixture modelfor clustering, instead of using a mixture of discrete and continuous distributions. After describing the method,this paper presents the results of both simulated and real-case data and compares the performances of themultivariate Gaussian mixture model and of a mixture of joint multivariate and multinomial distributions.Results show that the former model outperforms the mixture model for variables with different scales, bothin terms of classification error rate and reproduction of the clusters means.
A latent variables approach for clustering mixed binary and continuous variables within a Gaussian mixture model / Morlini, Isabella. - In: ADVANCES IN DATA ANALYSIS AND CLASSIFICATION. - ISSN 1862-5347. - STAMPA. - 6:1(2012), pp. 5-28. [10.1007/s11634-011-0101-z]
A latent variables approach for clustering mixed binary and continuous variables within a Gaussian mixture model
MORLINI, Isabella
2012
Abstract
For clustering objects, we often collect not only continuous variables, but binary attributes as well. This paperproposes a model-based clustering approach with mixed binary and continuous variables where each binaryattribute is generated by a latent continuous variable that is dichotomized with a suitable threshold value, andwhere the scores of the latent variables are estimated from the binary data. In economics, such variables arecalled utility functions and the assumption is that the binary attributes (the presence or the absence of a publicservice or utility) are determined by low and high values of these functions. In genetics, the latent responseis interpreted as the ‘liability’ to develop a qualitative trait or phenotype. The estimated scores of the latentvariables, together with the observed continuous ones, allow to use a multivariate Gaussian mixture modelfor clustering, instead of using a mixture of discrete and continuous distributions. After describing the method,this paper presents the results of both simulated and real-case data and compares the performances of themultivariate Gaussian mixture model and of a mixture of joint multivariate and multinomial distributions.Results show that the former model outperforms the mixture model for variables with different scales, bothin terms of classification error rate and reproduction of the clusters means.File | Dimensione | Formato | |
---|---|---|---|
Morlini Adac 2012.pdf
Accesso riservato
Tipologia:
Versione dell'autore revisionata e accettata per la pubblicazione
Dimensione
714.42 kB
Formato
Adobe PDF
|
714.42 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris