An Overall Index for Comparing Hierarchical Clusterings

In this paper we suggest a new index for measuring the distance between two hierarchical clusterings. This index can be decomposed into the contributions pertaining to each stage of the hierarchies. We show the relations of such components with the currently used criteria for comparing two partitions. We obtain a similarity index as the complement to one of the suggested distances and we propose its adjustment for agreement due to chance. We consider the extension of the proposed distance and similarity measures to more than two dendrograms and their use for the consensus of classiﬁcation and variable selection in cluster analysis.


Introduction
In cluster analysis, one may be interested in comparing two or more hierarchical clusterings obtained for the same set of n objects. Indeed, different clusterings may be obtained by using different linkages, different distances or different sets of variables. In the literature the most popular measures have been proposed for comparing two partitions obtained by cutting the trees at a certain stage of the two hierarchical procedures (Rand (1971); Fowlkes and Mallows (1983); Hubert and Arabie (1985); Meila (2007); Youness and Saporta (2010)). Less attention has been devoted to the comparison of the global results of two hierarchical classifications, i.e. two dendrograms obtained for the same set of objects. Sokal and Rohlf (1962) have introduced the so-called cophenetic correlation coefficient (see also Rohlf 1982 andLapointe andLegendre 1995). Baker (1974) has proposed the rank correlation between stages where pairs of objects combine in the tree for measuring the similarity between two hierarchical clusterings. Reilly et al. (2005) have discussed the use of Cohen's kappa in studying the agreement between two classifications.
In this work we suggest a new index for measuring the dissimilarity between two hierarchical clusterings. This index is a distance and can be decomposed into the contributions pertaining to each stage of the hierarchies. In Sect. 2 we define the new index for two dendrograms. We then present its properties and its decomposition with reference to each stage. Section 3 shows the relations of each component of the index with the currently used criteria for comparing two partitions. Section 4 considers the similarity index obtained as the complement to one of the suggested distances and shows that its single components obtained at each stage of the hierarchies can be related to the measure B k suggested by Fowlkes and Mallows (1983). This section also deals with the adjustment of the similarity index for agreement due to chance. Section 5 considers the extension of the overall index to more than two clusterings. Section 6 gives some concluding remarks.

The Index and Its Properties
Suppose we have two hierarchical clusterings of the same number of objects, n. Let us consider the N D n.n 1/=2 pairs of objects and let us define, for each non trivial partition in k groups (k D 2; : : : ; n 1), a binary variable X k with values x ik D 1 if objects in pair i .i D 1; : : : ; N / are classified in the same cluster in partition in k groups and x ik D 0 otherwise. A binary .N .n 2// matrix X g for each clustering g .g D 1; 2/ may be derived, in which the columns are the binary variables X k . A global measure of dissimilarity between the two clusterings may be defined as follows: where k A kD P i P k k a ik k is the L 1 norm of the matrix A. In expression (1), since the matrices involved take only binary values, the L 1 norm is equal to the square of the L 2 norm.
Index Z has the following properties: • It is bounded in [0,1].
• Z D 0 if and only if the two hierarchical clusterings are identical and Z D 1 when the two clusterings have the maximum degree of dissimilarity, that is when for each partition in k groups and for each i , objects in pair i are in the same group in clustering 1 and in two different groups in clustering 2 (or vice versa). • It is a distance, since it satisfies the conditions of non negativity, identity, symmetry and triangular inequality (Zani (1986)).
• The complement to 1 of Z is a similarity measure, since it satisfies the conditions of non negativity, normalization and symmetry. • It does not depend on the group labels since it refers to pairs of objects. • It may be decomposed in .n 2/ parts related to each pair of partitions in k groups since: The plot of Z k versus k shows the distance between the two clusterings at each stage of the procedure.

The Comparison of Two Partitions in k Groups
Let us consider the comparison between two partitions in k groups obtained at a certain stage of the hierarchical procedures. The measurement of agreement between two partitions of the same set of objects is a well-known problem in the classification literature and different approaches have been suggested (see, i.e., Brusco and Steinley 2008;Denoeud 2008). In order to highlight the relation of the suggested index with the ones proposed in the literature, we present the socalled matching matrix M k D OEm fj where m fj indicates the number of objects placed in cluster f .f D 1; : : : ; k/ according to the first partition and in cluster j .j D 1; : : : ; k/, according to the second partition (Table 1). Information in Table 1 can be collapsed in a .2 2/ contingency table, showing the cluster membership of the object pairs in each of the two partitions ( Table 2). The number of pairs which are placed in the same cluster according to both partitions is The counts of pairs joined in each partition are: The numerator of formula (2) with reference to the two partitions in k groups can be expressed as a function of the previous quantities: The well-known Rand index (Rand 1971) computed for two partitions in k groups is given by (see Warrens 2008, for the derivation of the Rand index in terms of the quantities in Table 2): Therefore, the numerator of Z k in (2) can be expressed as a function of the Rand index: The information in Table 2 can also be summarized by a similarity index, e.g. the simple matching coefficient (Sokal and Michener 1958): If the Rand index is formulated in terms of the quantities in Table 2 it is equivalent to the simple matching coefficient and can be written as:

The Complement of the Index
Since k X 1 kD P k Q k and k X 2 kD P k P k , the complement to 1 of Z is: Also the similarity index S may be decomposed in .n 2/ parts V k related to each pair of partitions in k groups: The components V k , however, are not similarity indices for each k since they assume values < 1 even if the two partitions in k groups are identical. For this reason, we consider the complement to 1 of each Z k in order to obtain a single similarity index for each pair of partitions: Expression (13) can be written as: The index suggested by Fowlkes and Mallows (1983) for two partitions in k groups in our notation is given by: The statistics B k and S k may be thought of as resulting from two different methods of scaling T k to lie in the unit interval. Furthermore, in S k and B k the pairs U k (see Table 2), which are not joined in either of the clusterings, are not considered as indicative of similarity. On the contrary, in the Rand index, the pairs U k are considered as indicative of similarity. With many clusters, U k must necessarily be large and the inclusion of this count makes R k tending to 1, for large k. How the treatment of the pairs U k may influence so much the values of R k for different k or the values of R k and B k , for the same k, is illustrated in Wallace (1983). A similarity index between two partitions may be adjusted for agreement due to chance (Hubert and Arabie 1985;Albatineh et al. 2006;Warrens 2008). With reference to formula (13) the adjusted similarity index AS k has the form: Under the hypothesis of independence of the two partitions, the expectation of T k in Table 2 is: Therefore, the expectation of S k is given by: Given that max.S k / D 1, we obtain: Simplifying terms, this reduces to: The adjusted Rand index for two partitions in k groups is given by (Warrens 2008): and so AS k is equal to the Adjusted Rand Index.

Extension to More than Two Clusterings
When a set of G .G > 2/ hierarchical clusterings for the same set of objects is available, we may be interested to gain insights into the relations of the different classifications. The index Z defined in (1) may be applied to each pair of clusterings in order to produce a G G distance matrix: Z D OEZ gh ; g; h D 1; : : : ; G: Furthermore, considering the index S defined in (11) for each pair of dendrograms, we obtain a G G similarity matrix: S D OES gh ; g; h D 1; : : : ; G that displays the proximities between each pair of classifications. Usually, the G clusterings are obtained applying different algorithms to the same data set. In this case, matrices Z and S may be useful in the context of the "consensus of classifications", i.e. the problem or reconciling clustering information coming from different methods (Gordon and Vichi 1998;Krieger and Green 1999). Clusterings with high distances (or low similarities) from all the others can be deleted before computing the single (consensus) clustering. Indexes Z and S can also be used for variable selection in cluster analysis (Fowlkes et al. 1988;Fraiman et al. 2008;Steinley and Brusco 2008). The inclusion of "noisy" variables can actually degrade the ability of clustering procedures to recover the true underlying structure. For a set of p variables and a certain clustering method, we suggest different approaches.
First we may obtain the p one dimensional clustering with reference to each single variable and then compute the p p similarity matrix S. The pairs of variables reflecting the same underlying structure show high similarity and can be used to obtain a multidimensional classification. On the contrary, the noisy variables should present a similarity with the other variables near to the expected value for chance agreement. We may select a subset of variables that best explains the classification into homogeneous groups. These variables help us to better understand the multivariate structure and suggest a dimension reduction that can be used in a new data set for the same problem (Fraiman et al. 2008).
A second approach consists in finding the similarities between clusterings obtained with subsets of variables (regarding, for example, different features). This approach is helpful in finding aspects that lead to similar partitions and subsets of variables that, on the contrary, lead to different clusterings.
A third way to proceed consists in finding the similarities between the "master" clustering obtained by considering all variables and the clusterings obtained by eliminating each single variable in turn, in order to highlight the "marginal" contribution of each variable to the master structure.

Concluding Remarks
In this paper we have introduced a new index to compare two hierarchical clusterings. This measure is a distance and it is appealing since it does summarize the dissimilarity by one number and can be decomposed in contributions relative to each pair of partitions. This "additive" feature is necessary for comparisons with other indices and for interpretability purposes. The complement to 1 of the suggested measure is a similarity index and it also can be expressed a sum of the components with reference to each stage of the clustering procedure.
The new distance is a measure of dissimilarity of two sequences of partitions of n objects into 2; 3; : : : ; n 2; n 1 groups. The fact that these partitions came from successive cutting of two hierarchical trees is irrelevant. The partitions could also come from a sequence of non hierarchical clusterings (obtained, i.e., by k-means methods with a different number of groups).
Further studies are needed in order to illustrate the performance of the suggested indices on both real and simulated data sets.