New Weighed Similarity Indexes for Market Segmentation Using Categorical Variables

In this paper we introduce new similarity indexes for binary and poly-tomous variables, employing the concept of “information content”. In contrast to traditionally used similarity measures, we suggest to consider the frequency of the categories of each attribute in the sample. This feature is useful when dealing with rare categories, since it makes sense to differently evaluate the pairwise presence of a rare category from the pairwise presence of a widespread one. We also propose a weighted index for dependent categorical variables. The suitability of the proposed measures from a marketing research perspective is shown using two real data sets.


Introduction
Consider a general setup in which k categorical variables X s (s D 1; : : : ; k) with nominal scale are of interest and a categorical data set X D .x 01 ; : : : ; x 0 n / 0 is collected from n subjects u 1 ; u 2 ; : : : ; u n .Let x i D .xi1 ; : : : ; x ik / 0 be the profile of the k attributes for the i th subject.The resemblance between two subjects u i and u j is typically measured by pairwise similarity indexes (see, e.g., Sneath andSokal 1973 and, recently, Warrens 2008).Most of the available similarity indexes in the literature have been developed to deal with binary variables and few measures have been proposed specifically for polytomous attributes.For these variables, distance functions like the Euclidean or the Manhattan are sometimes used, especially for classification purposes.However, the principal difficulty in dealing with nominal categorical data is typically the lack of a metric space in which data points are positioned and the measured distances can be different when a different coding scheme is used for the variables (Zhang et al. 2006).In this paper we extend the original work of Zani (1982) and we first allow the classical similarity measures to deal with polytomous variables.Then we consider the problem of weighting variables in computing similarities between subjects.We propose a criterion for weighting the pairwise presence of a category, on the basis of the Shannon's "information content" of the relative frequency in the sample.Both in marketing research and in other fields, it appears relevant to attach a higher weight to the pairwise presence of a rare category in the sample rather than to the pairwise presence of a widespread one.A similar criterion is used in correspondence analysis, where the effect of increasing the values corresponding to low-frequencies categories relatively more than those corresponding to high-frequencies categories is accomplished with the 2 distance.Finally, we provide some numerical examples to illustrate the use of the indexes and we show the suitability of the proposed measures for market segmentation.

A Class of Similarity Indexes for Polytomous Variables
Consider k categorical variables X s .sD 1; : : : ; k/ with h s 2 categories.An easy way to numerically code the attributes is through the so called "dummy variables".A binary variable is introduced for each category: the number of dummy variables is h D P k sD1 h s .With this coding scheme, we obtain a .nh/ data matrix of the form: X 1 : : : X s : : : X k X 11 : : : X 1h 1 : : : X s1 : : : X sv : : : X sh s : : : X k1 : : : X kh k : : : : : : x isv : : : : : : : : : n sv : : : where X sv is the dummy variable for the vth category of the sth attribute (s D 1; : : : ; k v D 1; : : : ; h s ).The observed value for the i th observation is The frequency, in the sample, of the vth category of the sth attribute is n sv D P n i D1 x isv and the relative frequency is f sv D n sv =n.When the categorical variable is dichotomous, this coding scheme implies two dummy variables.It is obvious that x is1 D 1 " x is2 D 0 and x is1 D 0 " x is2 D 1 and the second dummy is superfluous.However, when dealing with mixed polytomous and dichotomous variables, the same coding is needed for both.To evaluate the similarity between subjects u i and u j , we introduce the following contingency table: We will call positive matches or agreements in u i and u j the ˛pairs 1 1 and disagreements the ˇC pairs 1 0 and 0 1.The ı pairs 0 0 (negative matches) simply indicate that both u i and u j do not share the category corresponding to the dummy variable and are useless in evaluating the similarity between two subjects since this number only depends on the number of the categories of the original categorical variables.The index: is bounded in [0,1] and has the following properties: -1 S1 ij D .ˇC/=.˛C ˇC / is a distance.ˇC is the Manhattan and the square Euclidean distance between u i and u j in the dummy variable coding.-for binary variables, i.e., for h s D 2, s D 1; : : : ; k, index S1 ij becomes equivalent to the Rogers-Tanimoto index.
We may obtain a more general index by introducing a weight for the disagreements (Gower and Legendre 1986): with w > 0. When w D 0:5 and h s D 2, s D 1; : : : ; k, expression (2) is equivalent to the Sokal-Michener index.Given two subjects u i and u j , the probability of an agreement in X sv , in a Bernoulli trial, is f 2 sv .The weight given to an agreement in X sv should be a decreasing function of f 2 sv .Assuming .f 2 sv > 0/, here we propose the weight w sv D log.1=f 2 sv / which is a measure of the information content of an agreement in X sv .For independent variables, this measure is additive: if subjects u i and u j present two positive matches, in X sv and X kl , then the joint weight is w sv;kl D w sv C w kl .The choice of w sv is for interpretability purposes rather than for numerical ones.This weight is conceived in the light of the information theory.This criterion was first introduced by Burnaby (1970).We do not use the information entropy (see, e.g., MacKay 2002) e sv D f 2 sv log.1=f 2 sv /, because it is not a decreasing function of f 2 sv .By weighting the pairwise positive matches with w sv , we obtain the index: where 3) is equal to zero iff u i and u j do not share any positive match.However, it is not a similarity index since the condition E i i D E jj D 1 for i; j D 1; : : : ; n is not satisfied.A general expression for a similarity index based on (3) is: with F ij 0 depending on the number of disagreements in u i and u j .We may define F ij in different ways: .i; j / s D 0 otherwise.P k sD1 .i;j / s D 0:5.ˇC / In the first case, F ij is equal to the information content of the specific pairwise disagreements in u i and u j and expression (4) becomes: 5) is equal to (1) when h s D q and f sv D 1=q, for s D 1; : : : ; k and v D 1; : : : ; q.However, for two couples of subjects having the same pairwise positive matches but different pairwise disagreements it may assume a different value.Given the categorical nature of the data, the evaluation of the similarity should depend on the number of disagreements but not on the dummy variables in which they are present.In the second case, F ij is the information content of any dissimilarity, without considering the specific dummy variable in which the dissimilarity is present.For attribute X s , the probability of a positive match is P h s vD1 f 2 sv and thus the probability of a dissimilarity is its complement 1 P h s vD1 f 2 sv .Considering this second expression, index (4) becomes: Index (6) assumes the same numerical value for every pair of subjects having the same positive matches, regardless of the specific dummy variables in which the disagreements are present.In the trivial case in which h s D q and f sv D 1=q for s D 1; : : : ; k and v D 1; : : : ; q, (6) becomes equal to (2) when w D 0:5log q q 1 = log.q 2 /.In the third case, F ij is equal to the average of the information content of the pairwise positive matches in variables which have disagreements in u i and u j .Thus, F ij may be perceived as the average loss in the information content due to the lack of positive matches in u i and u j .For each attribute, the information content of a pairwise positive match in modality X sv is log.1=f 2 sv /.The average of the information content is then P h s vD1 f sv log.1=f 2 sv /.This quantity is also the average loss of the information content due to the lack of a positive match in X s .With this expression: Index ( 7) satisfies the following properties: (1) S2 ij D 0 iff u i and u j do not share any positive match and S2 ij D 1 iff u i and u j share a positive match in each attribute.
(2) It is invariant to any permutation of the disagreements, provided that the disagreements are on the same attributes.
(3) In the trivial case in which h s D q and f sv D 1=q, for s D 1; : : : ; k, v D 1; : : : ; q, it becomes equal to (2) with w D 0:5.This last index, when h s D 2, s D 1; : : : ; k, is equivalent to the Sokal-Michener measure.In order to take into account the possible association between variables, the information content E ij of the pairwise agreements between two subjects should be defined in term of frequencies of specific 'sequences' of agreements.Let c ij be the sequence of ones corresponding to pairwise agreements in u i and u j and f r.c ij / be the relative frequency, in the sample, of observations holding the sequence c ij .Consider, for example, three attributes having three, three and two categories, respectively.If the profile vectors of the dummy variables in u i and u j are x 0 i D OE001 100 01 and x 0 j D OE001 100 10, f r.c ij / is the relative frequency, in the sample, of observations having the third category in the first attribute and the first category in the second attribute.Assume f r.c ij / 2 to be the probability, in a Bernoulli trial, of sampling two subjects with sequence c ij .The information content of the sequence of agreements in u i and u j is: L ij D 2log f r.c ij / with the convention L ij D 0 if u i and u j do not have any positive match.We normalize L ij and we introduce the new similarity index: where M.L ij / is the average information content of possible agreements in categories which have a dissimilarity in u i and u j .M.L ij / may be thought of as the average loss in the information content due to the lack of pairwise agreements.Let us consider a number of disagreements equal to g, with 1 Ä g < k.For g D 0, we introduce the convention S3 ij D 0. For g D k, S3 ij D 1.The count of observations having the same sequence of categories is m ij D n f r.c ij /.In the sub sample of these m ij observations, we determine the relative frequencies of each particular sequence of agreements for the remaining (k g) attributes having a disagreements in u i and u j .Since the number of categories in the sth attribute is h s , the count of all possible sequences of agreements is p D Q k g sD1 h s where the product is extended to the attributes having a disagreement in u i and u j .Let -c.ij / t be the tth sequence of agreements among the .kg/ attributes having a disagreement in u i and u j , t D 1; : : : ; p. -f r.c.ij / t / be the relative frequency of the sequence c.ij / t in the sub sample of the m ij subjects having the sequence c ij .
The information content of c.ij / t is 2log .fr.c.ij / t //.The average information content of the sequences c.ij / t is M.L ij / D P p t D1 f r.c.ij / t / 2log .fr.c.ij / t //.With this expression, after algebraic simplification, index S3 ij can be written as follows: Using the conventions previously introduced, S3 ij D 0 iff u i and u j do not have any agreement, S3 ij D 1 iff u i and u j have a positive match in all k attributes.

Applications in Marketing Research and Discussion
In this section we try to gain insights into the characteristics of S1, S2 and S3 through applications in marketing research.All the analyses are performed in the Matlab environment (programs are available upon request).We also compare the proposed indexes with two popular similarity measures for polytomous variables: the Jaccard index (Sd ) and the Hamming similarity index (Sh) (the function I f g 2 f0; 1g indicates the truth of its argument): While Sh is independent on the coding scheme, Sd depends on the code.For binary variables indicating the presence or the absence of a feature, we use the code 1 for the presence and 0 for the absence (so that the number of pairwise absences is not counted in Sd ).For dichotomous variables in which the categories do not reflect the presence or the absence of a feature and for polytomous variables we use the code 1, 2, . . ., s.It is worth to highlight that Sd differs from S1 both in case of all polytomous variables and in case of mixed dichotomous and polytomous variables.We refer to Boriah et al. (2008) for a comparative study of the performances of a variety of similarity measures.The first data set consists of k D 37 observed features (technical specifications) of n D 100 satellite navigators.Seven variables have three categories and the other 30 variables are binary attributes (presence or absence).Some features, like a CD player, are very rare.Some other features, like a touch screen and a Gps system, are very common (see Table 1).Among the 666  0.75, 0.92, 0.91, 0.61, 0.91, 0.24, 0.9, 0.44, 0.93, 0.98, 0.98, 0.6, 0.95, 0.95, 0.82, 0.99.Since the number of positive matches and negative matches in the binary variables are also equal, the degree of similarity evaluated by Sd , Sh and S1 is identical: the number of shared categories, Sd and S2 may also decrease as the number of shared categories increase.Due to the weights, S2 is the most variable index.Sd may differ between couples of objects when the number of shared categories is equal but the number of pairwise absences differs.S2 may also differ when the number of shared categories and the number of pairwise absences are identical.Index S3 is not an increasing function of the number of positive matches since agreements in variables which are strongly associated are "penalized".
The second data set consists of k D 10 features of n D 106 sparkling wines.Five features are dichotomous variables and the others are polytomous.For this data set we obtain partitions with the most common hierarchical methods applied to the complements to one of the indexes.In marketing research, a specific criterion for assessing the performance of similarity measures is the 'segment addressability' suggested by Helsen and Green (1991), related to the degree to which a clustering solution can be explained by variables controlled by marketing managers and helping 'targeting' competitors.In Fig. 1, only two dendrograms are reported, for lack of space.In general, all classifications based on S2 and S3 readily distinguish four main groups while the other indexes show less ability to provide separation.Moreover, the classification based on S2 remains more stable, with respect to the different linkages, than those reached by the other measures.The four groups detected by S2 and S3 delineate specific segments of products and are easily interpretable also for the size (the smaller group comprehends eight wines and the biggest one 44 wines).These segments are homogeneous with respect to the alcohol content and the sugar level: the R 2 statistics for these variables is always higher in partitions reached with S2 and S3.Among the features used for classification, the 'taste' and the 'origin' have the rarest categories.In the 4-groups partition with S3, wines having the same modality in these two attributes are classified into the same group.With Sh, there is no evidence of a clustering structure in three or four groups: a very small cluster remains isolated until the last aggregation steps.
In conclusion, the major advantage of these indexes is that they are able to handle mixed dichotomous and polytomous variables and the weighted versions are able to give more importance to agreements in rare categories.Index S3 is designed to take into account the possible associations between variables and there do not appear to be other similarity measures that are directly focused on this goal.

Fig. 1
Fig. 1 Dendrograms obtained with the average linkage: in the left is used S2, in the right Sh

Table 1
Relative frequencies of technical features in the satellite navigators by the 2 statistics, there are 178 values leading to the rejection of the null hypothesis of independency between variables for ˛D 0:05 and 110 for ˛D 0:01.To analyze the behavior of the indexes, we consider the objects Kenwood Dnx 7,200 (u 1 ) and LG Lan 9,600 R (u 2 ).They share the same category in 18 attributes (14 dichotomous and 4 polytomous).They both have a DVD and a CD player and the most rare category in two of the polytomous variables.The relative frequencies in the sample of the 18 categories shared by the objects are: 0.02, 0.24, 0.22, 0.03, 0.92 0.83, 0.61, 0.97, 0.91, 0.44, 0.95, 0.07, 0.02, 0.02, 0.6, 0.82 0.57, 0.99.The values of the similarity indexes are: