Вы находитесь на странице: 1из 7

Mr. Digvijay D. Desai.

Gower's General Similarity Coefficient

Gower's General Similarity Coefficient is one of the most popular measures of proximity for
mixed data types. For details of mixed data types click here.

Gower's General Similarity Coefficient sij compares two cases i and j, and is defined as follows:
sij = Σk wijksijk
Σk wijk

where: sijk denotes the contribution provided by the kth variable, and
wijk is usually 1 or 0 depending upon whether or not the comparison is
valid for the kth variable; if differential variable weights are specified
it is the weight of the kth variable or 0 if the comparison is not valid.

It should be noted that the effect of the denominator Σk wijk is to divide the sum of the similarity
scores by the number of variables; or if variable weights have been specified, by the sum of their
weights.

Ordinal and Continuous Variables


Gower defines the value of sijk for ordinal and continuous variables as follows:

sijk = 1 - | xik - xjk | /rk

where: rk is the range of values for the kth variable.

For continuous variables sijk ranges between 1, for identical values xik = xjk, and 0, for the two
extreme values xmax - xmin.

Binary Variables

Value of attribute k
For a binary variable (or dichotomous character), Gower defines
the component of similarity and the weight according to the table
Case i + + - -
(right), where + denotes that attribute k is "present" and - denotes Case j + - + -
that attribute k is "absent". sijk 1 0 0 0
wijk 1 1 1 0
Thus sijk = 1 if cases i and j both have attribute k "present" or 0
otherwise, and the weight wijk causes negative matches to be ignored. If negative matches are
not to be ignored, the variable should be specified as a nominal variable (see below).

If all your variables are binary, then Gower's General Similarity Coefficient is equivalent to
Jaccard's Similarity Coefficient A/(A+B+C) since the negative matches scored in cell D are
ignored.

Nominal Variables
The value of sijk for nominal variables is 1 if xik = xjk , or 0 if xik ≠ xjk. Thus sijk = 1 if cases i and j
Mr. Digvijay D. Desai.

have the same "state" for attribute k, or 0 if they have different "states", and w ijk = 1 if both cases
have observed states for attribute k.

Differential Variable Weights


It was noted above that the weight wijk for the comparison on the kth variable is usually 1 or 0.
However, if you assign differential weights to your variables in ClustanGraphics, then wijk is either
the weight of the k th variable or 0, depending upon whether the comparison is valid or not. This
allows larger weights to be given to important variables, or for another type of external scaling of
the variables to be specified.

If the weight of any variable is zero, then the variable is effectively ignored for the calculation of
proximities. Such variables are "masked" for clustering, but available for cluster profiling, to
assist in the interpretation of a resulting cluster analysis.

General Distance Coefficients


If you specify mixed data types in ClustanGraphics and select Gower's Similarity Coefficient in
Compute/Proximities, your proximity matrix will be calculated according to the above definitions.

However, the clustering options available using Gower are restricted to those applicable to
similarity measures, and not to dissimilarities. Thus, for example, you will not be able to optimize
the Euclidean Sum of Squares without first transforming your proximities into distances. For
details of the corresponding General Distance Coefficient.

Our implementation of Gower's General Similarity Coefficient is another example of the great
flexibilty provided in Clustan software. Mixed data types frequently occur in social surveys and
databases, but you are unlikely to find that other software for cluster analysis or neural networks
adequately caters for such practical diversity.

Gower's General Similarity Coefficient has been available in Clustan since 1984, and in
ClustanGraphics since release 5 in 2001. A worked example of Gower's coefficient with
psychiatric data is given here.

ClustanGraphics allows you to run very powerful clustering algorithms on different data types with
or without missing values and differential case or variable weighting. Having read your data,
either specify your variable types using an Auto Script, or select Edit/Data Types and specify
them interactively using the following dialogue:
Mr. Digvijay D. Desai.

The example shown here illustrates four types of variables allowed in ClustanGraphics - binary,
nominal, ordinal and continuous, and two data transformations - range or z-scores. These apply
as follows:

Binary Two codes other than missing, the higher code signifying "yes" or "present", the
lower code signifying "no" or "absent" (e.g. CreditAllowed, meaning whether the client
has credit terms).

Nominal Integer codes having no logical numerical order (e.g. AccountType or


ClientSector).

Ordinal Integer codes having a logical numerical order (e.g. VolumeLevel, by band).

Continuous Wide range of numerical values on a continuous or semi-continuous


scale (e.g. InvoiceValue, or the actual value of the current contract).

To When you have completed a cluster analysis with mixed data types, the results are easily and
flexibly presented in our cluster model dialogue, shown here.

Data Types
On first entry, ClustanGraphics examines your data and
tries to interpret the type of each variable according to
whether the values are integers and their frequencies. This
may be correct; for example, if all your variables are binary
then they should be interpreted as binary by having only two
possible values. If you have nominal or ordinal variables,
they will be interpreted as nominal - you should therefore
change the type of any such variable that is ordinal. To do
this, click on the type cell and select from the drop-down list (right).
Mr. Digvijay D. Desai.

Variable Transformations
ClustanGraphics allows you to transform ordinal or
continuous variables. The transformation options
are none, range or z-scores. Range divides each
value by the range of valid values, so that the
transformed values range between zero and 1. z-
scores transforms the values so that they have a
mean of zero and a standard deviation of 1. To
specify the transformation of any variable, click on the variable transform cell and select from the
drop-down list (right). More details of data transformations are here.

Transformations are not available for binary or nominal variables. A binary variable is stored as a
present/absent score for each case (e.g. CreditAllowed is either true or false). Liikewise, a
nominal variable is stored as a present/absent score for each category represented by an integer
code (e.g. ClientSector=5 is held as true for sector 5 and false for all other sector codes).

Variable Weights
With ClustanGraphics you can have different
weights for each variable. The standard default is
a weight of 1, so that all variables have equal
weight. If you want to give some variables more
emphasis than others you can specify differential
variable weights. To do this, click the variable
weight cell and type a new weight value (right).

Your current choice of weights can also be reviewed and changed in the Edit/Weights dialogue,
on the Edit menu.

Masking Variables
If you specify a weight of zero, the
variable will be masked from the cluster
analysis. In this case, the Edit/Data
Types dialogue will show the variable as masked, and its entries will be grayed (right). This is
helpful if you want to carry background variables that are "inactive", that is not to be used for
clustering but are nevertheless to be interpreted in cluster profiling.

Variable Names
The Edit/Data Types dialogue allows you to change the names of
variables. Simply click on a variable's name and edit it in situ (right).

Your current choice of variable names can also be reviewed and


changed in the Edit/Labels dialogue, on the Edit menu.

Variable Summaries
If you point the cursor at any variable and click the right mouse
button, a summary of the current parameters for that variable will
be displayed. This helps you check that you have selected the
correct type and transformation for the variable (right).

You can display a summary table for all your variables, by clicking
the Summary button. An abbreviated table of Data Types
specifications can be printed by clicking the Print button.
Mr. Digvijay D. Desai.

Confirming Data Types


When you click OK in the Edit/Data Types dialogue, you will be asked whether you wish the
changed specifications to be confirmed. At this point you can, if you wish, revert to the type
settings previously recorded; or you can update to the new settings entered into the dialogue.
Don't forget to save your ClustanGraphics file so that your changes will be correctly reproduced
when you next open your file.

You are now ready to run a cluster analysis on mixed data types. The current options are
hierarchical cluster analysis using Compute Proximities, Nearest Neighbours , k-Means Analysis
and Classify Cases . For further details, please refer to the file DataTypes.doc which
accompanies ClustanGraphics or view a worked example of Gower's Similarity Coefficient with
mixed data types here.

Example
This is a worked example of Gower's Similarity Coefficient, taken from Cluster Analysis, Third
Edition, by Brian S. Everitt, Arnold, London, 45-46.

Everitt illustrates the coefficient using the following data for five psychiatrically ill patients:

Case Weight Anxiety Depression Hallucination Age


Patient1 120 1 1 1 1
Patient2 150 2 2 1 2
Patient3 110 3 2 2 3
Patient4 145 1 1 2 3
Patient5 120 1 1 2 1

The above data can be easily read by ClustanGraphics. Simply select the values in the table and
copy them to an Excel file, then click File/New/Data in ClustanGraphics and choose Excel
Spreadsheet as the file format to read the file and the headings and case labels.

Next, select Edit/Data Types and change the type specifications of Anxiety and Age to nominal.
Note here that this is possibly an incorrect definition, since these two variables appear to be
ordinal; however, we shall specify nominal to be consistent with the type definitions in Everitt's
example.
Mr. Digvijay D. Desai.

Click OK and accept the changed data type specifications. Note that it is not necessary to
transform the Weight variable because transformation by range is standard in Gower's coefficient.

Now select Prox/Compute, noting that ClustanGraphics has recognized that the variables
comprise mixed data types. Select Gower's Coefficient from the list of similarity and dissimilarity
coefficients available for mixed data types.
Mr. Digvijay D. Desai.

When you press OK the proximity matrix will be computed. You may also wish at this stage to
cluster the data hierarchically. To check the values for Gower's coefficient click View/Prox.

There are unfortunately two errors in the similarity matrix shown on page 46 of Everitt's book -
coefficients s25 and s45 are wrongly reported. You can easily check by hand that the correct
Gower similarity coefficients have been computed by ClustanGraphics.