In statistics, Gower's distance between two mixed-type objects is a similarity measure that can handle different types of data (binary, ordinal, continuous) within the same dataset (unlike Hamming distance or Euclidean distance), by normalizing the differences between each pair of variables and then computing a weighted average of these differences.
The distance was defined in 1971 by Gower[1] and it takes values between 0 and 1. This technique is particularly useful in cluster analysis (such as K-nearest neighbors algorithm) or other multivariate statistical techniques.
Definition
For two objects
i
{\displaystyle i}
and
j
{\displaystyle j}
having
p
{\displaystyle p}
descriptors, the similarity
S
{\displaystyle S}
is defined as:
S
i
j
=
∑
k
=
1
p
w
i
j
k
s
i
j
k
∑
k
=
1
p
w
i
j
k
,
{\displaystyle S_{ij}={\frac {\sum _{k=1}^{p}w_{ijk}s_{ijk}}{\sum _{k=1}^{p}w_{ijk}}},}
where the
w
i
j
k
{\displaystyle w_{ijk}}
are non-negative weights usually set to
1
{\displaystyle 1}
[2] and
s
i
j
k
{\displaystyle s_{ijk}}
is the similarity between the two objects regarding their
k
{\displaystyle k}
-th variable. If the variable is binary or ordinal, the values of
s
i
j
k
{\displaystyle s_{ijk}}
are 0 or 1, with 1 denoting equality. If the variable is continuous,
s
i
j
k
=
1
−
|
x
i
−
x
j
|
R
k
{\displaystyle s_{ijk}=1-{\frac {|x_{i}-x_{j}|}{R_{k}}}}
with
R
k
{\displaystyle R_{k}}
being the range of
k
{\displaystyle k}
-th variable and thus ensuring
0
≤
s
i
j
k
≤
1
{\displaystyle 0\leq s_{ijk}\leq 1}
. As a result, the overall similarity
S
i
j
{\displaystyle S_{ij}}
between two objects is the weighted average of the
similarities calculated for all their descriptors.[3]
In its original exposition, the distance does not treat ordinal variables in a special manner. In the 1990s, first Kaufman and Rousseeuw[4] and later Podani[5] suggested extensions where the ordering of an ordinal feature is used. For example, Podani obtains relative rank differences as
s
i
j
k
=
1
−
|
r
i
−
r
j
|
max
{
r
}
−
min
{
r
}
{\displaystyle s_{ijk}=1-{\frac {|r_{i}-r_{j}|}{\max {\{r\}}-\min {\{r\}}}}}
with
r
{\displaystyle r}
being the ranks corresponding to the ordered categories of the
k
{\displaystyle k}
-th variable.
References
- Gower, John C (1971). "A general coefficient of similarity and some of its properties". Biometrics. 27 (4): 857–871. doi:10.2307/2528823. JSTOR 2528823. Retrieved 2024-06-03.
- Borg, Ingwer; Groenen, Patrick J. F. (2005). Modern multidimensional scaling: theory and applications (2 ed.). New York [Heidelberg]: Springer. pp. 124–125. ISBN 978-0387-25150-9.
- Legendre, Pierre; Legendre, Louis (2012). Numerical ecology (Third English ed.). Amsterdam: Elsevier. pp. 278–280. ISBN 978-0-444-53868-0.
- Kaufman, Leonard; Rousseeuw, Peter J. (1990). Finding groups in data: an introduction to cluster analysis. New York: Wiley. pp. 35–36. ISBN 9780471878766.
- Podani, János (May 1999). "Extending Gower's general coefficient of similarity to ordinal characters". Taxon. 48 (2): 331–340. Bibcode:1999Taxon..48..331P. doi:10.2307/1224438. JSTOR 1224438.