Distance and similarity are more or less opposite concepts.
Distance is a numerical measure describing how are two objects (defined by certain variables) different (pairwise distance).
Different distance measures exist for different data types.
Distance
Scale 0 – \(\infty\)
0 – Two objects with 0 distance between them.
\(\infty\) – Two objects with infinite distance.
In practice, maximum distance is often 1.
Denoted by \(D\) (for distance, or dissimilarity).
\(D = 1 - S\)
Similarity
Scale 0 – 1
0 – Two objects completely dissimilar (0%).
1 – Two objects competely similar (100%).
Denoted by \(S\) (for similarity).
\(S = 1 - D\)
Different distance measures
Dichotomous variables
Symmetrical – Simple matching distance
Asymmetrical – Jaccard index (binary distance)
Categorical variables
Hamming distance
Numeric continuous variables
Euclidean distance
Mahalanobis distance
Mixed data sets
Gower’s distance
Binary distances
For TRUE/FALSE, 1/0, presence/absence (etc.) data
Symmetrical
Two presences as match.
Two absences as match.
If a trait is present, two objects are more similar. If a trait is absent, two objects are more similar. For example if biological sex is encoded in one variable with 0 for male and 1 for female, it is symmetrical.
Simple maching distance
Asymmetrical
Two presences as match.
Two absences as mismatch.
If a trait is present, two objects are more similar. If a trait is absent in both cases, e.g. undetermined, missing etc., this does not affect similarity. This is more practical in archaeology.
Jaccard index, i.e. binary distance
dist(x, method = "binary")
Distance between (continuous) numeric data
To remove effects of scale (different units etc.), variables should be scaled (normalized).
It is symmetrical. Lower triangular is the same as upper triangular.
On the diagonal, there is distance of the given object to itself, i.e. 0.
Visualizing distance matrix
Package corrplot has a nice way of plotting heat maps.
library(corrplot)# arg. is.corr set to FALSE, because we are not visualizing correlation matrixcorrplot::corrplot(as.matrix(d), is.corr =FALSE, type ="upper")
Resources
For a much more detailed overview of distance methods, see the tutorial on classification by Schmidt, S. C. et al. DOI: 10.5281/zenodo.6325372 (direct link to a HTML file is here).