Clustering

Clustering

Broad set of techniques for finding subgroups.
Observations within groups are quite similar to each other and observations in different groups quite different from each other.

Clustering methods

Partitioning: k-means clustering
Agglomerative/divisive: hierarchical clustering
Model-based: mixture models

K-means clustering

Simple approach to partitioning a data set into K distinct, non-overlapping clusters.
K is specified beforehand.

James et al. 2013

K-means algorithm

\(K\) – number of clusters.

The process:

Randomly assign each observation to groups 1 to K;
For each of K clusters, derive its centroid (midpoint);
Reassign each observation into a cluster whose centroid is the closest;
Iterate until stability in cluster assignments is reached (local vs global optima).

James et al. 2013

Some properties of K-means partitioning

Standard algorithm for k-means clustering is using Euclidean distance (distances are calculated using Pythagorean theorem).
Different local optima (stability) can be reached.

Number of clusters K must be specified in advance
How to determine optimal number of clusters?
- Elbow method,
- Silhouette method

Hierarchical clustering

The number of clusters is not specified beforehand.
Output is a dendrogram – a hierarchical structure visualizing the cluster growth.
Starts with a distance matrix.

Agglomerative

Builds the hierarchy of clusters from bottom-up until a single cluster is reached.

Put each object in its own cluster;
Join the clusters that are the closest;
Iterate until a single cluster encompassing all objects is reached.

Divisive

Divides a single large cluster into individual objects (top-down).

Put all objects into a single cluster;
Divide the cluster into subclusters at a similar distance;
Iterate util all objects are in their own clusters.

Hierarchical clustering algorithms

Clustering methods comparison

K-means partitioning

Pre-specified number of clusters.
Clusters may vary (different local optima).
Best when groups in data are (hyper)spheres.

Hierarchical clustering

Variable cluster numbers.
Clusters are stable.
Any shape of data distribution.