Agglomerative hierarchical
clustering is a
clustering algorithm that builds a cluster hierarchy from the
*bottom-up*. It starts by adding a cluster for each of the data points
to be clustered, followed by iterative pair-wise merging of clusters
until only one cluster is left at the top of the hierarchy. The choice
of clusters to merge at each iteration is decided based on a *distance
metric*. In this article, we implement this algorithm in C.

## Source code

The complete source code is available at:

https://github.com/gyaikhom/agglomerative-hierarchical-clustering.git

## The input

Since we are particularly interested in understanding the algorithm, we shall choose the simplest input format possible. The input file has the following format:

```
<number of items to cluster>
<label string>| <x-axis value> <y-axis value>
...
```

For instance, the following is a valid input. It contains 12 data points, where each data point is referred to by its label and has coordinates in the two-dimensional Euclidean plane. Note that we can adapt this input format to accommodate \(n\)-dimensional vectors.

```
12
A| 1.0 1.0
B| 2.0 1.0
C| 2.0 2.0
D| 4.0 5.0
E| 5.0 4.0
F| 5.0 5.0
G| 5.0 6.0
H| 6.0 5.0
I| 9.0 9.0
J| 10.0 9.0
K| 10.0 10.0
L| 11.0 9.0
```

After running the clustering algorithm, we get the following hierarchy:

Normally, hierarchical clusters are displayed as Dendrograms. However, we chose the above representation to make it easier to see the cluster formation in relation to the two-dimensional coordinates of the items that are clustered.

## The algorithm

The agglomerative hierarchical clustering algorithm is implemented as follows:

- Based on the
*linkage criteria*, set the*distance function*. - Parse the input file and read in the items to cluster.
- Initialise the cluster array and set the appropriate item counts.
- Calculate the Euclidean distances between item pairs.
For each item in the input:

- Add a leaf node that represents a root cluster.
- Update the distance from root cluster to its neighbours.

While there are more than one cluster.

- Find a pair of clusters with the minimum distance.
- Merge them into a root cluster.
- Update the distance from root cluster to its neighbours.

## Representing the cluster

After all of the \(n\) items have been added to the cluster hierarchy,
we get \(n\) root clusters. By *root cluster*, we mean clusters that
have not yet been merged inside a higher-level cluster. Now, at every
iteration we find two root clusters and merge them into a higher-level
root cluster. The clusters that were merged are no longer treated as
root clusters. With every merger, the number of root clusters are
reduced by one. Hence, when we have finished building the cluster
hierarchy, we would have carried out \((n - 1)\) mergers.

Because we are merging two clusters at a time, the complete cluster hierarchy can be represented by a binary tree, where leaves represent the items and internal nodes represent mergers. Since there are \(n\) leaf nodes that represent items that were merged into clusters, there are \((n - 1)\) internal nodes that represent each of the mergers. Hence, the binary tree will have \((2n - 1)\) nodes. Furthermore, since the hierarchy grows by appending merger nodes, the binary tree can be represented by an array with \((2n - 1)\) elements.

In the following figure, an orange circle represents items to be clustered. The green square represents a cluster node in the binary tree, which is stored in the cluster array. The integer value inside the squares gives the array index of the cluster node. The clustering, in this example, was done with single-linkage distance criteria (see below).

Based on this, we have the following data structure for representing the cluster:

```
struct cluster_s {
int num_items; /* number of items that was clustered */
int num_clusters; /* current number of root clusters */
int num_nodes; /* number of leaf and merged clusters */
cluster_node_t *nodes; /* leaf and merged clusters */
float **distances; /* distance between leaves */
};
```

In the above, `num_items`

is fixed throughout the clustering and
stores the number of items that are clustered. Variable `num_clusters`

,
on the other hand, stores the current number of root clusters. This
is set to `num_items`

when the clustering begins and should have the
value `1`

when clustering has finished. The total number of nodes that
are currently in the binary tree is stored in `num_nodes`

. It starts
with a value of `0`

and gets incremented with addition of a leaf
cluster or a merger, and should equal \((2n - 1)\) when done. The
`nodes`

variable stores the array representing the binary
tree. Finally, we calculate the adjacency matrix between the
`num_items`

items and store this inside a two-dimensional matrix of
real numbers. These are Euclidean distances calculated from the \((x,
y)\) coordinates supplied in the input.

## Representing a cluster node

Each node in the cluster hierarchy array stores the following information:

```
struct cluster_node_s {
int type; /* type of the cluster node */
int is_root; /* true if cluster hasn't merged with another */
int height; /* height of node from the bottom */
coord_t centroid; /* centroid of this cluster */
char *label; /* label of a leaf node */
int *merged; /* indexes of root clusters merged */
int num_items; /* number of leaf nodes inside new cluster */
int *items; /* array of leaf nodes indices inside merged clusters */
neighbour_t *neighbours; /* sorted linked list of distances to roots */
};
```

After the array representing the binary tree is initialised, the
`type`

of all of the \((2n - 1)\) elements are marked as
`NOT_USED`

. When an item is added for clustering, the corresponding
element `type`

is marked as `LEAF_NODE`

. When an element represents a
merger, its `type`

is marked as `A_MERGER`

.

We use `is_root`

to identify if a cluster has already been merged into
a higher-level cluster. When a cluster node hasn't been merged,
`is_root`

is set to `1`

. When they are merged, the value is set to
`0`

. To help with further analysis, we also store the *height* of the node
from the bottom of the binary tree. Note that all of the leaves have
height `0`

. The centroid of
the cluster is stored in `centroid`

. This is used during
*centroidal linkage* (see below).

When an array element represents a leaf node, `label`

stores the
corresponding label supplied in the input. This field is empty for
merger nodes. When an element represents a merger, `merged`

stores the
array indices of the two nodes that were merged to generate this new
node. For nodes that represent either a leaf or a merger, `num_items`

stores the number of leaf items inside the cluster, with the
array indices of the corresponding leaf items stored in the `items`

array. Finally, the neighbourhood linked list `neighbours`

stores the
distance between this root cluster and all of the other root clusters
in the current iteration.

```
struct neighbour_s {
int target; /* the index of cluster node representing neighbour */
float distance; /* distance between the nodes */
neighbour_t *next, *prev; /* linked list entries */
};
```

Since we choose root cluster pairs with the minimum distance in every
iteration, we maintain the neighbourhood linked list as a *sorted
linked list*, where the head always stores the neighbour with the
least distance.

In this specific implementation, we assume that the distances from
cluster \(i\) to cluster \(j\), and from cluster \(j\) to
cluster \(i\) are the same. This allows us to simplify the
structure of the cluster nodes, so that distance between root clusters
\((i, j)\) is always stored only once in the `neighbours`

linked
list of the array element with index \(max(i, j)\).

## Choosing the minimum distance

There are three common *linkage criteria* to determine the distance
between a cluster pair. These are:

**Single linkage**The distance between two root clusters*A*and*B*is the minimum distance between any leaf in*A*to any leaf in*B*.**Complete linkage**The distance between two root clusters*A*and*B*is the maximum distance between any leaf in*A*to any leaf in*B*.**Average linkage**The distance between two root clusters*A*and*B*is the average distance between any leaf in*A*to any leaf in*B*.**Centroidal linkage**The distance between two root clusters*A*and*B*is the distance between the centroids of*A*and*B*.

In our implementation, we initialise the function pointer
`distance_fptr`

to one of `single_linkage`

, `complete_linkage`

,
`average_linkage`

or `centroid_linkage`

based on the command line
parameter.

Note here that, although *centroidal linkage* allows us to define a
more accurate center for each of the clusters, it is not always
possible to define such a centroid for some types of data points. For
instance, if we are clustering *words* using a distance metric, such
as the Damerau–Levenshtein distance,
we cannot define a centroid for a word cluster as it does not make
sense. A centroid for a cluster of data points in an \(n\)-dimensional
Euclidean space is itself a unique data point in the same space. This
is not the case for a cluster of words. In cases like these, we use
the other linkage criteria.

## Getting \(k\) clusters

We sometimes wish to generate \(k\) disjointed clusters from a
given set of \(n\) input items. There are two ways of extracting
\(k\) clusters using the algorithm described above. If all we are
interested in is \(k\) clusters, we can simply stop the merging
once the number of root clusters `num_clusters`

equals \(k\). On
the other hand, if we are interested in extracting \(k\) clusters
where \(k\) could vary, we use a retrospective approach which uses
the complete cluster hierarchy tree, as described below.

To extract \(k\) disjointed clusters, what we must realise is that it is akin to going back in time when we had \(k\) disjointed root clusters before further mergers were made. But how far do we go back? Well, since every merger reduces the number of root clusters by one, we need to go back \((k - 1)\) mergers. Furthermore, since a merger increments the number of cluster nodes by one, we should discard the last \((k - 1)\) nodes of the array.

Of the remaining \((n - k + 1)\) nodes, we need to find \(k\) root clusters. Since these were merged into one of the \((k - 1)\) nodes that we plan to discard, we can scan the array from index \((n - 1)\) to \((n - k)\) and print out the children clusters that the node merged, if and only if the child node is not one of the nodes to be discarded.

## Acknowledgements

Further to the references on Wikipedia, this implementation was inspired by Matteo Matteucci's description of the algorithm.