Agglomerative hierarchical clustering

August 21, 2014

Agglomerative hierarchical clustering is a clustering algorithm that builds a cluster hierarchy from the bottom-up. It starts by adding a cluster for each of the data points to be clustered, followed by iterative pair-wise merging of clusters until only one cluster is left at the top of the hierarchy. The choice of clusters to merge at each iteration is decided based on a distance metric. In this article, we implement this algorithm in C.

Source code

The complete source code is available at:

https://github.com/gyaikhom/agglomerative-hierarchical-clustering.git

The input

Since we are particularly interested in understanding the algorithm, we shall choose the simplest input format possible. The input file has the following format:

<number of items to cluster>
<label string>| <x-axis value> <y-axis value>
...

For instance, the following is a valid input. It contains 12 data points, where each data point is referred to by its label and has coordinates in the two-dimensional Euclidean plane. Note that we can adapt this input format to accommodate \(n\)-dimensional vectors.

12
A| 1.0 1.0
B| 2.0 1.0
C| 2.0 2.0
D| 4.0 5.0
E| 5.0 4.0
F| 5.0 5.0
G| 5.0 6.0
H| 6.0 5.0
I| 9.0 9.0
J| 10.0 9.0
K| 10.0 10.0
L| 11.0 9.0

After running the clustering algorithm, we get the following hierarchy:

Example agglomerative hierarchical clustering


Example agglomerative hierarchical clustering

Normally, hierarchical clusters are displayed as Dendrograms. However, we chose the above representation to make it easier to see the cluster formation in relation to the two-dimensional coordinates of the items that are clustered.

The algorithm

The agglomerative hierarchical clustering algorithm is implemented as follows:

  1. Based on the linkage criteria, set the distance function.
  2. Parse the input file and read in the items to cluster.
  3. Initialise the cluster array and set the appropriate item counts.
  4. Calculate the Euclidean distances between item pairs.
  5. For each item in the input:

    1. Add a leaf node that represents a root cluster.
    2. Update the distance from root cluster to its neighbours.
  6. While there are more than one cluster.

    1. Find a pair of clusters with the minimum distance.
    2. Merge them into a root cluster.
    3. Update the distance from root cluster to its neighbours.

Representing the cluster

After all of the \(n\) items have been added to the cluster hierarchy, we get \(n\) root clusters. By root cluster, we mean clusters that have not yet been merged inside a higher-level cluster. Now, at every iteration we find two root clusters and merge them into a higher-level root cluster. The clusters that were merged are no longer treated as root clusters. With every merger, the number of root clusters are reduced by one. Hence, when we have finished building the cluster hierarchy, we would have carried out \((n - 1)\) mergers.

Because we are merging two clusters at a time, the complete cluster hierarchy can be represented by a binary tree, where leaves represent the items and internal nodes represent mergers. Since there are \(n\) leaf nodes that represent items that were merged into clusters, there are \((n - 1)\) internal nodes that represent each of the mergers. Hence, the binary tree will have \((2n - 1)\) nodes. Furthermore, since the hierarchy grows by appending merger nodes, the binary tree can be represented by an array with \((2n - 1)\) elements.

In the following figure, an orange circle represents items to be clustered. The green square represents a cluster node in the binary tree, which is stored in the cluster array. The integer value inside the squares gives the array index of the cluster node. The clustering, in this example, was done with single-linkage distance criteria (see below).

Hierarchical
     cluster as a binary tree


Hierarchical cluster as a binary tree

Based on this, we have the following data structure for representing the cluster:

struct cluster_s {
    int num_items; /* number of items that was clustered */
    int num_clusters; /* current number of root clusters */
    int num_nodes; /* number of leaf and merged clusters */
    cluster_node_t *nodes; /* leaf and merged clusters */
    float **distances; /* distance between leaves */
};

In the above, num_items is fixed throughout the clustering and stores the number of items that are clustered. Variable num_clusters, on the other hand, stores the current number of root clusters. This is set to num_items when the clustering begins and should have the value 1 when clustering has finished. The total number of nodes that are currently in the binary tree is stored in num_nodes. It starts with a value of 0 and gets incremented with addition of a leaf cluster or a merger, and should equal \((2n - 1)\) when done. The nodes variable stores the array representing the binary tree. Finally, we calculate the adjacency matrix between the num_items items and store this inside a two-dimensional matrix of real numbers. These are Euclidean distances calculated from the \((x, y)\) coordinates supplied in the input.

Representing a cluster node

Each node in the cluster hierarchy array stores the following information:

struct cluster_node_s {
    int type; /* type of the cluster node */
    int is_root; /* true if cluster hasn't merged with another */
    int height; /* height of node from the bottom */
    coord_t centroid; /* centroid of this cluster */
    char *label; /* label of a leaf node */
    int *merged; /* indexes of root clusters merged */
    int num_items; /* number of leaf nodes inside new cluster */
    int *items; /* array of leaf nodes indices inside merged clusters */
    neighbour_t *neighbours; /* sorted linked list of distances to roots */
};

After the array representing the binary tree is initialised, the type of all of the \((2n - 1)\) elements are marked as NOT_USED. When an item is added for clustering, the corresponding element type is marked as LEAF_NODE. When an element represents a merger, its type is marked as A_MERGER.

We use is_root to identify if a cluster has already been merged into a higher-level cluster. When a cluster node hasn't been merged, is_root is set to 1. When they are merged, the value is set to 0. To help with further analysis, we also store the height of the node from the bottom of the binary tree. Note that all of the leaves have height 0. The centroid of the cluster is stored in centroid. This is used during centroidal linkage (see below).

When an array element represents a leaf node, label stores the corresponding label supplied in the input. This field is empty for merger nodes. When an element represents a merger, merged stores the array indices of the two nodes that were merged to generate this new node. For nodes that represent either a leaf or a merger, num_items stores the number of leaf items inside the cluster, with the array indices of the corresponding leaf items stored in the items array. Finally, the neighbourhood linked list neighbours stores the distance between this root cluster and all of the other root clusters in the current iteration.

struct neighbour_s {
    int target; /* the index of cluster node representing neighbour */
    float distance; /* distance between the nodes */
    neighbour_t *next, *prev; /* linked list entries */
};

Since we choose root cluster pairs with the minimum distance in every iteration, we maintain the neighbourhood linked list as a sorted linked list, where the head always stores the neighbour with the least distance.

In this specific implementation, we assume that the distances from cluster \(i\) to cluster \(j\), and from cluster \(j\) to cluster \(i\) are the same. This allows us to simplify the structure of the cluster nodes, so that distance between root clusters \((i, j)\) is always stored only once in the neighbours linked list of the array element with index \(max(i, j)\).

Choosing the minimum distance

There are three common linkage criteria to determine the distance between a cluster pair. These are:

  • Single linkage The distance between two root clusters A and B is the minimum distance between any leaf in A to any leaf in B.
  • Complete linkage The distance between two root clusters A and B is the maximum distance between any leaf in A to any leaf in B.
  • Average linkage The distance between two root clusters A and B is the average distance between any leaf in A to any leaf in B.

  • Centroidal linkage The distance between two root clusters A and B is the distance between the centroids of A and B.

In our implementation, we initialise the function pointer distance_fptr to one of single_linkage, complete_linkage, average_linkage or centroid_linkage based on the command line parameter.

Note here that, although centroidal linkage allows us to define a more accurate center for each of the clusters, it is not always possible to define such a centroid for some types of data points. For instance, if we are clustering words using a distance metric, such as the Damerau–Levenshtein distance, we cannot define a centroid for a word cluster as it does not make sense. A centroid for a cluster of data points in an \(n\)-dimensional Euclidean space is itself a unique data point in the same space. This is not the case for a cluster of words. In cases like these, we use the other linkage criteria.

Getting \(k\) clusters

We sometimes wish to generate \(k\) disjointed clusters from a given set of \(n\) input items. There are two ways of extracting \(k\) clusters using the algorithm described above. If all we are interested in is \(k\) clusters, we can simply stop the merging once the number of root clusters num_clusters equals \(k\). On the other hand, if we are interested in extracting \(k\) clusters where \(k\) could vary, we use a retrospective approach which uses the complete cluster hierarchy tree, as described below.

To extract \(k\) disjointed clusters, what we must realise is that it is akin to going back in time when we had \(k\) disjointed root clusters before further mergers were made. But how far do we go back? Well, since every merger reduces the number of root clusters by one, we need to go back \((k - 1)\) mergers. Furthermore, since a merger increments the number of cluster nodes by one, we should discard the last \((k - 1)\) nodes of the array.

Of the remaining \((n - k + 1)\) nodes, we need to find \(k\) root clusters. Since these were merged into one of the \((k - 1)\) nodes that we plan to discard, we can scan the array from index \((n - 1)\) to \((n - k)\) and print out the children clusters that the node merged, if and only if the child node is not one of the nodes to be discarded.

Acknowledgements

Further to the references on Wikipedia, this implementation was inspired by Matteo Matteucci's description of the algorithm.

comments powered by Disqus