Implementing the DBSCAN clustering algorithm

September 04, 2015

In this note, we implement the DBSCAN clustering algorithm. DBSCAN stands for Density-based spatial clustering of applications with noise. This implementation is based on the original paper by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu, A density-based algorithm for discovering clusters in large spatial databases with noise [Proceedings of KDD-96, AAAI Press, 1996]. It will help to learn the concepts defined in this paper before studying this implementation.

The DBSCAN clustering algorithm is one of the few clustering algorithms that allows us to find clusters within clusters, like the ones shown below.

The algorithm essentially works like the flow of liquids on a terrain. It starts at a point on the terrain and flows over the terrain where there is least resistance. The resulting cluster is the area covered by the liquid. Now imagine different liquids with varying densities which cover the terrain in layers. Try it here:

Epsilon:
Min pts:
Clear points

The first input supplied to the DBSCAN clustering algorithm is the set of data points that we wish to cluster. These can be anything, as long as we can define a distance function between the points, which is the second input. For instance, if we wish to cluster points on the two-dimensional plane using proximity as the clustering property, we use a distance function such as the Euclidean distance. On the other hand, if we wish to cluster pixels of the same intensity on an image, we use an intensity distance function. The third input configures the sensitivity of the clustering algorithm. This affects the manner in which the algorithm decides if two data points are quite similar or too different. Finally, we specify another sensitivity factor that determines whether a new cluster should start at a given data point, depending on the density of nearby data points. These will become clearer in the following sections.

Let us begin with the set of data points to cluster, denoted as \(D\). In this implementation, we will maintain all of the supplied data points as an array. The algorithm will manipulate these data points by using the corresponding array indices. We shall parse the input supplied to the algorithm, and store the data points using the struct point_s data structure.

In this implementation, we are using three-dimensional data points; however, we can also use any \(n\)-dimensional data points as long as a distance function can be defined between any two points in \(D\). When the algorithms terminates successfully, cluster_id stores the cluster to which the data point belongs.

<Data structures>=
typedef struct point_s point_t;
struct point_s {
    double x, y, z;
    int cluster_id;
};

What are the possible values of cluster_id? At the beginning before running the algorithm, the cluster_id value for all of the data points are initialised to UNCLASSIFIED. This essentially means that the data point needs to be looked at as it currently does not belong to any cluster. When a data point is completely isolated and could not form a cluster, it is treated as noise. These data point are assigned a cluster_id that equals NOISE. Any non-negative value identifies the cluster to which the data point belongs.

<Constant definitions>=
#define UNCLASSIFIED -1
#define NOISE -2

If an UNCLASSIFIED data point starts a new cluster, it is considered to be a core data point. We shall use the following values to decide whether to start a new cluster with the data point currently under investigation, or to continue by adding the data point to the current cluster.

<Constant definitions>+=
#define CORE_POINT 1
#define NOT_CORE_POINT 0

We use the following when a function should return success or failure status. The value of FAILURE must not coincide with the previous return values.

<Constant definitions>+=
#define SUCCESS 0
#define FAILURE -3

One of the fundamental aspects of the DBSCAN algorithm is the process of finding a given data point's epsilon neighbours. Essentially, the epsilon neighbours of a data point \(p\) is the set of data points \(S \subset D\) such that \(\delta(p, q) \le \epsilon\) for \(q \in S\) and \(q \ne p\). Here, \(\delta\) is the distance function, and \(\epsilon\) is the distance sensitivity of the algorithm, as discussed in the introduction. The figure above shows example epsilon neighbours for two-dimensional data points using the Euclidean distance.

The DBSCAN algorithm processes members of the epsilon neighbours sequentially. Hence, we will use a single linked list data structure to store the epsilon neighbours of a data point. As shown below, each node in the single linked list points to a data point. The index field stores the index of the data point in the data points array.

<Data structures>+=
typedef struct node_s node_t;
struct node_s {
    unsigned int index;
    node_t *next;
};

We use the following function to create a new linked list node that points to the supplied data point.

<Function definitions>=
node_t *create_node(unsigned int index)
{
    node_t *n = (node_t *) calloc(1, sizeof(node_t));
    if (n == NULL)
        perror("Failed to allocate node.");
    else {
        n->index = index;
        n->next = NULL;
    }
    return n;
}

To store the epsilon neighbours of a data point, we use the struct epsilon_neighbours_s data structure. This encapsulates the single linked list of nodes, and augments it with the number of nodes in the linked list. The latter is used to check if the data point against which this epsilon neighbourhood was determined is a core data point. The head points to the start of the single linked list; the tail points to the end. We use the tail to append nodes at the end of the single linked list.

<Data structures>+=
typedef struct epsilon_neighbours_s epsilon_neighbours_t;
struct epsilon_neighbours_s {
    unsigned int num_members;
    node_t *head, *tail;
};

We use the following to append a node that points to the data point at index, at the end of the linked list encapsulated by the epsilon neighbours instance en.

<Function definitions>+=
int append_at_end(
     unsigned int index,
     epsilon_neighbours_t *en)
{
    node_t *n = create_node(index);
    if (n == NULL) {
        free(en);
        return FAILURE;
    }
    if (en->head == NULL) {
        en->head = n;
        en->tail = n;
    } else {
        en->tail->next = n;
        en->tail = n;
    }
    ++(en->num_members);
    return SUCCESS;
}

We now define a function that returns the epsilon neighbours of a data point. As discussed in the previous sections, epsilon neighbours are data points that are no further than epsilon from the supplied data point. The distance is determined by the supplied distance function. The input data that are supplied to the get_epsilon_neighbours function are the index of the data point that we wish to find the neighbours for, the array of data points points, the number of data points num_points in the array, the epsilon value to use as proximity cut-off and the distance function dist.

<Function definitions>+=
epsilon_neighbours_t *get_epsilon_neighbours(
    unsigned int index,
    point_t *points,
    unsigned int num_points,
    double epsilon,
    double (*dist)(point_t *a, point_t *b))
{
    epsilon_neighbours_t *en = (epsilon_neighbours_t *)
        calloc(1, sizeof(epsilon_neighbours_t));
    if (en == NULL) {
        perror("Failed to allocate epsilon neighbours.");
        return en;
    }
    for (int i = 0; i < num_points; ++i) {
        if (i == index)
            continue;
        if (dist(&points[index], &points[i]) > epsilon)
            continue;
        else {
            if (append_at_end(i, en) == FAILURE) {
                destroy_epsilon_neighbours(en);
                en = NULL;
                break;
            }
        }
    }
    return en;
}

The following prints the epsilon neighbours to stdout.

<Function definitions>+=
void print_epsilon_neighbours(
    point_t *points,
    epsilon_neighbours_t *en)
{
    if (en) {
        node_t *h = en->head;
        while (h) {
            printf("(%lfm, %lf, %lf)\n",
                   points[h->index].x,
                   points[h->index].y,
                   points[h->index].z);
            h = h->next;
        }
    }
}

The following function destroys the epsilon neighbours by freeing up the container and encapsulated linked list nodes. We need this as new epsilon neighbours are created and destroyed as we process each of the data points.

As we can see, this is an inefficient way to manage memory. We should reuse already allocated nodes without freeing them. See memory management using memory areas, as described by Donald Knuth in The Stanford GraphBase: A Platform for Combinatorial Computing [New York: ACM Press, 1994]. In this implementation, we use the inefficient but simpler version.

<Function definitions>+=
void destroy_epsilon_neighbours(epsilon_neighbours_t *en)
{
    if (en) {
        node_t *t, *h = en->head;
        while (h) {
            t = h->next;
            free(h);
            h = t;
        }
        free(en);
    }
}

We shall now describe the main iteration of the DBSCAN algorithm. This function takes as input the array of data points points, the number of data points in the array num_points, the epsilon value for proximity sensitivity, the minimum number of neighbours minpts required to class a data point as a core data point, and the distance function dist.

At each iteration, we check if the data point is UNCLASSIFIED. If it is, we determine if a new cluster can be created starting with this data point. The expand function does this. If a cluster is started, i.e., the data point was classified as a core data point, expand returns CORE_POINT; otherwise, it returns NOT_CORE_POINT. If expand returns CORE_POINT, it also means that the next data point should start a new cluster, as the current cluster is complete. Recall the flow-of-liquid analogy that we discussed in the introduction.

<Function definitions>+=
void dbscan(
    point_t *points,
    unsigned int num_points,
    double epsilon,
    unsigned int minpts,
    double (*dist)(point_t *a, point_t *b))
{
    unsigned int i, cluster_id = 0;
    for (i = 0; i < num_points; ++i) {
        if (points[i].cluster_id == UNCLASSIFIED) {
            if (expand(i, cluster_id, points,
                       num_points, epsilon, minpts,
                       dist) == CORE_POINT)
                ++cluster_id;
        }
    }
}

Let us now look at what the expand function does. It takes as input the index of the data point that we wish to consider. In case it turns out to be a core data point, it must know the cluster that it should either start, or join as a new member. This is supplied as cluster_id. Then it takes the array of data points points, the number of data points num_points, the epsilon sensitivity, the minpts for classification as core data point, and finally the distance function dist.

The function expand first identifies the epsilon neighbours of the supplied data point. It then checks if there are enough neighbours to classify the data point as a core data point. If it is insufficient, the data point is marked tentatively as NOISE and the function returns NO_CORE_POINT. Future iterations due to the processing of other data points might update this classification.

On the other hand, if the data point has sufficient epsilon neighbours, we start a new cluster and continue spreading the cluster (just like the flow of liquid) as long as the sensitivity conditions are satisfied. Once all of the data points that should belong to the new cluster have been added, the function returns CORE_POINT, which is an indication that the next core data point should start a new cluster.

<Function definitions>+=
int expand(
    unsigned int index,
    unsigned int cluster_id,
    point_t *points,
    unsigned int num_points,
    double epsilon,
    unsigned int minpts,
    double (*dist)(point_t *a, point_t *b))
{
    int return_value = NOT_CORE_POINT;
    epsilon_neighbours_t *seeds =
        get_epsilon_neighbours(index, points,
                               num_points, epsilon,
                               dist);
    if (seeds == NULL)
        return FAILURE;

    if (seeds->num_members < minpts)
        points[index].cluster_id = NOISE;
    else {
        points[index].cluster_id = cluster_id;
        <Add epsilon neighbours to the same cluster>
        <See how far the cluster spreads>
        return_value = CORE_POINT;
    }
    destroy_epsilon_neighbours(seeds);
    return return_value;
}

Make all of the epsilon neighbours of this data point also belong to the same cluster.

<Add epsilon neighbours to the same cluster>=
node_t *h = seeds->head;
while (h) {
    points[h->index].cluster_id = cluster_id;
    h = h->next;
}

Now process all of the epsilon neighbours and see how far the cluster should spread.

<See how far the cluster spreads>=
h = seeds->head;
while (h) {
    spread(h->index, seeds, cluster_id, points,
           num_points, epsilon, minpts, dist);
    h = h->next;
}

Now we present the final component of the algorithm, where we determine how far the current cluster should spread. The spread function takes as input the index of the epsilon neighbour being considered, the epsilon neighbours seeds of the core data point that started the current cluster, the current cluster identifier cluster_id, the array of data points points, the number of data points in the array num_points, the epsilon sensitivity, the minpts core data point criterion and finally the distance function dist.

<Function definitions>+=
int spread(
    unsigned int index,
    epsilon_neighbours_t *seeds,
    unsigned int cluster_id,
    point_t *points,
    unsigned int num_points,
    double epsilon,
    unsigned int minpts,
    double (*dist)(point_t *a, point_t *b))
{
    epsilon_neighbours_t *spread =
        get_epsilon_neighbours(index, points,
                       num_points, epsilon,
                       dist);
    if (spread == NULL)
        return FAILURE;
    <Process epsilon neighbours of neighbour>
    destroy_epsilon_neighbours(spread);
    return SUCCESS;
}

For every epsilon neighbour that the algorithm adds to the current cluster, it also processes their epsilon neighbours. If it turns out that the epsilon neighbour is itself a core data point, we process it's neighbours by appending them at the end of the current list of data points to be processed.

If the epsilon neighbour of a newly added core data point is classified as either NOISE or UNCLASSIFIED, they are added to the current cluster. Furthermore, if we encounter an UNCLASSIFIED data point, we append this at the end of the current core data point's epsilon neighbours seeds. Remember that this is the epsilon neighbours of the core point that started the current cluster. Hence, as the algorithms progresses seeds grows, as directly reachable data points are appended.

<Process epsilon neighbours of neighbour>=
if (spread->num_members >= minpts) {
    node_t *n = spread->head;
    point_t *d;
    while (n) {
        d = &points[n->index];
        if (d->cluster_id == NOISE ||
            d->cluster_id == UNCLASSIFIED) {
            if (d->cluster_id == UNCLASSIFIED) {
                if (append_at_end(n->index, seeds)
                    == FAILURE) {
                    destroy_epsilon_neighbours(spread);
                    return FAILURE;
                }
            }
            d->cluster_id = cluster_id;
        }
        n = n->next;
    }
}

What we need now is the definition of a distance function. The following function implements Euclidean distance between two three-dimensional data points. This can be use for clustering based on spatial proximity.

<Function definitions>+=
double euclidean_dist(point_t *a, point_t *b)
{
    return sqrt(pow(a->x - b->x, 2) +
            pow(a->y - b->y, 2) +
            pow(a->z - b->z, 2));
}

The following function parses the data points from file, and stores the data points inside the array points.

<Function definitions>+=
unsigned int parse_input(
    FILE *file,
    point_t **points,
    double *epsilon,
    unsigned int *minpts)
{
    unsigned int num_points, i = 0;
    fscanf(file, "%lf %u %u\n",
           epsilon, minpts, &num_points);
    point_t *p = (point_t *)
        calloc(num_points, sizeof(point_t));
    if (p == NULL) {
        perror("Failed to allocate points array");
        return 0;
    }
    while (i < num_points) {
          fscanf(file, "%lf %lf %lf\n",
                 &(p[i].x), &(p[i].y), &(p[i].z));
          p[i].cluster_id = UNCLASSIFIED;
          ++i;
    }
    *points = p;
    return num_points;
}

The following function prints out the data points and cluster identifier for that data point to stdout.

<Function definitions>+=
void print_points(
    point_t *points,
    unsigned int num_points)
{
    unsigned int i = 0;
    printf("Number of points: %u\n"
        " x     y     z     cluster_id\n"
        "-----------------------------\n"
        , num_points);
    while (i < num_points) {
          printf("%5.2lf %5.2lf %5.2lf: %d\n",
                 points[i].x,
                 points[i].y, points[i].z,
                 points[i].cluster_id);
          ++i;
    }
}

Finally, the main function that parses the input data points, runs the DBSCAN algorithm and then prints the clusters.

<The main program>=
int main(void) {
    point_t *points;
    double epsilon;
    unsigned int minpts;
    unsigned int num_points =
        parse_input(stdin, &points, &epsilon, &minpts);
    if (num_points) {
        dbscan(points, num_points, epsilon,
               minpts, euclidean_dist);
        printf("Epsilon: %lf\n", epsilon);
        printf("Minimum points: %u\n", minpts);
        print_points(points, num_points);
    }
    free(points);
    return 0;
}

Forward declaration of functions so that we can define and explain the functions out of compilation order.

<Function declarations>=
node_t *create_node(unsigned int index);
int append_at_end(
     unsigned int index,
     epsilon_neighbours_t *en);
epsilon_neighbours_t *get_epsilon_neighbours(
    unsigned int index,
    point_t *points,
    unsigned int num_points,
    double epsilon,
    double (*dist)(point_t *a, point_t *b));
void print_epsilon_neighbours(
    point_t *points,
    epsilon_neighbours_t *en);
void destroy_epsilon_neighbours(epsilon_neighbours_t *en);
void dbscan(
    point_t *points,
    unsigned int num_points,
    double epsilon,
    unsigned int minpts,
    double (*dist)(point_t *a, point_t *b));
int expand(
    unsigned int index,
    unsigned int cluster_id,
    point_t *points,
    unsigned int num_points,
    double epsilon,
    unsigned int minpts,
    double (*dist)(point_t *a, point_t *b));
int spread(
    unsigned int index,
    epsilon_neighbours_t *seeds,
    unsigned int cluster_id,
    point_t *points,
    unsigned int num_points,
    double epsilon,
    unsigned int minpts,
    double (*dist)(point_t *a, point_t *b));
double euclidean_dist(point_t *a, point_t *b);
double adjacent_intensity_dist(point_t *a, point_t *b);
unsigned int parse_input(
    FILE *file,
    point_t **points,
    double *epsilon,
    unsigned int *minpts);
void print_points(
    point_t *points,
    unsigned int num_points);

<Include libraries>=
#include <limits.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

<dbscan>=
<Include libraries>
<Constant definitions>
<Data structures>
<Function declarations>
<Function definitions>
<The main program>

Source code and print

A printable version of this note is available here. You can download the source code for C and JavaScript at:

git clone https://github.com/gyaikhom/dbscan.git
comments powered by Disqus