SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method

Main point Sibson gives an O(n 2) algorithm for single-linkage clustering, and proves that this algorithm achieves the theoretically optimal lower time bound for obtaining a single-linkage dendrogram. This improves upon the naive O(n 3) implementation of single linkage clustering. A single linkage dendrogram is a tree, where each level of the tree corresponds to a different threshold dissimilarity measure h. The nodes of a dataset are grouped into " equivalence classes " c(h) at each level of the dendrogram, where two classes C i and C j are merged if there is a pair of " OTU's " (vertices) v i ∈ C i and v j ∈ C j such that the dissimilarity measure between v i and v j is less than h, or D(v i , v j) < h. For example, consider a set of 10 vertices v 1 ,. .. , v 10 for which the dissimilarity matrix D is given below, with D ij equal to the dissimilarity between v i and v j. Suppose we take four cutoff dissimilarity measures h 1 , h 2 , h 3 , h 4 and produce the dendrogram according to these thresholds. An example illustrating how the 10 vertices are grouped into equivalence classes at each level is shown in Figure 1. Since no dissimilarity is at or below 1, each vertex or " OTU " is its own equivalence class at the level corresponding to h 1 = 1. At the next level, however, we see that some classes have been merged together because several dissimilarity measures are below h 2 = 2. We can see that c(h 2) consists of 6 equivalence classes, c(h 3) has 3 equivalence classes, and c(h 4 = 4) aggregates all the vertices into one equivalence class. In single linkage clustering, the number of levels in the tree is determined by the nearest-neighbor criterion – at each level, at least one new merge is made between two clusters, and the merge is made for clusters C i and C j if the minimal distance between vertices v i ∈ C i and v j ∈ C j is the smallest such distance across all the clusters. In other words, the nearest neighbors between clusters C j and C i are found, and if these neighbors are closer than all the other nearest-neighbor pairs, then C i and C …