Skip to contents

This document has two parts:

  • the first part aims at clarifying relations between dissimilarity and similarity methods for hierarchical agglomerative clustering (HAC) and at explaining implementation choices in adjclust;

  • the second part describes the different types of dendrograms that are implemented in plot.chac.

In this document, we assume to be given nn objects, {1,,n}\{1, \ldots, n\} that have to be clustered using adjacency-constrained HAC (CHAC), that is, in such a way that only adjacent objects/clusters can be merged.

We refer to [5] for a comprehensive treatment of the applicability and interpretability of Ward’s hierarchical agglomerative clustering with or without contiguity constraints.

Notes on relations between similarity and dissimilarity implementation

Basic implementation of CHAC in adjclust

The basic implementation of adjclust takes, as an input, a kernel kk which is supposed to be symmetric and positive (in the kernel sense). If your data are under this format, then the constrained clustering can be performed with

fit <- adjClust(k, type = "similarity")

or with

fit <- adjClust(k, type = "similarity", h = h)

if, in addition, the kernel kk is supposed to have only null entries outside of a diagonal of size h.

The implementation is the one described in [1].

More advanced used for kernel or similarity matrices

Non positive but normalized similarities

In this section, the available data set is a matrix ss that can either have only positive entries (in this case it is called a similarity) or both positive and non-positive entries. If, in addition, the matrix ss is normalized, i.e., s(i,i)+s(j,j)2s(i,j)0s(i,i) + s(j,j) - 2s(i,j) \geq 0 for all i,j=1,,ni,j=1,\ldots,n then the algorithm implemented in adjclust can be applied directly, similarly as for a standard kernel (section 1). This section explains why this is the case.

The interpretation is similar to the kernel case, under the assumption that small similarity values or similarity values that are strongly negative are less expected to be clustered together than large similarity values. The application of the method is justified by the fact that, for a given matrix ss described as above, we can find a λ>0\lambda > 0 such that the matrix kλk_\lambda defined by 1,,n,kλ(i,j)=s(i,j)+λ𝟙{i=j} \forall\,1,\ldots,n,\qquad k_\lambda(i,j) = s(i,j) + \lambda \mathbb{1}_{\{i=j\}} is a kernel (i.e., the matrix k=s+λIk = s + \lambda I is positive definite; indeed, it is the case for any λ\lambda larger than the opposite of the smallest negative eigenvalue of ss. [3] shows that the HAC obtained from the distance induced by the kernel kλk_\lambda in its feature space and the HAC obtained from the ad hoc dissimilarity defined by i,j=1,,n,d(i,j)=s(i,i)+s(j,j)2s(i,j) \forall\, i,j=1,\ldots,n,\qquad d(i,j) = \sqrt{s(i,i) + s(j,j) - 2s(i,j)} are identical, except that all the merging levels are shifted by λ\lambda.

In conclusion, to address this case, the command lines that have to be used are the ones described in section 1.

Non normalized similarities

Suppose now that the data set is described by a matrix ss as in the previous section except that this similarity matrix is not normalized, meaning that, there is at least one pair (i,j)(i,j), such that 2s(i,j)>s(i,i)+s(j,j). 2s(i,j) > s(i,i) + s(j,j).

The package then performs the following pre-transformation: a matrix s*s^{*} is defined as i,j=1,,n,s*(i,j)=s(i,j)+λ𝟙{i=j} \forall\,i,j=1,\ldots,n,\qquad s^{*}(i,j) = s(i,j) + \lambda \mathbb{1}_{\{i=j\}} for a λ\lambda large enough to ensure that s*s^{*} becomes normalized. In the package, λ\lambda is chosen as λ:=ϵ+maxi,j(2s(i,j)s(i,i)s(j,j))+ \lambda := \epsilon + \max_{i,j} \left(2s(i,j) - s(i,i) - s(j,j)\right)_+ for a small ϵ>0\epsilon > 0. This case is justified by the property described in Section 2.1 (Non-positive but normalized similarities). The underlying idea is that, shifting the diagonal entries of a similarity matrix does not change HAC result and thus they can be shifted until they induce a proper ad-hoc dissimilarity matrix. The transformation affects only the heights to ensure that they are all positive and the two command lines described in the first section of this note are still valid.

Case of dissimilarity data

The original implementation of (unconstrained) HAC in stats::hclust takes as input a dissimilarity matrix. However, the implementation of adjclust is based on a kernel/similarity approach. We describe in this section how the dissimilarity case is handled.

Suppose given a dissimilarity dd which satisfies:

  • dd has non negative entries: d(i,j)0d(i,j) \geq 0 for all i=1,,ni=1,\ldots,n;

  • dd is symmetric: d(i,j)=d(j,i)d(i,j) = d(j,i) for all i,j=1,,ni,j=1,\ldots,n;

  • dd has a null diagonal: d(i,i)=0d(i,i) = 0 for all i=1,,ni=1,\ldots,n.

Any sequence of positive numbers (ai)i=1,,n(a_i)_{i=1,\ldots,n} would provide a similarity ss for which dd is the ad-hoc dissimilarity by setting: {s(i,i)=ais(i,j)=12(ai+ajd2(i,j)). \left\{ \begin{array}{l} s(i,i) = a_i\\ s(i,j) = \frac{1}{2} (a_i + a_j - d^2(i,j)) \end{array} \right. . By definition, such an ss is normalized and any choice for (ai)i=1,,n(a_i)_{i=1,\ldots,n} yields the same clustering (since they all correspond to the same ad-hoc dissimilarity). The arbitrary choice ai=1a_i = 1 for all i=1,,ni=1,\ldots,n has thus been made.

The basic and the sparse implementations are both available with, respectively,

fit <- adjClust(d, type = "dissimilarity")

and

fit <- adjClust(d, type = "dissimilarity", h = h)

Options for displaying the dendrogram

In this section, we suppose given an Euclidean distance dd between objects (even though the results described in this section are not specific to this case, they are described more easily using this framework). Ward’s criterion, that is implemented in adjclust aims at minimizing the Error Sum of Squares (ESS) which is equal to: ESS(𝒞)=C𝒞iCd2(i,gC) \mbox{ESS}(\mathcal{C}) = \sum_{C \in \mathcal{C}} \sum_{i \in C} d^2(i, g_C) where 𝒞\mathcal{C} is the clustering and gC=1μCiCig_C = \frac{1}{\mu_C} \sum_{i \in C} i is the center of gravity of the cluster CC with μC\mu_C elements [6]. In the sequel, we will denote:

  • within-cluster dispersion which, for a given cluster CC, is equal to I(C)=iCd2(i,gC). I(C) = \sum_{i \in C} d^2(i, g_C). We can prove that I(C)=12μCi,jCd2(i,j)I(C) = \frac{1}{2\mu_C} \sum_{i,j \in C} d^2(i,j) (see [4] for instance);

  • average within-cluster dispersion which is equal to I(C)μC\frac{I(C)}{\mu_C} and corresponds to the cluster variance.

Usually, the results of standard HAC are displayed under the form of a dendrogram for which the heights of the different merges correspond to the linkage criterion δ(A,B)=I(AB)I(A)I(B) \delta(A,B) = I(A \cup B) - I(A) - I(B) of that merge. This criterion corresponds to the increase in total dispersion (ESS) that occurs by merging the two clusters AA and BB. However, for constrained HAC, there is no guaranty that this criterion is non decreasing (see [2] for instance) and thus, the dendrogram build using this method can contain reversals in its branches. This is the default option in plot.chac (that corresponds to mode = "standard"). To provide dendrograms that are easier to interpret, alternative options have been implemented in the package: the first one is a simple correction of the standard method, and the three others are suggested by [3].

In the sequel, we denote by (mt)t=1,,n1(m_t)_{t=1,\ldots,n-1} the series of linkage criterion values obtained during the clustering.

mode = "corrected"

This option simply corrects the heights by adding the minimal value making them non decreasing. More precisely, if at a given step t{2,,n1}t \in \{2,\ldots,n-1\} of the clustering, we have that mt<mt1m_t < m_{t-1} then, we define the corrected weights as: m̃t={mtif t<tmt+(mt1mt)otherwise. \tilde{m}_{t'} = \left\{ \begin{array}{ll} m_{t'} & \textrm{if } t' < t\\ m_{t'} + (m_{t-1} - m_t) & \textrm{otherwise} \end{array} \right.. This correction is iteratively performed for all decreasing merges, ensuring a visually increasing dendrogram.

mode = "total-disp"

This option represents the dendrogram using the total dispersion (that is the objective function) at every level of the clustering. It can easily be proved that the total dispersion is equal to ESSt=ttmt_t = \sum_{t' \leq t} m_{t'} and that this quantity is always non decreasing. This is the quantity recommended by [2] to display the dendrogram.

mode = "within-disp"

This option represents a cluster specific criterion by using the within cluster dispersion of the two clusters being merged at every given step of the algorithm. It can be proved that this quantity is also non decreasing, but it is depends strongly on the cluster size, leading to flattened dendrogram in most cases.

mode = "average-disp"

This last option addresses the problem of the dependency to cluster sizes posed by the previous method ("within-disp") by using the average within-cluster dispersion of the two clusters being merged at every given step of the algorithm. This criterion is also a cluster specific one but does not guaranty the absence of reversals in heights.

Relations with ‘hclust’ and ‘rioja’

As documented in [4], the call to hclust(..., method = "ward.D") implicitly assumes that ... is a squared distance matrix. As explained above, we did not make such an assumption so hclust(d^2, method = "ward.D") and adjClust(d, method = "dissimilarity") give identical results when the ordering of the (unconstrained) clustering is compatible with the natural ordering of objects used as a constraint. In addition, since hclust(..., method = "ward.D2") takes for linkage mt\sqrt{m_t}, hclust(d, method = "ward.D2") and adjClust(d, method = "dissimilarity") give identical results for the merges and the slot height of the first is the square root of the slot height of the second, when the ordering of the (unconstrained) clustering is compatible with the natural ordering of objects used as a constraint.

Finally, rioja uses ESSt_t to display the heights of the dendrogram (because, as documented above, this quantity is non decreasing, in the Euclidean case, even for constrained clusterings). Hence, rioja(d, method = "coniss") and adjClust(d, method = "dissimilarity") give identical results for the merges and the slot height of the first is the cumulative sum of the slot height of the second.

References

[1] Ambroise C., Dehman A., Neuvial P., Rigaill G., and Vialaneix N. (2019). Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics. Algorithms for Molecular Biology, 14, 22.

[2] Grimm, E.C. (1987) CONISS: a fortran 77 program for stratigraphically constrained cluster analysis by the method of incremental sum of squares. Computers & Geosciences, 13(1), 13-35.

[3] Miyamoto S., Abe R., Endo Y., Takeshita J. (2015) Ward method of hierarchical clustering for non-Euclidean similarity measures. In: Proceedings of the VIIth International Conference of Soft Computing and Pattern Recognition (SoCPaR 2015).

[4] Murtagh, F. and Legendre, P. (2014) Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion? Journal of Classification, 31, 274-295.

[5] Randriamihamison N., Vialaneix N., & Neuvial P. (2020). Applicability and interpretability of Ward’s hierarchical agglomerative clustering with or without contiguity constraints. Journal of Classification 38, 1-27.

[6] Ward, J.H. (1963) Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236-244.