Bastian Rieck and Heike Leitte Exploring and comparing clusterings of multivariate data sets using persistent homology
Abstract: Clustering algorithms support exploratory data analysis by grouping inputs that share similar features. Especially the clustering of unlabelled data is said to be a fiendishly difficult problem, because users not only have to choose a suitable clustering algorithm but also a suitable number of clusters. The known issues of existing clustering validity measures comprise instabilities in the presence of noise and restrictive assumptions about cluster shapes. In addition, they cannot evaluate individual clusters locally. We present a new measure for assessing and comparing different clusterings both on a global and on a local level. Our measure is based on the topological method of persistent homology, which is stable and unbiased towards cluster shapes. Based on our measure, we also describe a new visualization that displays similarities between different clusterings (using a global graph view) and supports their comparison on the individual cluster level (using a local glyph view). We demonstrate how our visualization helps detect different—but equally valid—clusterings of data sets from multiple application domains.