Skip to content

Examples

Clustering algorithms (clustbench)

We have ported the clustbench clustering benchmark (Gagolewski, 2022) to evaluate 105 datasets with a known ground truth and 27 methods using six partition metrics (see table below).

If present, we included noisy points during the clustering process, but ignored them when calculating the performance metrics.

We grouped datasets and methods according to their generator and/or software environment, hence writing modules able to run multiple methods on demand. The Clustering_conda.yml manifest makes use of this parametrization to specify the dataset/method/metric to be run from a benchmarking module.

Beyond Conda, we also designed EasyBuild and Apptainer execution environments to evaluate the impact of the software backend on benchmarking results, both in terms of algorithmic outcomes (e.g., clustering metrics) and computational performance.

Components

Stage Module Components Count
Data fcps atom, chainlink, engytime, hepta, lsun, target, tetra, twodiamonds, wingnut 9
graves graves1--graves12 12
other aggregation, aniso, blobs, circles, complex9v1--complex9v55 59
sipu a1, a2, a3, a4, dim032, dim064, dim128, dim256, g2--g6, s1--s4, unbalance, triangle1, triangle2 20
uci iris, wine, yeast 3
wut spiral, zigzag_outliers 2
Clustering fastcluster complete, ward, average, weighted, median, centroid 6
sklearn birch, kmeans, spectral, gm 4
agglomerative average, complete, ward 3
genieclust genie, gic, ica 3
fcps Minimax, MinEnergy, HDBSCAN_2/4/8, Diana, Fanny, Hardcl, Softcl, Clara, PAM 11
Metrics partition_metrics normalized_clustering_accuracy, adjusted_fm_score, adjusted_mi_score, adjusted_rand_score, fm_score, mi_score 6

Running clustbench with Omnibenchmark

To run the benchmark using Conda as a software backend:

git clone git@github.com:omnibenchmark/clustering_example.git
cd clustering_example

ob run Clustering_conda.yml