import numpy as np
import matplotlib.pyplot as plt
import beam
from beam.datasets import load_duo2018
from beam.cards import properties_for
from beam.mcda import (
run_from_registry,
critical_difference,
smaa,
smallest_weight_perturbation,
)Duo 2018 clustering benchmark, walkthrough
Goal
This vignette re-analyses the fourteen single-cell clustering methods benchmarked in Duo, Robinson and Soneson (2018) with the beam metric registry and the MCDA decision module. It loads the published method by dataset by metric tensor, pulls the metric cards, and pools each metric across the twelve datasets using the rule its card recommends. It then runs the MCDA pipeline and checks how stable the ranking is under different weightings, aggregation methods, and weight perturbations. It closes with a Demsar critical-difference diagram that asks whether the methods are separable on ARI at all.
Set-up
The five aggregation methods used below (SAW, TOPSIS, VIKOR, PROMETHEE II, COMET) are wrapped from pymcdm. beam normalizes by metric card, then calls pymcdm on the normalized matrix and keeps the higher-is-better convention. The weighting schemes are beam’s own, since pymcdm’s reject the zeros beam’s normalization produces.
Load the tensor
load_duo2018 reads the bundled CSV into a frozen dataclass. The scores are a method by dataset by metric tensor with numpy.nan in the cells that were the literal string NA in the source. The loader does not impute or drop anything.
duo = load_duo2018()
print("scores shape:", duo.scores.shape)
print("methods:", ", ".join(duo.method_names))
print("datasets:", ", ".join(duo.dataset_names))
print("metrics:", ", ".join(duo.metric_ids))scores shape: (14, 12, 4)
methods: CIDR, FlowSOM, monocle, PCAHC, PCAKmeans, pcaReduce, RaceID2, RtsneKmeans, SAFE, SC3, SC3svm, Seurat, TSCAN, ascend
datasets: Koh, KohTCC, Kumar, KumarTCC, SimKumar4easy, SimKumar4hard, SimKumar8hard, Trapnell, TrapnellTCC, Zhengmix4eq, Zhengmix4uneq, Zhengmix8eq
metrics: ari, runtime, shannon_entropy_diff, nclust_deviation
Missing cells
Four metrics are recorded. ARI, runtime and Shannon entropy difference each have five missing cells; cluster-count deviation has 101 of 168 cells missing because several methods do not report a fixed cluster count for every dataset. With more than half its cells empty, cluster-count deviation cannot be pooled across datasets without an imputation choice that would drive the result, so this analysis uses the three well-populated metrics: ARI, runtime, and Shannon entropy difference.
for metric_id in duo.metric_ids:
n_missing = int(np.isnan(duo.tensor((metric_id,))[:, :, 0]).sum())
print(f"{metric_id:22s} missing cells: {n_missing:3d} / {14 * 12}")
analysis_metrics = ["ari", "runtime", "shannon_entropy_diff"]ari missing cells: 5 / 168
runtime missing cells: 5 / 168
shannon_entropy_diff missing cells: 5 / 168
nclust_deviation missing cells: 101 / 168
Pull the metric cards
properties_for looks up the card for each metric and exposes the fields the pipeline consumes: polarity, scale type, declared range, and the recommended cross-dataset aggregation. ARI is higher is better; runtime and Shannon entropy difference are lower is better. Runtime is a ratio metric that spans orders of magnitude, so its card recommends the geometric mean across datasets (Smith 1988). ARI and Shannon entropy difference take the arithmetic mean.
props = properties_for(analysis_metrics)
polarity = [p.polarity for p in props]
for p in props:
print(
f"{p.id:22s} polarity={p.polarity:16s} scale={p.scale_type:8s} "
f"range=({p.range_lower}, {p.range_upper}) "
f"agg_across_datasets={p.recommended_aggregation_across_datasets}"
)ari polarity=higher_is_better scale=interval range=(-1, 1) agg_across_datasets=arithmetic_mean
runtime polarity=lower_is_better scale=ratio range=(0, None) agg_across_datasets=geometric_mean
shannon_entropy_diff polarity=lower_is_better scale=ratio range=(0, 1) agg_across_datasets=arithmetic_mean
Pool across the twelve datasets
The MCDA pipeline works on a method by metric matrix, so the twelve datasets have to be pooled into one value per metric first. For each metric this code applies the rule the card declares, using a NaN-aware reduction so the five missing cells do not skew a method’s pooled score. ARI and Shannon entropy difference use a NaN-aware arithmetic mean; runtime uses a NaN-aware geometric mean (the mean of the logs, then exponentiated, over the observed datasets only).
The NaN handling lives here in the example, not in the core pipeline. beam.mcda.aggregate_across_datasets is not NaN-aware by design; pushing coverage-aware pooling into the core normalization, weighting and aggregation is future work tied to the heterogeneity module. Doing it at the example level keeps the choice explicit: a method’s pooled score is the average over the datasets where it was actually measured. The five gaps fall on SAFE (two datasets) and ascend (three datasets); every other method is observed on all twelve.
def pool_metric(metric_id, rule):
matrix = duo.tensor((metric_id,))[:, :, 0]
if rule == "arithmetic_mean":
return np.nanmean(matrix, axis=1)
if rule == "geometric_mean":
return np.exp(np.nanmean(np.log(matrix), axis=1))
raise ValueError(f"unsupported pooling rule {rule!r}")
pooled = np.column_stack(
[pool_metric(p.id, p.recommended_aggregation_across_datasets) for p in props]
)
print("pooled matrix shape:", pooled.shape)
print("pooled matrix all finite:", bool(np.isfinite(pooled).all()))pooled matrix shape: (14, 3)
pooled matrix all finite: True
One MCDA run, ontology-aware
run_from_registry pulls polarity, declared bounds and the per-metric normalization strategy from the cards, validates the requested aggregation against the declared scale types, and runs the full pipeline. The default here is equal weights with simple additive weighting (SAW).
result = run_from_registry(pooled, analysis_metrics, weights="equal", method="saw")
print(f"weighting={result.weighting} method={result.method}")
print(f"{'method':12s} composite rank")
order = np.argsort(result.ranks)
for i in order:
print(f"{duo.method_names[i]:12s} {result.composite[i]:.3f} {result.ranks[i]}")weighting=equal method=saw
method composite rank
Seurat 0.947 1
PCAKmeans 0.858 2
PCAHC 0.857 3
CIDR 0.832 4
monocle 0.828 5
RtsneKmeans 0.814 6
TSCAN 0.778 7
ascend 0.753 8
FlowSOM 0.725 9
SC3svm 0.672 10
SC3 0.632 11
pcaReduce 0.599 12
RaceID2 0.581 13
SAFE 0.549 14
Compare across weightings and aggregation methods
Four weightings (equal, entropy, std, critic) crossed with four aggregation methods (saw, topsis, vikor, promethee_ii) give sixteen runs. A method whose rank stays the same down a column is stable to these choices; a method whose rank moves is sensitive to them.
weightings = ["equal", "entropy", "std", "critic"]
methods = ["saw", "topsis", "vikor", "promethee_ii"]
combos = [(w, m) for w in weightings for m in methods]
ranks_grid = np.zeros((len(combos), len(duo.method_names)), dtype=int)
row_labels = []
for i, (w, m) in enumerate(combos):
r = run_from_registry(pooled, analysis_metrics, weights=w, method=m)
ranks_grid[i] = r.ranks
row_labels.append(f"{w} / {m}")The ranks as a heatmap, rank 1 (best) in green and rank 14 (worst) in red. A column of one colour means that method holds its rank across every configuration.
n_methods = len(duo.method_names)
fig, ax = plt.subplots(figsize=(9.0, 6.0))
im = ax.imshow(ranks_grid, cmap="RdYlGn_r", aspect="auto", vmin=1, vmax=n_methods)
ax.set_xticks(range(n_methods))
ax.set_xticklabels(duo.method_names, rotation=90)
ax.set_yticks(range(len(row_labels)))
ax.set_yticklabels(row_labels)
ax.set_xlabel("clustering method")
ax.set_ylabel("weighting / aggregation method")
ax.set_title("Rank by configuration (ARI higher better, runtime and entropy diff lower better)")
for i in range(ranks_grid.shape[0]):
for j in range(ranks_grid.shape[1]):
ax.text(j, i, ranks_grid[i, j], ha="center", va="center", fontsize=7, color="black")
cbar = fig.colorbar(im, ax=ax, ticks=[1, n_methods])
cbar.set_label("rank (1 = best)")
fig.tight_layout()
plt.show()top_per_combo = {duo.method_names[int(np.argmin(row))] for row in ranks_grid}
print("methods that are top-ranked in at least one of the 16 configurations:")
print(" " + ", ".join(sorted(top_per_combo)))methods that are top-ranked in at least one of the 16 configurations:
Seurat
Critical-difference diagram on ARI
The MCDA composite gives one ranking, but it does not say whether the methods are separable. The Demsar (2006) Friedman test ranks the methods on each dataset and asks whether the average ranks differ more than chance. The Nemenyi post-hoc gives the critical difference: two methods whose average ranks differ by less than it are not distinguishable from this data.
This runs on ARI alone, on the block of methods observed on all twelve datasets. SAFE and ascend have missing ARI cells, so they are dropped here to keep the block complete; the remaining twelve methods are observed on all twelve datasets.
ari = duo.tensor(("ari",))[:, :, 0]
complete = duo.complete_methods(("ari",))
block_idx = [i for i in range(n_methods) if complete[i]]
block = ari[block_idx, :]
block_names = [duo.method_names[i] for i in block_idx]
print(f"ARI block: {block.shape[0]} methods by {block.shape[1]} datasets, all observed: "
f"{bool(np.isfinite(block).all())}")
print("dropped (incomplete ARI):",
", ".join(duo.method_names[i] for i in range(n_methods) if not complete[i]))
cd = critical_difference(block, higher_is_better=True, tool_names=tuple(block_names))
print(f"\nFriedman statistic={cd.friedman_statistic:.2f} p-value={cd.friedman_pvalue:.2e}")
print(f"Nemenyi critical difference (alpha={cd.alpha}): {cd.critical_difference:.2f}")
print("average ranks (1 = best on ARI):")
for i in cd.order:
print(f" {block_names[i]:12s} {cd.average_ranks[i]:.2f}")ARI block: 12 methods by 12 datasets, all observed: True
dropped (incomplete ARI): SAFE, ascend
Friedman statistic=47.36 p-value=1.86e-06
Nemenyi critical difference (alpha=0.05): 4.81
average ranks (1 = best on ARI):
SC3 3.17
RtsneKmeans 4.79
Seurat 4.79
SC3svm 4.96
pcaReduce 5.46
PCAHC 5.83
monocle 6.79
PCAKmeans 7.21
TSCAN 7.50
CIDR 8.67
FlowSOM 9.21
RaceID2 9.62
The diagram draws each method at its average ARI rank, with rank 1 (best, highest ARI) on the left. The bars join methods whose average ranks lie within the critical difference, so the methods a bar spans are not separable on ARI at this sample size.
ranks_sorted = cd.average_ranks[cd.order]
names_sorted = [block_names[i] for i in cd.order]
fig, ax = plt.subplots(figsize=(9.0, 3.2))
y = 0.0
ax.scatter(ranks_sorted, np.full_like(ranks_sorted, y), color="black", zorder=3)
for x, name in zip(ranks_sorted, names_sorted):
ax.annotate(name, (x, y), xytext=(0, 8), textcoords="offset points",
ha="center", rotation=90, fontsize=8)
clique_y = -0.15
for k, clique in enumerate(cd.cliques):
positions = [float(cd.average_ranks[i]) for i in clique]
ax.plot([min(positions), max(positions)], [clique_y - 0.06 * k] * 2,
color="tab:blue", linewidth=3, solid_capstyle="round")
ax.set_xlabel("average rank on ARI (1 = best, higher ARI; ground truth direction marked)")
ax.set_yticks([])
ax.set_ylim(clique_y - 0.06 * len(cd.cliques) - 0.1, 0.6)
ax.set_title(f"Friedman p={cd.friedman_pvalue:.1e}, Nemenyi CD={cd.critical_difference:.2f}")
ax.invert_xaxis()
fig.tight_layout()
plt.show()SMAA confidence
smaa samples weight vectors from a Dirichlet over the three-metric simplex, runs the pipeline once per sample, and reports the confidence factor per method: the share of samples in which that method is top-ranked. A method with a confidence factor near 1 is top-ranked under almost any weighting.
smaa_report = smaa(pooled, polarity, n_samples=2000, method="topsis", seed=0)
print("confidence factor (share of weight samples ranked first), nonzero only:")
for i in np.argsort(-smaa_report.confidence_factor):
if smaa_report.confidence_factor[i] > 0:
print(f" {duo.method_names[i]:12s} {smaa_report.confidence_factor[i]:.3f}")confidence factor (share of weight samples ranked first), nonzero only:
Seurat 0.992
SC3 0.008
fig, ax = plt.subplots(figsize=(8.0, 3.5))
ax.bar(range(n_methods), smaa_report.confidence_factor, color="tab:green")
ax.set_xticks(range(n_methods))
ax.set_xticklabels(duo.method_names, rotation=90)
ax.set_xlabel("clustering method")
ax.set_ylabel("SMAA confidence factor (share top-ranked)")
ax.set_ylim(0, 1)
ax.set_title("SMAA over 2000 Dirichlet weight draws (method=topsis)")
fig.tight_layout()
plt.show()Smallest weight perturbation
smallest_weight_perturbation reports, under SAW, the smallest single-weight change that swaps any ranked-above pair. It also flags whether the top-ranked method is fragile, that is, whether some single weight can be moved by less than the fragility threshold (default 0.05) to push it off the top rank.
ts = smallest_weight_perturbation(pooled, polarity, weights="equal", method="saw")
top_method = duo.method_names[int(np.argmin(ts.base.ranks))]
print(f"top-ranked under equal weights / SAW: {top_method}")
print(f"top rank is fragile: {ts.top_rank_is_fragile}")
if ts.most_fragile_pair is not None:
p = ts.most_fragile_pair
print(
f"most fragile ordered pair: {duo.method_names[p.higher]} over "
f"{duo.method_names[p.lower]}, flips by changing the weight on "
f"{analysis_metrics[p.criterion]!r} by {p.delta:+.3f}"
)top-ranked under equal weights / SAW: Seurat
top rank is fragile: False
most fragile ordered pair: TSCAN over SC3svm, flips by changing the weight on 'runtime' by -0.001
End to end with beam.rank and beam.report
The sections above call the pipeline primitives one by one. The same analysis runs in three lines through the top-level API. beam.rank takes the method by dataset by metric tensor, pools it across datasets per each card’s rule (nan-aware), runs the MCDA pipeline, and runs the default sensitivity analysis on the same normalization context: SMAA, leave-one-metric-out, the smallest-weight perturbation, and, because the input is a tensor, leave-one-dataset-out.
scores = beam.Scores(
values=duo.tensor(tuple(analysis_metrics)),
tool_names=tuple(duo.method_names),
metric_ids=tuple(analysis_metrics),
dataset_names=duo.dataset_names,
layout="long",
)
run = beam.rank(scores, weights="equal", method="saw", seed=0)
print("top-ranked under equal weights / SAW:", run.top_tool)
print("matrix ranked (methods by metrics):", run.matrix.shape)top-ranked under equal weights / SAW: Seurat
matrix ranked (methods by metrics): (14, 3)
Leave-one-dataset-out asks how much the ranking leans on any single dataset. For each dataset dropped, the remaining eleven are pooled the same way, re-ranked, and compared to the base ranking. The per-method stability is the share of those runs in which the method keeps its base rank.
lodo = run.leave_one_dataset_out
print(f"datasets evaluated: {len(lodo.evaluated_datasets)} of {duo.scores.shape[1]}")
base_order = np.argsort(run.result.ranks)
print(f"{'method':12s} base rank rank held across leave-one-dataset-out runs")
for i in base_order:
print(f"{duo.method_names[i]:12s} {run.result.ranks[i]:>4d} {lodo.rank_stability[i] * 100:5.0f}%")
influential = lodo.dataset_names[lodo.most_influential_dataset]
print(f"\nmost influential dataset: {influential} (largest rank shift {lodo.max_rank_shift})")datasets evaluated: 12 of 12
method base rank rank held across leave-one-dataset-out runs
Seurat 1 100%
PCAKmeans 2 67%
PCAHC 3 67%
CIDR 4 50%
monocle 5 50%
RtsneKmeans 6 100%
TSCAN 7 100%
ascend 8 100%
FlowSOM 9 100%
SC3svm 10 100%
SC3 11 100%
pcaReduce 12 75%
RaceID2 13 75%
SAFE 14 100%
most influential dataset: KohTCC (largest rank shift 1)
n_eval = len(lodo.evaluated_datasets)
order = base_order
fig, ax = plt.subplots(figsize=(8.0, 4.0))
ax.barh(range(n_methods), lodo.rank_stability[order] * 100, color="tab:orange")
ax.set_yticks(range(n_methods))
ax.set_yticklabels([duo.method_names[i] for i in order])
ax.invert_yaxis()
ax.set_xlabel(f"rank held across {n_eval} leave-one-dataset-out runs (percent)")
ax.set_ylabel("clustering method (ordered by base rank)")
ax.set_xlim(0, 100)
ax.set_title("Leave-one-dataset-out rank stability (equal weights / SAW)")
fig.tight_layout()
plt.show()The funky heatmap shows the same run as a glyph table, with two robustness panels: the span of ranks each method takes across the leave-one-dataset-out runs, and the span across the five aggregations (SAW, TOPSIS, VIKOR, PROMETHEE II, COMET) holding the weighting fixed. The brackets on the left group the methods the Friedman-Nemenyi test cannot separate on ARI. The two span panels differ: dropping a dataset barely moves the order, and Seurat holds rank one, but the choice of aggregation moves it considerably. On this benchmark the order depends more on how the metrics are combined than on which datasets are pooled.
from beam.mcda import critical_difference
from beam.reporting import funky_heatmap_from_run
ari_for_cd = duo.tensor(("ari",))[:, :, 0]
complete = ~np.isnan(ari_for_cd).any(axis=0)
cd = critical_difference(ari_for_cd[:, complete], "higher_is_better", tool_names=duo.method_names)
cliques = tuple(tuple(duo.method_names[i] for i in clique) for clique in cd.cliques)
funky_heatmap_from_run(
run,
cliques=cliques,
show_smaa=False,
show_aggregation_consensus=True,
title="Duo 2018: scores and rank robustness",
)beam.report writes the whole run to one self-contained HTML file: the ranking, the normalization diagnostics, every sensitivity output including the leave-one-dataset-out figure above, the critical-difference diagram, and the recommendation paragraph. ground_truth_tool outlines the method documented to rank first on the ranking figure.
beam.report(run, "duo2018_report.html", ground_truth_tool="Seurat")
print("wrote duo2018_report.html")wrote duo2018_report.html
Heterogeneity: where does the ARI variance live
Leave-one-dataset-out asks whether the pooled ranking hangs on any single dataset. A mixed-effects model asks a different question: how much of the score variation is a stable method effect and how much is the method-by-dataset interaction that a single ranking hides. beam.heterogeneity.mixed_effects fits score ~ method + (1 | dataset) on one metric in R’s lme4, with the method as a fixed effect and the dataset as a random intercept that absorbs how easy or hard each dataset is for every method alike. The fit needs the R toolchain, so the section below runs only when it is available.
from beam.heterogeneity import mixed_effects_from_matrix, r_available
ari = duo.tensor(("ari",))[:, :, 0]
if r_available():
me = mixed_effects_from_matrix(ari, duo.method_names, duo.dataset_names)
print(f"model: {me.formula}")
print(f"dataset shift (ICC): {me.icc_dataset:.2f} of the ARI variance")
print(f"residual share: {me.residual_share:.2f} (interaction confounded with noise at one run per cell)")
order = np.argsort(-me.method_effects)
print("\nmethod ARI marginal mean")
for i in order[:3]:
print(f"{me.method_names[i]:14s} {me.method_effects[i]:.3f} +/- {me.method_effect_se[i]:.3f}")
print("\nlargest interaction residuals (method, dataset, residual):")
for m, ds, res in me.top_outliers(3):
print(f" {m:10s} {ds:16s} {res:+.2f}")
else:
me = None
print("R with lme4 not available; skipping the mixed-effects fit.")
print("Provision it with envs/heterogeneity.yml (conda or mamba).")R with lme4 not available; skipping the mixed-effects fit.
Provision it with envs/heterogeneity.yml (conda or mamba).
if me is not None:
order = np.argsort(me.method_effects)
names = [me.method_names[i] for i in order]
means = me.method_effects[order]
ses = me.method_effect_se[order]
fig, ax = plt.subplots(figsize=(8.0, 4.0))
ax.errorbar(means, range(len(names)), xerr=ses, fmt="o", color="tab:blue", capsize=3)
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
ax.set_xlabel("ARI marginal mean over datasets (higher ranks first)")
ax.set_ylabel("clustering method (ordered by marginal mean)")
ax.set_title("Mixed-effects method effects on ARI, with standard errors")
fig.tight_layout()
plt.show()The dataset intercept takes about 0.71 of the ARI variance, so most of the spread is datasets being uniformly easy or hard rather than methods reordering. SC3 and Seurat lead the marginal means and are not separable from each other given the standard errors, which matches the critical-difference reading above. The residual share, the upper bound on the method-by-dataset interaction at one run per cell, concentrates on the two weakest methods (RaceID2 and FlowSOM) on a few simulated and four-group datasets, not on a reshuffling of the strong methods. See the mixed-effects explanation.
Heterogeneity: does a dataset feature reverse the ranking
The variance decomposition says how much interaction there is. The Bradley-Terry tree asks which dataset feature drives it. For one metric, each dataset becomes a set of pairwise method comparisons (the higher ARI wins), and psychotree::bttree splits the datasets by their features, so each leaf gets its own Bradley-Terry ranking and a parameter-stability test decides where a split is real. The dataset features (number of cells, number of true clusters, real vs simulated, family) ship next to the benchmark. The fit needs the R toolchain, so the section runs only when psychotree is available.
from beam.datasets import load_duo2018_features
from beam.heterogeneity import bradley_terry_tree, bttree_available
if bttree_available():
features = load_duo2018_features()
numeric, categorical = features.aligned_to(duo.dataset_names)
bt = bradley_terry_tree(
ari,
duo.method_names,
duo.dataset_names,
numeric_features=numeric,
categorical_features=categorical,
polarity="higher_is_better",
minsize=4,
)
print(f"split found: {bt.did_split}")
print(f"global ranking (top 3): {', '.join(bt.global_ranking()[:3])}")
print(bt.summary())
else:
bt = None
print("R with psychotree not available; skipping the Bradley-Terry tree.")
print("Provision it with envs/heterogeneity.yml (conda or mamba).")R with psychotree not available; skipping the Bradley-Terry tree.
Provision it with envs/heterogeneity.yml (conda or mamba).
On these 12 datasets the parameter-stability test finds no split, so the tree reduces to a single Bradley-Terry ranking led by SC3, the same order the net pairwise wins and the marginal means give. A dozen datasets is too few for the test to separate a feature-dependent regime from sampling noise, which is also why the critical-difference diagram and the variance decomposition both report a stable global ranking here. The tree is more informative on a benchmark with many datasets carrying real feature variation; the OpenProblems spatially_variable_genes vignette is where a split appears. See the Bradley-Terry explanation.
Recommendation
Under the default pipeline (equal weights, SAW) on ARI, runtime and Shannon entropy difference pooled across the twelve datasets, Seurat ranks first, and it holds that position across all sixteen weighting-by-method configurations tested above. The SMAA analysis puts Seurat top-ranked in about 99 percent of random weight draws, and the weight-perturbation check finds its top rank is not fragile: no single weight change within the threshold dislodges it. Leave-one-dataset-out adds the dataset axis to that picture: Seurat keeps its first rank in all twelve runs that drop one dataset, so the recommendation does not hang on any single dataset. On the composite, RaceID2 and SAFE rank near the bottom.
The critical-difference diagram qualifies this. On ARI alone, the per-dataset ranks favour SC3 (best average ARI rank), with Seurat tied close behind, and the Friedman test is clearly significant (p well below 0.05), so the methods are not all equivalent. But the Nemenyi critical difference is large relative to the rank spread, so most pairs sit inside overlapping cliques and are not separable on ARI at twelve datasets. The composite ranks Seurat first because it also pools strong runtime and entropy-difference scores, not because it is the single best on ARI. The informative reading is that Seurat is a stable top choice across these three metrics and these weightings, while the gap between the upper-middle methods on ARI alone is within the noise the test can resolve.
This matches the expected narrative for Seurat (stable top choice) and for RaceID2 (a consistent low performer). It diverges on FlowSOM: here FlowSOM sits mid-pack on the composite, not among the lowest, so the expected “RaceID2 and FlowSOM are consistent low performers” holds for RaceID2 but not for FlowSOM on this metric set and pooling.