Duo 2018 clustering benchmark, walkthrough

Author

Izaskun Mallona

Published

July 9, 2026

Set-up

The five aggregation methods used below (SAW, TOPSIS, VIKOR, PROMETHEE II, COMET) are wrapped from pymcdm. beam normalizes by metric card, then calls pymcdm on the normalized matrix and keeps the higher-is-better convention. The weighting schemes are beam’s own, since pymcdm’s reject the zeros beam’s normalization produces.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from IPython.display import display

import beam
from beam.datasets import load_duo2018
from beam.cards import properties_for
from beam.mcda import (
    run_from_registry,
    critical_difference,
    smaa,
    smallest_weight_perturbation,
)

Load the tensor

load_duo2018 reads the bundled CSV into a frozen dataclass. The scores are a method by dataset by metric tensor with numpy.nan in the cells that were the literal string NA in the source. The loader does not impute or drop anything.

duo = load_duo2018()
print("scores shape:", duo.scores.shape)
print("methods:", ", ".join(duo.method_names))
print("datasets:", ", ".join(duo.dataset_names))
print("metrics:", ", ".join(duo.metric_ids))

scores shape: (14, 12, 4)
methods: CIDR, FlowSOM, monocle, PCAHC, PCAKmeans, pcaReduce, RaceID2, RtsneKmeans, SAFE, SC3, SC3svm, Seurat, TSCAN, ascend
datasets: Koh, KohTCC, Kumar, KumarTCC, SimKumar4easy, SimKumar4hard, SimKumar8hard, Trapnell, TrapnellTCC, Zhengmix4eq, Zhengmix4uneq, Zhengmix8eq
metrics: ari, runtime, shannon_entropy_diff, nclust_deviation

Missing cells

Four metrics are recorded. ARI, runtime and Shannon entropy difference each have five missing cells; cluster-count deviation has 101 of 168 cells missing because several methods do not report a fixed cluster count for every dataset. With more than half its cells empty, cluster-count deviation cannot be pooled across datasets without an imputation choice that would drive the result, so this analysis uses the three well-populated metrics: ARI, runtime, and Shannon entropy difference.

for metric_id in duo.metric_ids:
    n_missing = int(np.isnan(duo.tensor((metric_id,))[:, :, 0]).sum())
    print(f"{metric_id:22s} missing cells: {n_missing:3d} / {14 * 12}")

analysis_metrics = ["ari", "runtime", "shannon_entropy_diff"]

ari                    missing cells:   5 / 168
runtime                missing cells:   5 / 168
shannon_entropy_diff   missing cells:   5 / 168
nclust_deviation       missing cells: 101 / 168

Pull the metric cards

properties_for looks up the card for each metric and exposes the fields the pipeline consumes: polarity, scale type, declared range, and the recommended cross-dataset aggregation. ARI is higher is better; runtime and Shannon entropy difference are lower is better. Runtime is a ratio metric that spans orders of magnitude, so its card recommends the geometric mean across datasets (Smith 1988). ARI and Shannon entropy difference take the arithmetic mean.

props = properties_for(analysis_metrics)
polarity = [p.polarity for p in props]
for p in props:
    print(
        f"{p.id:22s}  polarity={p.polarity:16s}  scale={p.scale_type:8s}  "
        f"range=({p.range_lower}, {p.range_upper})  "
        f"agg_across_datasets={p.recommended_aggregation_across_datasets}"
    )

ari                     polarity=higher_is_better  scale=interval  range=(-1, 1)  agg_across_datasets=arithmetic_mean
runtime                 polarity=lower_is_better   scale=ratio     range=(0, None)  agg_across_datasets=geometric_mean
shannon_entropy_diff    polarity=lower_is_better   scale=ratio     range=(0, 1)  agg_across_datasets=arithmetic_mean

Pool across the twelve datasets

beam needs one value per metric, so the datasets are pooled first, keeping NaN awareness: the arithmetic mean of ARI and Shannon entropy difference, the geometric mean of runtime.

A method’s pooled score is the average over the datasets where it was measured. The five gaps fall on SAFE (two datasets) and ascend (three datasets); every other method is observed on all twelve.

def pool_metric(metric_id, rule):
    matrix = duo.tensor((metric_id,))[:, :, 0]
    if rule == "arithmetic_mean":
        return np.nanmean(matrix, axis=1)
    if rule == "geometric_mean":
        return np.exp(np.nanmean(np.log(matrix), axis=1))
    raise ValueError(f"unsupported pooling rule {rule!r}")

pooled = np.column_stack(
    [pool_metric(p.id, p.recommended_aggregation_across_datasets) for p in props]
)
print("pooled matrix shape:", pooled.shape)
print("pooled matrix all finite:", bool(np.isfinite(pooled).all()))

pooled matrix shape: (14, 3)
pooled matrix all finite: True

card_data_consistency checks the pooled scores against what the cards declare: every value inside the declared range, the chance baseline and noise floor sane. It reads the raw scores, before normalization, so a card or data error (a value outside the declared range, a metric reported on the wrong scale) is named here.

from beam.mcda import card_data_consistency, registry_context

ctx = registry_context(analysis_metrics, "saw")
audit = card_data_consistency(
    pooled, ctx.polarity, ctx.bounds,
    baselines=ctx.baselines, targets=ctx.targets, noise_floors=ctx.noise_floors,
    metric_ids=analysis_metrics,
)
print("pooled scores consistent with the cards:", audit.ok)
for finding in audit.findings:
    print(" ", finding.severity, finding.message)

pooled scores consistent with the cards: True

One MCDA run, ontology-aware

run_from_registry pulls polarity, declared bounds and the per-metric normalization strategy from the cards, validates the requested aggregation against the declared scale types, and runs the full pipeline. The default here is equal weights with simple additive weighting (SAW).

result = run_from_registry(pooled, analysis_metrics, weights="equal", method="saw")

print(f"weighting={result.weighting}  method={result.method}")
print(f"{'method':12s}  composite  rank")
order = np.argsort(result.ranks)
for i in order:
    print(f"{duo.method_names[i]:12s}  {result.composite[i]:.3f}      {result.ranks[i]}")

weighting=equal  method=saw
method        composite  rank
Seurat        0.947      1
PCAKmeans     0.858      2
PCAHC         0.857      3
CIDR          0.832      4
monocle       0.828      5
RtsneKmeans   0.814      6
TSCAN         0.778      7
ascend        0.753      8
FlowSOM       0.725      9
SC3svm        0.672      10
SC3           0.632      11
pcaReduce     0.599      12
RaceID2       0.581      13
SAFE          0.549      14

Compare across weightings and aggregation methods

Four weightings (equal, entropy, std, critic) crossed with four aggregation methods (saw, topsis, vikor, promethee_ii) give sixteen runs. A method whose rank stays the same down a column is stable to these choices.

weightings = ["equal", "entropy", "std", "critic"]
methods = ["saw", "topsis", "vikor", "promethee_ii"]

combos = [(w, m) for w in weightings for m in methods]
ranks_grid = np.zeros((len(combos), len(duo.method_names)), dtype=int)
row_labels = []
for i, (w, m) in enumerate(combos):
    r = run_from_registry(pooled, analysis_metrics, weights=w, method=m)
    ranks_grid[i] = r.ranks
    row_labels.append(f"{w} / {m}")

The ranks as a heatmap, rank 1 (best) in green and rank 14 (worst) in red. A column of one colour means that method holds its rank across every configuration.

from beam import plot

n_methods = len(duo.method_names)
plot.rank_heatmap(
    ranks_grid,
    row_names=row_labels,
    col_names=duo.method_names,
    row_label="weighting / aggregation method",
    col_label="clustering method",
    title="Rank by configuration",
)

top_per_combo = {duo.method_names[int(np.argmin(row))] for row in ranks_grid}
print("methods that are top-ranked in at least one of the 16 configurations:")
print("  " + ", ".join(sorted(top_per_combo)))

methods that are top-ranked in at least one of the 16 configurations:
  Seurat

Critical-difference diagram on ARI

The MCDA composite gives one ranking, but it does not say whether the methods are separable. The Demsar (2006) Friedman test ranks the methods on each dataset and asks whether the average ranks differ more than chance. The Nemenyi post-hoc gives the critical difference: two methods whose average ranks differ by less than it are not distinguishable from this data.

This runs only on ARI, on the complete observations across the twelve datasets. SAFE and ascend have missing ARI cells, so they are ignored. The rest are complete.

ari = duo.tensor(("ari",))[:, :, 0]
complete = duo.complete_methods(("ari",))
block_idx = [i for i in range(n_methods) if complete[i]]
block = ari[block_idx, :]
block_names = [duo.method_names[i] for i in block_idx]
print(f"ARI block: {block.shape[0]} methods by {block.shape[1]} datasets, all observed: "
      f"{bool(np.isfinite(block).all())}")
print("dropped (incomplete ARI):",
      ", ".join(duo.method_names[i] for i in range(n_methods) if not complete[i]))

cd = critical_difference(block, higher_is_better=True, tool_names=tuple(block_names))
print(f"\nFriedman statistic={cd.friedman_statistic:.2f}  p-value={cd.friedman_pvalue:.2e}")
print(f"Nemenyi critical difference (alpha={cd.alpha}): {cd.critical_difference:.2f}")
print("average ranks (1 = best on ARI):")
for i in cd.order:
    print(f"  {block_names[i]:12s}  {cd.average_ranks[i]:.2f}")

ARI block: 12 methods by 12 datasets, all observed: True
dropped (incomplete ARI): SAFE, ascend

Friedman statistic=47.36  p-value=1.86e-06
Nemenyi critical difference (alpha=0.05): 4.81
average ranks (1 = best on ARI):
  SC3           3.17
  RtsneKmeans   4.79
  Seurat        4.79
  SC3svm        4.96
  pcaReduce     5.46
  PCAHC         5.83
  monocle       6.79
  PCAKmeans     7.21
  TSCAN         7.50
  CIDR          8.67
  FlowSOM       9.21
  RaceID2       9.62

The diagram draws each method at its average ARI rank, with rank 1 (best, highest ARI) on the left. The bars join methods whose average ranks lie within the critical difference, so the methods a bar spans are not separable on ARI at this sample size.

from beam import plot

plot.critical_difference(cd)

Pairwise superiority on ARI

The critical-difference diagram says which methods are separable, but not by how much or how often one method outperforms another. pairwise_superiority compares the methods two at a time across the datasets, with the ARI noise floor (0.01) as the equivalence band: a method counts as outperforming another on a dataset only when the difference clears that floor. The probability of superiority is the fraction of datasets on which one method outperforms the other, an effect size that does not depend on the pool.

from beam.mcda import pairwise_superiority

ari_floor = properties_for(["ari"])[0].noise_floor or 0.0
sup = pairwise_superiority(ari, "higher_is_better", rope=ari_floor, method_names=duo.method_names)
top_method = duo.method_names[sup.order[0]]
print(f"highest standing: {top_method} ({sup.standing[sup.order[0]]:.2f})")
print(f"method pairs the sign test cannot separate: {len(sup.equivalent_pairs)} of {len(sup.per_pair)}")

highest standing: SC3 (0.80)
method pairs the sign test cannot separate: 60 of 91

Most pairs are not distinguishable: across the twelve datasets the ARI differences are often inside the 0.01 floor, so neither method outperforms the other decisively. Beyond the method with the highest standing, the rest are largely interchangeable on ARI at this dataset count.

Pairwise transitivity on ARI

The ranking is one order of the methods, but the pairwise majorities behind it need not agree with any single order. pairwise_transitivity reads the same pairwise relation and asks whether one order is consistent with it: it reports the method preferred to every other one, the circular triads (a method outperforms a second, the second a third, the third the first), and whether the relation is transitive.

from beam.mcda import pairwise_transitivity

trans = pairwise_transitivity(sup)
choice = duo.method_names[trans.condorcet_choice] if trans.condorcet_choice is not None else "none"
print(f"method preferred to every other one: {choice}")
print(f"transitive: {trans.is_transitive}; circular triads: {trans.n_circular_triads} of {trans.n_triads}")

method preferred to every other one: SC3
transitive: False; circular triads: 1 of 364

SC3 is preferred to every other method one at a time, yet the relation still contains one circular triad among the methods below it, so no single order agrees with all the pairwise majorities. The matrix below orders the methods by how many others they outperform. A transitive relation fills the upper triangle and leaves the lower one empty; a red cell below the diagonal marks a method that outperforms one ranked above it, which can only happen inside a cycle.

plot.pairwise_majority(trans)

Bayesian comparison on ARI

The tests above report a p-value: the chance of the observed split if two methods scored the same. bayesian_sign_comparison reads the same pairwise relation and reports, for each pair, the posterior probability that one method is practically better, that the two are practically equivalent within the floor, and that the other is practically better.

from beam.mcda import bayesian_sign_comparison

bayes = bayesian_sign_comparison(sup)
runner = duo.method_names[bayes.order[1]]
pair = next(p for p in bayes.per_pair if {p.a, p.b} == {bayes.order[0], bayes.order[1]})
p_top = pair.p_a_better if pair.a == bayes.order[0] else pair.p_b_better
print(f"P({top_method} practically better than {runner}): {p_top:.2f}")
print(f"P(the two are practically equivalent): {pair.p_equivalent:.2f}")

P(SC3 practically better than SC3svm): 0.07
P(the two are practically equivalent): 0.93

SC3 is practically better than the methods with the lowest standing with probability near 1. Against the method with the next-highest standing the posterior favors equivalence within the floor, and most pairs stay inconclusive at the 0.95 threshold. The heatmap shows the posterior probability that the row method scores higher than the column method.

plot.bayesian_comparison(bayes)

SMAA confidence

smaa samples weight vectors from a Dirichlet over the three-metric simplex, runs the pipeline once per sample, and reports the confidence factor per method: the share of samples in which that method is top-ranked. A method with a confidence factor near 1 is top-ranked under almost any weighting.

smaa_report = smaa(pooled, polarity, n_samples=2000, method="topsis", seed=0)
print("confidence factor (share of weight samples ranked first), nonzero only:")
for i in np.argsort(-smaa_report.confidence_factor):
    if smaa_report.confidence_factor[i] > 0:
        print(f"  {duo.method_names[i]:12s}  {smaa_report.confidence_factor[i]:.3f}")

confidence factor (share of weight samples ranked first), nonzero only:
  Seurat        0.992
  SC3           0.008

fig, ax = plt.subplots(figsize=(8.0, 3.5))
ax.bar(range(n_methods), smaa_report.confidence_factor, color="tab:green")
ax.set_xticks(range(n_methods))
ax.set_xticklabels(duo.method_names, rotation=90)
ax.set_xlabel("clustering method")
ax.set_ylabel("SMAA confidence factor (share top-ranked)")
ax.set_ylim(0, 1)
ax.set_title("SMAA over 2000 Dirichlet weight draws (method=topsis)")
fig.tight_layout()
plt.show()

Smallest weight perturbation

smallest_weight_perturbation reports, under SAW, the smallest single-weight change that swaps any ranked-above pair. It also flags whether the top-ranked method is fragile, that is, whether some single weight can be moved by less than the fragility threshold (default 0.05) to push it off the top rank.

ts = smallest_weight_perturbation(pooled, polarity, weights="equal", method="saw")
top_method = duo.method_names[int(np.argmin(ts.base.ranks))]
print(f"top-ranked under equal weights / SAW: {top_method}")
print(f"top rank is fragile: {ts.top_rank_is_fragile}")
if ts.most_fragile_pair is not None:
    p = ts.most_fragile_pair
    print(
        f"most fragile ordered pair: {duo.method_names[p.higher]} over "
        f"{duo.method_names[p.lower]}, flips by changing the weight on "
        f"{analysis_metrics[p.criterion]!r} by {p.delta:+.3f}"
    )

top-ranked under equal weights / SAW: Seurat
top rank is fragile: False
most fragile ordered pair: TSCAN over SC3svm, flips by changing the weight on 'runtime' by -0.001

End to end with beam.rank

The sections above call the pipeline primitives one by one. The same analysis runs in three lines through the top-level API. beam.rank takes the method by dataset by metric tensor, pools it across datasets per each card’s rule (nan-aware), runs the MCDA pipeline, and runs the default sensitivity analysis on the same normalization context: SMAA, leave-one-metric-out, the smallest-weight perturbation, and, because the input is a tensor, leave-one-dataset-out.

scores = beam.Scores(
    values=duo.tensor(tuple(analysis_metrics)),
    tool_names=tuple(duo.method_names),
    metric_ids=tuple(analysis_metrics),
    dataset_names=duo.dataset_names,
    layout="long",
)
run = beam.rank(scores, weights="equal", method="saw", seed=0)
print("top-ranked under equal weights / SAW:", run.top_tool)
print("matrix ranked (methods by metrics):", run.matrix.shape)

top-ranked under equal weights / SAW: Seurat
matrix ranked (methods by metrics): (14, 3)

Leave-one-dataset-out asks how much the ranking leans on any single dataset. For each dataset dropped, the remaining eleven are pooled the same way, re-ranked, and compared to the base ranking. The per-method stability is the share of those runs in which the method keeps its base rank.

lodo = run.leave_one_dataset_out
print(f"datasets evaluated: {len(lodo.evaluated_datasets)} of {duo.scores.shape[1]}")
base_order = np.argsort(run.result.ranks)
print(f"{'method':12s}  base rank  rank held across leave-one-dataset-out runs")
for i in base_order:
    print(f"{duo.method_names[i]:12s}  {run.result.ranks[i]:>4d}       {lodo.rank_stability[i] * 100:5.0f}%")

influential = lodo.dataset_names[lodo.most_influential_dataset]
print(f"\nmost influential dataset: {influential} (largest rank shift {lodo.max_rank_shift})")

datasets evaluated: 12 of 12
method        base rank  rank held across leave-one-dataset-out runs
Seurat           1         100%
PCAKmeans        2          67%
PCAHC            3          67%
CIDR             4          50%
monocle          5          50%
RtsneKmeans      6         100%
TSCAN            7         100%
ascend           8         100%
FlowSOM          9         100%
SC3svm          10         100%
SC3             11         100%
pcaReduce       12          75%
RaceID2         13          75%
SAFE            14         100%

most influential dataset: KohTCC (largest rank shift 1)

plot.dataset_stability(run)

The funky heatmap shows the same run as a glyph table, with two robustness panels: the span of ranks each method takes across the leave-one-dataset-out runs, and the span across the five aggregations (SAW, TOPSIS, VIKOR, PROMETHEE II, COMET) holding the weighting fixed. The brackets on the left group the methods the Friedman-Nemenyi test cannot separate on ARI. The two span panels differ: dropping a dataset barely moves the order, and Seurat holds rank one, but the choice of aggregation moves it considerably. On this benchmark the order depends more on how the metrics are combined than on which datasets are pooled.

from beam.mcda import critical_difference
from beam.reporting import funky_heatmap_from_run

ari_for_cd = duo.tensor(("ari",))[:, :, 0]
complete = ~np.isnan(ari_for_cd).any(axis=0)
cd = critical_difference(ari_for_cd[:, complete], "higher_is_better", tool_names=duo.method_names)
cliques = tuple(tuple(duo.method_names[i] for i in clique) for clique in cd.cliques)

funky_heatmap_from_run(
    run,
    cliques=cliques,
    show_smaa=False,
    show_aggregation_consensus=True,
    title="Duo 2018: scores and rank robustness",
)

Dataset concordance

The pooled ranking averages over the twelve datasets. dataset_concordance ranks the methods within each dataset and compares every pair of per-dataset orderings with Kendall tau-b. A high mean says the pooled ranking stands in for the individual datasets; a low mean says it does not, and the per-dataset orderings diverge. The diagnostic needs no replicates and assumes nothing about the datasets being interchangeable, so it measures the heterogeneity rather than asking how many datasets would be enough.

conc = run.dataset_concordance
names = conc.dataset_names
print(f"mean agreement across datasets (Kendall tau-b): {conc.mean_pairwise_tau:.2f}")
print("least typical dataset:", names[conc.most_idiosyncratic_dataset])
print("mutually consistent groups:",
      [tuple(names[d] for d in g) for g in conc.concordant_groups])
print("where methods depart most from their own average rank:")
for cell in conc.notable_cells[:5]:
    side = "lower" if cell.deviation > 0 else "higher"
    print(f"  {conc.tool_names[cell.tool]} on {names[cell.dataset]}: "
          f"rank {cell.rank}, {side} than its mean {cell.mean_rank:.1f}")

mean agreement across datasets (Kendall tau-b): 0.39
least typical dataset: Zhengmix4eq
mutually consistent groups: [('Kumar', 'KumarTCC', 'SimKumar4easy', 'Trapnell', 'TrapnellTCC'), ('Zhengmix4eq', 'Zhengmix4uneq', 'Zhengmix8eq')]
where methods depart most from their own average rank:
  CIDR on Zhengmix4eq: rank 13, lower than its mean 5.5
  FlowSOM on SimKumar4easy: rank 3, higher than its mean 10.5
  CIDR on Zhengmix4uneq: rank 12, lower than its mean 5.5
  CIDR on Zhengmix8eq: rank 12, lower than its mean 5.5
  PCAHC on Zhengmix4eq: rank 10, lower than its mean 4.5

The datasets group into the Koh, the Kumar and Sim, and the Zhengmix families, which is how the source studies built the data. The agreement matrix shows that structure directly.

plot.dataset_concordance(run)

The companion view locates where the disagreement comes from. Each cell is a method’s rank on a dataset minus its mean rank, so a strong positive cell marks a method that places lower than usual on that dataset. A few cells carry most of the spread: some methods collapse on the harder simulated datasets while sitting mid-table on average.

plot.dataset_struggle(run)

Specification curve

rank_sensitivity runs every combination of the weighting, the aggregation and the dataset, and splits the rank variance between them. specification_curve reads the same grid and lists the rankings, then reports how often the top method holds. The choice-only grid (weighting by aggregation, on the pooled matrix) and the choice-plus-data grid (the dataset joins it) answer different questions: the first asks whether the recommendation survives the modeling choices, the second whether it survives a change of dataset.

from beam.mcda import rank_sensitivity, specification_curve

rs = rank_sensitivity(
    scores.values,
    run.context.polarity,
    normalization=list(run.context.normalization),
    bounds=list(run.context.bounds),
    baselines=list(run.context.baselines),
    targets=list(run.context.targets),
    missing="worst",
    tool_names=duo.method_names,
    dataset_names=duo.dataset_names,
)
curve = specification_curve(rs)
dom = curve.tool_names[curve.most_frequent_top_tool]
print(f"specifications (weighting x aggregation x dataset): {curve.n_specifications}")
print(f"{dom} ranks first in {curve.most_frequent_top_fraction * 100:.0f}% of them")
print(f"distinct methods reaching the top: {curve.n_distinct_top_tools}")

pooled = rank_sensitivity(
    run.matrix,
    run.context.polarity,
    normalization=list(run.context.normalization),
    bounds=list(run.context.bounds),
    baselines=list(run.context.baselines),
    targets=list(run.context.targets),
    tool_names=duo.method_names,
)
pooled_curve = specification_curve(pooled)
pdom = pooled_curve.tool_names[pooled_curve.most_frequent_top_tool]
print(
    f"choices only ({pooled_curve.n_specifications} combinations): "
    f"{pdom} first in {pooled_curve.most_frequent_top_fraction * 100:.0f}%"
)

specifications (weighting x aggregation x dataset): 240
Seurat ranks first in 50% of them
distinct methods reaching the top: 6
choices only (20 combinations): Seurat first in 100%

The pooled grid keeps Seurat first in every weighting-by-aggregation combination, so the top does not depend on the modeling choice. Adding the twelve datasets drops the fraction, since dropping to a single dataset is where the order moves. The curve below plots the top method’s rank across the full grid, sorted from its highest rank to its lowest.

plot.specification_curve(curve)

Blind analysis

A benchmarker who can see which method is which while choosing the weighting and the metric set has room to tune the choices toward a method they expect to rank first. beam.blind masks the method names and shuffles the rows under a seed, so the pipeline is fixed on opaque labels; beam.unblind restores the names afterward. The ranking does not change, because beam ranks on the score values and unblinding only renames the rows. What blinding adds is a record: the seal fingerprint goes into the run manifest.

from beam import blind, unblind

blinded, seal = blind(scores, seed=0)
print("blinded labels:", blinded.tool_names[:3], "...")

blind_run = beam.rank(blinded, weights="equal", method="saw", seed=0, sensitivity=False)
restored = unblind(blind_run, seal)
named_run = beam.rank(scores, weights="equal", method="saw", seed=0, sensitivity=False)

named = dict(zip(named_run.tool_names, named_run.result.ranks))
unblinded = dict(zip(restored.tool_names, restored.result.ranks))
print("ranking identical after unblinding:", named == unblinded)
print("top method after unblinding:", restored.top_tool)
print("blinding fingerprint in manifest:", blind_run.manifest["blinding"]["seal_sha256"][:12])

blinded labels: ('method_01', 'method_02', 'method_03') ...
ranking identical after unblinding: True
top method after unblinding: Seurat
blinding fingerprint in manifest: ab94772c1d60

Mixed-effects variance decomposition

Leave-one-dataset-out asks whether the pooled ranking depends on any single dataset. A mixed-effects model asks a different question: how much of the score variation is a stable method effect and how much is the method-by-dataset interaction that a single ranking does not show. beam.heterogeneity.mixed_effects fits score ~ method + (1 | dataset) on one metric in R’s lme4, with the method as a fixed effect and the dataset as a random intercept that absorbs how easy or hard each dataset is for every method alike. The fit needs the R toolchain, so the section below runs only when it is available.

from beam.heterogeneity import mixed_effects_from_matrix, r_available

ari = duo.tensor(("ari",))[:, :, 0]

if r_available():
    me = mixed_effects_from_matrix(ari, duo.method_names, duo.dataset_names)
    print(f"model: {me.formula}")
    print(f"dataset shift (ICC): {me.icc_dataset:.2f} of the ARI variance")
    print(f"residual share:      {me.residual_share:.2f} (interaction confounded with noise at one run per cell)")
    order = np.argsort(-me.method_effects)
    print("\nmethod          ARI marginal mean")
    for i in order[:3]:
        print(f"{me.method_names[i]:14s}  {me.method_effects[i]:.3f} +/- {me.method_effect_se[i]:.3f}")
    print("\nlargest interaction residuals (method, dataset, residual):")
    for m, ds, res in me.top_outliers(3):
        print(f"  {m:10s} {ds:16s} {res:+.2f}")
else:
    me = None
    print("R with lme4 not available; skipping the mixed-effects fit.")
    print("Provision it with envs/heterogeneity.yml (conda or mamba).")

model: score ~ method + (1 | dataset)
dataset shift (ICC): 0.71 of the ARI variance
residual share:      0.29 (interaction confounded with noise at one run per cell)

method          ARI marginal mean
SC3             0.853 +/- 0.083
Seurat          0.847 +/- 0.083
SC3svm          0.823 +/- 0.083

largest interaction residuals (method, dataset, residual):
  RaceID2    SimKumar4hard    -0.50
  RaceID2    SimKumar8hard    -0.43
  RaceID2    Zhengmix4eq      +0.42

if me is not None:
    display(plot.model_effects(me, xlabel="ARI marginal mean over datasets (higher ranks first)"))

The dataset intercept takes about 0.71 of the ARI variance, so most of the spread is datasets being uniformly easy or hard rather than methods reordering. SC3 and Seurat have the highest marginal means and are not separable from each other given the standard errors, which matches the critical-difference reading above. The residual share, the upper bound on the method-by-dataset interaction at one run per cell, concentrates on the two lowest-scoring methods (RaceID2 and FlowSOM) on a few simulated and four-group datasets, not on a reshuffling of the higher-scoring methods. See the mixed-effects explanation.

Bradley-Terry tree on dataset features

The variance decomposition says how much interaction there is. The Bradley-Terry tree asks which dataset feature drives it. For one metric, each dataset becomes a set of pairwise method comparisons (the higher ARI is the preferred method), and psychotree::bttree splits the datasets by their features, so each leaf gets its own Bradley-Terry ranking and a parameter-stability test decides where a split is real. The dataset features (number of cells, number of true clusters, real vs simulated, family) ship next to the benchmark. The fit needs the R toolchain, so the section runs only when psychotree is available.

from beam.datasets import load_duo2018_features
from beam.heterogeneity import bradley_terry_tree, bttree_available

if bttree_available():
    features = load_duo2018_features()
    numeric, categorical = features.aligned_to(duo.dataset_names)
    bt = bradley_terry_tree(
        ari,
        duo.method_names,
        duo.dataset_names,
        numeric_features=numeric,
        categorical_features=categorical,
        polarity="higher_is_better",
        minsize=4,
    )
    print(f"split found: {bt.did_split}")
    print(f"global ranking (top 3): {', '.join(bt.global_ranking()[:3])}")
    print(bt.summary())
else:
    bt = None
    print("R with psychotree not available; skipping the Bradley-Terry tree.")
    print("Provision it with envs/heterogeneity.yml (conda or mamba).")

split found: False
global ranking (top 3): SC3, RtsneKmeans, Seurat
The Bradley-Terry tree found no dataset feature that splits the method ranking at alpha 0.05 over 12 datasets, so the ranking is reported as one Bradley-Terry model over all of them, led by SC3. With this many datasets the split test has few observations to work with, the same small-sample limit the critical-difference diagram shows; a benchmark with more datasets is where a split can appear.

On these 12 datasets the parameter-stability test finds no split, so the tree reduces to a single Bradley-Terry ranking led by SC3, the same order the net pairwise preferences and the marginal means give. A dozen datasets is too few for the test to separate a feature-dependent regime from sampling noise, which is also why the critical-difference diagram and the variance decomposition both report a stable global ranking here. The tree is more informative on a benchmark with many datasets carrying real feature variation; the OpenProblems spatially_variable_genes vignette is where a split appears. See the Bradley-Terry explanation.

Rank sensitivity: data or analyst choice

The sections above measure three things separately: leave-one-dataset-out (does the pooled order depend on one dataset), the weighting-by-aggregation grid (does the order depend on the modeling choice), and the mixed-effects ICC (how much score variance is the dataset). rank_sensitivity puts all three on one scale. It ranks the methods on each dataset, under every weighting and aggregation, and splits each method’s rank variance into the dataset, the weighting, the aggregation and their interactions.

from beam.mcda import rank_sensitivity

rs = rank_sensitivity(
    duo.tensor(tuple(analysis_metrics)),
    ctx.polarity,
    normalization=list(ctx.normalization),
    bounds=list(ctx.bounds),
    baselines=list(ctx.baselines),
    targets=list(ctx.targets),
    missing="worst",
    tool_names=duo.method_names,
    dataset_names=duo.dataset_names,
)
print(f"{rs.n_combinations} combinations")
print(f"  dataset:      {rs.dataset_share:.3f} of the rank variance")
print(f"  weighting:    {rs.weighting_share:.3f}")
print(f"  aggregation:  {rs.aggregation_share:.3f}")
print(f"  interactions: {rs.interaction_share:.3f}")
print(f"  most influential factor: {rs.most_influential_factor}")

240 combinations
  dataset:      0.734 of the rank variance
  weighting:    0.035
  aggregation:  0.033
  interactions: 0.198
  most influential factor: dataset

plot.rank_sensitivity(rs)

The dataset carries the most, close to the 0.71 the mixed-effects ICC put on the ARI variance, reached here by an exact factorial decomposition of the ranks rather than a random-effects model of the scores. This is not in tension with the stable leave-one-dataset-out result above: that asks whether the pooled order survives dropping one dataset, and it does, because pooling averages over the per-dataset disagreement. The dataset share measures that disagreement directly, by ranking on each dataset on its own. The weighting filled the partial cells with the lowest score (missing="worst"), a stated choice the share is read against.

The shares above are pooled over the methods. The per-method version splits the same variance one method at a time. It separates a method whose rank depends on the dataset from one that depends on the weighting or the aggregation. The span next to each bar is the difference between the method’s best and worst rank, read alongside its shares.

plot.rank_sensitivity_by_tool(rs)

Recommendation

Under the default pipeline (equal weights, SAW) on ARI, runtime and Shannon entropy difference pooled across the twelve datasets, Seurat ranks first, and it holds that position across all sixteen weighting-by-method configurations tested above. The SMAA analysis puts Seurat top-ranked in about 99 percent of random weight draws, and the weight-perturbation check finds its top rank is not fragile: no single weight change within the threshold dislodges it. Leave-one-dataset-out works across the datasets: Seurat keeps its first rank in all twelve runs that drop one dataset, so the recommendation does not depend on any single dataset. On the composite, RaceID2 and SAFE rank near the bottom.

The critical-difference diagram qualifies this. On ARI alone, the per-dataset ranks favour SC3 (best average ARI rank), with Seurat tied close behind, and the Friedman test is clearly significant (p well below 0.05), so the methods are not all equivalent. But the Nemenyi critical difference is large relative to the rank spread, so most pairs sit inside overlapping cliques and are not separable on ARI at twelve datasets. The composite ranks Seurat first because it also pools strong runtime and entropy-difference scores, not because it ranks first on ARI alone. Read together, Seurat is a stable top choice across these three metrics and these weightings, while the gap between the upper-middle methods on ARI alone is within the noise the test can resolve.

This matches the expected narrative for Seurat (stable top choice) and for RaceID2 (a consistent low performer). It diverges on FlowSOM: here FlowSOM sits mid-pack on the composite, not among the lowest, so the expected “RaceID2 and FlowSOM are consistent low performers” holds for RaceID2 but not for FlowSOM on this metric set and pooling.