Cross-benchmark meta-analysis: do single-cell integration benchmarks disagree about the methods, the data, or the benchmarker?
Author
Izaskun Mallona
Published
May 29, 2026
Goal
Single-cell data-integration benchmarks reach different conclusions about which method to use. This vignette asks where the disagreement comes from: the methods, the datasets, or the benchmarker’s own analysis choices (which metrics, how they are scaled, how they are weighted). It runs the meta-analysis on three benchmarks that publish reusable per-method scores on a shared metric set, and shows that re-ranking them all with one consistent rule removes much of the disagreement.
The benchmarks, and why these four
Four single-cell integration benchmarks publish reusable per-method scores on the shared scIB metric family (ARI, ASW, kBET, LISI) for an overlapping set of classical methods:
scIB (Luecken et al., Nature Methods 2022, DOI 10.1038/s41592-021-01336-8), scores from the scib-reproducibility repository.
OpenProblems batch integration (Nature Biotechnology 2025, DOI 10.1038/s41587-025-02694-w), the bundled CC-BY scores.
Tran et al. (Genome Biology 2020, DOI 10.1186/s13059-019-1850-9), scores from the paper’s supplementary Tables S4 and S7.
Tyler, Guccione and Schadt (bioRxiv 2021.11.15.468733 v2, 2023-10-26, preprint only as of 2026-05-28), scores from Extended Data Table 2 of the preprint. Tyler covers three of the five methods (harmony, scanorama, liger) and three of the four metrics (ARI, ASW, kBET; the cLISI Tyler reports is a different quantity from the iLISI the others report), so it enters the four-source set as a partial-coverage block.
The five methods common to the first three benchmarks are combat, harmony, fastMNN, scanorama and LIGER. Tyler’s per-source polarity differs on kBET (Tyler reports raw rejection rate, lower is better; the others report 1-rejection or its scaled form, higher is better); the loader handles this per source so the within-cell ranking is consistent. Benchmarks excluded: Tran’s repository appeared empty until the scores were found in its supplement (a repository check is not sufficient); scIB-E (Genome Biology 2025) shares the scIB metrics but its methods are deep-learning ones (scVI, scANVI and loss-function variants) with no overlap with the classical five; the Communications Biology 2025 reference-informed evaluation covers one dataset with a bespoke RBET metric and overlaps only on combat and scanorama; BatchBench and sc_mixology use off-family metrics; spatial and cross-species benchmarks use different methods. The full provenance, licences and the dataset crosswalk are in src/beam/data/README.md.
Each benchmark reported its own ranking of the five methods, using its own metric set, scaling and weighting (Tran’s Table S7 final rank, scIB’s 0.6 biological / 0.4 batch weighted overall, OpenProblems’ mean of scaled scores).
print("reported rank of each method (1 best):")print(f" {'method':10s}", " ".join(f"{b:>12s}"for b in BENCH))for m in CANON:print(f" {m:10s}", " ".join(f"{published[b][m]:>12d}"for b in BENCH))pub_vectors = {b: np.array([published[b][m] for m in CANON]) for b in BENCH}pub_rhos = [spearmanr(pub_vectors[a], pub_vectors[b]).correlation for a, b in combinations(BENCH, 2)]print(f"\nmean cross-benchmark Spearman of the reported ranks: {np.mean(pub_rhos):.2f}")
reported rank of each method (1 best):
method Tran scIB OpenProblems
combat 5 2 1
harmony 1 3 2
fastmnn 4 1 3
scanorama 3 4 5
liger 2 5 4
mean cross-benchmark Spearman of the reported ranks: -0.10
combat ranks first in one benchmark and last in another; the reported rankings barely correlate.
beam’s consistent re-ranking
beam re-ranks all three from their raw scores with one rule: four shared metrics, equal weight, ranked within the common methods so the per-benchmark scale does not matter. The dataset axis is pooled per benchmark.
cell = defaultdict(list)for b, d, m, mk, r inzip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True): cell[(b, m)].append(r)beam_mean = {(b, m): float(np.mean(cell[(b, m)])) for b in BENCH for m in CANON}beam_rank = {b: dict(zip(CANON, rankdata([beam_mean[(b, m)] for m in CANON], method="ordinal"), strict=True)) for b in BENCH}beam_vectors = {b: np.array([beam_mean[(b, m)] for m in CANON]) for b in BENCH}beam_rhos = [spearmanr(beam_vectors[a], beam_vectors[b]).correlation for a, b in combinations(BENCH, 2)]print(f"mean cross-benchmark Spearman, beam consistent ranking: {np.mean(beam_rhos):.2f}")print(f"reported: {np.mean(pub_rhos):+.2f} beam: {np.mean(beam_rhos):+.2f} change: {np.mean(beam_rhos) - np.mean(pub_rhos):+.2f}")
Agreement rises from near zero to moderate. Much of the disagreement was in the benchmarker’s analysis choices, not the methods: when the metrics, scaling and weighting are held fixed, the benchmarks line up much more closely.
Subway plot: reported ranks and the consensus
The bump chart shows each method’s reported rank in the three benchmarks (left of the divider) and the beam consensus (right), derived by pooling the consistent per-benchmark ranks.
consensus =dict(zip( CANON, rankdata([np.mean([beam_mean[(b, m)] for b in BENCH]) for m in CANON], method="ordinal"), strict=True,))columns = [f"{b}\n(reported)"for b in BENCH] + ["beam\nconsensus"]ranks = np.array([[published[b][m] for b in BENCH] + [int(consensus[m])] for m in CANON])fig = rank_bump(tuple(CANON), tuple(columns), ranks, divider_after=2, title="Reported method ranks across three benchmarks, and the beam consensus",)display(fig)
Harmony holds the top rank consistently; combat, reported anywhere from first to last, settles to the bottom on the shared metrics; the consensus order is harmony, liger, fastMNN, scanorama, combat.
Funky heatmaps per benchmark
The glyph tables show why the reported orders differ: the per-metric pattern over the five methods is not the same across benchmarks. Each circle is the method’s standing on that metric within the benchmark (larger is better), coloured by the scIB biological and batch groups. The per-metric column maximum shifts across benchmarks for the same method, which is what drives the disagreement.
groups = ["bio", "bio", "batch", "batch"] # ARI, ASW | kBET, LISIn =len(CANON)for b in BENCH: norm = np.zeros((n, len(METRICS)))for j, mk inenumerate(METRICS): ranks_bm = [np.mean([r for bb, _d, mm, mkk, r inzip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True) if bb == b and mm == m and mkk == mk] or [np.nan]) for m in CANON] norm[:, j] =1.0- (np.array(ranks_bm) -1.0) / (n -1) # best rank -> 1 composite = np.nanmean(norm, axis=1) order = rankdata(-composite, method="ordinal") fig = funky_heatmap( norm, tuple(CANON), tuple(METRICS), composite, order, metric_groups=tuple(groups), title=f"{b}: standing of the common methods per metric", ) display(fig)
How much of the variance is the benchmark
A cross-benchmark variance decomposition puts a number on the split. The model is score ~ method + (1 | benchmark) + (1 | benchmark:dataset) + (1 | method:benchmark), where the method-by-benchmark component is the disagreement attributable to the benchmark rather than the method. The fit pools all four sources (the three full-coverage benchmarks plus Tyler 2023 on the three methods and three metrics it covers). The fit needs R’s lme4, so the chunk runs only when it is available.
from beam.heterogeneity import r_available, source_variance_decompositionmethods, datasets, benchmarks, scores = ib.mean_rank_records()if r_available(): rep = source_variance_decomposition(methods, datasets, benchmarks, scores)print(f"benchmarks in the fit: {rep.n_benchmarks}, datasets: {rep.n_datasets}, observations: {rep.n_obs}")print(f"method-by-benchmark share: {rep.method_benchmark_share:.2f}")print(f"residual (method-by-dataset plus noise): {rep.residual_share:.2f}")else:print("R with lme4 not available; skipping the variance decomposition.")print("Provision it with envs/heterogeneity.yml.")
R with lme4 not available; skipping the variance decomposition.
Provision it with envs/heterogeneity.yml.
Tyler 2023 raises the method-by-benchmark share above the three-source value. Their in silico runs carry a controlled biological signal with known ground truth, which is exactly the regime in which their preprint argues unsupervised methods erase real biology. Adding it to the pool gives a sharper estimate of the share of variance that depends on which benchmarker did the analysis rather than which method was run.
Same data, two pipelines: the human pancreas
The pooled decomposition above mixes pipeline differences with dataset differences, because the first three benchmarks barely share datasets and Tyler’s in silico runs share none of them. The one confirmed overlap removes the data confound: Tran’s Dataset 4 is built from Muraro, Segerstolpe, Baron, Wang and Xin, the same five studies scIB’s pancreas task uses. Running both pipelines on this shared block reads off the pipeline effect with no data variability in the way.
from beam.datasets import load_pancreas_contrastpc = load_pancreas_contrast()tran_top, scib_top = pc.top_method()print(f"top method on Tran D4: {tran_top}")print(f"top method on scIB pancreas: {scib_top}")print(f"Spearman of mean ranks (5 methods): {pc.spearman():+.2f}")print()print("mean rank per method (1 best, 5 worst):")for i, m inenumerate(pc.methods):print(f" {m:10s} Tran D4 {pc.tran_mean_rank[i]:.2f} scIB pancreas {pc.scib_mean_rank[i]:.2f}")
top method on Tran D4: harmony
top method on scIB pancreas: harmony
Spearman of mean ranks (5 methods): +0.46
mean rank per method (1 best, 5 worst):
combat Tran D4 4.50 scIB pancreas 4.00
harmony Tran D4 1.50 scIB pancreas 1.50
fastmnn Tran D4 4.00 scIB pancreas 2.00
scanorama Tran D4 3.00 scIB pancreas 3.50
liger Tran D4 2.00 scIB pancreas 4.00
Both pipelines agree on the leader on the shared data, but the rest of the order disagrees more than the pooled Spearman in the previous section would predict. LIGER ranks second on Tran D4 (it took first on kBET and LISI in Tran’s per-metric ranks) and tied last on scIB pancreas, where the same metrics put it near the bottom. The disagreement is not the data; it is in the pipeline.
Per-benchmark fragility: which top method is closest to flipping
Within each benchmark, the smallest single-weight change that swaps the top-ranked method tells the reader how much the recommendation hangs on the analyst’s weighting. beam uses the Triantaphyllou-Sanchez closed-form weight delta on the mean-rank matrix; SAW with equal weights is the baseline so the perturbation is interpretable as “how much weight off one metric is needed to flip the top”.
from beam.mcda import smallest_weight_perturbationpolarity = ("lower_is_better",) *4print("smallest single-weight change that flips the top method:")print(f" {'benchmark':12s}{'top':10s}{'flipped to':10s}{'metric':6s} abs delta")for bench in BENCH: _, metrics_b, matrix = ib.method_metric_matrix(bench) rep = smallest_weight_perturbation(matrix, polarity=polarity, weights="equal", method="saw") top_idx =int(rep.base.ranks.argmin()) p = rep.top_rank_perturbationif p isNone:print(f" {bench:12s}{CANON[top_idx]:10s}{'-':10s}{'-':6s} none found")else: challenger = CANON[p.lower]print(f" {bench:12s}{CANON[top_idx]:10s}{challenger:10s} "f"{metrics_b[p.criterion]:6s}{p.absolute_delta:.3f}" )
smallest single-weight change that flips the top method:
benchmark top flipped to metric abs delta
Tran harmony liger ASW 0.239
scIB fastmnn liger LISI 0.902
OpenProblems fastmnn harmony ASW 0.114
OpenProblems is the most fragile of the three: a 0.11 absolute change to the ASW weight (out of 0.25 equal-weighted baseline) flips fastMNN to harmony. Tran’s harmony lead needs a 0.24 ASW shift before LIGER takes top. scIB’s fastMNN lead is the most stable: a 0.90 shift on LISI is needed before LIGER takes top, well past the [0, 1] feasible range and so practically out of reach. Each benchmark sits at a different point on the fragility axis, even though the analysis rule is the same.
Limits
The published-rank chart and the within-five-methods Spearman comparison are computed on the first three benchmarks because Tyler does not publish an overall ranking (its paper is a methodological critique, not a recommendation). The variance decomposition pools all four sources. Numbers stay indicative, not precise: a Spearman over five methods is coarse, four benchmarks is still few for a sharp method-by-benchmark variance estimate, and the reported rankings are reconstructed (Tran’s final rank directly, scIB’s 0.6/0.4 overall and OpenProblems’ mean score recomputed). For the pooled comparison the benchmarks also use mostly different datasets, so part of that disagreement is genuine data variability, not the benchmarker; the human pancreas section removes that confound on the one block where it can be removed. With those caveats, we showcase how standardizing the decision making with beam removes much of the apparent disagreement between these benchmarks results.
Source Code
---title: "Cross-benchmark meta-analysis: do single-cell integration benchmarks disagree about the methods, the data, or the benchmarker?"author: "Izaskun Mallona"date: todayformat: html: theme: cosmo toc: true toc-location: left embed-resources: true code-tools: true fig-width: 7 fig-height: 4---## GoalSingle-cell data-integration benchmarks reach different conclusions about which method to use. This vignette asks where the disagreement comes from: the methods, the datasets, or the benchmarker's own analysis choices (which metrics, how they are scaled, how they are weighted). It runs the meta-analysis on three benchmarks that publish reusable per-method scores on a shared metric set, and shows that re-ranking them all with one consistent rule removes much of the disagreement.## The benchmarks, and why these fourFour single-cell integration benchmarks publish reusable per-method scores on the shared scIB metric family (ARI, ASW, kBET, LISI) for an overlapping set of classical methods:- scIB (Luecken et al., Nature Methods 2022, DOI [10.1038/s41592-021-01336-8](https://doi.org/10.1038/s41592-021-01336-8)), scores from the scib-reproducibility repository.- OpenProblems batch integration (Nature Biotechnology 2025, DOI [10.1038/s41587-025-02694-w](https://doi.org/10.1038/s41587-025-02694-w)), the bundled CC-BY scores.- Tran et al. (Genome Biology 2020, DOI [10.1186/s13059-019-1850-9](https://doi.org/10.1186/s13059-019-1850-9)), scores from the paper's supplementary Tables S4 and S7.- Tyler, Guccione and Schadt (bioRxiv 2021.11.15.468733 v2, 2023-10-26, preprint only as of 2026-05-28), scores from Extended Data Table 2 of the preprint. Tyler covers three of the five methods (harmony, scanorama, liger) and three of the four metrics (ARI, ASW, kBET; the cLISI Tyler reports is a different quantity from the iLISI the others report), so it enters the four-source set as a partial-coverage block.The five methods common to the first three benchmarks are combat, harmony, fastMNN, scanorama and LIGER. Tyler's per-source polarity differs on kBET (Tyler reports raw rejection rate, lower is better; the others report 1-rejection or its scaled form, higher is better); the loader handles this per source so the within-cell ranking is consistent. Benchmarks excluded: Tran's repository appeared empty until the scores were found in its supplement (a repository check is not sufficient); scIB-E (Genome Biology 2025) shares the scIB metrics but its methods are deep-learning ones (scVI, scANVI and loss-function variants) with no overlap with the classical five; the Communications Biology 2025 reference-informed evaluation covers one dataset with a bespoke RBET metric and overlaps only on combat and scanorama; BatchBench and sc_mixology use off-family metrics; spatial and cross-species benchmarks use different methods. The full provenance, licences and the dataset crosswalk are in `src/beam/data/README.md`.```{python}%matplotlib inlineimport numpy as npfrom collections import defaultdictfrom itertools import combinationsfrom scipy.stats import spearmanr, rankdatafrom IPython.display import displayfrom beam.datasets import load_integration_benchmarks, load_integration_published_ranksfrom beam.reporting import rank_bump, funky_heatmapCANON = ["combat", "harmony", "fastmnn", "scanorama", "liger"]BENCH = ["Tran", "scIB", "OpenProblems"]METRICS = ["ARI", "ASW", "kBET", "LISI"]ib = load_integration_benchmarks()published = load_integration_published_ranks()print("records:", len(ib.rank), "| benchmarks:", sorted(set(ib.benchmark)))```## The reported rankings disagreeEach benchmark reported its own ranking of the five methods, using its own metric set, scaling and weighting (Tran's Table S7 final rank, scIB's 0.6 biological / 0.4 batch weighted overall, OpenProblems' mean of scaled scores).```{python}print("reported rank of each method (1 best):")print(f" {'method':10s}", " ".join(f"{b:>12s}"for b in BENCH))for m in CANON:print(f" {m:10s}", " ".join(f"{published[b][m]:>12d}"for b in BENCH))pub_vectors = {b: np.array([published[b][m] for m in CANON]) for b in BENCH}pub_rhos = [spearmanr(pub_vectors[a], pub_vectors[b]).correlation for a, b in combinations(BENCH, 2)]print(f"\nmean cross-benchmark Spearman of the reported ranks: {np.mean(pub_rhos):.2f}")```combat ranks first in one benchmark and last in another; the reported rankings barely correlate.## beam's consistent re-rankingbeam re-ranks all three from their raw scores with one rule: four shared metrics, equal weight, ranked within the common methods so the per-benchmark scale does not matter. The dataset axis is pooled per benchmark.```{python}cell = defaultdict(list)for b, d, m, mk, r inzip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True): cell[(b, m)].append(r)beam_mean = {(b, m): float(np.mean(cell[(b, m)])) for b in BENCH for m in CANON}beam_rank = {b: dict(zip(CANON, rankdata([beam_mean[(b, m)] for m in CANON], method="ordinal"), strict=True)) for b in BENCH}beam_vectors = {b: np.array([beam_mean[(b, m)] for m in CANON]) for b in BENCH}beam_rhos = [spearmanr(beam_vectors[a], beam_vectors[b]).correlation for a, b in combinations(BENCH, 2)]print(f"mean cross-benchmark Spearman, beam consistent ranking: {np.mean(beam_rhos):.2f}")print(f"reported: {np.mean(pub_rhos):+.2f} beam: {np.mean(beam_rhos):+.2f} change: {np.mean(beam_rhos) - np.mean(pub_rhos):+.2f}")```Agreement rises from near zero to moderate. Much of the disagreement was in the benchmarker's analysis choices, not the methods: when the metrics, scaling and weighting are held fixed, the benchmarks line up much more closely.## Subway plot: reported ranks and the consensusThe bump chart shows each method's reported rank in the three benchmarks (left of the divider) and the beam consensus (right), derived by pooling the consistent per-benchmark ranks.```{python}consensus =dict(zip( CANON, rankdata([np.mean([beam_mean[(b, m)] for b in BENCH]) for m in CANON], method="ordinal"), strict=True,))columns = [f"{b}\n(reported)"for b in BENCH] + ["beam\nconsensus"]ranks = np.array([[published[b][m] for b in BENCH] + [int(consensus[m])] for m in CANON])fig = rank_bump(tuple(CANON), tuple(columns), ranks, divider_after=2, title="Reported method ranks across three benchmarks, and the beam consensus",)display(fig)```Harmony holds the top rank consistently; combat, reported anywhere from first to last, settles to the bottom on the shared metrics; the consensus order is harmony, liger, fastMNN, scanorama, combat.## Funky heatmaps per benchmarkThe glyph tables show why the reported orders differ: the per-metric pattern over the five methods is not the same across benchmarks. Each circle is the method's standing on that metric within the benchmark (larger is better), coloured by the scIB biological and batch groups. The per-metric column maximum shifts across benchmarks for the same method, which is what drives the disagreement.```{python}groups = ["bio", "bio", "batch", "batch"] # ARI, ASW | kBET, LISIn =len(CANON)for b in BENCH: norm = np.zeros((n, len(METRICS)))for j, mk inenumerate(METRICS): ranks_bm = [np.mean([r for bb, _d, mm, mkk, r inzip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True) if bb == b and mm == m and mkk == mk] or [np.nan]) for m in CANON] norm[:, j] =1.0- (np.array(ranks_bm) -1.0) / (n -1) # best rank -> 1 composite = np.nanmean(norm, axis=1) order = rankdata(-composite, method="ordinal") fig = funky_heatmap( norm, tuple(CANON), tuple(METRICS), composite, order, metric_groups=tuple(groups), title=f"{b}: standing of the common methods per metric", ) display(fig)```## How much of the variance is the benchmarkA cross-benchmark variance decomposition puts a number on the split. The model is `score ~ method + (1 | benchmark) + (1 | benchmark:dataset) + (1 | method:benchmark)`, where the method-by-benchmark component is the disagreement attributable to the benchmark rather than the method. The fit pools all four sources (the three full-coverage benchmarks plus Tyler 2023 on the three methods and three metrics it covers). The fit needs R's lme4, so the chunk runs only when it is available.```{python}from beam.heterogeneity import r_available, source_variance_decompositionmethods, datasets, benchmarks, scores = ib.mean_rank_records()if r_available(): rep = source_variance_decomposition(methods, datasets, benchmarks, scores)print(f"benchmarks in the fit: {rep.n_benchmarks}, datasets: {rep.n_datasets}, observations: {rep.n_obs}")print(f"method-by-benchmark share: {rep.method_benchmark_share:.2f}")print(f"residual (method-by-dataset plus noise): {rep.residual_share:.2f}")else:print("R with lme4 not available; skipping the variance decomposition.")print("Provision it with envs/heterogeneity.yml.")```Tyler 2023 raises the method-by-benchmark share above the three-source value. Their in silico runs carry a controlled biological signal with known ground truth, which is exactly the regime in which their preprint argues unsupervised methods erase real biology. Adding it to the pool gives a sharper estimate of the share of variance that depends on which benchmarker did the analysis rather than which method was run.## Same data, two pipelines: the human pancreasThe pooled decomposition above mixes pipeline differences with dataset differences, because the first three benchmarks barely share datasets and Tyler's in silico runs share none of them. The one confirmed overlap removes the data confound: Tran's Dataset 4 is built from Muraro, Segerstolpe, Baron, Wang and Xin, the same five studies scIB's pancreas task uses. Running both pipelines on this shared block reads off the pipeline effect with no data variability in the way.```{python}from beam.datasets import load_pancreas_contrastpc = load_pancreas_contrast()tran_top, scib_top = pc.top_method()print(f"top method on Tran D4: {tran_top}")print(f"top method on scIB pancreas: {scib_top}")print(f"Spearman of mean ranks (5 methods): {pc.spearman():+.2f}")print()print("mean rank per method (1 best, 5 worst):")for i, m inenumerate(pc.methods):print(f" {m:10s} Tran D4 {pc.tran_mean_rank[i]:.2f} scIB pancreas {pc.scib_mean_rank[i]:.2f}")```Both pipelines agree on the leader on the shared data, but the rest of the order disagrees more than the pooled Spearman in the previous section would predict. LIGER ranks second on Tran D4 (it took first on kBET and LISI in Tran's per-metric ranks) and tied last on scIB pancreas, where the same metrics put it near the bottom. The disagreement is not the data; it is in the pipeline.```{python}import matplotlib.pyplot as pltfig, ax = plt.subplots(figsize=(6, 3.2))xs = np.arange(len(pc.methods))ax.bar(xs -0.18, pc.tran_mean_rank, width=0.34, label="Tran D4")ax.bar(xs +0.18, pc.scib_mean_rank, width=0.34, label="scIB pancreas")ax.set_xticks(xs)ax.set_xticklabels(pc.methods)ax.set_ylabel("mean rank across 4 metrics (lower ranks first)")ax.set_xlabel("method")ax.set_title("Same pancreas data, two pipelines")ax.invert_yaxis()ax.legend(loc="upper right")fig.tight_layout()plt.show()```## Per-benchmark fragility: which top method is closest to flippingWithin each benchmark, the smallest single-weight change that swaps the top-ranked method tells the reader how much the recommendation hangs on the analyst's weighting. beam uses the Triantaphyllou-Sanchez closed-form weight delta on the mean-rank matrix; SAW with equal weights is the baseline so the perturbation is interpretable as "how much weight off one metric is needed to flip the top".```{python}from beam.mcda import smallest_weight_perturbationpolarity = ("lower_is_better",) *4print("smallest single-weight change that flips the top method:")print(f" {'benchmark':12s}{'top':10s}{'flipped to':10s}{'metric':6s} abs delta")for bench in BENCH: _, metrics_b, matrix = ib.method_metric_matrix(bench) rep = smallest_weight_perturbation(matrix, polarity=polarity, weights="equal", method="saw") top_idx =int(rep.base.ranks.argmin()) p = rep.top_rank_perturbationif p isNone:print(f" {bench:12s}{CANON[top_idx]:10s}{'-':10s}{'-':6s} none found")else: challenger = CANON[p.lower]print(f" {bench:12s}{CANON[top_idx]:10s}{challenger:10s} "f"{metrics_b[p.criterion]:6s}{p.absolute_delta:.3f}" )```OpenProblems is the most fragile of the three: a 0.11 absolute change to the ASW weight (out of 0.25 equal-weighted baseline) flips fastMNN to harmony. Tran's harmony lead needs a 0.24 ASW shift before LIGER takes top. scIB's fastMNN lead is the most stable: a 0.90 shift on LISI is needed before LIGER takes top, well past the [0, 1] feasible range and so practically out of reach. Each benchmark sits at a different point on the fragility axis, even though the analysis rule is the same.## LimitsThe published-rank chart and the within-five-methods Spearman comparison are computed on the first three benchmarks because Tyler does not publish an overall ranking (its paper is a methodological critique, not a recommendation). The variance decomposition pools all four sources. Numbers stay indicative, not precise: a Spearman over five methods is coarse, four benchmarks is still few for a sharp method-by-benchmark variance estimate, and the reported rankings are reconstructed (Tran's final rank directly, scIB's 0.6/0.4 overall and OpenProblems' mean score recomputed). For the pooled comparison the benchmarks also use mostly different datasets, so part of that disagreement is genuine data variability, not the benchmarker; the human pancreas section removes that confound on the one block where it can be removed. With those caveats, we showcase how standardizing the decision making with beam removes much of the apparent disagreement between these benchmarks results.