Cross-benchmark meta-analysis: do single-cell integration benchmarks disagree about the methods, the data, or the benchmarker?

Author

Izaskun Mallona

Published

May 29, 2026

Goal

Single-cell data-integration benchmarks reach different conclusions about which method to use. This vignette asks where the disagreement comes from: the methods, the datasets, or the benchmarker’s own analysis choices (which metrics, how they are scaled, how they are weighted). It runs the meta-analysis on three benchmarks that publish reusable per-method scores on a shared metric set, and shows that re-ranking them all with one consistent rule removes much of the disagreement.

The benchmarks, and why these four

Four single-cell integration benchmarks publish reusable per-method scores on the shared scIB metric family (ARI, ASW, kBET, LISI) for an overlapping set of classical methods:

  • scIB (Luecken et al., Nature Methods 2022, DOI 10.1038/s41592-021-01336-8), scores from the scib-reproducibility repository.
  • OpenProblems batch integration (Nature Biotechnology 2025, DOI 10.1038/s41587-025-02694-w), the bundled CC-BY scores.
  • Tran et al. (Genome Biology 2020, DOI 10.1186/s13059-019-1850-9), scores from the paper’s supplementary Tables S4 and S7.
  • Tyler, Guccione and Schadt (bioRxiv 2021.11.15.468733 v2, 2023-10-26, preprint only as of 2026-05-28), scores from Extended Data Table 2 of the preprint. Tyler covers three of the five methods (harmony, scanorama, liger) and three of the four metrics (ARI, ASW, kBET; the cLISI Tyler reports is a different quantity from the iLISI the others report), so it enters the four-source set as a partial-coverage block.

The five methods common to the first three benchmarks are combat, harmony, fastMNN, scanorama and LIGER. Tyler’s per-source polarity differs on kBET (Tyler reports raw rejection rate, lower is better; the others report 1-rejection or its scaled form, higher is better); the loader handles this per source so the within-cell ranking is consistent. Benchmarks excluded: Tran’s repository appeared empty until the scores were found in its supplement (a repository check is not sufficient); scIB-E (Genome Biology 2025) shares the scIB metrics but its methods are deep-learning ones (scVI, scANVI and loss-function variants) with no overlap with the classical five; the Communications Biology 2025 reference-informed evaluation covers one dataset with a bespoke RBET metric and overlaps only on combat and scanorama; BatchBench and sc_mixology use off-family metrics; spatial and cross-species benchmarks use different methods. The full provenance, licences and the dataset crosswalk are in src/beam/data/README.md.

%matplotlib inline
import numpy as np
from collections import defaultdict
from itertools import combinations
from scipy.stats import spearmanr, rankdata
from IPython.display import display

from beam.datasets import load_integration_benchmarks, load_integration_published_ranks
from beam.reporting import rank_bump, funky_heatmap

CANON = ["combat", "harmony", "fastmnn", "scanorama", "liger"]
BENCH = ["Tran", "scIB", "OpenProblems"]
METRICS = ["ARI", "ASW", "kBET", "LISI"]

ib = load_integration_benchmarks()
published = load_integration_published_ranks()
print("records:", len(ib.rank), "| benchmarks:", sorted(set(ib.benchmark)))
records: 403 | benchmarks: ['OpenProblems', 'Tran', 'Tyler', 'scIB']

The reported rankings disagree

Each benchmark reported its own ranking of the five methods, using its own metric set, scaling and weighting (Tran’s Table S7 final rank, scIB’s 0.6 biological / 0.4 batch weighted overall, OpenProblems’ mean of scaled scores).

print("reported rank of each method (1 best):")
print(f"  {'method':10s}", "  ".join(f"{b:>12s}" for b in BENCH))
for m in CANON:
    print(f"  {m:10s}", "  ".join(f"{published[b][m]:>12d}" for b in BENCH))

pub_vectors = {b: np.array([published[b][m] for m in CANON]) for b in BENCH}
pub_rhos = [spearmanr(pub_vectors[a], pub_vectors[b]).correlation for a, b in combinations(BENCH, 2)]
print(f"\nmean cross-benchmark Spearman of the reported ranks: {np.mean(pub_rhos):.2f}")
reported rank of each method (1 best):
  method             Tran          scIB  OpenProblems
  combat                5             2             1
  harmony               1             3             2
  fastmnn               4             1             3
  scanorama             3             4             5
  liger                 2             5             4

mean cross-benchmark Spearman of the reported ranks: -0.10

combat ranks first in one benchmark and last in another; the reported rankings barely correlate.

beam’s consistent re-ranking

beam re-ranks all three from their raw scores with one rule: four shared metrics, equal weight, ranked within the common methods so the per-benchmark scale does not matter. The dataset axis is pooled per benchmark.

cell = defaultdict(list)
for b, d, m, mk, r in zip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True):
    cell[(b, m)].append(r)
beam_mean = {(b, m): float(np.mean(cell[(b, m)])) for b in BENCH for m in CANON}
beam_rank = {b: dict(zip(CANON, rankdata([beam_mean[(b, m)] for m in CANON], method="ordinal"), strict=True)) for b in BENCH}

beam_vectors = {b: np.array([beam_mean[(b, m)] for m in CANON]) for b in BENCH}
beam_rhos = [spearmanr(beam_vectors[a], beam_vectors[b]).correlation for a, b in combinations(BENCH, 2)]
print(f"mean cross-benchmark Spearman, beam consistent ranking: {np.mean(beam_rhos):.2f}")
print(f"reported: {np.mean(pub_rhos):+.2f}   beam: {np.mean(beam_rhos):+.2f}   change: {np.mean(beam_rhos) - np.mean(pub_rhos):+.2f}")
mean cross-benchmark Spearman, beam consistent ranking: 0.50
reported: -0.10   beam: +0.50   change: +0.60

Agreement rises from near zero to moderate. Much of the disagreement was in the benchmarker’s analysis choices, not the methods: when the metrics, scaling and weighting are held fixed, the benchmarks line up much more closely.

Subway plot: reported ranks and the consensus

The bump chart shows each method’s reported rank in the three benchmarks (left of the divider) and the beam consensus (right), derived by pooling the consistent per-benchmark ranks.

consensus = dict(zip(
    CANON,
    rankdata([np.mean([beam_mean[(b, m)] for b in BENCH]) for m in CANON], method="ordinal"),
    strict=True,
))
columns = [f"{b}\n(reported)" for b in BENCH] + ["beam\nconsensus"]
ranks = np.array([[published[b][m] for b in BENCH] + [int(consensus[m])] for m in CANON])
fig = rank_bump(
    tuple(CANON), tuple(columns), ranks, divider_after=2,
    title="Reported method ranks across three benchmarks, and the beam consensus",
)
display(fig)

Harmony holds the top rank consistently; combat, reported anywhere from first to last, settles to the bottom on the shared metrics; the consensus order is harmony, liger, fastMNN, scanorama, combat.

Funky heatmaps per benchmark

The glyph tables show why the reported orders differ: the per-metric pattern over the five methods is not the same across benchmarks. Each circle is the method’s standing on that metric within the benchmark (larger is better), coloured by the scIB biological and batch groups. The per-metric column maximum shifts across benchmarks for the same method, which is what drives the disagreement.

groups = ["bio", "bio", "batch", "batch"]  # ARI, ASW | kBET, LISI
n = len(CANON)
for b in BENCH:
    norm = np.zeros((n, len(METRICS)))
    for j, mk in enumerate(METRICS):
        ranks_bm = [np.mean([r for bb, _d, mm, mkk, r in zip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True) if bb == b and mm == m and mkk == mk] or [np.nan]) for m in CANON]
        norm[:, j] = 1.0 - (np.array(ranks_bm) - 1.0) / (n - 1)  # best rank -> 1
    composite = np.nanmean(norm, axis=1)
    order = rankdata(-composite, method="ordinal")
    fig = funky_heatmap(
        norm, tuple(CANON), tuple(METRICS), composite, order,
        metric_groups=tuple(groups), title=f"{b}: standing of the common methods per metric",
    )
    display(fig)

How much of the variance is the benchmark

A cross-benchmark variance decomposition puts a number on the split. The model is score ~ method + (1 | benchmark) + (1 | benchmark:dataset) + (1 | method:benchmark), where the method-by-benchmark component is the disagreement attributable to the benchmark rather than the method. The fit pools all four sources (the three full-coverage benchmarks plus Tyler 2023 on the three methods and three metrics it covers). The fit needs R’s lme4, so the chunk runs only when it is available.

from beam.heterogeneity import r_available, source_variance_decomposition

methods, datasets, benchmarks, scores = ib.mean_rank_records()
if r_available():
    rep = source_variance_decomposition(methods, datasets, benchmarks, scores)
    print(f"benchmarks in the fit: {rep.n_benchmarks}, datasets: {rep.n_datasets}, observations: {rep.n_obs}")
    print(f"method-by-benchmark share: {rep.method_benchmark_share:.2f}")
    print(f"residual (method-by-dataset plus noise): {rep.residual_share:.2f}")
else:
    print("R with lme4 not available; skipping the variance decomposition.")
    print("Provision it with envs/heterogeneity.yml.")
R with lme4 not available; skipping the variance decomposition.
Provision it with envs/heterogeneity.yml.

Tyler 2023 raises the method-by-benchmark share above the three-source value. Their in silico runs carry a controlled biological signal with known ground truth, which is exactly the regime in which their preprint argues unsupervised methods erase real biology. Adding it to the pool gives a sharper estimate of the share of variance that depends on which benchmarker did the analysis rather than which method was run.

Same data, two pipelines: the human pancreas

The pooled decomposition above mixes pipeline differences with dataset differences, because the first three benchmarks barely share datasets and Tyler’s in silico runs share none of them. The one confirmed overlap removes the data confound: Tran’s Dataset 4 is built from Muraro, Segerstolpe, Baron, Wang and Xin, the same five studies scIB’s pancreas task uses. Running both pipelines on this shared block reads off the pipeline effect with no data variability in the way.

from beam.datasets import load_pancreas_contrast

pc = load_pancreas_contrast()
tran_top, scib_top = pc.top_method()
print(f"top method on Tran D4:           {tran_top}")
print(f"top method on scIB pancreas:     {scib_top}")
print(f"Spearman of mean ranks (5 methods): {pc.spearman():+.2f}")
print()
print("mean rank per method (1 best, 5 worst):")
for i, m in enumerate(pc.methods):
    print(f"  {m:10s}  Tran D4 {pc.tran_mean_rank[i]:.2f}    scIB pancreas {pc.scib_mean_rank[i]:.2f}")
top method on Tran D4:           harmony
top method on scIB pancreas:     harmony
Spearman of mean ranks (5 methods): +0.46

mean rank per method (1 best, 5 worst):
  combat      Tran D4 4.50    scIB pancreas 4.00
  harmony     Tran D4 1.50    scIB pancreas 1.50
  fastmnn     Tran D4 4.00    scIB pancreas 2.00
  scanorama   Tran D4 3.00    scIB pancreas 3.50
  liger       Tran D4 2.00    scIB pancreas 4.00

Both pipelines agree on the leader on the shared data, but the rest of the order disagrees more than the pooled Spearman in the previous section would predict. LIGER ranks second on Tran D4 (it took first on kBET and LISI in Tran’s per-metric ranks) and tied last on scIB pancreas, where the same metrics put it near the bottom. The disagreement is not the data; it is in the pipeline.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 3.2))
xs = np.arange(len(pc.methods))
ax.bar(xs - 0.18, pc.tran_mean_rank, width=0.34, label="Tran D4")
ax.bar(xs + 0.18, pc.scib_mean_rank, width=0.34, label="scIB pancreas")
ax.set_xticks(xs)
ax.set_xticklabels(pc.methods)
ax.set_ylabel("mean rank across 4 metrics (lower ranks first)")
ax.set_xlabel("method")
ax.set_title("Same pancreas data, two pipelines")
ax.invert_yaxis()
ax.legend(loc="upper right")
fig.tight_layout()
plt.show()

Per-benchmark fragility: which top method is closest to flipping

Within each benchmark, the smallest single-weight change that swaps the top-ranked method tells the reader how much the recommendation hangs on the analyst’s weighting. beam uses the Triantaphyllou-Sanchez closed-form weight delta on the mean-rank matrix; SAW with equal weights is the baseline so the perturbation is interpretable as “how much weight off one metric is needed to flip the top”.

from beam.mcda import smallest_weight_perturbation

polarity = ("lower_is_better",) * 4
print("smallest single-weight change that flips the top method:")
print(f"  {'benchmark':12s} {'top':10s} {'flipped to':10s}  {'metric':6s}  abs delta")
for bench in BENCH:
    _, metrics_b, matrix = ib.method_metric_matrix(bench)
    rep = smallest_weight_perturbation(matrix, polarity=polarity, weights="equal", method="saw")
    top_idx = int(rep.base.ranks.argmin())
    p = rep.top_rank_perturbation
    if p is None:
        print(f"  {bench:12s} {CANON[top_idx]:10s} {'-':10s}  {'-':6s}  none found")
    else:
        challenger = CANON[p.lower]
        print(
            f"  {bench:12s} {CANON[top_idx]:10s} {challenger:10s}  "
            f"{metrics_b[p.criterion]:6s}  {p.absolute_delta:.3f}"
        )
smallest single-weight change that flips the top method:
  benchmark    top        flipped to  metric  abs delta
  Tran         harmony    liger       ASW     0.239
  scIB         fastmnn    liger       LISI    0.902
  OpenProblems fastmnn    harmony     ASW     0.114

OpenProblems is the most fragile of the three: a 0.11 absolute change to the ASW weight (out of 0.25 equal-weighted baseline) flips fastMNN to harmony. Tran’s harmony lead needs a 0.24 ASW shift before LIGER takes top. scIB’s fastMNN lead is the most stable: a 0.90 shift on LISI is needed before LIGER takes top, well past the [0, 1] feasible range and so practically out of reach. Each benchmark sits at a different point on the fragility axis, even though the analysis rule is the same.

Limits

The published-rank chart and the within-five-methods Spearman comparison are computed on the first three benchmarks because Tyler does not publish an overall ranking (its paper is a methodological critique, not a recommendation). The variance decomposition pools all four sources. Numbers stay indicative, not precise: a Spearman over five methods is coarse, four benchmarks is still few for a sharp method-by-benchmark variance estimate, and the reported rankings are reconstructed (Tran’s final rank directly, scIB’s 0.6/0.4 overall and OpenProblems’ mean score recomputed). For the pooled comparison the benchmarks also use mostly different datasets, so part of that disagreement is genuine data variability, not the benchmarker; the human pancreas section removes that confound on the one block where it can be removed. With those caveats, we showcase how standardizing the decision making with beam removes much of the apparent disagreement between these benchmarks results.