Cross-benchmark meta-analysis

Author

Izaskun Mallona

Published

July 9, 2026

Goal

Single-cell data-integration benchmarks reach different conclusions about which method to use. This vignette asks where the disagreement comes from: the methods, the datasets, or the benchmarker’s own analysis choices (which metrics, how they are scaled, how they are weighted). It runs the meta-analysis on benchmarks that publish reusable per-method scores on a shared metric set, and shows that re-ranking them all with one consistent rule removes part of the disagreement.

The benchmarks

Five single-cell integration benchmarks are pooled here. Four publish reusable per-method scores on the shared scIB metric family (ARI, ASW, kBET, LISI); the fifth, BatchBench, scores on its own batch and cell-type entropy and enters on the within-benchmark rank scale.

Benchmark	Paper	Methods covered	Metrics covered	Scores from
scIB	Luecken et al., Nature Methods 2022, 10.1038/s41592-021-01336-8	all five	ARI, ASW, kBET, LISI	scib-reproducibility repository
OpenProblems batch integration	OpenProblems consortium, Nature Biotechnology 2025, 10.1038/s41587-025-02694-w	all five	ARI, ASW, kBET, LISI	bundled CC-BY scores
Tran et al.	Tran et al., Genome Biology 2020, 10.1186/s13059-019-1850-9	all five	ARI, ASW, kBET, LISI	supplementary Tables S4 and S7
Tyler et al.	Tyler, Guccione and Schadt, bioRxiv 2023, 10.1101/2021.11.15.468733	harmony, scanorama, liger	ARI, ASW, kBET	Extended Data Table 2
BatchBench	Chazarra-Gil et al., Nucleic Acids Research 2021, 10.1093/nar/gkab004	combat, harmony, fastMNN, scanorama	batch entropy, cell-type entropy	Supplementary Table 1, provided by the author
Shen 2026 (discrimination axis only)	Shen, He and Guan, PLOS Comput Biol 2026, 10.1371/journal.pcbi.1014008	ten semi-supervised and unsupervised DL methods (overlap: harmony, scanorama only)	own ARI, NMI, ASW and others	authors’ repository, pinned commit

Tyler is a preprint as of 2026-05-28 and covers three of the five methods and three of the four metrics (the cLISI it reports is a different quantity from the iLISI the others report), so it enters as a partial-coverage block. Its kBET is the raw rejection rate (lower is better), the opposite polarity to the others; the loader handles this per source so the within-cell ranking is consistent. BatchBench covers four of the five methods (it has no LIGER) on two entropy metrics: batch entropy, where higher is better (batches better mixed), and cell-type entropy, where lower is better (a higher value means cell types are more blurred). Ruben Chazarra-Gil provided the scores by personal communication. The metrics differ from the scIB family, so BatchBench enters on the within-benchmark rank scale, the same common currency as the others.

The sixth source, Shen 2026, is not pooled with the first five. Its deep-learning methods overlap the classical five only on harmony and scanorama, too few to rank-harmonize, the same disjointness that excludes scIB-E. Rather than being dropped it enters the last section on a method-agnostic axis, dataset discrimination, which any benchmark can report whatever methods it ran.

The five methods common to the first three benchmarks are combat, harmony, fastMNN, scanorama and LIGER. The excluded benchmarks and the reasons, with provenance, licenses and the dataset crosswalk, are in src/beam/data/README.md.

%matplotlib inline
import numpy as np
from collections import defaultdict
from itertools import combinations
from scipy.stats import spearmanr, rankdata
from IPython.display import display

from beam import plot
from beam.datasets import load_integration_benchmarks, load_integration_published_ranks
from beam.reporting import funky_heatmap

CANON = ["combat", "harmony", "fastmnn", "scanorama", "liger"]
BENCH = ["Tran", "scIB", "OpenProblems"]
METRICS = ["ARI", "ASW", "kBET", "LISI"]

ib = load_integration_benchmarks()
published = load_integration_published_ranks()
print("records:", len(ib.rank), "| benchmarks:", sorted(set(ib.benchmark)))

records: 643 | benchmarks: ['BatchBench', 'OpenProblems', 'Tran', 'Tyler', 'scIB']

The reported rankings

Each benchmark reported its own ranking of the five methods, using its own metric set, scaling and weighting (Tran’s Table S7 final rank, scIB’s 0.6 biological / 0.4 batch weighted overall, OpenProblems’ mean of scaled scores).

print("reported rank of each method (1 best):")
print(f"  {'method':10s}", "  ".join(f"{b:>12s}" for b in BENCH))
for m in CANON:
    print(f"  {m:10s}", "  ".join(f"{published[b][m]:>12d}" for b in BENCH))

pub_vectors = {b: np.array([published[b][m] for m in CANON]) for b in BENCH}
pub_rhos = [spearmanr(pub_vectors[a], pub_vectors[b]).correlation for a, b in combinations(BENCH, 2)]
print(f"\nmean cross-benchmark Spearman of the reported ranks: {np.mean(pub_rhos):.2f}")

reported rank of each method (1 best):
  method             Tran          scIB  OpenProblems
  combat                5             2             1
  harmony               1             3             2
  fastmnn               4             1             3
  scanorama             3             4             5
  liger                 2             5             4

mean cross-benchmark Spearman of the reported ranks: -0.10

combat ranks first in one benchmark and last in another; the reported rankings barely correlate.

beam’s consistent re-ranking

beam re-ranks all three from their raw scores with one rule: four shared metrics, equal weight, ranked within the common methods so the per-benchmark scale does not matter. The datasets are pooled per benchmark.

cell = defaultdict(list)
for b, d, m, mk, r in zip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True):
    cell[(b, m)].append(r)
beam_mean = {(b, m): float(np.mean(cell[(b, m)])) for b in BENCH for m in CANON}
beam_rank = {b: dict(zip(CANON, rankdata([beam_mean[(b, m)] for m in CANON], method="ordinal"), strict=True)) for b in BENCH}

beam_vectors = {b: np.array([beam_mean[(b, m)] for m in CANON]) for b in BENCH}
beam_rhos = [spearmanr(beam_vectors[a], beam_vectors[b]).correlation for a, b in combinations(BENCH, 2)]
print(f"mean cross-benchmark Spearman, beam consistent ranking: {np.mean(beam_rhos):.2f}")
print(f"reported: {np.mean(pub_rhos):+.2f}   beam: {np.mean(beam_rhos):+.2f}   change: {np.mean(beam_rhos) - np.mean(pub_rhos):+.2f}")

mean cross-benchmark Spearman, beam consistent ranking: 0.50
reported: -0.10   beam: +0.50   change: +0.60

Agreement rises from near zero to moderate. Much of the disagreement was in the benchmarker’s analysis choices, not the methods.

Subway plot: reported ranks and the consensus

The bump chart shows each method’s reported rank in the three benchmarks (left of the divider) and the beam consensus (right), derived by pooling the consistent per-benchmark ranks.

consensus = dict(zip(
    CANON,
    rankdata([np.mean([beam_mean[(b, m)] for b in BENCH]) for m in CANON], method="ordinal"),
    strict=True,
))
columns = [f"{b}\n(reported)" for b in BENCH] + ["beam\nconsensus"]
ranks = np.array([[published[b][m] for b in BENCH] + [int(consensus[m])] for m in CANON])
fig = plot.rank_bump(
    tuple(CANON), tuple(columns), ranks, divider_after=2,
    title="Reported method ranks across three benchmarks, and the beam consensus",
)
display(fig)

Harmony ranks first consistently; combat, reported anywhere from first to last, settles to the bottom on the shared metrics; the consensus order is harmony, liger, fastMNN, scanorama, combat.

Funky heatmaps per benchmark

The glyph tables show why the reported orders differ: the per-metric pattern over the five methods is not the same across benchmarks. Each circle is the method’s standing on that metric within the benchmark (larger is better), coloured by the scIB biological and batch groups. The per-metric column maximum shifts across benchmarks for the same method, which is what causes the disagreement.

groups = ["bio", "bio", "batch", "batch"]  # ARI, ASW | kBET, LISI
n = len(CANON)
for b in BENCH:
    norm = np.zeros((n, len(METRICS)))
    for j, mk in enumerate(METRICS):
        ranks_bm = [np.mean([r for bb, _d, mm, mkk, r in zip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True) if bb == b and mm == m and mkk == mk] or [np.nan]) for m in CANON]
        norm[:, j] = 1.0 - (np.array(ranks_bm) - 1.0) / (n - 1)  # best rank -> 1
    composite = np.nanmean(norm, axis=1)
    order = rankdata(-composite, method="ordinal")
    fig = funky_heatmap(
        norm, tuple(CANON), tuple(METRICS), composite, order,
        metric_groups=tuple(groups), title=f"{b}: standing of the common methods per metric",
    )
    display(fig)

A fifth source: BatchBench

BatchBench scores the methods with two entropy metrics rather than the scIB family: batch entropy, where a higher value means the batches are better mixed, and cell-type entropy, where a higher value means the cell types are more blurred, so a lower value is better for biology. Ruben Chazarra-Gil provided the scores by personal communication. The two metrics show the batch-mixing against biology tradeoff directly.

import matplotlib.pyplot as plt

from beam.datasets import load_batchbench

bb = load_batchbench()
bi = bb.metric_ids.index("batch_entropy")
ci = bb.metric_ids.index("cell_type_entropy")
batch_mean = np.nanmean(bb.scores[:, :, bi], axis=1)
cell_mean = np.nanmean(bb.scores[:, :, ci], axis=1)

fig, ax = plt.subplots(figsize=(6.5, 4.5))
ax.scatter(batch_mean, cell_mean, color="tab:blue")
for x, y, name in zip(batch_mean, cell_mean, bb.method_names, strict=True):
    ax.annotate(name, (x, y), xytext=(4, 0), textcoords="offset points", fontsize=8)
ax.set_xlabel("batch entropy (higher = batches better mixed)")
ax.set_ylabel("cell-type entropy (higher = cell types more blurred)")
ax.set_title("BatchBench: batch mixing against biology")
fig.tight_layout()
plt.show()

A method low and to the right mixes batches well while keeping the cell types separate. harmony sits high and to the right: it mixes batches well but blurs the cell types more than the others. A benchmark that weights biology equally then ranks it below fastMNN, which is what BatchBench does.

Benchmark share of the variance

A cross-benchmark variance decomposition puts a number on the split. The model is score ~ method + (1 | benchmark) + (1 | benchmark:dataset) + (1 | method:benchmark), where the method-by-benchmark component is the disagreement attributable to the benchmark rather than the method. The chunk fits it twice, with the four scIB-family sources and again with BatchBench added, so the effect of the fifth source is visible. The fit needs R’s lme4, so the chunk runs only when it is available.

from beam.heterogeneity import r_available, source_variance_decomposition

methods, datasets, benchmarks, scores = ib.mean_rank_records()
if r_available():
    def share(keep):
        idx = [i for i in range(len(benchmarks)) if keep(benchmarks[i])]
        rep = source_variance_decomposition(
            [methods[i] for i in idx], [datasets[i] for i in idx],
            [benchmarks[i] for i in idx], [scores[i] for i in idx],
        )
        return rep.method_benchmark_share
    s4 = share(lambda b: b != "BatchBench")
    s5 = share(lambda b: True)
    print(f"method-by-benchmark share: {s4:.2f} without BatchBench, {s5:.2f} with it")
else:
    print("R with lme4 not available; skipping the variance decomposition.")
    print("Provision it with envs/heterogeneity.yml.")

method-by-benchmark share: 0.23 without BatchBench, 0.42 with it

Adding BatchBench raises the method-by-benchmark share, from about 0.23 to about 0.42. Its two entropy metrics weight biological conservation and batch mixing equally, and on that footing it disagrees with the scIB-family benchmarks about harmony, so more of the spread now sits in the benchmark rather than the method. Part of the rise is also that BatchBench ranks four methods rather than five, the same scale difference Tyler has, so read it as a clear direction rather than an exact number. The fits are singular (with so few benchmarks one or more variance components are estimated at zero); this does not stop the share from being computed, and it is one more reason to read the value as a direction.

One coherent ranking: network meta-analysis

The variance decomposition says how much of the spread is the benchmark. A network meta-analysis instead pools all five sources into one ranking, the way clinical research ranks treatments that were never all tried in one trial. Each method is a treatment, each (benchmark, dataset) block is a study, and the within-study effect of a method is its mean rank over the metrics with a standard deviation across them. netmeta combines the direct and the indirect evidence, and reports a P-score per method (higher is better) plus heterogeneity and inconsistency statistics. The fit needs R’s netmeta, so the chunk runs only when it is available.

from beam.heterogeneity import netmeta_available, network_meta_analysis

treatment, study, mean, sd, n = ib.network_arms()
if netmeta_available():
    nma = network_meta_analysis(treatment, study, mean, sd, n)
    print(f"studies: {nma.n_studies}, treatments: {nma.n_treatments}, comparisons: {nma.n_comparisons}")
    print(f"ranking (best first): {', '.join(nma.ranking())}")
    print("P-score per method (higher is better):")
    for m, p in sorted(zip(nma.treatments, nma.pscore), key=lambda x: -x[1]):
        print(f"  {m:10s} {p:.2f}")
    print(f"I-squared: {nma.i2:.2f}")
    print(f"inconsistency Q: {nma.q_inconsistency:.1f} on {nma.df_inconsistency:.0f} df, p = {nma.pval_inconsistency:.1e}")
else:
    print("R with netmeta not available; skipping the network meta-analysis.")
    print("Provision it with envs/heterogeneity.yml.")

studies: 51, treatments: 5, comparisons: 307
ranking (best first): liger, harmony, fastmnn, scanorama, combat
P-score per method (higher is better):
  liger      0.83
  harmony    0.83
  fastmnn    0.58
  scanorama  0.25
  combat     0.00
I-squared: 0.67
inconsistency Q: 103.0 on 19 df, p = 1.5e-13

With BatchBench in the pool, harmony no longer ranks first on its own: it ties with liger at the top, fastMNN follows, and combat ranks last. The I-squared is high and the inconsistency Q is significant: the direct and the indirect evidence for the same method pairs do not line up.

Same data, two pipelines: the human pancreas

The pooled decomposition above mixes pipeline differences with dataset differences, because the first three benchmarks barely share datasets and Tyler’s in silico runs share none of them. The one confirmed overlap removes the data confound: Tran’s Dataset 4 is built from Muraro, Segerstolpe, Baron, Wang and Xin, the same five studies scIB’s pancreas task uses. Running both pipelines on this shared block reads off the pipeline effect with no data variability in the way.

from beam.datasets import load_pancreas_contrast

pc = load_pancreas_contrast()
tran_top, scib_top = pc.top_method()
print(f"top method on Tran D4:           {tran_top}")
print(f"top method on scIB pancreas:     {scib_top}")
print(f"Spearman of mean ranks (5 methods): {pc.spearman():+.2f}")
print()
print("mean rank per method (1 best, 5 worst):")
for i, m in enumerate(pc.methods):
    print(f"  {m:10s}  Tran D4 {pc.tran_mean_rank[i]:.2f}    scIB pancreas {pc.scib_mean_rank[i]:.2f}")

top method on Tran D4:           harmony
top method on scIB pancreas:     harmony
Spearman of mean ranks (5 methods): +0.46

mean rank per method (1 best, 5 worst):
  combat      Tran D4 4.50    scIB pancreas 4.00
  harmony     Tran D4 1.50    scIB pancreas 1.50
  fastmnn     Tran D4 4.00    scIB pancreas 2.00
  scanorama   Tran D4 3.00    scIB pancreas 3.50
  liger       Tran D4 2.00    scIB pancreas 4.00

Both pipelines agree on the top-ranked method on the shared data, but the rest of the order disagrees more than the pooled Spearman in the previous section would predict. LIGER ranks second on Tran D4 (it took first on kBET and LISI in Tran’s per-metric ranks) and tied last on scIB pancreas, where the same metrics put it near the bottom. The disagreement is not the data; it is in the pipeline.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 3.2))
xs = np.arange(len(pc.methods))
ax.bar(xs - 0.18, pc.tran_mean_rank, width=0.34, label="Tran D4")
ax.bar(xs + 0.18, pc.scib_mean_rank, width=0.34, label="scIB pancreas")
ax.set_xticks(xs)
ax.set_xticklabels(pc.methods)
ax.set_ylabel("mean rank across 4 metrics (lower ranks first)")
ax.set_xlabel("method")
ax.set_title("Same pancreas data, two pipelines")
ax.invert_yaxis()
ax.legend(loc="upper right")
fig.tight_layout()
plt.show()

Within each benchmark, the smallest single-weight change that swaps the top-ranked method measures how much the recommendation depends on the analyst’s weighting. beam uses the Triantaphyllou-Sanchez closed-form weight delta on the mean-rank matrix; SAW with equal weights is the baseline so the perturbation is interpretable as “how much weight off one metric is needed to flip the top”.

from beam.mcda import smallest_weight_perturbation

polarity = ("lower_is_better",) * 4
print("smallest single-weight change that flips the top method:")
print(f"  {'benchmark':12s} {'top':10s} {'flipped to':10s}  {'metric':6s}  abs delta")
for bench in BENCH:
    _, metrics_b, matrix = ib.method_metric_matrix(bench)
    rep = smallest_weight_perturbation(matrix, polarity=polarity, weights="equal", method="saw")
    top_idx = int(rep.base.ranks.argmin())
    p = rep.top_rank_perturbation
    if p is None:
        print(f"  {bench:12s} {CANON[top_idx]:10s} {'-':10s}  {'-':6s}  none found")
    else:
        challenger = CANON[p.lower]
        print(
            f"  {bench:12s} {CANON[top_idx]:10s} {challenger:10s}  "
            f"{metrics_b[p.criterion]:6s}  {p.absolute_delta:.3f}"
        )

smallest single-weight change that flips the top method:
  benchmark    top        flipped to  metric  abs delta
  Tran         harmony    liger       ASW     0.239
  scIB         fastmnn    liger       LISI    0.902
  OpenProblems fastmnn    harmony     ASW     0.114

OpenProblems is the most fragile of the three: a 0.11 absolute change to the ASW weight (out of 0.25 equal-weighted baseline) flips fastMNN to harmony. Tran’s harmony lead needs a 0.24 ASW shift before LIGER takes top. scIB’s fastMNN lead is the most stable: a 0.90 shift on LISI is needed before LIGER takes top, well past the [0, 1] feasible range and so practically out of reach. Each benchmark sits at a different point on the fragility axis, even though the analysis rule is the same.

Specification curve per benchmark

The fragility delta is one number per benchmark. The specification curve lists the whole multiverse of analyst choices: every weighting by every aggregation on the mean-rank matrix. rank_sensitivity runs the grid and specification_curve reports how often the top method holds. The benchmarks differ here too: on Tran and scIB the top method holds across all 20 combinations, while on OpenProblems it does not.

from beam.mcda import rank_sensitivity, specification_curve

print("specification curve per benchmark (weighting x aggregation on the mean-rank matrix):")
for bench in BENCH:
    _, _, matrix = ib.method_metric_matrix(bench)
    curve = specification_curve(rank_sensitivity(matrix, polarity, tool_names=CANON))
    dom = curve.tool_names[curve.most_frequent_top_tool]
    print(f"  {bench:12s} {dom:10s} first in {curve.most_frequent_top_fraction * 100:3.0f}% of "
          f"{curve.n_specifications} combinations, {curve.n_distinct_top_tools} method(s) reach the top")

specification curve per benchmark (weighting x aggregation on the mean-rank matrix):
  Tran         harmony    first in 100% of 20 combinations, 1 method(s) reach the top
  scIB         fastmnn    first in 100% of 20 combinations, 1 method(s) reach the top
  OpenProblems harmony    first in  55% of 20 combinations, 3 method(s) reach the top

On OpenProblems the top method does not hold across the choices, so the per-method split shows which methods carry that variance and whether it is the weighting or the aggregation. The four metrics (ARI, ASW, kBET, LISI) are the columns being combined, the criteria of the decision, not a factor in the split. The split has no factor for the dataset either. Each cell of this matrix is already a mean rank over the benchmark’s datasets, so the datasets are collapsed and the bars carry only the weighting, the aggregation and their interaction.

op_matrix = ib.method_metric_matrix("OpenProblems")[2]
op_rs = rank_sensitivity(op_matrix, polarity, tool_names=CANON)
plot.rank_sensitivity_by_tool(op_rs, title="OpenProblems: what moves each method's rank")

Blind analysis

A blind analysis fixes the weighting and the aggregation before the method names are revealed, so the choices cannot be tuned toward a method known to lead. These benchmarks rank on the mean-rank scale rather than through metric cards, so the demonstration uses the low-level beam.mcda.run with explicit polarity. beam.blind relabels and shuffles the rows; the seal maps the labels back. The order is the same once unblinded.

from beam import blind, Scores
from beam.mcda import run as mcda_run

_, metrics_b, matrix = ib.method_metric_matrix("Tran")
scores = Scores(
    values=matrix, tool_names=tuple(CANON), metric_ids=tuple(metrics_b),
    dataset_names=None, layout="wide",
)
blinded, seal = blind(scores, seed=0)

named = mcda_run(scores.values, polarity, weights="equal", method="saw")
blind_res = mcda_run(blinded.values, polarity, weights="equal", method="saw")
order_named = [scores.tool_names[i] for i in named.ranks.argsort()]
order_unblinded = [seal.true_name(blinded.tool_names[i]) for i in blind_res.ranks.argsort()]
print("blinded labels:", blinded.tool_names)
print("order, named:     ", order_named)
print("order, unblinded: ", order_unblinded)
print("same order after unblinding:", order_named == order_unblinded)

blinded labels: ('method_1', 'method_2', 'method_3', 'method_4', 'method_5')
order, named:      ['harmony', 'liger', 'scanorama', 'fastmnn', 'combat']
order, unblinded:  ['harmony', 'liger', 'scanorama', 'fastmnn', 'combat']
same order after unblinding: True

A sixth source: dataset discrimination

Every analysis above compares method orders, so it needs shared methods. Shen, He and Guan (2026, 10.1371/journal.pcbi.1014008) score ten methods, five semi-supervised (scANVI, scGEN, STACAS, scDREAMER, ItClust) and five unsupervised (Seurat, scVI, Harmony, scanorama, scCRAFT), on six datasets, each under annotation scenarios that degrade label supervision from full to none. Only harmony and scanorama overlap the classical five, too few to rank-harmonize. A property defined per dataset is still available: how much each dataset separates the methods it scores. Any benchmark can report it for every dataset, whatever methods it ran. See the data README for provenance and the dataset discrimination explanation.

dataset_discrimination computes it from the scores. beam treats each (dataset, scenario) pair as one unit and reports two values per unit: the spread of the pooled method scores, the effect size, and Kendall’s W over the unit’s method-by-metric matrix, the consistency of the order across metrics. It complements dataset_concordance: concordance asks whether datasets agree on the order, discrimination asks whether a dataset separates the methods at all.

from beam.datasets import load_openproblems, load_semisupervised_integration
from beam.mcda import dataset_discrimination, difficulty_concordance

shen = load_semisupervised_integration()
disc = dataset_discrimination(shen.scores, shen.polarity, dataset_ids=shen.unit_names)
print(f"{len(shen.unit_names)} units (6 datasets x annotation scenarios), {len(shen.method_names)} methods")
print(f"mean discrimination (spread): {disc.mean_spread:.3f}, mean concordance (W): {disc.mean_kendall_w:.2f}")
print(f"most discriminating: {disc.most_discriminating}")
print(f"least discriminating: {disc.least_discriminating}")
display(plot.dataset_discrimination(disc, top=15))

104 units (6 datasets x annotation scenarios), 10 methods
mean discrimination (spread): 0.079, mean concordance (W): 0.27
most discriminating: macaque/randomly_wrong_70
least discriminating: human_immune/missing_and_mixing_at_edge_70

The strongest discriminators are the corrupted-label scenarios (randomly_wrong_70 and the heavily missing ones): wrong labels separate the methods that use them from the unsupervised ones. The weakest are the mild perturbations, where the methods score alike and the unit cannot rank them. On the high-spread units the colour is dark, so the metrics agree on the order there.

Difficulty concordance: classical versus deep learning

Splitting the methods into two categories, classical and deep learning, separates two kinds of hardness. difficulty_concordance correlates the per-category difficulty profiles across the datasets. A high concordance places the hardness in the data; a low one places it in the kind of method.

dl = {"scVI", "scANVI", "scGEN", "scDREAMER", "scCRAFT", "ItClust"}
families = ["DL" if m in dl else "classical" for m in shen.method_names]
shen_conc = difficulty_concordance(shen.scores, shen.polarity, families, dataset_ids=shen.unit_names)
print(f"Shen (annotation-scenario hardness): concordance {shen_conc.concordance[0, 1]:+.2f}")
print(f"  families disagree most on: {shen_conc.most_divergent_dataset}")

op = load_openproblems("batch_integration")
op_metrics = [m for m in op.metric_ids if m != "hvg_overlap"]
op_tensor = op.tensor(tuple(op_metrics))
op_pol = tuple("higher_is_better" for _ in op_metrics)
op_dl = {"scvi", "scanvi", "scalex", "scgpt_finetuned", "scgpt_zeroshot",
         "geneformer", "scprint", "uce", "scimilarity"}
op_families = ["DL" if m in op_dl else "classical" for m in op.method_names]
op_conc = difficulty_concordance(op_tensor, op_pol, op_families, dataset_ids=op.dataset_names)
print(f"OpenProblems (intrinsic-complexity hardness): concordance {op_conc.concordance[0, 1]:+.2f}")
display(plot.difficulty_concordance(shen_conc, title="Shen 2026: do DL and classical find the same units hard?"))

Shen (annotation-scenario hardness): concordance +0.32
  families disagree most on: macaque/randomly_wrong_70
OpenProblems (intrinsic-complexity hardness): concordance +0.89

The two benchmarks differ because their datasets are hard for different reasons. On OpenProblems the hardness is dataset complexity, and the two categories agree (Spearman about 0.89): a dataset hard for classical methods is hard for the deep-learning ones. On Shen the hardness is annotation quality, and they agree weakly (about 0.32): degraded labels affect only the methods that use them. On the hardest units the deep-learning methods score below the classical ones; on the easy units they score above. Dataset complexity is shared; a benchmark that stresses what one category depends on is hard only for that category, which a single cross-method ranking would not show.

A controlled contrast inside scIB

The OpenProblems and Shen contrasts compare the two categories across different method sets. The scIB source harmonized at the top of this vignette removes that confound: it scores deep-learning methods (scVI, scANVI, scGen, DESC, SAUCIE, trVAE) on the same five datasets and four metrics as the classical five already used, so the only thing that changes between classical and deep learning is the method. load_scib_integration_families returns both blocks; the deep-learning scores are extracted from the same scib-reproducibility table by reduce_scib_dl.py.

from beam.datasets import load_scib_integration_families

fam = load_scib_integration_families()
scib_conc = difficulty_concordance(fam.scores, fam.polarity, fam.families, dataset_ids=fam.dataset_names)
print(f"scIB (intrinsic complexity, identical data for both families): concordance {scib_conc.concordance[0, 1]:+.2f}")

shared = sorted({m.lower() for m in fam.method_names} & {m.lower() for m in shen.method_names})
print("methods now shared with Shen:", shared)
display(plot.difficulty_concordance(scib_conc, title="scIB: do DL and classical find the same datasets hard?"))

scIB (intrinsic complexity, identical data for both families): concordance +0.90
methods now shared with Shen: ['harmony', 'scanorama', 'scanvi', 'scgen', 'scvi']

On scIB the two categories agree (about 0.90), as on OpenProblems and unlike Shen, even though the datasets and metrics are held fixed and only the method changes. The low Shen concordance is a property of annotation degradation, not of being a deep-learning method. Extracting these methods also lifts the deep-learning overlap: scVI, scANVI and scGen now join harmony and scanorama as methods shared with Shen, where before no deep-learning method was carried from scIB at all.

Limits

The published-rank chart and the within-five-methods Spearman comparison are computed on the first three benchmarks because Tyler does not publish an overall ranking (its paper is a methodological critique, not a recommendation). The variance decomposition pools all four sources. Numbers stay indicative, not precise: a Spearman over five methods is coarse, four benchmarks is still few for a sharp method-by-benchmark variance estimate, and the reported rankings are reconstructed (Tran’s final rank directly, scIB’s 0.6/0.4 overall and OpenProblems’ mean score recomputed). For the pooled comparison the benchmarks also use mostly different datasets, so part of that disagreement is genuine data variability, not the benchmarker; the human pancreas section removes that confound on the one block where it can be removed. With those caveats, standardizing the decision making with beam removes part of the apparent disagreement between these benchmarks, but it does not close it: the network meta-analysis finds a significant residual inconsistency even under the shared rule.

--- title: "Cross-benchmark meta-analysis" author: "Izaskun Mallona" date: today format: html: theme: cosmo toc: true toc-location: left embed-resources: true code-tools: true fig-width: 7 fig-height: 4 --- ## Goal Single-cell data-integration benchmarks reach different conclusions about which method to use. This vignette asks where the disagreement comes from: the methods, the datasets, or the benchmarker's own analysis choices (which metrics, how they are scaled, how they are weighted). It runs the meta-analysis on benchmarks that publish reusable per-method scores on a shared metric set, and shows that re-ranking them all with one consistent rule removes part of the disagreement. ## The benchmarks Five single-cell integration benchmarks are pooled here. Four publish reusable per-method scores on the shared scIB metric family (ARI, ASW, kBET, LISI); the fifth, BatchBench, scores on its own batch and cell-type entropy and enters on the within-benchmark rank scale. | Benchmark | Paper | Methods covered | Metrics covered | Scores from | |---|---|---|---|---| | scIB | Luecken et al., Nature Methods 2022, [10.1038/s41592-021-01336-8](https://doi.org/10.1038/s41592-021-01336-8) | all five | ARI, ASW, kBET, LISI | scib-reproducibility repository | | OpenProblems batch integration | OpenProblems consortium, Nature Biotechnology 2025, [10.1038/s41587-025-02694-w](https://doi.org/10.1038/s41587-025-02694-w) | all five | ARI, ASW, kBET, LISI | bundled CC-BY scores | | Tran et al. | Tran et al., Genome Biology 2020, [10.1186/s13059-019-1850-9](https://doi.org/10.1186/s13059-019-1850-9) | all five | ARI, ASW, kBET, LISI | supplementary Tables S4 and S7 | | Tyler et al. | Tyler, Guccione and Schadt, bioRxiv 2023, [10.1101/2021.11.15.468733](https://doi.org/10.1101/2021.11.15.468733) | harmony, scanorama, liger | ARI, ASW, kBET | Extended Data Table 2 | | BatchBench | Chazarra-Gil et al., Nucleic Acids Research 2021, [10.1093/nar/gkab004](https://doi.org/10.1093/nar/gkab004) | combat, harmony, fastMNN, scanorama | batch entropy, cell-type entropy | Supplementary Table 1, provided by the author | | Shen 2026 (discrimination axis only) | Shen, He and Guan, PLOS Comput Biol 2026, [10.1371/journal.pcbi.1014008](https://doi.org/10.1371/journal.pcbi.1014008) | ten semi-supervised and unsupervised DL methods (overlap: harmony, scanorama only) | own ARI, NMI, ASW and others | authors' repository, pinned commit | Tyler is a preprint as of 2026-05-28 and covers three of the five methods and three of the four metrics (the cLISI it reports is a different quantity from the iLISI the others report), so it enters as a [partial-coverage](../../docs/explanations/missing-data.md) block. Its kBET is the raw rejection rate (lower is better), the opposite polarity to the others; the loader handles this per source so the within-cell ranking is consistent. BatchBench covers four of the five methods (it has no LIGER) on two entropy metrics: batch entropy, where higher is better (batches better mixed), and cell-type entropy, where lower is better (a higher value means cell types are more blurred). Ruben Chazarra-Gil provided the scores by personal communication. The metrics differ from the scIB family, so BatchBench enters on the within-benchmark rank scale, the same common currency as the others. The sixth source, Shen 2026, is not pooled with the first five. Its deep-learning methods overlap the classical five only on harmony and scanorama, too few to rank-harmonize, the same disjointness that excludes scIB-E. Rather than being dropped it enters the last section on a method-agnostic axis, dataset discrimination, which any benchmark can report whatever methods it ran. The five methods common to the first three benchmarks are combat, harmony, fastMNN, scanorama and LIGER. The excluded benchmarks and the reasons, with provenance, licenses and the dataset crosswalk, are in `src/beam/data/README.md`. ```{python} %matplotlib inline import numpy as np from collections import defaultdict from itertools import combinations from scipy.stats import spearmanr, rankdata from IPython.display import display from beam import plot from beam.datasets import load_integration_benchmarks, load_integration_published_ranks from beam.reporting import funky_heatmap CANON = ["combat", "harmony", "fastmnn", "scanorama", "liger"] BENCH = ["Tran", "scIB", "OpenProblems"] METRICS = ["ARI", "ASW", "kBET", "LISI"] ib = load_integration_benchmarks() published = load_integration_published_ranks() print("records:", len(ib.rank), "| benchmarks:", sorted(set(ib.benchmark))) ``` ## The reported rankings Each benchmark reported its own ranking of the five methods, using its own metric set, [scaling](../../docs/explanations/normalization-and-scales.md) and [weighting](../../docs/explanations/weighting-schemes.md) (Tran's Table S7 final rank, scIB's 0.6 biological / 0.4 batch weighted overall, OpenProblems' mean of scaled scores). ```{python} print("reported rank of each method (1 best):") print(f" {'method':10s}", " ".join(f"{b:>12s}" for b in BENCH)) for m in CANON: print(f" {m:10s}", " ".join(f"{published[b][m]:>12d}" for b in BENCH)) pub_vectors = {b: np.array([published[b][m] for m in CANON]) for b in BENCH} pub_rhos = [spearmanr(pub_vectors[a], pub_vectors[b]).correlation for a, b in combinations(BENCH, 2)] print(f"\nmean cross-benchmark Spearman of the reported ranks: {np.mean(pub_rhos):.2f}") ``` combat ranks first in one benchmark and last in another; the reported rankings barely correlate. ## beam's consistent re-ranking beam re-ranks all three from their raw scores with one rule: four shared metrics, equal weight, ranked within the common methods so the per-benchmark scale does not matter. The datasets are pooled per benchmark. ```{python} cell = defaultdict(list) for b, d, m, mk, r in zip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True): cell[(b, m)].append(r) beam_mean = {(b, m): float(np.mean(cell[(b, m)])) for b in BENCH for m in CANON} beam_rank = {b: dict(zip(CANON, rankdata([beam_mean[(b, m)] for m in CANON], method="ordinal"), strict=True)) for b in BENCH} beam_vectors = {b: np.array([beam_mean[(b, m)] for m in CANON]) for b in BENCH} beam_rhos = [spearmanr(beam_vectors[a], beam_vectors[b]).correlation for a, b in combinations(BENCH, 2)] print(f"mean cross-benchmark Spearman, beam consistent ranking: {np.mean(beam_rhos):.2f}") print(f"reported: {np.mean(pub_rhos):+.2f} beam: {np.mean(beam_rhos):+.2f} change: {np.mean(beam_rhos) - np.mean(pub_rhos):+.2f}") ``` Agreement rises from near zero to moderate. Much of the disagreement was in the benchmarker's analysis choices, not the methods. ## Subway plot: reported ranks and the consensus The bump chart shows each method's reported rank in the three benchmarks (left of the divider) and the beam consensus (right), derived by pooling the consistent per-benchmark ranks. ```{python} consensus = dict(zip( CANON, rankdata([np.mean([beam_mean[(b, m)] for b in BENCH]) for m in CANON], method="ordinal"), strict=True, )) columns = [f"{b}\n(reported)" for b in BENCH] + ["beam\nconsensus"] ranks = np.array([[published[b][m] for b in BENCH] + [int(consensus[m])] for m in CANON]) fig = plot.rank_bump( tuple(CANON), tuple(columns), ranks, divider_after=2, title="Reported method ranks across three benchmarks, and the beam consensus", ) display(fig) ``` Harmony ranks first consistently; combat, reported anywhere from first to last, settles to the bottom on the shared metrics; the consensus order is harmony, liger, fastMNN, scanorama, combat. ## Funky heatmaps per benchmark The [glyph tables](../../docs/explanations/funky-heatmaps-and-robustness.md) show why the reported orders differ: the per-metric pattern over the five methods is not the same across benchmarks. Each circle is the method's standing on that metric within the benchmark (larger is better), coloured by the scIB biological and batch groups. The per-metric column maximum shifts across benchmarks for the same method, which is what causes the disagreement. ```{python} groups = ["bio", "bio", "batch", "batch"] # ARI, ASW | kBET, LISI n = len(CANON) for b in BENCH: norm = np.zeros((n, len(METRICS))) for j, mk in enumerate(METRICS): ranks_bm = [np.mean([r for bb, _d, mm, mkk, r in zip(ib.benchmark, ib.dataset, ib.method, ib.metric, ib.rank, strict=True) if bb == b and mm == m and mkk == mk] or [np.nan]) for m in CANON] norm[:, j] = 1.0 - (np.array(ranks_bm) - 1.0) / (n - 1) # best rank -> 1 composite = np.nanmean(norm, axis=1) order = rankdata(-composite, method="ordinal") fig = funky_heatmap( norm, tuple(CANON), tuple(METRICS), composite, order, metric_groups=tuple(groups), title=f"{b}: standing of the common methods per metric", ) display(fig) ``` ## A fifth source: BatchBench BatchBench scores the methods with two entropy metrics rather than the scIB family: batch entropy, where a higher value means the batches are better mixed, and cell-type entropy, where a higher value means the cell types are more blurred, so a lower value is better for biology. Ruben Chazarra-Gil provided the scores by personal communication. The two metrics show the batch-mixing against biology tradeoff directly. ```{python} import matplotlib.pyplot as plt from beam.datasets import load_batchbench bb = load_batchbench() bi = bb.metric_ids.index("batch_entropy") ci = bb.metric_ids.index("cell_type_entropy") batch_mean = np.nanmean(bb.scores[:, :, bi], axis=1) cell_mean = np.nanmean(bb.scores[:, :, ci], axis=1) fig, ax = plt.subplots(figsize=(6.5, 4.5)) ax.scatter(batch_mean, cell_mean, color="tab:blue") for x, y, name in zip(batch_mean, cell_mean, bb.method_names, strict=True): ax.annotate(name, (x, y), xytext=(4, 0), textcoords="offset points", fontsize=8) ax.set_xlabel("batch entropy (higher = batches better mixed)") ax.set_ylabel("cell-type entropy (higher = cell types more blurred)") ax.set_title("BatchBench: batch mixing against biology") fig.tight_layout() plt.show() ``` A method low and to the right mixes batches well while keeping the cell types separate. harmony sits high and to the right: it mixes batches well but blurs the cell types more than the others. A benchmark that weights biology equally then ranks it below fastMNN, which is what BatchBench does. ## Benchmark share of the variance A cross-benchmark [variance decomposition](../../docs/explanations/attribution-synthesis.md) puts a number on the split. The model is `score ~ method + (1 | benchmark) + (1 | benchmark:dataset) + (1 | method:benchmark)`, where the method-by-benchmark component is the disagreement attributable to the benchmark rather than the method. The chunk fits it twice, with the four scIB-family sources and again with BatchBench added, so the effect of the fifth source is visible. The fit needs R's lme4, so the chunk runs only when it is available. ```{python} from beam.heterogeneity import r_available, source_variance_decomposition methods, datasets, benchmarks, scores = ib.mean_rank_records() if r_available(): def share(keep): idx = [i for i in range(len(benchmarks)) if keep(benchmarks[i])] rep = source_variance_decomposition( [methods[i] for i in idx], [datasets[i] for i in idx], [benchmarks[i] for i in idx], [scores[i] for i in idx], ) return rep.method_benchmark_share s4 = share(lambda b: b != "BatchBench") s5 = share(lambda b: True) print(f"method-by-benchmark share: {s4:.2f} without BatchBench, {s5:.2f} with it") else: print("R with lme4 not available; skipping the variance decomposition.") print("Provision it with envs/heterogeneity.yml.") ``` Adding BatchBench raises the method-by-benchmark share, from about 0.23 to about 0.42. Its two entropy metrics weight biological conservation and batch mixing equally, and on that footing it disagrees with the scIB-family benchmarks about harmony, so more of the spread now sits in the benchmark rather than the method. Part of the rise is also that BatchBench ranks four methods rather than five, the same scale difference Tyler has, so read it as a clear direction rather than an exact number. The fits are singular (with so few benchmarks one or more variance components are estimated at zero); this does not stop the share from being computed, and it is one more reason to read the value as a direction. ## One coherent ranking: network meta-analysis The variance decomposition says how much of the spread is the benchmark. A network meta-analysis instead pools all five sources into one ranking, the way clinical research ranks treatments that were never all tried in one trial. Each method is a treatment, each (benchmark, dataset) block is a study, and the within-study effect of a method is its mean rank over the metrics with a standard deviation across them. netmeta combines the direct and the indirect evidence, and reports a P-score per method (higher is better) plus heterogeneity and inconsistency statistics. The fit needs R's netmeta, so the chunk runs only when it is available. ```{python} from beam.heterogeneity import netmeta_available, network_meta_analysis treatment, study, mean, sd, n = ib.network_arms() if netmeta_available(): nma = network_meta_analysis(treatment, study, mean, sd, n) print(f"studies: {nma.n_studies}, treatments: {nma.n_treatments}, comparisons: {nma.n_comparisons}") print(f"ranking (best first): {', '.join(nma.ranking())}") print("P-score per method (higher is better):") for m, p in sorted(zip(nma.treatments, nma.pscore), key=lambda x: -x[1]): print(f" {m:10s} {p:.2f}") print(f"I-squared: {nma.i2:.2f}") print(f"inconsistency Q: {nma.q_inconsistency:.1f} on {nma.df_inconsistency:.0f} df, p = {nma.pval_inconsistency:.1e}") else: print("R with netmeta not available; skipping the network meta-analysis.") print("Provision it with envs/heterogeneity.yml.") ``` With BatchBench in the pool, harmony no longer ranks first on its own: it ties with liger at the top, fastMNN follows, and combat ranks last. The I-squared is high and the inconsistency Q is significant: the direct and the indirect evidence for the same method pairs do not line up. ## Same data, two pipelines: the human pancreas The pooled decomposition above mixes pipeline differences with dataset differences, because the first three benchmarks barely share datasets and Tyler's in silico runs share none of them. The one confirmed overlap removes the data confound: Tran's Dataset 4 is built from Muraro, Segerstolpe, Baron, Wang and Xin, the same five studies scIB's pancreas task uses. Running both pipelines on this shared block reads off the pipeline effect with no data variability in the way. ```{python} from beam.datasets import load_pancreas_contrast pc = load_pancreas_contrast() tran_top, scib_top = pc.top_method() print(f"top method on Tran D4: {tran_top}") print(f"top method on scIB pancreas: {scib_top}") print(f"Spearman of mean ranks (5 methods): {pc.spearman():+.2f}") print() print("mean rank per method (1 best, 5 worst):") for i, m in enumerate(pc.methods): print(f" {m:10s} Tran D4 {pc.tran_mean_rank[i]:.2f} scIB pancreas {pc.scib_mean_rank[i]:.2f}") ``` Both pipelines agree on the top-ranked method on the shared data, but the rest of the order disagrees more than the pooled Spearman in the previous section would predict. LIGER ranks second on Tran D4 (it took first on kBET and LISI in Tran's per-metric ranks) and tied last on scIB pancreas, where the same metrics put it near the bottom. The disagreement is not the data; it is in the pipeline. ```{python} import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(6, 3.2)) xs = np.arange(len(pc.methods)) ax.bar(xs - 0.18, pc.tran_mean_rank, width=0.34, label="Tran D4") ax.bar(xs + 0.18, pc.scib_mean_rank, width=0.34, label="scIB pancreas") ax.set_xticks(xs) ax.set_xticklabels(pc.methods) ax.set_ylabel("mean rank across 4 metrics (lower ranks first)") ax.set_xlabel("method") ax.set_title("Same pancreas data, two pipelines") ax.invert_yaxis() ax.legend(loc="upper right") fig.tight_layout() plt.show() ``` Within each benchmark, the smallest single-weight change that swaps the top-ranked method measures how much the recommendation depends on the analyst's weighting. beam uses the Triantaphyllou-Sanchez closed-form weight delta on the mean-rank matrix; [SAW](../../docs/explanations/aggregation-methods.md) with equal weights is the baseline so the perturbation is interpretable as "how much weight off one metric is needed to flip the top". ```{python} from beam.mcda import smallest_weight_perturbation polarity = ("lower_is_better",) * 4 print("smallest single-weight change that flips the top method:") print(f" {'benchmark':12s} {'top':10s} {'flipped to':10s} {'metric':6s} abs delta") for bench in BENCH: _, metrics_b, matrix = ib.method_metric_matrix(bench) rep = smallest_weight_perturbation(matrix, polarity=polarity, weights="equal", method="saw") top_idx = int(rep.base.ranks.argmin()) p = rep.top_rank_perturbation if p is None: print(f" {bench:12s} {CANON[top_idx]:10s} {'-':10s} {'-':6s} none found") else: challenger = CANON[p.lower] print( f" {bench:12s} {CANON[top_idx]:10s} {challenger:10s} " f"{metrics_b[p.criterion]:6s} {p.absolute_delta:.3f}" ) ``` OpenProblems is the most fragile of the three: a 0.11 absolute change to the ASW weight (out of 0.25 equal-weighted baseline) flips fastMNN to harmony. Tran's harmony lead needs a 0.24 ASW shift before LIGER takes top. scIB's fastMNN lead is the most stable: a 0.90 shift on LISI is needed before LIGER takes top, well past the [0, 1] feasible range and so practically out of reach. Each benchmark sits at a different point on the fragility axis, even though the analysis rule is the same. ## Specification curve per benchmark The fragility delta is one number per benchmark. The specification curve lists the whole multiverse of analyst choices: every weighting by every aggregation on the mean-rank matrix. [`rank_sensitivity`](../../docs/explanations/rank-sensitivity.md) runs the grid and [`specification_curve`](../../docs/explanations/rank-sensitivity.md#the-specification-curve) reports how often the top method holds. The benchmarks differ here too: on Tran and scIB the top method holds across all 20 combinations, while on OpenProblems it does not. ```{python} from beam.mcda import rank_sensitivity, specification_curve print("specification curve per benchmark (weighting x aggregation on the mean-rank matrix):") for bench in BENCH: _, _, matrix = ib.method_metric_matrix(bench) curve = specification_curve(rank_sensitivity(matrix, polarity, tool_names=CANON)) dom = curve.tool_names[curve.most_frequent_top_tool] print(f" {bench:12s} {dom:10s} first in {curve.most_frequent_top_fraction * 100:3.0f}% of " f"{curve.n_specifications} combinations, {curve.n_distinct_top_tools} method(s) reach the top") ``` On OpenProblems the top method does not hold across the choices, so the per-method split shows which methods carry that variance and whether it is the weighting or the aggregation. The four metrics (ARI, ASW, kBET, LISI) are the columns being combined, the criteria of the decision, not a factor in the split. The split has no factor for the dataset either. Each cell of this matrix is already a mean rank over the benchmark's datasets, so the datasets are collapsed and the bars carry only the weighting, the aggregation and their interaction. ```{python} op_matrix = ib.method_metric_matrix("OpenProblems")[2] op_rs = rank_sensitivity(op_matrix, polarity, tool_names=CANON) plot.rank_sensitivity_by_tool(op_rs, title="OpenProblems: what moves each method's rank") ``` ## Blind analysis A [blind analysis](../../docs/explanations/analysis-blinding.md) fixes the weighting and the aggregation before the method names are revealed, so the choices cannot be tuned toward a method known to lead. These benchmarks rank on the mean-rank scale rather than through [metric cards](../../docs/explanations/cards-and-pipeline.qmd), so the demonstration uses the low-level [`beam.mcda.run`](../../docs/reference/run.qmd) with explicit polarity. `beam.blind` relabels and shuffles the rows; the seal maps the labels back. The order is the same once unblinded. ```{python} from beam import blind, Scores from beam.mcda import run as mcda_run _, metrics_b, matrix = ib.method_metric_matrix("Tran") scores = Scores( values=matrix, tool_names=tuple(CANON), metric_ids=tuple(metrics_b), dataset_names=None, layout="wide", ) blinded, seal = blind(scores, seed=0) named = mcda_run(scores.values, polarity, weights="equal", method="saw") blind_res = mcda_run(blinded.values, polarity, weights="equal", method="saw") order_named = [scores.tool_names[i] for i in named.ranks.argsort()] order_unblinded = [seal.true_name(blinded.tool_names[i]) for i in blind_res.ranks.argsort()] print("blinded labels:", blinded.tool_names) print("order, named: ", order_named) print("order, unblinded: ", order_unblinded) print("same order after unblinding:", order_named == order_unblinded) ``` ## A sixth source: dataset discrimination Every analysis above compares method orders, so it needs shared methods. Shen, He and Guan (2026, [10.1371/journal.pcbi.1014008](https://doi.org/10.1371/journal.pcbi.1014008)) score ten methods, five semi-supervised (scANVI, scGEN, STACAS, scDREAMER, ItClust) and five unsupervised (Seurat, scVI, Harmony, scanorama, scCRAFT), on six datasets, each under annotation scenarios that degrade label supervision from full to none. Only harmony and scanorama overlap the classical five, too few to rank-harmonize. A property defined per dataset is still available: how much each dataset separates the methods it scores. Any benchmark can report it for every dataset, whatever methods it ran. See the data README for provenance and the [dataset discrimination explanation](../../docs/explanations/dataset-concordance-and-discrimination.md#dataset-discrimination). [`dataset_discrimination`](../../docs/reference/dataset_discrimination.qmd) computes it from the scores. beam treats each (dataset, scenario) pair as one unit and reports two values per unit: the spread of the pooled method scores, the effect size, and Kendall's W over the unit's method-by-metric matrix, the consistency of the order across metrics. It complements [`dataset_concordance`](../../docs/explanations/dataset-concordance-and-discrimination.md): concordance asks whether datasets agree on the order, discrimination asks whether a dataset separates the methods at all. ```{python} from beam.datasets import load_openproblems, load_semisupervised_integration from beam.mcda import dataset_discrimination, difficulty_concordance shen = load_semisupervised_integration() disc = dataset_discrimination(shen.scores, shen.polarity, dataset_ids=shen.unit_names) print(f"{len(shen.unit_names)} units (6 datasets x annotation scenarios), {len(shen.method_names)} methods") print(f"mean discrimination (spread): {disc.mean_spread:.3f}, mean concordance (W): {disc.mean_kendall_w:.2f}") print(f"most discriminating: {disc.most_discriminating}") print(f"least discriminating: {disc.least_discriminating}") display(plot.dataset_discrimination(disc, top=15)) ``` The strongest discriminators are the corrupted-label scenarios (`randomly_wrong_70` and the heavily missing ones): wrong labels separate the methods that use them from the unsupervised ones. The weakest are the mild perturbations, where the methods score alike and the unit cannot rank them. On the high-spread units the colour is dark, so the metrics agree on the order there. ### Difficulty concordance: classical versus deep learning Splitting the methods into two categories, classical and deep learning, separates two kinds of hardness. [`difficulty_concordance`](../../docs/reference/difficulty_concordance.qmd) correlates the per-category difficulty profiles across the datasets. A high concordance places the hardness in the data; a low one places it in the kind of method. ```{python} dl = {"scVI", "scANVI", "scGEN", "scDREAMER", "scCRAFT", "ItClust"} families = ["DL" if m in dl else "classical" for m in shen.method_names] shen_conc = difficulty_concordance(shen.scores, shen.polarity, families, dataset_ids=shen.unit_names) print(f"Shen (annotation-scenario hardness): concordance {shen_conc.concordance[0, 1]:+.2f}") print(f" families disagree most on: {shen_conc.most_divergent_dataset}") op = load_openproblems("batch_integration") op_metrics = [m for m in op.metric_ids if m != "hvg_overlap"] op_tensor = op.tensor(tuple(op_metrics)) op_pol = tuple("higher_is_better" for _ in op_metrics) op_dl = {"scvi", "scanvi", "scalex", "scgpt_finetuned", "scgpt_zeroshot", "geneformer", "scprint", "uce", "scimilarity"} op_families = ["DL" if m in op_dl else "classical" for m in op.method_names] op_conc = difficulty_concordance(op_tensor, op_pol, op_families, dataset_ids=op.dataset_names) print(f"OpenProblems (intrinsic-complexity hardness): concordance {op_conc.concordance[0, 1]:+.2f}") display(plot.difficulty_concordance(shen_conc, title="Shen 2026: do DL and classical find the same units hard?")) ``` The two benchmarks differ because their datasets are hard for different reasons. On OpenProblems the hardness is dataset complexity, and the two categories agree (Spearman about 0.89): a dataset hard for classical methods is hard for the deep-learning ones. On Shen the hardness is annotation quality, and they agree weakly (about 0.32): degraded labels affect only the methods that use them. On the hardest units the deep-learning methods score below the classical ones; on the easy units they score above. Dataset complexity is shared; a benchmark that stresses what one category depends on is hard only for that category, which a single cross-method ranking would not show. ### A controlled contrast inside scIB The OpenProblems and Shen contrasts compare the two categories across different method sets. The scIB source harmonized at the top of this vignette removes that confound: it scores deep-learning methods (scVI, scANVI, scGen, DESC, SAUCIE, trVAE) on the same five datasets and four metrics as the classical five already used, so the only thing that changes between classical and deep learning is the method. `load_scib_integration_families` returns both blocks; the deep-learning scores are extracted from the same `scib-reproducibility` table by `reduce_scib_dl.py`. ```{python} from beam.datasets import load_scib_integration_families fam = load_scib_integration_families() scib_conc = difficulty_concordance(fam.scores, fam.polarity, fam.families, dataset_ids=fam.dataset_names) print(f"scIB (intrinsic complexity, identical data for both families): concordance {scib_conc.concordance[0, 1]:+.2f}") shared = sorted({m.lower() for m in fam.method_names} & {m.lower() for m in shen.method_names}) print("methods now shared with Shen:", shared) display(plot.difficulty_concordance(scib_conc, title="scIB: do DL and classical find the same datasets hard?")) ``` On scIB the two categories agree (about 0.90), as on OpenProblems and unlike Shen, even though the datasets and metrics are held fixed and only the method changes. The low Shen concordance is a property of annotation degradation, not of being a deep-learning method. Extracting these methods also lifts the deep-learning overlap: scVI, scANVI and scGen now join harmony and scanorama as methods shared with Shen, where before no deep-learning method was carried from scIB at all. ## Limits The published-rank chart and the within-five-methods Spearman comparison are computed on the first three benchmarks because Tyler does not publish an overall ranking (its paper is a methodological critique, not a recommendation). The variance decomposition pools all four sources. Numbers stay indicative, not precise: a Spearman over five methods is coarse, four benchmarks is still few for a sharp method-by-benchmark variance estimate, and the reported rankings are reconstructed (Tran's final rank directly, scIB's 0.6/0.4 overall and OpenProblems' mean score recomputed). For the pooled comparison the benchmarks also use mostly different datasets, so part of that disagreement is genuine data variability, not the benchmarker; the human pancreas section removes that confound on the one block where it can be removed. With those caveats, standardizing the decision making with beam removes part of the apparent disagreement between these benchmarks, but it does not close it: the network meta-analysis finds a significant residual inconsistency even under the shared rule.