OpenProblems: MCDA and heterogeneity on a real single-cell benchmark

Author

Izaskun Mallona

Published

July 9, 2026

Provenance of the bundled tables

beam does not ship the raw single-cell data. It ships small derived long-format tables (src/beam/data/openproblems_batch_integration.csv, openproblems_svg.csv, openproblems_svg_features.csv), reduced once from the CC-BY-4.0 results JSON in the openproblems-bio/website repository at the pinned commit 76ce7f288da591b1b19c32cbfe8ce50bc3706ece. The control and baseline methods were dropped at vendoring time. Provenance and license are recorded in src/beam/data/README.md. Cite the OpenProblems consortium (Nature Biotechnology 2025, DOI 10.1038/s41587-025-02694-w).

SHA=76ce7f288da591b1b19c32cbfe8ce50bc3706ece
BASE=https://raw.githubusercontent.com/openproblems-bio/website/$SHA/results
for task in batch_integration spatially_variable_genes; do
  for f in results metric_info dataset_info method_info; do
    curl -sf -o "$task.$f.json" "$BASE/$task/data/$f.json"
  done
done

import csv, json

# beam loads these exact file names (see beam.datasets._OP_TASKS).
OUTPUT = {
    "batch_integration": "openproblems_batch_integration.csv",
    "spatially_variable_genes": "openproblems_svg.csv",
}

def reduce_task(task):
    results = json.load(open(f"{task}.results.json"))
    method_info = json.load(open(f"{task}.method_info.json"))
    baselines = {m["method_id"] for m in method_info if m["is_baseline"]}
    rows = []
    for r in results:
        if r["method_id"] in baselines:        # drop controls and baselines
            continue
        for metric_id, value in r["metric_values"].items():
            score = "" if value in ("NA", None) else value   # the source "NA" becomes empty
            rows.append((r["method_id"], r["dataset_id"], metric_id, score))
    rows.sort()
    with open(OUTPUT[task], "w", newline="") as fh:
        w = csv.writer(fh)
        w.writerow(["method_id", "dataset_id", "metric_id", "score"])
        w.writerows(rows)

# The spatial dataset features are parsed from the dataset id <source>/<technology>/<name>.
def svg_features():
    datasets = sorted({r["dataset_id"] for r in json.load(open("spatially_variable_genes.results.json"))})
    with open("openproblems_svg_features.csv", "w", newline="") as fh:
        w = csv.writer(fh)
        w.writerow(["dataset_id", "technology", "organism", "condition"])
        for d in datasets:
            _, technology, name = d.split("/")
            organism = next((o for o in ("human", "mouse", "drosophila") if name.startswith(o)), "other")
            condition = "cancer" if ("cancer" in name or "melanoma" in name) else "noncancer"
            w.writerow([d, technology, organism, condition])

The metric directions and the reference DOIs on the metric cards come from each task’s metric_info.json (every metric here has maximize true, so all are higher is better).

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

import beam
from beam.datasets import load_openproblems, load_openproblems_svg_features

Batch integration: many metrics, the MCDA layer

The batch integration task has 19 methods, 6 datasets and 13 scIB metrics, all read from the registry. Coverage is uneven: the hvg_overlap column and some method-by-dataset cells are sparse, surfaced as NaN. We drop the sparse hvg_overlap metric and keep the methods that are observed on at least one dataset for every remaining metric, so the cross-dataset pooling is defined.

op = load_openproblems("batch_integration")
metrics = [m for m in op.metric_ids if m != "hvg_overlap"]
tensor = op.tensor(tuple(metrics))
keep = (~np.isnan(tensor).all(axis=1)).all(axis=1)
methods = [m for m, k in zip(op.method_names, keep) if k]
tensor = tensor[keep]
print(f"{len(methods)} methods, {len(metrics)} metrics, {len(op.dataset_names)} datasets")

14 methods, 12 metrics, 6 datasets

card_data_consistency checks the scores against what the 12 cards declare, before any normalization. The scIB metrics are scaled to [0, 1] on the platform, so the audit is clean here.

from beam.mcda import card_data_consistency, registry_context

ctx = registry_context(metrics, "saw")
pooled_native = np.nanmean(tensor, axis=1)
audit = card_data_consistency(
    pooled_native, ctx.polarity, ctx.bounds,
    baselines=ctx.baselines, targets=ctx.targets, noise_floors=ctx.noise_floors,
    metric_ids=metrics,
)
print("scores consistent with the cards:", audit.ok)
for finding in audit.findings:
    print(" ", finding.severity, finding.message)

scores consistent with the cards: True

We use equal weights and TOPSIS.

scores = beam.Scores(
    values=tensor,
    tool_names=tuple(methods),
    metric_ids=tuple(metrics),
    dataset_names=op.dataset_names,
    layout="long",
)
result = beam.rank(scores, weights="equal", method="topsis")
beam_order = np.argsort(result.result.ranks)
print("beam (equal weights, TOPSIS), top 5:")
for i in beam_order[:5]:
    print(f"  {result.tool_names[i]}")

beam (equal weights, TOPSIS), top 5:
  combat
  harmonypy
  harmony
  scvi
  scalex

OpenProblems builds its own leaderboard from the mean of the per-metric scaled scores. Pooling the raw scores the same way gives a different order, so the aggregation rule decides the top method, not the data alone.

mean_score = np.nanmean(np.nanmean(tensor, axis=1), axis=1)
mean_order = np.argsort(-mean_score)
print("mean of scores (OpenProblems-style), top 5:")
for i in mean_order[:5]:
    print(f"  {methods[i]}")
print()
print("beam top method:", result.tool_names[beam_order[0]])
print("mean-of-scores top method:", methods[mean_order[0]])

mean of scores (OpenProblems-style), top 5:
  scanvi
  combat
  scvi
  harmonypy
  harmony

beam top method: combat
mean-of-scores top method: scanvi

The two rules disagree on the top-ranked method: a benchmark recommendation is partly a decision about how the criteria are combined.

from beam import plot

# The outlined bar is the method the platform's mean-of-scores rule ranks first.
plot.ranking(result, ground_truth_tool=methods[mean_order[0]])

Funky heatmap with rank robustness

beam adds three panels to the funky heatmap: a worth panel with the ARI marginal mean and its 95 percent mixed-effects interval (overlapping intervals mean two methods are not separable), the leave-one-dataset-out rank span, and the SMAA rank-acceptability bar. The row order and circle sizes depend on the normalization, which beam takes from the cards rather than min-max.

from beam.heterogeneity import mixed_effects_from_matrix, r_available
from beam.reporting import funky_heatmap_from_run

bio = {
    "ari", "nmi", "asw_label", "isolated_label_f1", "isolated_label_asw",
    "cell_cycle_conservation", "hvg_overlap", "clisi",
}
groups = ["bio" if m in bio else "batch" for m in result.metric_ids]

worth = worth_ci = None
if r_available():
    ari_slice = op.tensor(("ari",))[keep, :, 0]
    me = mixed_effects_from_matrix(ari_slice, methods, op.dataset_names)
    by_name = {name: i for i, name in enumerate(me.method_names)}
    worth = np.array(
        [me.method_effects[by_name[m]] if m in by_name else np.nan for m in result.tool_names]
    )
    worth_ci = np.array(
        [1.96 * me.method_effect_se[by_name[m]] if m in by_name else np.nan for m in result.tool_names]
    )

funky = funky_heatmap_from_run(
    result,
    metric_groups=groups,
    worth=worth,
    worth_ci=worth_ci,
    worth_label="ARI marginal mean\n(mixed-effects, 95% CI)",
    show_aggregation_consensus=False,
    title="OpenProblems batch integration: scores and rank robustness",
)
funky

With only six datasets the leave-one-dataset-out spans are wide: the top methods shift by several ranks when a single dataset is dropped, while the bottom methods stay put. The worth panel agrees: the ARI marginal-mean intervals of the top methods overlap, so they are not separable. The SMAA bar puts the top ranks on several methods rather than one.

Mixed-effects variance decomposition on ARI

A mixed-effects model on one metric splits the score variation into a stable method effect and the method-by-dataset interaction. We run it on ARI. The fit needs R’s lme4, so the chunk runs only when it is available.

from beam.heterogeneity import mixed_effects_from_matrix, r_available

ari = op.tensor(("ari",))[:, :, 0]
ari_keep = ~np.isnan(ari).all(axis=1)
ari_methods = [m for m, k in zip(op.method_names, ari_keep) if k]
ari = ari[ari_keep]

if r_available():
    me = mixed_effects_from_matrix(ari, ari_methods, op.dataset_names)
    print(f"dataset shift (ICC): {me.icc_dataset:.2f} of the ARI variance")
    print(f"residual share:      {me.residual_share:.2f}")
    top = np.argsort(-me.method_effects)[:3]
    print("highest ARI marginal means:")
    for i in top:
        print(f"  {me.method_names[i]:14s} {me.method_effects[i]:.3f} +/- {me.method_effect_se[i]:.3f}")
else:
    print("R with lme4 not available; skipping the mixed-effects fit.")
    print("Provision it with envs/heterogeneity.yml.")

dataset shift (ICC): 0.57 of the ARI variance
residual share:      0.43
highest ARI marginal means:
  scanvi         0.679 +/- 0.066
  bbknn          0.661 +/- 0.066
  scvi           0.628 +/- 0.066

Metric validity

The scIB score weights two metric groups, biological conservation at 0.6 and batch correction at 0.4, on the premise that they measure different things. beam.mcda.metric_validity checks that premise against the scores (Campbell and Fiske 1959). It treats each method-by-dataset cell as one observation, orients every metric to higher-is-better from the cards, and correlates the metrics with Spearman rank correlation. Metrics in the same group should agree (convergent); metrics in different groups should agree less (discriminant). See the validity explanation.

from beam.cards import polarities_for
from beam.mcda import metric_validity

validity = metric_validity(
    tensor,
    polarities_for(metrics),
    groups,
    metric_ids=list(metrics),
)
print("within-group agreement (convergent):")
for g, r in validity.convergent_by_group.items():
    print(f"  {g:6s} {r:.2f}")
print(f"between-group agreement (discriminant): {validity.mean_discriminant:.2f}")
print(f"convergent {validity.mean_convergent:.2f} > discriminant {validity.mean_discriminant:.2f}: {validity.discriminant_ok}")
print()
print("metrics that lean toward the other group:")
for name, group, within, between, nearest in validity.crossloading_metrics:
    print(f"  {name:20s} {group:6s} within {within:.2f} vs between {between:.2f}, leans {nearest}")

within-group agreement (convergent):
  bio    0.45
  batch  0.24
between-group agreement (discriminant): 0.30
convergent 0.38 > discriminant 0.30: True

metrics that lean toward the other group:
  asw_batch            batch  within 0.26 vs between 0.30, leans bio
  asw_label            bio    within 0.36 vs between 0.38, leans batch
  graph_connectivity   batch  within 0.34 vs between 0.59, leans bio
  kbet                 batch  within 0.18 vs between 0.31, leans bio

The grouping holds, but barely. Metrics in the same group agree a little more (0.38 on average) than metrics in different groups (0.30). The biological metrics agree more with each other (0.45) than the batch metrics do (0.24). Three batch metrics, graph_connectivity most of all, agree more with the biological group than with their own. So the two groups overlap on this data, and the 0.6 / 0.4 weighting treats them as more separate than the scores show.

Metric reliability

Validity asks whether the bio/batch split is the right axis. Reliability asks whether each side of that split reads as one scale. beam.mcda.metric_reliability reports standardized Cronbach’s alpha per group from the same oriented Spearman correlations (Cronbach 1951). A group above the conventional 0.7 cutoff reads as one reliable scale; below it the group is a looser collection. See the reliability explanation.

from beam.mcda import metric_reliability

reliability = metric_reliability(
    tensor,
    polarities_for(metrics),
    groups,
    metric_ids=list(metrics),
)
for g, alpha in reliability.alpha_by_group.items():
    k = reliability.k_by_group[g]
    r_bar = reliability.mean_inter_item_by_group[g]
    print(f"  {g:6s} alpha {alpha:.2f} over {k} metrics, mean inter-item r {r_bar:.2f}")
print()
print("groups below the 0.7 cutoff:", [g for g, _ in reliability.low_reliability_groups])
print("alpha if a batch metric is dropped:")
batch_alpha = reliability.alpha_by_group["batch"]
for name, group, alpha_without in reliability.alpha_if_dropped:
    if group == "batch":
        gain = "raises" if alpha_without > batch_alpha else "lowers"
        print(f"  drop {name:18s} alpha -> {alpha_without:.2f} ({gain})")

  bio    alpha 0.85 over 7 metrics, mean inter-item r 0.45
  batch  alpha 0.62 over 5 metrics, mean inter-item r 0.24

groups below the 0.7 cutoff: ['batch']
alpha if a batch metric is dropped:
  drop asw_batch          alpha -> 0.55 (lowers)
  drop graph_connectivity alpha -> 0.46 (lowers)
  drop ilisi              alpha -> 0.47 (lowers)
  drop kbet               alpha -> 0.62 (raises)
  drop pcr                alpha -> 0.67 (raises)

The biological group holds together (alpha 0.85 over seven metrics), the batch group does not at the cutoff (alpha 0.62 over five). Dropping pcr is the only batch removal that raises the batch alpha, so it is the batch metric least consistent with the rest of its group. This is consistent with the validity result: the bio/batch split is robust, with bio a more coherent scale than batch.

Metric dimensionality

A high alpha can mean one factor measured consistently, or several factors in a group long enough to average out to a high alpha. beam.mcda.metric_dimensionality separates the two by counting the factors in each group, with parallel analysis on the same oriented Spearman correlations (Horn 1965, Glorfeld 1995). See the dimensionality explanation.

from beam.mcda import metric_dimensionality

dimensionality = metric_dimensionality(
    tensor,
    polarities_for(metrics),
    groups,
    metric_ids=list(metrics),
)
for g in dimensionality.k_by_group:
    k = dimensionality.k_by_group[g]
    pc1 = dimensionality.pc1_explained_by_group[g]
    n_factors = dimensionality.parallel_components_by_group[g]
    print(f"  {g:6s} {n_factors} factor(s) over {k} metrics, first component explains {pc1:.2f}")
print()
print("read as one factor:", list(dimensionality.unidimensional_groups))

  bio    2 factor(s) over 7 metrics, first component explains 0.54
  batch  1 factor(s) over 5 metrics, first component explains 0.42

read as one factor: ['batch']

Dimensionality and reliability point in opposite directions here. The biological group has the higher alpha but carries two factors, so its 0.85 is partly the size of the group rather than one quantity. The batch group has the lower alpha but is one factor, weakly tracked.

Batch-integration rank sensitivity

The OpenProblems leaderboard pools by the mean of the scaled scores; beam can pool any of several ways. rank_sensitivity asks how much that choice matters against the data. It runs every combination of weighting, aggregation and dataset, and splits the rank variance. COMET is left out of the aggregations here: it builds characteristic objects whose count grows fast with the number of criteria, so it is slow on a 12-metric task.

from beam.mcda import rank_sensitivity

rs = rank_sensitivity(
    tensor,
    ctx.polarity,
    methods=["saw", "topsis", "vikor", "promethee_ii"],
    normalization=list(ctx.normalization),
    bounds=list(ctx.bounds),
    baselines=list(ctx.baselines),
    targets=list(ctx.targets),
    missing="worst",
    tool_names=methods,
    dataset_names=op.dataset_names,
)
print(f"{rs.n_combinations} combinations")
print(f"  dataset:      {rs.dataset_share:.3f} of the rank variance")
print(f"  weighting:    {rs.weighting_share:.3f}")
print(f"  aggregation:  {rs.aggregation_share:.3f}")
print(f"  interactions: {rs.interaction_share:.3f}")
print(f"  most influential factor: {rs.most_influential_factor}")

96 combinations
  dataset:      0.508 of the rank variance
  weighting:    0.041
  aggregation:  0.156
  interactions: 0.295
  most influential factor: dataset

The dataset is still the largest single factor, but here it is not the only one the way it is on M4. The aggregation choice and the dataset-by-choice interaction carry a share too. With 12 metrics that can disagree, how the metrics are combined changes the order more than it does on the two-metric M4 task.

Splitting the shares one method at a time shows which owe their rank movement to the dataset and which to the weighting or aggregation; the span beside each bar is that method’s best-to-worst rank range.

plot.rank_sensitivity_by_tool(rs)

specification_curve reads the same grid as a list of rankings. Because the aggregation carries a real share here, the choice-only multiverse is less unanimous than on M4: the top method does not hold in every weighting-by-aggregation combination.

from beam.mcda import specification_curve

curve = specification_curve(rs)
dom = curve.tool_names[curve.most_frequent_top_tool]
print(f"choices plus dataset ({curve.n_specifications} combinations): "
      f"{dom} first in {curve.most_frequent_top_fraction * 100:.0f}%, "
      f"{curve.n_distinct_top_tools} methods reach the top")

pooled = specification_curve(
    rank_sensitivity(
        result.matrix, ctx.polarity,
        methods=["saw", "topsis", "vikor", "promethee_ii"],
        normalization=list(ctx.normalization), bounds=list(ctx.bounds),
        baselines=list(ctx.baselines), targets=list(ctx.targets),
        missing="worst", tool_names=methods,
    )
)
pdom = pooled.tool_names[pooled.most_frequent_top_tool]
print(f"choices only ({pooled.n_specifications} combinations): "
      f"{pdom} first in {pooled.most_frequent_top_fraction * 100:.0f}%")

plot.specification_curve(curve)

choices plus dataset (96 combinations): combat first in 35%, 9 methods reach the top
choices only (16 combinations): combat first in 56%

Dataset concordance

Pooling averages over the six batch-integration datasets. dataset_concordance re-ranks the methods within each dataset on its own and correlates every pair of those orderings with Kendall tau-b. A high mean means the pooled order represents the datasets; a low mean means it papers over their disagreement. No replicates are needed and the datasets need not be exchangeable.

conc = result.dataset_concordance
names = conc.dataset_names
print(f"mean agreement across datasets (Kendall tau-b): {conc.mean_pairwise_tau:.2f}")
print("least typical dataset:", names[conc.most_idiosyncratic_dataset])
print("mutually consistent groups:",
      [tuple(names[d] for d in g) for g in conc.concordant_groups])
print("where methods depart most from their own average rank:")
for cell in conc.notable_cells[:5]:
    side = "lower" if cell.deviation > 0 else "higher"
    print(f"  {conc.tool_names[cell.tool]} on {names[cell.dataset]}: "
          f"rank {cell.rank}, {side} than its mean {cell.mean_rank:.1f}")

mean agreement across datasets (Kendall tau-b): 0.49
least typical dataset: cellxgene_census/gtex_v9
mutually consistent groups: [('cellxgene_census/gtex_v9',), ('cellxgene_census/immune_cell_atlas',)]
where methods depart most from their own average rank:
  scimilarity on cellxgene_census/gtex_v9: rank 12, lower than its mean 6.5
  scimilarity on cellxgene_census/immune_cell_atlas: rank 1, higher than its mean 6.5
  scanvi on cellxgene_census/gtex_v9: rank 7, lower than its mean 4.5
  scanvi on cellxgene_census/immune_cell_atlas: rank 2, higher than its mean 4.5
  combat on cellxgene_census/gtex_v9: rank 1, higher than its mean 2.5

plot.dataset_concordance(result)

plot.dataset_struggle(result)

Blind analysis

A blind analysis fixes the weighting, the aggregation and the metric set before the method names are revealed. beam.blind masks the names and shuffles the rows; beam.unblind restores them. The order does not change, and the seal fingerprint goes into the manifest.

from beam import blind, unblind

blinded, seal = blind(scores, seed=0)
blind_run = beam.rank(blinded, weights="equal", method="topsis", seed=0, sensitivity=False)
restored = unblind(blind_run, seal)
named = beam.rank(scores, weights="equal", method="topsis", seed=0, sensitivity=False)
print("ranking identical after unblinding:",
      dict(zip(named.tool_names, named.result.ranks))
      == dict(zip(restored.tool_names, restored.result.ranks)))
print("top method after unblinding:", restored.top_tool)
print("blinding fingerprint:", blind_run.manifest["blinding"]["seal_sha256"][:12])

ranking identical after unblinding: True
top method after unblinding: combat
blinding fingerprint: 14d2dba2852d

Pairwise superiority on ARI

pairwise_superiority compares the methods two at a time on ARI across the six datasets, with the ARI noise floor as the equivalence band. It reports how often one method outperforms another, the effect size next to the significance the mixed-effects intervals give.

from beam.mcda import pairwise_superiority

ari_kept = op.tensor(("ari",))[keep, :, 0]
ari_floor = registry_context(["ari"], "saw").noise_floors[0] or 0.0
sup = pairwise_superiority(ari_kept, "higher_is_better", rope=ari_floor, method_names=methods)
print(f"highest standing: {methods[sup.order[0]]} ({sup.standing[sup.order[0]]:.2f})")
print(f"pairs the sign test cannot separate: {len(sup.equivalent_pairs)} of {len(sup.per_pair)}")

highest standing: scanvi (0.87)
pairs the sign test cannot separate: 73 of 91

With six datasets the comparison is coarse: most pairs do not reach significance, matching the wide mixed-effects intervals and the leave-one-dataset-out spans above.

pairwise_transitivity reads the same pairwise relation and reports whether one order is consistent with it.

from beam.mcda import pairwise_transitivity

trans = pairwise_transitivity(sup)
print(f"transitive: {trans.is_transitive}; circular triads: {trans.n_circular_triads} of {trans.n_triads}")

transitive: False; circular triads: 1 of 364

The matrix below orders the methods by how many others they outperform. A transitive relation fills the upper triangle; a red cell below the diagonal marks a method that outperforms one ranked above it, which only happens inside a cycle.

plot.pairwise_majority(trans)

On the probability scale, bayesian_sign_comparison gives for each pair the posterior probability that one method is practically better. With six datasets most pairs stay inconclusive at the 0.95 threshold.

from beam.mcda import bayesian_sign_comparison

bayes = bayesian_sign_comparison(sup)
decisive = sum(1 for p in bayes.per_pair if p.decision != "inconclusive")
print(f"pairs with a decisive posterior at 0.95: {decisive} of {len(bayes.per_pair)}")
plot.bayesian_comparison(bayes)

pairs with a decisive posterior at 0.95: 18 of 91

Spatially variable genes: many datasets, the Bradley-Terry tree

The spatially variable genes task has 14 methods, 50 spatial datasets and one correlation metric (higher is better). The 50 datasets carry real feature variation: the spatial assay technology, the organism, and a cancer or non-cancer condition, all parsed from the dataset id. This is the input the Bradley-Terry tree needs.

svg = load_openproblems("spatially_variable_genes")
svg_features = load_openproblems_svg_features()
correlation = svg.tensor(("correlation",))[:, :, 0]
_, categorical = svg_features.aligned_to(svg.dataset_names)
print(f"{len(svg.method_names)} methods, {len(svg.dataset_names)} datasets")
from collections import Counter
print("technologies:", dict(Counter(categorical["technology"])))

14 methods, 50 datasets
technologies: {'post_xenium': 2, 'visium': 20, 'dbitseq': 6, 'merfish': 5, 'seqfish': 1, 'slideseqv2': 5, 'starmap': 2, 'stereoseq': 5, 'slidetags': 4}

The mean correlation per method per assay technology already shows the structure: no method ranks highest on every assay.

import warnings

technologies = sorted(set(categorical["technology"]))
tech_array = np.array(categorical["technology"])
heat = np.full((len(svg.method_names), len(technologies)), np.nan)
for ti, tech in enumerate(technologies):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=RuntimeWarning)
        heat[:, ti] = np.nanmean(correlation[:, tech_array == tech], axis=1)

method_order = np.argsort(-np.nanmean(heat, axis=1))
plot.score_heatmap(
    heat[method_order],
    row_names=[svg.method_names[i] for i in method_order],
    col_names=technologies,
    row_label="spatially-variable-genes method (ordered by mean correlation)",
    col_label="spatial assay technology",
    value_label="mean correlation (higher is better)",
    title="Mean correlation per method per assay (the column maximum moves between methods)",
)

The tree turns each dataset into pairwise method comparisons on the correlation score, then splits the datasets by their features so each leaf has its own ranking, putting a statistical test behind the pattern the heatmap shows. The fit needs R’s psychotree.

from beam.heterogeneity import bradley_terry_tree, bttree_available

if bttree_available():
    bt = bradley_terry_tree(
        correlation,
        svg.method_names,
        svg.dataset_names,
        categorical_features=categorical,
        polarity="higher_is_better",
        minsize=6,
    )
    print(f"split found: {bt.did_split}")
    print(f"pooled ranking, top 3: {', '.join(bt.global_ranking()[:3])}")
    print(bt.summary())
else:
    bt = None
    print("R with psychotree not available; skipping the Bradley-Terry tree.")
    print("Provision it with envs/heterogeneity.yml.")

split found: True
pooled ranking, top 3: spark_x, nnsvg, gpcounts
The Bradley-Terry tree splits the 50 datasets on organism, technology into 4 leaves; the global ranking is led by spark_x. The pooled recommendation does not hold everywhere: in a leaf of 10 datasets the ranking is led by nnsvg; in a leaf of 22 datasets the ranking is led by spanve.

The plot shows the number of datasets in each leaf and the method that ranks first there.

if bt is not None:
    display(plot.bradley_terry_leaves(bt))

The pooled ranking has one method at the top, but the leaves disagree: which method ranks first depends on the spatial assay technology. A single ranking does not show this heterogeneity. See the Bradley-Terry explanation.

Recommendation

On batch integration, beam re-derives the per-metric normalization from the cards and runs MCDA with sensitivity, which can place a different method first than the platform’s mean-of-scores leaderboard. On spatially variable genes, the Bradley-Terry tree shows there is no single method that ranks first across assays. A benchmark recommendation depends on how the criteria are combined and on which datasets it is read over.

--- title: "OpenProblems: MCDA and heterogeneity on a real single-cell benchmark" author: "Izaskun Mallona" date: today format: html: theme: cosmo toc: true toc-location: left embed-resources: true code-tools: true fig-width: 6 fig-height: 3.5 --- ## Provenance of the bundled tables beam does not ship the raw single-cell data. It ships small derived long-format tables (`src/beam/data/openproblems_batch_integration.csv`, `openproblems_svg.csv`, `openproblems_svg_features.csv`), reduced once from the CC-BY-4.0 results JSON in the `openproblems-bio/website` repository at the pinned commit `76ce7f288da591b1b19c32cbfe8ce50bc3706ece`. The control and baseline methods were dropped at vendoring time. Provenance and license are recorded in `src/beam/data/README.md`. Cite the OpenProblems consortium (Nature Biotechnology 2025, DOI [10.1038/s41587-025-02694-w](https://doi.org/10.1038/s41587-025-02694-w)). ```bash SHA=76ce7f288da591b1b19c32cbfe8ce50bc3706ece BASE=https://raw.githubusercontent.com/openproblems-bio/website/$SHA/results for task in batch_integration spatially_variable_genes; do for f in results metric_info dataset_info method_info; do curl -sf -o "$task.$f.json" "$BASE/$task/data/$f.json" done done ``` ```python import csv, json # beam loads these exact file names (see beam.datasets._OP_TASKS). OUTPUT = { "batch_integration": "openproblems_batch_integration.csv", "spatially_variable_genes": "openproblems_svg.csv", } def reduce_task(task): results = json.load(open(f"{task}.results.json")) method_info = json.load(open(f"{task}.method_info.json")) baselines = {m["method_id"] for m in method_info if m["is_baseline"]} rows = [] for r in results: if r["method_id"] in baselines: # drop controls and baselines continue for metric_id, value in r["metric_values"].items(): score = "" if value in ("NA", None) else value # the source "NA" becomes empty rows.append((r["method_id"], r["dataset_id"], metric_id, score)) rows.sort() with open(OUTPUT[task], "w", newline="") as fh: w = csv.writer(fh) w.writerow(["method_id", "dataset_id", "metric_id", "score"]) w.writerows(rows) # The spatial dataset features are parsed from the dataset id <source>/<technology>/<name>. def svg_features(): datasets = sorted({r["dataset_id"] for r in json.load(open("spatially_variable_genes.results.json"))}) with open("openproblems_svg_features.csv", "w", newline="") as fh: w = csv.writer(fh) w.writerow(["dataset_id", "technology", "organism", "condition"]) for d in datasets: _, technology, name = d.split("/") organism = next((o for o in ("human", "mouse", "drosophila") if name.startswith(o)), "other") condition = "cancer" if ("cancer" in name or "melanoma" in name) else "noncancer" w.writerow([d, technology, organism, condition]) ``` The metric directions and the reference DOIs on the metric cards come from each task's `metric_info.json` (every metric here has `maximize` true, so all are higher is better). ```{python} %matplotlib inline import numpy as np import matplotlib.pyplot as plt from IPython.display import display import beam from beam.datasets import load_openproblems, load_openproblems_svg_features ``` ## Batch integration: many metrics, the MCDA layer The batch integration task has 19 methods, 6 datasets and 13 scIB metrics, all read from the registry. [Coverage is uneven](../../docs/explanations/missing-data.md): the `hvg_overlap` column and some method-by-dataset cells are sparse, surfaced as NaN. We drop the sparse `hvg_overlap` metric and keep the methods that are observed on at least one dataset for every remaining metric, so the cross-dataset pooling is defined. ```{python} op = load_openproblems("batch_integration") metrics = [m for m in op.metric_ids if m != "hvg_overlap"] tensor = op.tensor(tuple(metrics)) keep = (~np.isnan(tensor).all(axis=1)).all(axis=1) methods = [m for m, k in zip(op.method_names, keep) if k] tensor = tensor[keep] print(f"{len(methods)} methods, {len(metrics)} metrics, {len(op.dataset_names)} datasets") ``` [`card_data_consistency`](../../docs/explanations/card-data-consistency.md) checks the scores against what the 12 [cards](../../docs/explanations/cards-and-pipeline.qmd) declare, before any [normalization](../../docs/explanations/normalization-and-scales.md). The scIB metrics are scaled to [0, 1] on the platform, so the audit is clean here. ```{python} from beam.mcda import card_data_consistency, registry_context ctx = registry_context(metrics, "saw") pooled_native = np.nanmean(tensor, axis=1) audit = card_data_consistency( pooled_native, ctx.polarity, ctx.bounds, baselines=ctx.baselines, targets=ctx.targets, noise_floors=ctx.noise_floors, metric_ids=metrics, ) print("scores consistent with the cards:", audit.ok) for finding in audit.findings: print(" ", finding.severity, finding.message) ``` We use [equal weights](../../docs/explanations/weighting-schemes.md) and [TOPSIS](../../docs/explanations/aggregation-methods.md). ```{python} scores = beam.Scores( values=tensor, tool_names=tuple(methods), metric_ids=tuple(metrics), dataset_names=op.dataset_names, layout="long", ) result = beam.rank(scores, weights="equal", method="topsis") beam_order = np.argsort(result.result.ranks) print("beam (equal weights, TOPSIS), top 5:") for i in beam_order[:5]: print(f" {result.tool_names[i]}") ``` OpenProblems builds its own leaderboard from the mean of the per-metric scaled scores. Pooling the raw scores the same way gives a different order, so the aggregation rule decides the top method, not the data alone. ```{python} mean_score = np.nanmean(np.nanmean(tensor, axis=1), axis=1) mean_order = np.argsort(-mean_score) print("mean of scores (OpenProblems-style), top 5:") for i in mean_order[:5]: print(f" {methods[i]}") print() print("beam top method:", result.tool_names[beam_order[0]]) print("mean-of-scores top method:", methods[mean_order[0]]) ``` The two rules disagree on the top-ranked method: a benchmark recommendation is partly a decision about how the criteria are combined. ```{python} from beam import plot # The outlined bar is the method the platform's mean-of-scores rule ranks first. plot.ranking(result, ground_truth_tool=methods[mean_order[0]]) ``` ### Funky heatmap with rank robustness beam adds three panels to the [funky heatmap](../../docs/explanations/funky-heatmaps-and-robustness.md): a worth panel with the ARI marginal mean and its 95 percent mixed-effects interval (overlapping intervals mean two methods are not separable), the leave-one-dataset-out rank span, and the SMAA rank-acceptability bar. The row order and circle sizes depend on the normalization, which beam takes from the cards rather than min-max. ```{python} from beam.heterogeneity import mixed_effects_from_matrix, r_available from beam.reporting import funky_heatmap_from_run bio = { "ari", "nmi", "asw_label", "isolated_label_f1", "isolated_label_asw", "cell_cycle_conservation", "hvg_overlap", "clisi", } groups = ["bio" if m in bio else "batch" for m in result.metric_ids] worth = worth_ci = None if r_available(): ari_slice = op.tensor(("ari",))[keep, :, 0] me = mixed_effects_from_matrix(ari_slice, methods, op.dataset_names) by_name = {name: i for i, name in enumerate(me.method_names)} worth = np.array( [me.method_effects[by_name[m]] if m in by_name else np.nan for m in result.tool_names] ) worth_ci = np.array( [1.96 * me.method_effect_se[by_name[m]] if m in by_name else np.nan for m in result.tool_names] ) funky = funky_heatmap_from_run( result, metric_groups=groups, worth=worth, worth_ci=worth_ci, worth_label="ARI marginal mean\n(mixed-effects, 95% CI)", show_aggregation_consensus=False, title="OpenProblems batch integration: scores and rank robustness", ) funky ``` With only six datasets the leave-one-dataset-out spans are wide: the top methods shift by several ranks when a single dataset is dropped, while the bottom methods stay put. The worth panel agrees: the ARI marginal-mean intervals of the top methods overlap, so they are not separable. The SMAA bar puts the top ranks on several methods rather than one. ### Mixed-effects variance decomposition on ARI A [mixed-effects model](../../docs/explanations/method-by-dataset-heterogeneity.md) on one metric splits the score variation into a stable method effect and the method-by-dataset interaction. We run it on ARI. The fit needs R's lme4, so the chunk runs only when it is available. ```{python} from beam.heterogeneity import mixed_effects_from_matrix, r_available ari = op.tensor(("ari",))[:, :, 0] ari_keep = ~np.isnan(ari).all(axis=1) ari_methods = [m for m, k in zip(op.method_names, ari_keep) if k] ari = ari[ari_keep] if r_available(): me = mixed_effects_from_matrix(ari, ari_methods, op.dataset_names) print(f"dataset shift (ICC): {me.icc_dataset:.2f} of the ARI variance") print(f"residual share: {me.residual_share:.2f}") top = np.argsort(-me.method_effects)[:3] print("highest ARI marginal means:") for i in top: print(f" {me.method_names[i]:14s} {me.method_effects[i]:.3f} +/- {me.method_effect_se[i]:.3f}") else: print("R with lme4 not available; skipping the mixed-effects fit.") print("Provision it with envs/heterogeneity.yml.") ``` ### Metric validity The scIB score weights two metric groups, biological conservation at 0.6 and batch correction at 0.4, on the premise that they measure different things. [`beam.mcda.metric_validity`](../../docs/reference/metric_validity.qmd) checks that premise against the scores (Campbell and Fiske 1959). It treats each method-by-dataset cell as one observation, orients every metric to higher-is-better from the cards, and correlates the metrics with Spearman rank correlation. Metrics in the same group should agree (convergent); metrics in different groups should agree less (discriminant). See [the validity explanation](../../docs/explanations/metric-set-diagnostics.md). ```{python} from beam.cards import polarities_for from beam.mcda import metric_validity validity = metric_validity( tensor, polarities_for(metrics), groups, metric_ids=list(metrics), ) print("within-group agreement (convergent):") for g, r in validity.convergent_by_group.items(): print(f" {g:6s} {r:.2f}") print(f"between-group agreement (discriminant): {validity.mean_discriminant:.2f}") print(f"convergent {validity.mean_convergent:.2f} > discriminant {validity.mean_discriminant:.2f}: {validity.discriminant_ok}") print() print("metrics that lean toward the other group:") for name, group, within, between, nearest in validity.crossloading_metrics: print(f" {name:20s} {group:6s} within {within:.2f} vs between {between:.2f}, leans {nearest}") ``` The grouping holds, but barely. Metrics in the same group agree a little more (0.38 on average) than metrics in different groups (0.30). The biological metrics agree more with each other (0.45) than the batch metrics do (0.24). Three batch metrics, `graph_connectivity` most of all, agree more with the biological group than with their own. So the two groups overlap on this data, and the 0.6 / 0.4 weighting treats them as more separate than the scores show. ### Metric reliability Validity asks whether the bio/batch split is the right axis. Reliability asks whether each side of that split reads as one scale. [`beam.mcda.metric_reliability`](../../docs/reference/metric_reliability.qmd) reports standardized Cronbach's alpha per group from the same oriented Spearman correlations (Cronbach 1951). A group above the conventional 0.7 cutoff reads as one reliable scale; below it the group is a looser collection. See [the reliability explanation](../../docs/explanations/metric-set-diagnostics.md#reliability). ```{python} from beam.mcda import metric_reliability reliability = metric_reliability( tensor, polarities_for(metrics), groups, metric_ids=list(metrics), ) for g, alpha in reliability.alpha_by_group.items(): k = reliability.k_by_group[g] r_bar = reliability.mean_inter_item_by_group[g] print(f" {g:6s} alpha {alpha:.2f} over {k} metrics, mean inter-item r {r_bar:.2f}") print() print("groups below the 0.7 cutoff:", [g for g, _ in reliability.low_reliability_groups]) print("alpha if a batch metric is dropped:") batch_alpha = reliability.alpha_by_group["batch"] for name, group, alpha_without in reliability.alpha_if_dropped: if group == "batch": gain = "raises" if alpha_without > batch_alpha else "lowers" print(f" drop {name:18s} alpha -> {alpha_without:.2f} ({gain})") ``` The biological group holds together (alpha 0.85 over seven metrics), the batch group does not at the cutoff (alpha 0.62 over five). Dropping `pcr` is the only batch removal that raises the batch alpha, so it is the batch metric least consistent with the rest of its group. This is consistent with the validity result: the bio/batch split is robust, with bio a more coherent scale than batch. ### Metric dimensionality A high alpha can mean one factor measured consistently, or several factors in a group long enough to average out to a high alpha. [`beam.mcda.metric_dimensionality`](../../docs/reference/metric_dimensionality.qmd) separates the two by counting the factors in each group, with parallel analysis on the same oriented Spearman correlations (Horn 1965, Glorfeld 1995). See [the dimensionality explanation](../../docs/explanations/metric-set-diagnostics.md#dimensionality). ```{python} from beam.mcda import metric_dimensionality dimensionality = metric_dimensionality( tensor, polarities_for(metrics), groups, metric_ids=list(metrics), ) for g in dimensionality.k_by_group: k = dimensionality.k_by_group[g] pc1 = dimensionality.pc1_explained_by_group[g] n_factors = dimensionality.parallel_components_by_group[g] print(f" {g:6s} {n_factors} factor(s) over {k} metrics, first component explains {pc1:.2f}") print() print("read as one factor:", list(dimensionality.unidimensional_groups)) ``` Dimensionality and reliability point in opposite directions here. The biological group has the higher alpha but carries two factors, so its 0.85 is partly the size of the group rather than one quantity. The batch group has the lower alpha but is one factor, weakly tracked. ## Batch-integration rank sensitivity The OpenProblems leaderboard pools by the mean of the scaled scores; beam can pool any of several ways. [`rank_sensitivity`](../../docs/explanations/rank-sensitivity.md) asks how much that choice matters against the data. It runs every combination of weighting, aggregation and dataset, and splits the rank variance. [COMET](../../docs/explanations/aggregation-methods.md#comet) is left out of the aggregations here: it builds characteristic objects whose count grows fast with the number of criteria, so it is slow on a 12-metric task. ```{python} from beam.mcda import rank_sensitivity rs = rank_sensitivity( tensor, ctx.polarity, methods=["saw", "topsis", "vikor", "promethee_ii"], normalization=list(ctx.normalization), bounds=list(ctx.bounds), baselines=list(ctx.baselines), targets=list(ctx.targets), missing="worst", tool_names=methods, dataset_names=op.dataset_names, ) print(f"{rs.n_combinations} combinations") print(f" dataset: {rs.dataset_share:.3f} of the rank variance") print(f" weighting: {rs.weighting_share:.3f}") print(f" aggregation: {rs.aggregation_share:.3f}") print(f" interactions: {rs.interaction_share:.3f}") print(f" most influential factor: {rs.most_influential_factor}") ``` The dataset is still the largest single factor, but here it is not the only one the way it is on M4. The aggregation choice and the dataset-by-choice interaction carry a share too. With 12 metrics that can disagree, how the metrics are combined changes the order more than it does on the two-metric M4 task. Splitting the shares one method at a time shows which owe their rank movement to the dataset and which to the weighting or aggregation; the span beside each bar is that method's best-to-worst rank range. ```{python} plot.rank_sensitivity_by_tool(rs) ``` [`specification_curve`](../../docs/explanations/rank-sensitivity.md#the-specification-curve) reads the same grid as a list of rankings. Because the aggregation carries a real share here, the choice-only multiverse is less unanimous than on M4: the top method does not hold in every weighting-by-aggregation combination. ```{python} from beam.mcda import specification_curve curve = specification_curve(rs) dom = curve.tool_names[curve.most_frequent_top_tool] print(f"choices plus dataset ({curve.n_specifications} combinations): " f"{dom} first in {curve.most_frequent_top_fraction * 100:.0f}%, " f"{curve.n_distinct_top_tools} methods reach the top") pooled = specification_curve( rank_sensitivity( result.matrix, ctx.polarity, methods=["saw", "topsis", "vikor", "promethee_ii"], normalization=list(ctx.normalization), bounds=list(ctx.bounds), baselines=list(ctx.baselines), targets=list(ctx.targets), missing="worst", tool_names=methods, ) ) pdom = pooled.tool_names[pooled.most_frequent_top_tool] print(f"choices only ({pooled.n_specifications} combinations): " f"{pdom} first in {pooled.most_frequent_top_fraction * 100:.0f}%") plot.specification_curve(curve) ``` ## Dataset concordance Pooling averages over the six batch-integration datasets. [`dataset_concordance`](../../docs/explanations/dataset-concordance-and-discrimination.md) re-ranks the methods within each dataset on its own and correlates every pair of those orderings with Kendall tau-b. A high mean means the pooled order represents the datasets; a low mean means it papers over their disagreement. No replicates are needed and the datasets need not be exchangeable. ```{python} conc = result.dataset_concordance names = conc.dataset_names print(f"mean agreement across datasets (Kendall tau-b): {conc.mean_pairwise_tau:.2f}") print("least typical dataset:", names[conc.most_idiosyncratic_dataset]) print("mutually consistent groups:", [tuple(names[d] for d in g) for g in conc.concordant_groups]) print("where methods depart most from their own average rank:") for cell in conc.notable_cells[:5]: side = "lower" if cell.deviation > 0 else "higher" print(f" {conc.tool_names[cell.tool]} on {names[cell.dataset]}: " f"rank {cell.rank}, {side} than its mean {cell.mean_rank:.1f}") ``` ```{python} plot.dataset_concordance(result) ``` ```{python} plot.dataset_struggle(result) ``` ## Blind analysis A [blind analysis](../../docs/explanations/analysis-blinding.md) fixes the weighting, the aggregation and the metric set before the method names are revealed. `beam.blind` masks the names and shuffles the rows; `beam.unblind` restores them. The order does not change, and the seal fingerprint goes into the manifest. ```{python} from beam import blind, unblind blinded, seal = blind(scores, seed=0) blind_run = beam.rank(blinded, weights="equal", method="topsis", seed=0, sensitivity=False) restored = unblind(blind_run, seal) named = beam.rank(scores, weights="equal", method="topsis", seed=0, sensitivity=False) print("ranking identical after unblinding:", dict(zip(named.tool_names, named.result.ranks)) == dict(zip(restored.tool_names, restored.result.ranks))) print("top method after unblinding:", restored.top_tool) print("blinding fingerprint:", blind_run.manifest["blinding"]["seal_sha256"][:12]) ``` ## Pairwise superiority on ARI [`pairwise_superiority`](../../docs/reference/pairwise_superiority.qmd) [compares the methods two at a time](../../docs/explanations/pairwise-method-comparison.md) on ARI across the six datasets, with the ARI noise floor as the equivalence band. It reports how often one method outperforms another, the effect size next to the significance the mixed-effects intervals give. ```{python} from beam.mcda import pairwise_superiority ari_kept = op.tensor(("ari",))[keep, :, 0] ari_floor = registry_context(["ari"], "saw").noise_floors[0] or 0.0 sup = pairwise_superiority(ari_kept, "higher_is_better", rope=ari_floor, method_names=methods) print(f"highest standing: {methods[sup.order[0]]} ({sup.standing[sup.order[0]]:.2f})") print(f"pairs the sign test cannot separate: {len(sup.equivalent_pairs)} of {len(sup.per_pair)}") ``` With six datasets the comparison is coarse: most pairs do not reach significance, matching the wide mixed-effects intervals and the leave-one-dataset-out spans above. [`pairwise_transitivity`](../../docs/reference/pairwise_transitivity.qmd) reads the same pairwise relation and reports whether one order is [consistent with it](../../docs/explanations/pairwise-method-comparison.md#transitivity). ```{python} from beam.mcda import pairwise_transitivity trans = pairwise_transitivity(sup) print(f"transitive: {trans.is_transitive}; circular triads: {trans.n_circular_triads} of {trans.n_triads}") ``` The matrix below orders the methods by how many others they outperform. A transitive relation fills the upper triangle; a red cell below the diagonal marks a method that outperforms one ranked above it, which only happens inside a cycle. ```{python} plot.pairwise_majority(trans) ``` On the probability scale, [`bayesian_sign_comparison`](../../docs/explanations/pairwise-method-comparison.md#bayesian-sign-comparison) gives for each pair the posterior probability that one method is practically better. With six datasets most pairs stay inconclusive at the 0.95 threshold. ```{python} from beam.mcda import bayesian_sign_comparison bayes = bayesian_sign_comparison(sup) decisive = sum(1 for p in bayes.per_pair if p.decision != "inconclusive") print(f"pairs with a decisive posterior at 0.95: {decisive} of {len(bayes.per_pair)}") plot.bayesian_comparison(bayes) ``` ## Spatially variable genes: many datasets, the Bradley-Terry tree The spatially variable genes task has 14 methods, 50 spatial datasets and one correlation metric (higher is better). The 50 datasets carry real feature variation: the spatial assay technology, the organism, and a cancer or non-cancer condition, all parsed from the dataset id. This is the input the Bradley-Terry tree needs. ```{python} svg = load_openproblems("spatially_variable_genes") svg_features = load_openproblems_svg_features() correlation = svg.tensor(("correlation",))[:, :, 0] _, categorical = svg_features.aligned_to(svg.dataset_names) print(f"{len(svg.method_names)} methods, {len(svg.dataset_names)} datasets") from collections import Counter print("technologies:", dict(Counter(categorical["technology"]))) ``` The mean correlation per method per assay technology already shows the structure: no method ranks highest on every assay. ```{python} import warnings technologies = sorted(set(categorical["technology"])) tech_array = np.array(categorical["technology"]) heat = np.full((len(svg.method_names), len(technologies)), np.nan) for ti, tech in enumerate(technologies): with warnings.catch_warnings(): warnings.simplefilter("ignore", category=RuntimeWarning) heat[:, ti] = np.nanmean(correlation[:, tech_array == tech], axis=1) method_order = np.argsort(-np.nanmean(heat, axis=1)) plot.score_heatmap( heat[method_order], row_names=[svg.method_names[i] for i in method_order], col_names=technologies, row_label="spatially-variable-genes method (ordered by mean correlation)", col_label="spatial assay technology", value_label="mean correlation (higher is better)", title="Mean correlation per method per assay (the column maximum moves between methods)", ) ``` The tree turns each dataset into pairwise method comparisons on the correlation score, then splits the datasets by their features so each leaf has its own ranking, putting a statistical test behind the pattern the heatmap shows. The fit needs R's psychotree. ```{python} from beam.heterogeneity import bradley_terry_tree, bttree_available if bttree_available(): bt = bradley_terry_tree( correlation, svg.method_names, svg.dataset_names, categorical_features=categorical, polarity="higher_is_better", minsize=6, ) print(f"split found: {bt.did_split}") print(f"pooled ranking, top 3: {', '.join(bt.global_ranking()[:3])}") print(bt.summary()) else: bt = None print("R with psychotree not available; skipping the Bradley-Terry tree.") print("Provision it with envs/heterogeneity.yml.") ``` The plot shows the number of datasets in each leaf and the method that ranks first there. ```{python} if bt is not None: display(plot.bradley_terry_leaves(bt)) ``` The pooled ranking has one method at the top, but the leaves disagree: which method ranks first depends on the spatial assay technology. A single ranking does not show this heterogeneity. See [the Bradley-Terry explanation](../../docs/explanations/method-by-dataset-heterogeneity.md#bradley-terry-trees). ## Recommendation On batch integration, beam re-derives the per-metric normalization from the cards and runs MCDA with sensitivity, which can place a different method first than the platform's mean-of-scores leaderboard. On spatially variable genes, the Bradley-Terry tree shows there is no single method that ranks first across assays. A benchmark recommendation depends on how the criteria are combined and on which datasets it is read over.