Transportation: a cross-domain MCDA example

Author

Izaskun Mallona

Published

July 9, 2026

Goal

The MCDA core in beam is not specific to bioinformatics. This vignette runs it on a made-up transportation example. The modes of transport play the role of the methods, and the terrains play the role of the datasets. Each mode is scored on a terrain by speed (km/h, higher is better), cost (per km, lower is better), and CO2 (g per km, lower is better). The numbers are illustrative, kept in a plausible range. The three metrics are bundled metric cards (speed, cost, co2), so this cross-domain example reads polarity and normalization from the same registry as the bioinformatics examples.

Set-up

The five aggregation methods used below (SAW, TOPSIS, VIKOR, PROMETHEE II, COMET) are wrapped from pymcdm. beam normalizes the scores, then calls pymcdm on the normalized matrix and keeps the higher-is-better convention. The weighting schemes are beam’s own.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

from beam.mcda import (
    critical_difference,
    leave_one_dataset_out,
    run,
    run_from_registry,
    smaa,
    smallest_weight_perturbation,
)
from beam.scenarios import transportation_benchmark

tb = transportation_benchmark()
print("modes:   ", tb.mode_names)
print("terrains:", tb.terrain_names)
print("metrics: ", tb.metric_names)
print("polarity:", tb.polarity)
print("normalization:", tb.normalization)

modes:    ('foot', 'running', 'trail_running', 'bicycle', 'e_bike', 'motorcycle', 'train', 'kayak', 'boat', 'plane')
terrains: ('flat_road', 'mud', 'uphill', 'open_water', 'long_distance', 'urban_hop')
metrics:  ('speed', 'cost', 'co2')
polarity: ('higher_is_better', 'lower_is_better', 'lower_is_better')
normalization: ('log_min_max', 'log_min_max', 'rank')

The data, and the partial-coverage problem

The speed heatmap below shows the full mode-by-terrain table. Infeasible cells, where a mode cannot run on a terrain, are left blank and marked with a cross. A red box marks the fastest mode on each terrain: that is the documented ground truth. Reading down any column, the fastest mode changes from terrain to terrain. Reading across any row, no mode covers every terrain.

from beam import plot

speed = tb.metric("speed")
n_modes = len(tb.mode_names)
n_terrains = len(tb.terrain_names)

plot.score_heatmap(
    speed,
    row_names=tb.mode_names,
    col_names=tb.terrain_names,
    row_label="transport mode (method axis)",
    col_label="terrain (dataset axis)",
    value_label="speed (km/h, log scale)",
    log=True,
    highlight_best_per_col=True,
    higher_is_better=True,
    title="Speed by mode and terrain; red box marks the fastest mode",
)

The grey cells with a cross are the mode-terrain pairs that do not run. No row is free of them, so no mode is measured on every terrain. Any ranking that pools across terrains compares modes that were measured on different subsets of conditions, which is why a single pooled ranking over all modes is not well defined here.

feasible = tb.feasible()
runs_everywhere = [
    tb.mode_names[i] for i in range(n_modes) if feasible[i].all()
]
print("modes that run on every terrain:", runs_everywhere or "none")
print()
print("fastest mode per terrain, by speed, over the feasible modes (the ground truth):")
for t, terrain in enumerate(tb.terrain_names):
    best = tb.mode_names[int(np.nanargmax(speed[:, t]))]
    print(f"  {terrain:15s}  {best}")

modes that run on every terrain: none

fastest mode per terrain, by speed, over the feasible modes (the ground truth):
  flat_road        train
  mud              motorcycle
  uphill           motorcycle
  open_water       boat
  long_distance    plane
  urban_hop        train

Per-terrain MCDA with example-level NaN handling

Because no mode runs on every terrain, the analysis is done one terrain at a time. On each terrain the infeasible modes are dropped (their cells are NaN) and the MCDA pipeline runs over the modes that actually run there. This is the example-level NaN handling: no imputation and no pooling across terrains, just the modes that were measured on the terrain in question. The helper feasible_submatrix drops the infeasible rows and returns a dense, NaN-free score matrix. run_from_registry then reads polarity and normalization from the speed, cost and co2 cards.

The cards declare the normalization: speed and cost span orders of magnitude across modes, so they use log_min_max; CO2 has true zeros (walking, cycling), so it uses rank, which is defined when a column carries hard zeros.

A first pass uses equal weights and SAW.

print("per-terrain top-ranked mode, equal weights, SAW")
print(f"{'terrain':15s}  {'top-ranked':12s}  {'fastest by speed':16s}  same?")
for terrain in tb.terrain_names:
    names, sub = tb.feasible_submatrix(terrain)
    out = run_from_registry(sub, list(tb.metric_names), method="saw")
    top = names[int(np.argmin(out.ranks))]
    fastest = names[int(np.argmax(sub[:, 0]))]
    print(f"  {terrain:15s}  {top:12s}  {fastest:16s}  {top == fastest}")

per-terrain top-ranked mode, equal weights, SAW
terrain          top-ranked    fastest by speed  same?
  flat_road        running       train             False
  mud              trail_running  motorcycle        False
  uphill           trail_running  motorcycle        False
  open_water       boat          boat              True
  long_distance    running       plane             False
  urban_hop        running       train             False

Under equal weights the cheap, low-CO2 modes rank near the top on the land terrains even when they are slow. Speed is only one of three equally weighted metrics, so it does not decide the order on its own.

Now weight speed heavily (0.8 on speed, 0.1 on cost, 0.1 on CO2) and rerun. With speed in charge, the per-terrain top-ranked mode tracks the ground truth, and a different mode ranks first on different terrains.

speed_heavy = np.array([0.8, 0.1, 0.1])

print("per-terrain top-ranked mode, speed-heavy weights (0.8, 0.1, 0.1), SAW")
print(f"{'terrain':15s}  {'top-ranked':12s}  {'fastest by speed':16s}  same?")
for terrain in tb.terrain_names:
    names, sub = tb.feasible_submatrix(terrain)
    out = run_from_registry(sub, list(tb.metric_names), weights=speed_heavy, method="saw")
    top = names[int(np.argmin(out.ranks))]
    fastest = names[int(np.argmax(sub[:, 0]))]
    print(f"  {terrain:15s}  {top:12s}  {fastest:16s}  {top == fastest}")

per-terrain top-ranked mode, speed-heavy weights (0.8, 0.1, 0.1), SAW
terrain          top-ranked    fastest by speed  same?
  flat_road        train         train             True
  mud              motorcycle    motorcycle        True
  uphill           motorcycle    motorcycle        True
  open_water       boat          boat              True
  long_distance    plane         plane             True
  urban_hop        train         train             True

The mode that ranks first now changes across terrains: train on the flat road and the urban hop, motorcycle on mud and uphill, boat on open water, plane on the long distance. That is the method-by-dataset interaction stated at the top. A single global ranking cannot report all four at once.

A crossover among the slower modes

The fastest modes rank first on their own terrains, but the interaction is sharper among the slower land modes, where two modes change places across terrains. Trail running is slower than road running on the flat road, where road shoes and a hard surface help, but faster on mud and uphill, where off-road traction matters. An e-bike is faster than a bicycle on the flat road and the urban hop, but its weight slows it down uphill. These are crossovers, not feasibility gaps: both modes run on the terrains in question, and which one is faster flips from terrain to terrain.

speed = tb.metric("speed")
m = tb.mode_names.index
t = tb.terrain_names.index

print("trail running vs road running, speed (km/h):")
print(f"{'terrain':12s}  {'road':>6s}  {'trail':>6s}  faster")
for terrain in ["flat_road", "mud", "uphill"]:
    road = speed[m("running"), t(terrain)]
    trail = speed[m("trail_running"), t(terrain)]
    faster = "trail" if trail > road else "road"
    print(f"{terrain:12s}  {road:6.1f}  {trail:6.1f}  {faster}")

print()
print("e-bike vs bicycle, speed (km/h):")
print(f"{'terrain':12s}  {'bike':>6s}  {'ebike':>6s}  faster")
for terrain in ["flat_road", "urban_hop", "uphill"]:
    bike = speed[m("bicycle"), t(terrain)]
    ebike = speed[m("e_bike"), t(terrain)]
    faster = "ebike" if ebike > bike else "bike"
    print(f"{terrain:12s}  {bike:6.1f}  {ebike:6.1f}  {faster}")

trail running vs road running, speed (km/h):
terrain         road   trail  faster
flat_road       12.0    10.0  road
mud              6.0     8.0  trail
uphill           4.0     6.0  trail

e-bike vs bicycle, speed (km/h):
terrain         bike   ebike  faster
flat_road       20.0    28.0  ebike
urban_hop       15.0    20.0  ebike
uphill           6.0     4.0  bike

Trail running is slower than road running on the flat road and faster on mud and uphill, so neither mode is faster everywhere they both run. A pooled speed order over the two modes has to pick one ranking, which is wrong on at least one terrain. This is the clearest illustration of method-by-dataset interaction in the example: a clustering method that ranks first on one tissue and last on another behaves the same way, and a single pooled ranking would not show that reversal.

Five aggregations on one terrain

On the long-distance terrain every mode is feasible, so it is a good place to compare aggregation methods on a full mode set. The five methods are compared under the speed-heavy weighting. They agree on the extremes and differ in the middle.

names, sub = tb.feasible_submatrix("long_distance")
methods = ["saw", "topsis", "vikor", "promethee_ii", "comet"]

ranks_by_method = {}
print(f"{'method':14s}  ranks by mode (1 = best)")
for m in methods:
    out = run_from_registry(sub, list(tb.metric_names), weights=speed_heavy, method=m)
    ranks_by_method[m] = out.ranks
    by_mode = {n: int(r) for n, r in zip(names, out.ranks)}
    print(f"  {m:14s}  {by_mode}")

print()
print("top-ranked mode per method:")
for m in methods:
    print(f"  {m:14s}  {names[int(np.argmin(ranks_by_method[m]))]}")

method          ranks by mode (1 = best)
  saw             {'foot': 10, 'running': 7, 'trail_running': 8, 'bicycle': 4, 'e_bike': 5, 'motorcycle': 3, 'train': 2, 'kayak': 9, 'boat': 6, 'plane': 1}
  topsis          {'foot': 10, 'running': 7, 'trail_running': 8, 'bicycle': 6, 'e_bike': 5, 'motorcycle': 3, 'train': 2, 'kayak': 9, 'boat': 4, 'plane': 1}
  vikor           {'foot': 10, 'running': 7, 'trail_running': 8, 'bicycle': 6, 'e_bike': 5, 'motorcycle': 3, 'train': 2, 'kayak': 9, 'boat': 4, 'plane': 1}
  promethee_ii    {'foot': 10, 'running': 7, 'trail_running': 8, 'bicycle': 6, 'e_bike': 4, 'motorcycle': 3, 'train': 2, 'kayak': 9, 'boat': 5, 'plane': 1}
  comet           {'foot': 8, 'running': 5, 'trail_running': 6, 'bicycle': 4, 'e_bike': 7, 'motorcycle': 3, 'train': 2, 'kayak': 9, 'boat': 10, 'plane': 1}

top-ranked mode per method:
  saw             plane
  topsis          plane
  vikor           plane
  promethee_ii    plane
  comet           plane

All five methods put the plane first on the long distance under speed-heavy weights, which is the fastest mode there. The methods reshuffle the middle of the table instead. SAW, TOPSIS, VIKOR and PROMETHEE II agree on the top of the order (plane, train, motorcycle) and differ only in where the boat and the light land modes land. COMET reads the table differently: it builds its ranking from characteristic objects rather than from a direct distance to the ideal, so it ranks the boat last and pushes the e-bike down, while still keeping the plane first. The top of the order is stable across methods; the middle is method-dependent.

The same ranks as a heatmap. Each column is one mode, each row one aggregation method. A column of one colour means that mode keeps its rank across methods.

grid = np.array([ranks_by_method[m] for m in methods])

plot.rank_heatmap(
    grid,
    row_names=methods,
    col_names=names,
    row_label="aggregation method",
    col_label="transport mode",
    title="Long-distance rank by mode and method (speed-heavy weights)",
)

Four weightings on the same terrain

Holding the method at SAW and the terrain at long distance, the data-driven weightings are compared against equal weights. Entropy, standard deviation, and CRITIC all read the spread and correlation of the columns rather than any preference for speed, so they keep the three metrics near a third each. Under all four the order stays close to the equal-weight order, with running or train at the top rather than the fastest mode.

print(f"{'weighting':10s}  {'weights (speed, cost, co2)':28s}  top-ranked")
for wname in ["equal", "entropy", "std", "critic"]:
    out = run_from_registry(sub, list(tb.metric_names), weights=wname, method="saw")
    w_str = np.array2string(np.round(out.weights, 3), separator=", ")
    top = names[int(np.argmin(out.ranks))]
    print(f"  {wname:10s}  {w_str:28s}  {top}")

weighting   weights (speed, cost, co2)    top-ranked
  equal       [0.333, 0.333, 0.333]         running
  entropy     [0.465, 0.231, 0.304]         train
  std         [0.327, 0.35 , 0.322]         running
  critic      [0.438, 0.242, 0.319]         train

MEREC is left out of the table above. It takes the logarithm of the normalized scores, so it cannot accept a column that carries a hard zero, and the card normalization here (rank on CO2) produces such a zero. This is the one place the example calls run directly to override the card default with a zero-free normalization. Under z-score normalization MEREC reads the removal effect of each metric and weights cost and CO2 above speed, so a cheap mode ends up at the top, the same result as the other data-driven weightings.

merec = run(
    sub, tb.polarity,
    weights="merec",
    normalization="zscore",
    method="saw",
    metric_ids=list(tb.metric_names),
)
w_str = np.array2string(np.round(merec.weights, 3), separator=", ")
print(f"merec (zscore)  weights={w_str}  top-ranked={names[int(np.argmin(merec.ranks))]}")

merec (zscore)  weights=[0.051, 0.54 , 0.41 ]  top-ranked=running

Across both comparisons, the choice of weighting changes which mode ranks first far more than the choice of aggregation method does. The data-driven weightings recover a cheap mode; only a deliberate emphasis on speed recovers the per-terrain fastest mode.

A critical-difference diagram on complete cases

A Demsar critical-difference diagram needs a complete tool by dataset table with no missing cells. Because no mode runs on every terrain, the diagram is restricted to a block of modes and the terrains where all of them run. The four ground modes (foot, running, bicycle, motorcycle) are feasible on five common terrains (flat road, mud, uphill, long distance, urban hop); open water is excluded because no ground mode runs there. The helper common_feasible_block returns that block and its common terrains. The diagram is built on speed, the metric whose per-terrain ground truth is marked above.

block_modes = ("foot", "running", "bicycle", "motorcycle")
common_terrains, block = tb.common_feasible_block(block_modes)
print("block modes:     ", block_modes)
print("common terrains: ", common_terrains)

cd = critical_difference(
    block, higher_is_better=True, tool_names=block_modes
)
print(f"Friedman statistic = {cd.friedman_statistic:.3f}, p = {cd.friedman_pvalue:.4f}")
print(f"critical difference (alpha={cd.alpha}) = {cd.critical_difference:.3f}")
print("average ranks (1 = best, fastest):")
for i in cd.order:
    print(f"  {block_modes[i]:12s}  {cd.average_ranks[i]:.2f}")
groups = [tuple(block_modes[i] for i in c) for c in cd.cliques]
print("cliques (not separable at alpha):", groups or "none")

block modes:      ('foot', 'running', 'bicycle', 'motorcycle')
common terrains:  ('flat_road', 'mud', 'uphill', 'long_distance', 'urban_hop')
Friedman statistic = 15.000, p = 0.0018
critical difference (alpha=0.05) = 2.098
average ranks (1 = best, fastest):
  motorcycle    1.00
  bicycle       2.00
  running       3.00
  foot          4.00
cliques (not separable at alpha): [('motorcycle', 'bicycle', 'running'), ('bicycle', 'running', 'foot')]

plot.critical_difference_band(cd)

On this restricted block the Friedman test rejects (p around 0.002): the four ground modes are separable by speed across the five common terrains, in the fixed order motorcycle, bicycle, running, foot. The critical difference is wide enough that adjacent modes fall in the same clique, so neighbouring modes are not separable, but the extremes (motorcycle and foot) are. This is a different question from the per-terrain MCDA. It asks whether the ground modes have a stable speed ordering across the terrains they share, and here they do, because none of them has a terrain where it suddenly ranks first.

SMAA on one terrain

SMAA samples weight vectors from a Dirichlet over the three-metric simplex, runs the pipeline once per draw, and reports the share of draws in which each mode is top-ranked (its confidence factor). On the long-distance terrain with TOPSIS, no single mode owns the top rank across the weight space.

smaa_report = smaa(
    sub, tb.polarity,
    n_samples=500,
    method="topsis",
    seed=0,
)
print("confidence factor (share of sampled weightings ranking the mode first):")
for n, c in zip(names, smaa_report.confidence_factor):
    print(f"  {n:12s}  {c:.3f}")
most = names[int(np.argmax(smaa_report.confidence_factor))]
print(f"\nmost confident top-ranked mode: {most} "
      f"({smaa_report.confidence_factor.max():.2f} of samples)")

confidence factor (share of sampled weightings ranking the mode first):
  foot          0.000
  running       0.178
  trail_running  0.000
  bicycle       0.102
  e_bike        0.000
  motorcycle    0.000
  train         0.366
  kayak         0.000
  boat          0.000
  plane         0.354

most confident top-ranked mode: train (0.37 of samples)

The confidence is split between the train (about 0.37) and the plane (about 0.35), with running and bicycle taking the rest. No mode crosses half, so the long-distance top rank depends on the weighting: depending on how a user trades speed against cost and CO2, either the train or the plane comes out first. The SMAA confidence shows the distribution across ranks rather than a single composite.

Weight perturbation on one terrain

smallest_weight_perturbation reports, for every pair of modes where one is ranked above the other under SAW, the smallest single-weight change that swaps them. It runs in closed form for SAW. Here it runs on the long-distance terrain with equal weights.

ts = smallest_weight_perturbation(
    sub, tb.polarity,
    weights="equal",
    method="saw",
)
print("base ranks under equal weights, SAW:")
print("  ", {n: int(r) for n, r in zip(names, ts.base.ranks)})

if ts.most_fragile_pair is not None:
    p = ts.most_fragile_pair
    print(
        f"\nmost fragile pair: {names[p.higher]} ranked above {names[p.lower]}, "
        f"flips by changing the weight on {tb.metric_names[p.criterion]!r} "
        f"by {p.delta:+.4f}"
    )
print(f"top rank is fragile: {ts.top_rank_is_fragile}")

base ranks under equal weights, SAW:
   {'foot': 4, 'running': 2, 'trail_running': 3, 'bicycle': 5, 'e_bike': 6, 'motorcycle': 9, 'train': 1, 'kayak': 7, 'boat': 10, 'plane': 8}

most fragile pair: kayak ranked above plane, flips by changing the weight on 'speed' by +0.0053
top rank is fragile: False

The most fragile pair is the one that the smallest single-weight nudge can reorder. Here a change of about five thousandths to the speed weight is enough to swap the kayak and the plane, while the equal-weight top rank itself is not fragile under any single-weight change of that size. The pairing tells a user which two modes sit closest together under the current weighting, so a small revision to preferences would reorder them first.

Leave one dataset out

The sections above keep the analysis per terrain because no mode runs on every terrain. On complete cases, where a set of modes runs on every terrain, a pooled ranking is well defined, and the next question is how much that pooled ranking leans on any single terrain. Leave-one-dataset-out answers it: pool the block, rank it, then drop one terrain, pool the rest, and re-rank. The water modes show this sharply. The kayak, the motorboat and the small plane all run on two terrains, open water and the long distance, so they form complete cases over two terrains.

water_modes = ("kayak", "boat", "plane")
water_terrains = ("open_water", "long_distance")
mode_idx = [tb.mode_names.index(m) for m in water_modes]
terrain_idx = [tb.terrain_names.index(t) for t in water_terrains]
water_block = tb.scores[np.ix_(mode_idx, terrain_idx, range(len(tb.metric_names)))]
print("speed (km/h) by mode and terrain:")
print(f"{'mode':8s}  {'open_water':>11s}  {'long_distance':>13s}")
for i, mode in enumerate(water_modes):
    print(f"{mode:8s}  {water_block[i, 0, 0]:11.0f}  {water_block[i, 1, 0]:13.0f}")

speed (km/h) by mode and terrain:
mode       open_water  long_distance
kayak               6              8
boat               30             40
plane              18            750

Pooling the two terrains with speed-heavy weights and SAW, then dropping each terrain in turn. The reduction across terrains is the arithmetic mean of the raw scores, one per metric.

rules = ("arithmetic_mean",) * len(tb.metric_names)
lodo = leave_one_dataset_out(
    water_block, tb.polarity, rules,
    dataset_names=water_terrains,
    metric_ids=tb.metric_names,
    weights=speed_heavy,
    method="saw",
    normalization=list(tb.normalization),
)
print("pooled over both terrains:", {m: int(r) for m, r in zip(water_modes, lodo.base.ranks)})
for d, r in lodo.leave_one_out.items():
    print(f"drop {water_terrains[d]:14s}:", {m: int(rk) for m, rk in zip(water_modes, r.ranks)})
print()
print("rank held across the leave-one-dataset-out runs:")
for m, s in zip(water_modes, lodo.rank_stability):
    print(f"  {m:8s}  {s * 100:5.0f}%")
print(f"most influential terrain: {water_terrains[lodo.most_influential_dataset]} "
      f"(largest rank shift {lodo.max_rank_shift})")

pooled over both terrains: {'kayak': 3, 'boat': 2, 'plane': 1}
drop open_water    : {'kayak': 3, 'boat': 2, 'plane': 1}
drop long_distance : {'kayak': 3, 'boat': 1, 'plane': 2}

rank held across the leave-one-dataset-out runs:
  kayak       100%
  boat         50%
  plane        50%
most influential terrain: long_distance (largest rank shift 1)

order = np.argsort(lodo.base.ranks)
fig, ax = plt.subplots(figsize=(5.8, 2.6))
ax.barh(range(len(water_modes)), lodo.rank_stability[order] * 100, color="#3a7ca5")
ax.set_yticks(range(len(water_modes)))
ax.set_yticklabels([water_modes[i] for i in order])
ax.invert_yaxis()
ax.set_xlim(0, 100)
ax.set_xlabel("rank held across the 2 leave-one-dataset-out runs (percent)")
ax.set_ylabel("transport mode (ordered by pooled rank)")
ax.set_title("Water modes: leave-one-terrain-out rank stability (speed-heavy weights)")
fig.tight_layout()
plt.show()

Pooled over both terrains the plane ranks first, because its 750 km/h on the long leg pulls its mean speed far above the others. Drop the long distance and rank on open water alone, and the motorboat ranks first instead: it is the fastest mode on the water, where the plane is slow. Drop open water and the long-distance order stands, plane first. The plane’s first place is not a property of water travel; it is an artifact of including the long leg in the pool. The kayak ranks last in every run under speed-heavy weights. Leave-one-dataset-out turns that into a number: the plane and the motorboat each hold their rank in only half the runs, so the pooled recommendation between them depends entirely on whether the long-distance terrain is in the pool.

pairwise_transitivity reads both sets of complete cases from the pairwise side, on speed. The four land modes have a fixed speed order, so their pairwise majorities are transitive and give one consistent order, motorcycle down to foot. The water modes have no cycle either, but the plane and the boat tie. The plane is faster on the long distance and the boat on open water, so over the two terrains each outperforms the other once, and neither is the pairwise-majority first choice. The kayak is outperformed by both. The tie is the pairwise reading of the half-and-half stability above.

from beam.mcda import pairwise_superiority, pairwise_transitivity
from matplotlib.colors import ListedColormap

def majority_grid(scores, names):
    sup = pairwise_superiority(scores, "higher_is_better", method_names=names)
    trans = pairwise_transitivity(sup)
    dom = trans.dominance
    tied = {p for a, b in trans.tied_pairs for p in ((a, b), (b, a))}
    order = list(np.argsort(-dom.sum(axis=1), kind="stable"))
    grid = np.zeros((trans.n_methods, trans.n_methods))
    for r, i in enumerate(order):
        for c, j in enumerate(order):
            if i == j:
                grid[r, c] = 3
            elif (i, j) in tied:
                grid[r, c] = 2
            elif dom[i, j] == 1:
                grid[r, c] = 1
    return grid, [names[i] for i in order], trans

land_grid, land_labels, land_trans = majority_grid(block, block_modes)
water_grid, water_labels, water_trans = majority_grid(water_block[:, :, 0], water_modes)
print(f"land:  transitive {land_trans.is_transitive}, tied pairs {len(land_trans.tied_pairs)}")
print(f"water: transitive {water_trans.is_transitive}, tied pairs {len(water_trans.tied_pairs)}")

cmap = ListedColormap(["#f4f4f4", "#3b6ea5", "#cfcfcf", "#777777"])
fig, axes = plt.subplots(1, 2, figsize=(9.0, 4.5))
for ax, grid, labels, title in [
    (axes[0], land_grid, land_labels, "Land modes: one consistent order"),
    (axes[1], water_grid, water_labels, "Water modes: plane and boat tie"),
]:
    ax.imshow(grid, cmap=cmap, vmin=-0.5, vmax=3.5)
    ax.set_xticks(range(len(labels)))
    ax.set_xticklabels(labels, rotation=45, ha="right", fontsize=8)
    ax.set_yticks(range(len(labels)))
    ax.set_yticklabels(labels, fontsize=8)
    ax.set_xlabel("mode outperformed (column)")
    ax.set_title(title)
axes[0].set_ylabel("mode (row), ordered by methods outperformed")
fig.suptitle("Pairwise majorities by speed (blue = outperforms, grey = tie)")
fig.tight_layout()
plt.show()

land:  transitive True, tied pairs 0
water: transitive True, tied pairs 1

bayesian_sign_comparison reads the water block on the probability scale. With two terrains the counts are tiny, so the posterior stays close to the prior and no pair reaches a decisive label. The plane and the boat each score higher on one terrain, so the posterior splits between them.

from beam.mcda import bayesian_sign_comparison

water_sup = pairwise_superiority(water_block[:, :, 0], "higher_is_better", method_names=water_modes)
water_bayes = bayesian_sign_comparison(water_sup)
decisive = sum(1 for p in water_bayes.per_pair if p.decision != "inconclusive")
print(f"pairs with a decisive posterior at 0.95: {decisive} of {len(water_bayes.per_pair)}")
plot.bayesian_comparison(water_bayes)

pairs with a decisive posterior at 0.95: 0 of 3

The funky heatmap shows the complete water block as a glyph table, with the leave-one-dataset-out span and the aggregation span next to the order. The block is small (three modes, three metrics, two terrains), so the panels are short, but the dataset span panel records what the text above says: the plane and the motorboat each move by a rank when a terrain is dropped.

import beam
from beam.reporting import funky_heatmap_from_run

water = beam.Scores(
    values=water_block,
    tool_names=water_modes,
    metric_ids=tb.metric_names,
    dataset_names=water_terrains,
    layout="long",
)
water_run = beam.rank(water, weights="equal", method="saw")
funky_heatmap_from_run(water_run, title="Transport water block: scores and rank robustness")

specification_curve lists the rankings the water block produces under every weighting, aggregation and terrain. The two terrains order the modes differently, so the top mode does not hold across the grid: it is the same instability the leave-one-dataset-out section reports, read as a list of rankings.

from beam.mcda import rank_sensitivity, specification_curve

ctx = water_run.context
rs = rank_sensitivity(
    water.values,
    ctx.polarity,
    normalization=list(ctx.normalization),
    bounds=list(ctx.bounds),
    baselines=list(ctx.baselines),
    targets=list(ctx.targets),
    missing="worst",
    tool_names=water_modes,
    dataset_names=water_terrains,
)
curve = specification_curve(rs)
dom = curve.tool_names[curve.most_frequent_top_tool]
print(f"{curve.n_specifications} specifications; {dom} first in "
      f"{curve.most_frequent_top_fraction * 100:.0f}%, "
      f"{curve.n_distinct_top_tools} modes reach the top")

40 specifications; boat first in 42%, 3 modes reach the top

plot.specification_curve(curve)

The per-method version of the rank-sensitivity split shows it one mode at a time. It separates a mode whose rank depends on the terrain from one that depends on the weighting or the aggregation. The span next to each bar is the difference between the mode’s best and worst rank.

plot.rank_sensitivity_by_tool(rs)

A blind analysis masks the mode names before the pipeline is fixed. beam.blind relabels and shuffles the rows; beam.unblind restores them. The order does not change.

from beam import blind, unblind

blinded, seal = blind(water, seed=0)
blind_run = beam.rank(blinded, weights="equal", method="saw", seed=0, sensitivity=False)
restored = unblind(blind_run, seal)
named = beam.rank(water, weights="equal", method="saw", seed=0, sensitivity=False)
print("ranking identical after unblinding:",
      dict(zip(named.tool_names, named.result.ranks))
      == dict(zip(restored.tool_names, restored.result.ranks)))
print("blinding fingerprint:", blind_run.manifest["blinding"]["seal_sha256"][:12])

ranking identical after unblinding: True
blinding fingerprint: 8f8559f8d7ed

Dataset concordance

dataset_concordance ranks the modes within each terrain and reports the Kendall tau-b agreement between every pair of terrains. It reads on any complete cases. The four ground modes share five land terrains, so they form one set of complete cases; the water modes share two. The two blocks sit at opposite ends of the agreement scale.

land_mode_idx = [tb.mode_names.index(m) for m in block_modes]
land_terrain_idx = [tb.terrain_names.index(t) for t in common_terrains]
land_block = tb.scores[np.ix_(land_mode_idx, land_terrain_idx, range(len(tb.metric_names)))]
land = beam.Scores(
    values=land_block,
    tool_names=block_modes,
    metric_ids=tb.metric_names,
    dataset_names=common_terrains,
    layout="long",
)
land_run = beam.rank(land, weights="equal", method="saw")

land_conc = land_run.dataset_concordance
water_conc = water_run.dataset_concordance
print(f"land block, {len(common_terrains)} terrains: mean agreement (tau-b) {land_conc.mean_pairwise_tau:.2f}")
print(f"water block, 2 terrains:  mean agreement (tau-b) {water_conc.mean_pairwise_tau:.2f}")

land block, 5 terrains: mean agreement (tau-b) 1.00
water block, 2 terrains:  mean agreement (tau-b) -0.33

On the five land terrains the motorcycle is fastest throughout and the foot the slowest, so each terrain produces nearly the same order. On the two water terrains the fast mode changes: the small plane leads on the long-distance leg, the motorboat on open water. The mean tau-b reports the size of that gap between the two blocks.

plot.dataset_concordance(land_run)

The struggle map locates the reversal. On the water block the plane and the motorboat swap places between the two terrains, which is the method-by-dataset interaction the mixed-effects model puts a number on next.

plot.dataset_struggle(water_run)

A single concordance over every mode and terrain at once is not well defined, for the same reason the pooled ranking is not: no mode runs on every terrain, so the diagnostic is read within each feasible block.

Mixed-effects interaction

The mixed-effects model puts a number on what the per-terrain sections show. Fit on speed, with the mode as a fixed effect and the terrain as a random intercept, it splits the speed variance into a between-terrain shift and the rest. The fit needs R’s lme4, so the chunk runs only when it is available.

from beam.heterogeneity import mixed_effects_from_matrix, r_available

speed = tb.scores[:, :, 0]
keep = ~np.isnan(speed).all(axis=1)
speed_modes = [m for m, k in zip(tb.mode_names, keep) if k]

if r_available():
    me = mixed_effects_from_matrix(speed[keep], speed_modes, tb.terrain_names)
    print(f"terrain shift (ICC): {me.icc_dataset:.2f} of the speed variance")
    print(f"residual share:      {me.residual_share:.2f} (mode-by-terrain interaction plus noise)")
else:
    print("R with lme4 not available; skipping the mixed-effects fit.")

terrain shift (ICC): 0.10 of the speed variance
residual share:      0.90 (mode-by-terrain interaction plus noise)

The contrast with the other examples is clear. On M4 the band intercept takes most of the variance, so the bands differ mostly in difficulty. Here the terrain intercept takes very little: almost all of the speed variance is the mode-by-terrain interaction, because the fast mode on one terrain is the slow mode on another. That is why a single pooled speed ranking is misleading even where it can be computed. A Bradley-Terry tree on the six terrains has too few datasets to find a stable split, the same small-sample limit the Duo and M4 examples hit; the OpenProblems spatial task is where a split appears.

Per-terrain output

This matches method-by-dataset interaction in bioinformatics benchmarks. A clustering method that ranks first on one tissue can rank last on another, and a method that needs a feature one dataset lacks is simply not applicable there, which is the same role the NaN cells play here. The informative output is per dataset: which method ranks first on each. The cross-dataset tests (the critical-difference diagram, SMAA, weight perturbation) then say how much of the apparent ordering survives a change of conditions or of preferences.