M4 forecasting competition: a large real benchmark

Author

Izaskun Mallona

Published

July 9, 2026

Provenance of the bundled table

beam does not ship the 100,000 series. It ships a small derived table, src/beam/data/M4_2018_by_frequency.csv, computed once from the GPL-3 M4comp2018 data, which carries the realized future values and the point forecasts of the top 25 methods. The reduction is recorded in src/beam/data/reduce_m4.R and was run as follows:

git clone https://github.com/carlanetto/M4comp2018.git
cd M4comp2018 && git lfs pull          # the data is stored via git-lfs
# commit 3c75dcd25c72c631f04bff1a017d9917d0e7251c, R 4.3.3
Rscript reduce_m4.R                    # writes M4_2018_by_frequency.csv

reduce_m4.R computes the mean sMAPE and mean MASE per method per band, reproducing the published figures (Smyl’s sMAPE 11.374, MASE 1.536). The table is GPL-3, derived from GPL-3 data; cite Makridakis, Spiliotis and Assimakopoulos (2020, 10.1016/j.ijforecast.2019.04.014) when using it.

Load the table

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

import beam
from beam.datasets import load_m4
from beam.cards import properties_for

m4 = load_m4()
print("methods:", len(m4.method_names), "(rank order, top first:", m4.method_names[0] + ")")
print("frequency bands:", m4.frequency_names)
print("series per band:", dict(zip(m4.frequency_names, m4.n_series.tolist())))
print("metrics:", m4.metric_ids)

methods: 25 (rank order, top first: Smyl)
frequency bands: ('Yearly', 'Quarterly', 'Monthly', 'Weekly', 'Daily', 'Hourly')
series per band: {'Yearly': 23000, 'Quarterly': 24000, 'Monthly': 48000, 'Weekly': 359, 'Daily': 4227, 'Hourly': 414}
metrics: ('smape', 'mase')

Read the metric semantics from the cards

properties_for pulls the smape and mase cards. Both are lower is better ratio metrics. sMAPE is bounded in [0, 200] so its card recommends min-max normalization; MASE is an unbounded scaled error and also normalizes by min-max here. Both pool across datasets by arithmetic mean.

for p in properties_for(list(m4.metric_ids)):
    print(
        f"{p.id:6s}  polarity={p.polarity:15s}  scale={p.scale_type:6s}  "
        f"norm={p.recommended_normalization:8s}  across_datasets={p.recommended_aggregation_across_datasets}"
    )

smape   polarity=lower_is_better  scale=ratio   norm=min_max   across_datasets=arithmetic_mean
mase    polarity=lower_is_better  scale=ratio   norm=min_max   across_datasets=arithmetic_mean

Before ranking, card_data_consistency checks the scores against what those cards declare: every value inside the declared range, the baselines and targets in range, the noise floors positive. It reads the raw scores against the cards, so a unit mismatch (a metric on a percent scale against a fraction card, say) is caught here.

from beam.mcda import card_data_consistency, registry_context

ctx = registry_context(list(m4.metric_ids), "saw")
pooled_native = np.nanmean(m4.tensor(), axis=1)
audit = card_data_consistency(
    pooled_native, ctx.polarity, ctx.bounds,
    baselines=ctx.baselines, targets=ctx.targets, noise_floors=ctx.noise_floors,
    metric_ids=list(m4.metric_ids),
)
print("scores consistent with the cards:", audit.ok)
for finding in audit.findings:
    print(" ", finding.severity, finding.message)

scores consistent with the cards: True

One MCDA run through the registry

The tensor is dense (every top-25 method has a score on every band), so it goes straight into beam.rank. The headline run uses equal weights and SAW.

scores = beam.Scores(
    values=m4.tensor(),
    tool_names=m4.method_names,
    metric_ids=m4.metric_ids,
    dataset_names=m4.frequency_names,
    layout="long",
)
run = beam.rank(scores, weights="equal", method="saw", seed=0)
order = np.argsort(run.result.ranks)
print("top five under equal weights / SAW (each band weighted equally):")
for i in order[:5]:
    print(f"  {run.result.ranks[i]:>2d}  {m4.method_names[i]}")

top five under equal weights / SAW (each band weighted equally):
   1  Pawlikowski
   2  Montero-Manso
   3  Smyl
   4  Doornik
   5  Jaganathan

The method that ranks first here need not be the one that ranked first in the official competition. The official M4 ranking pools by OWA over all 100,000 series, so the monthly and yearly bands (48,000 and 23,000 series) dominate. beam treats each band as one dataset and weights the six equally, which lifts methods that do well on the small high-frequency bands.

Ranking by frequency band

The per-band sMAPE shows why a single pooled order is incomplete. The same method can rank first on one band and last on another.

smape = m4.tensor(("smape",))[:, :, 0]
n_methods = len(m4.method_names)
n_bands = len(m4.frequency_names)

# Rank methods within each band by sMAPE (1 = lowest sMAPE on that band).
band_ranks = np.empty((n_methods, n_bands), dtype=int)
for b in range(n_bands):
    band_ranks[:, b] = np.argsort(np.argsort(smape[:, b])) + 1

from beam import plot

plot.rank_heatmap(
    band_ranks,
    row_names=m4.method_names,
    col_names=m4.frequency_names,
    row_label="forecasting method",
    col_label="frequency band",
    title="Per-band sMAPE rank (1 = lowest sMAPE on that band)",
)

Leave one frequency band out

beam.rank ran leave-one-dataset-out across the six bands. For each band it is dropped, the remaining five are pooled, the methods are re-ranked, and the result is compared to the all-band ranking. A method with low stability owes its position to one band.

lodo = run.leave_one_dataset_out
top_idx = int(np.argmin(run.result.ranks))
print(f"bands evaluated: {len(lodo.evaluated_datasets)} of {n_bands}")
print(f"most influential band: {lodo.dataset_names[lodo.most_influential_dataset]} "
      f"(largest rank shift {lodo.max_rank_shift})")
print()
order = np.argsort(run.result.ranks)
print(f"{'method':16s}  pooled rank  rank held across leave-one-band-out runs")
for i in order[:8]:
    print(f"{m4.method_names[i]:16s}  {run.result.ranks[i]:>4d}        {lodo.rank_stability[i] * 100:5.0f}%")

bands evaluated: 6 of 6
most influential band: Hourly (largest rank shift 13)

method            pooled rank  rank held across leave-one-band-out runs
Pawlikowski          1           83%
Montero-Manso        2           67%
Smyl                 3           67%
Doornik              4           50%
Jaganathan           5           83%
Tartu M4 seminar     6           50%
Fiorucci             7           33%
Petropoulos          8            0%

plot.dataset_stability(run)

The hourly band is the most influential. It has only a few hundred series and very different dynamics from the long yearly and quarterly series, so a method tuned for it moves a lot in the ranking when it is dropped.

Dataset concordance

The pooled ranking averages over the six frequency bands. dataset_concordance ranks the methods within each band and compares every pair of per-band orderings with Kendall tau-b. A high mean says the pooled ranking stands in for the individual bands; a low mean says it does not.

conc = run.dataset_concordance
names = conc.dataset_names
print(f"mean agreement across bands (Kendall tau-b): {conc.mean_pairwise_tau:.2f}")
print("least typical band:", names[conc.most_idiosyncratic_dataset])
print("mutually consistent groups:",
      [tuple(names[d] for d in g) for g in conc.concordant_groups])
print("where methods depart most from their own average rank:")
for cell in conc.notable_cells[:5]:
    side = "lower" if cell.deviation > 0 else "higher"
    print(f"  {conc.tool_names[cell.tool]} on {names[cell.dataset]}: "
          f"rank {cell.rank}, {side} than its mean {cell.mean_rank:.1f}")

mean agreement across bands (Kendall tau-b): 0.21
least typical band: Daily
mutually consistent groups: [('Yearly',), ('Quarterly', 'Monthly'), ('Weekly',), ('Daily',), ('Hourly',)]
where methods depart most from their own average rank:
  Nikzad on Hourly: rank 2, higher than its mean 15.8
  Legaki on Yearly: rank 2, higher than its mean 15.5
  Ibrahim on Yearly: rank 6, higher than its mean 18.7
  Shaub on Yearly: rank 3, higher than its mean 15.5
  Darin on Daily: rank 25, lower than its mean 12.5

M4 is the contrast to Duo: the bands disagree more, which is the strong method-by-band interaction the rank-sensitivity section also reads.

plot.dataset_concordance(run)

A second plot marks where each method places higher or lower than its own typical rank, showing which methods carry the band disagreement.

plot.dataset_struggle(run)

Funky heatmap with rank robustness

The funky heatmap shows the same run as a glyph table over the two error metrics, with three robustness panels: the rank span across the six leave-one-band-out runs, the rank span across the five aggregations, and the SMAA rank-acceptability bar.

from beam.reporting import funky_heatmap_from_run

funky_heatmap_from_run(run, title="M4: scores and rank robustness")

There are only two metrics, so the glyph grid is small. The robustness panels carry more: many of the 25 methods change rank when a band is dropped or the aggregation is changed, and the SMAA bar spreads the top ranks across several methods rather than one. The methods score close together, so the pooled order is not firm.

Mixed-effects on sMAPE

The leave-one-band-out check asks whether the ranking leans on one band. A mixed-effects model asks the complementary question: how much of the sMAPE variation is a stable method effect and how much is the method-by-band interaction. It needs R’s lme4, so the chunk runs only when it is available.

from beam.heterogeneity import mixed_effects_from_matrix, r_available

smape_matrix = m4.tensor(("smape",))[:, :, 0]

if r_available():
    me = mixed_effects_from_matrix(smape_matrix, m4.method_names, m4.frequency_names)
    print(f"band shift (ICC):  {me.icc_dataset:.2f} of the sMAPE variance")
    print(f"residual share:    {me.residual_share:.2f}")
else:
    print("R with lme4 not available; skipping the mixed-effects fit.")

band shift (ICC):  0.90 of the sMAPE variance
residual share:    0.10

On M4 the band intercept takes most of the sMAPE variance: the bands differ mostly in how hard they are to forecast for every method alike, which is the opposite of the transportation example, where the terrain decides which mode ranks first. A Bradley-Terry tree on the six bands, with the seasonal period as the splitting feature, has too few datasets to find a stable split; this is the same small-sample limit the Duo benchmark hits, and why the OpenProblems spatial task (50 datasets) is where a split appears.

from beam.heterogeneity import bradley_terry_tree, bttree_available

seasonal_period = {
    "Yearly": 1.0, "Quarterly": 4.0, "Monthly": 12.0,
    "Weekly": 1.0, "Daily": 1.0, "Hourly": 24.0,
}
if bttree_available():
    bt = bradley_terry_tree(
        smape_matrix,
        m4.method_names,
        m4.frequency_names,
        numeric_features={"seasonal_period": [seasonal_period[b] for b in m4.frequency_names]},
        polarity="lower_is_better",
        minsize=2,
    )
    print(f"split found: {bt.did_split}")
    print(bt.summary())
else:
    print("R with psychotree not available; skipping the Bradley-Terry tree.")

split found: False
The Bradley-Terry tree found no dataset feature that splits the method ranking at alpha 0.05 over 6 datasets, so the ranking is reported as one Bradley-Terry model over all of them, led by Pawlikowski. With this many datasets the split test has few observations to work with, the same small-sample limit the critical-difference diagram shows; a benchmark with more datasets is where a split can appear.

Rank sensitivity: band or analyst choice

The leave-one-band-out check and the mixed-effects ICC both say the band matters. rank_sensitivity puts that on the same scale as the two modeling choices and says how much each one moves the ranking. It runs every combination of weighting scheme, aggregation rule and band, then splits each method’s rank variance into a share for each factor by analysis of variance. The design is a balanced full factorial, so the shares are exact, not sampled.

from beam.mcda import rank_sensitivity

rs = rank_sensitivity(
    m4.tensor(),
    ctx.polarity,
    normalization=list(ctx.normalization),
    bounds=list(ctx.bounds),
    baselines=list(ctx.baselines),
    targets=list(ctx.targets),
    tool_names=m4.method_names,
    dataset_names=m4.frequency_names,
)
print(f"{rs.n_combinations} combinations of {len(rs.weightings)} weightings, "
      f"{len(rs.methods)} aggregations and {len(rs.dataset_names)} bands")
print(f"  band (dataset): {rs.dataset_share:.3f} of the rank variance")
print(f"  weighting:      {rs.weighting_share:.3f}")
print(f"  aggregation:    {rs.aggregation_share:.3f}")
print(f"  interactions:   {rs.interaction_share:.3f}")
print(f"  most influential factor: {rs.most_influential_factor}")

120 combinations of 4 weightings, 5 aggregations and 6 bands
  band (dataset): 0.963 of the rank variance
  weighting:      0.002
  aggregation:    0.003
  interactions:   0.032
  most influential factor: dataset

plot.rank_sensitivity(rs)

The band accounts for almost all the rank variance and the two choices almost none. The M4 order is a question of which frequency you score on, not how you weight or aggregate. The mixed-effects ICC above reaches the same number a different way: an exact factorial decomposition of the ranks rather than a random-effects model of the scores.

The shares above are pooled over the methods. The per-method version splits the same variance one method at a time. It separates a method whose rank depends on the band from one that depends on the weighting or the aggregation. The span next to each bar is the difference between the method’s best and worst rank.

plot.rank_sensitivity_by_tool(rs)

headline = m4.method_names[rs.headline_tool]
print(f"top method overall: {headline}, rank 1 in {rs.headline_top_fraction * 100:.0f}% of combinations")
print(f"{'band':12s}  mean rank of {headline}")
for band, mean_rank in zip(rs.dataset_names, rs.headline_rank_by_dataset):
    print(f"{band:12s}  {mean_rank:.1f}")

top method overall: Smyl, rank 1 in 33% of combinations
band          mean rank of Smyl
Yearly        1.0
Quarterly     2.6
Monthly       1.0
Weekly        9.9
Daily         17.8
Hourly        5.4

specification_curve lists the rankings the same grid produces and reports how often the top method holds. The full grid mixes the choices with the band; running it on the pooled matrix instead isolates the choices, so the gap between the two fractions is the band’s doing.

from beam.mcda import specification_curve

curve = specification_curve(rs)
dom = curve.tool_names[curve.most_frequent_top_tool]
print(f"choices plus band ({curve.n_specifications} combinations): "
      f"{dom} first in {curve.most_frequent_top_fraction * 100:.0f}%, "
      f"{curve.n_distinct_top_tools} methods reach the top")

pooled = specification_curve(
    rank_sensitivity(
        run.matrix, ctx.polarity,
        normalization=list(ctx.normalization), bounds=list(ctx.bounds),
        baselines=list(ctx.baselines), targets=list(ctx.targets),
        tool_names=m4.method_names,
    )
)
pdom = pooled.tool_names[pooled.most_frequent_top_tool]
print(f"choices only ({pooled.n_specifications} combinations): "
      f"{pdom} first in {pooled.most_frequent_top_fraction * 100:.0f}%")

plot.specification_curve(curve)

choices plus band (120 combinations): Smyl first in 33%, 5 methods reach the top
choices only (20 combinations): Pawlikowski first in 90%

Blind analysis

A blind analysis fixes the pipeline before the method names are known, so the weighting and the metric set cannot be chosen to favor a method expected to rank first. beam.blind masks the names and shuffles the rows; beam.unblind restores them. The ranking is unchanged, and the seal fingerprint is recorded in the manifest.

from beam import blind, unblind

blinded, seal = blind(scores, seed=0)
blind_run = beam.rank(blinded, weights="equal", method="saw", seed=0, sensitivity=False)
restored = unblind(blind_run, seal)
named = beam.rank(scores, weights="equal", method="saw", seed=0, sensitivity=False)
print("ranking identical after unblinding:",
      dict(zip(named.tool_names, named.result.ranks))
      == dict(zip(restored.tool_names, restored.result.ranks)))
print("top method after unblinding:", restored.top_tool)
print("blinding fingerprint:", blind_run.manifest["blinding"]["seal_sha256"][:12])

ranking identical after unblinding: True
top method after unblinding: Pawlikowski
blinding fingerprint: 81d83ff65bdf

Pairwise superiority across the bands

rank_sensitivity showed the band carries the ranking. pairwise_superiority reads the same fact pair by pair: how often one method outperforms another across the six bands on sMAPE. sMAPE declares no noise floor, so the equivalence band is zero here; any difference counts.

from beam.mcda import pairwise_superiority

sup = pairwise_superiority(smape, "lower_is_better", method_names=m4.method_names)
top_method = m4.method_names[sup.order[0]]
print(f"highest standing: {top_method} ({sup.standing[sup.order[0]]:.2f})")
print(f"method pairs the sign test cannot separate: {len(sup.equivalent_pairs)} of {len(sup.per_pair)}")
top, runner = sup.order[0], sup.order[1]
pair = next(p for p in sup.per_pair if {p.a, p.b} == {int(top), int(runner)})
n_top = pair.a_outperforms if pair.a == top else pair.b_outperforms
print(f"{top_method} outperforms {m4.method_names[runner]} on {n_top} of {pair.n_compared} bands")

highest standing: Pawlikowski (0.84)
method pairs the sign test cannot separate: 238 of 300
Pawlikowski outperforms Smyl on 2 of 6 bands

With only six bands the sign test has little power, so few pairs reach significance, and the method with the highest standing outperforms the others on some bands and not on others. No method outperforms the field across every frequency.

pairwise_transitivity asks whether those pairwise majorities agree with a single order. With the ranking moving across the bands, they need not.

from beam.mcda import pairwise_transitivity

trans = pairwise_transitivity(sup)
print(f"transitive: {trans.is_transitive}; circular triads: {trans.n_circular_triads} of {trans.n_triads}")

transitive: False; circular triads: 2 of 2300

The matrix below orders the methods by how many others they outperform. A transitive relation fills the upper triangle; a red cell below the diagonal marks a method that outperforms one ranked above it, which can only happen inside a cycle.

plot.pairwise_majority(trans)

bayesian_sign_comparison puts the same comparison on the probability scale: for each pair, the posterior probability that one method is practically better. With six bands the posterior is coarse, so most pairs stay inconclusive at the 0.95 threshold.

from beam.mcda import bayesian_sign_comparison

bayes = bayesian_sign_comparison(sup)
decisive = sum(1 for p in bayes.per_pair if p.decision != "inconclusive")
print(f"pairs with a decisive posterior at 0.95: {decisive} of {len(bayes.per_pair)}")
plot.bayesian_comparison(bayes)

pairs with a decisive posterior at 0.95: 62 of 300

Recommendation

Pooled with equal weight per frequency band and SAW over sMAPE and MASE, the ranking favours methods that do well across all six bands, not only on the high-volume monthly and yearly series. The official competition order differs: it pools by OWA weighted by the number of series, so the monthly and yearly bands dominate and the ES-RNN of Smyl ranks first. The leave-one-band-out analysis points at the hourly band as the one the ranking leans on most, and with only six bands the critical-difference diagram has little power to separate the top methods. The top methods are close together, and which one comes first turns on whether the bands are weighted equally or by series count. The choice is recorded in the manifest.