beam

Benchmark Evaluation and Metrics

beam is a metric formalization layer for method comparisons and benchmarks in bioinformatics. It records each performance metric in a metric card with measurement-theory grounded metadata, ranks methods with multi-criteria decision analysis, and decomposes method-dataset heterogeneity.

The metric cards are mapped to STATO, UO, OBI and HuggingFace evaluate where available. The OWL is available at docs/beam.owl.ttl and regenerated from the cards on each release.

Where to start

Quick start. From a CSV to a ranking and an HTML report.
How to run from beam.yaml. The declarative beam specification.
Install and run in Python, R, or the CLI.
Cards and pipeline. What a metric card encodes and how it is processed.
Critical difference. Whether the methods are separable across datasets (Friedman, Nemenyi, Skillings-Mack).

Worked vignettes

Duo 2018 clustering: scRNA-seq clustering with 14 methods on 12 datasets.
Simulated scenarios: consistency checks against documented ground truth.
Transportation modes: cross-domain example with partial coverage.
M4 forecasting: 25 methods on six frequency bands.
OpenProblems batch integration: MCDA contrasted with the platform’s own mean-of-scores rule, plus a Bradley-Terry tree on 50 spatial datasets.
Cross-benchmark meta-analysis: consistency across independently published scRNA-seq integration benchmarks.
LLM cell type annotation: GPT-4 and GPT-3.5, GPTCelltype etc versus classical annotation methods.

We also provide the standard reports for some of these comparisons.

Source

github.com/imallona/beam. GPL-3 code, CC-BY-4.0 metric cards.