Metric cards
Every metric in beam’s registry is described by a metric card: a small YAML file that declares what the metric measures, its measurement scale, its polarity, its declared range, how to normalize it, and how to pool it across datasets. The pipeline reads these fields so that ranking decisions follow from the metric definition rather than from a hidden default. The cards below are generated from the registry, so this page always matches the shipped cards.
For how the pipeline consumes these fields, see Cards and pipeline.
Clustering
Corrected-for-chance similarity between two partitions of the same set of elements. Defined as the Rand Index adjusted by its expected value under a random model of partition pairs. Equals 1 for identical partitions and 0 for random partitions; can be negative for worse-than-random agreement.
- Task
- clustering
- Scale
- interval
- Range
- [-1, 1]
- Normalization
- baseline_relative
- Across datasets
- arithmetic_mean
- Ground truth
- required
Hubert L, Arabie P. Comparing partitions. Journal of Classification, 1985. 10.1007/BF01908075
STATO: STATO_0000593
Mutual information between two partitions, normalized to [0, 1] using one of several conventions (arithmetic mean of entropies, geometric mean, or max). Equals 1 for identical partitions and 0 for independent partitions. Unlike ARI, NMI is not chance-corrected.
- Task
- clustering
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max (default)
- Across datasets
- arithmetic_mean
- Ground truth
- required
Strehl A, Ghosh J. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. JMLR, 2002.
Mean silhouette coefficient over all clustered instances. For each instance, s = (b - a) / max(a, b), where a is the mean distance to other instances in the same cluster and b is the mean distance to instances in the nearest other cluster. Ranges over [-1, 1].
- Task
- clustering
- Scale
- interval
- Range
- [-1, 1]
- Normalization
- min_max (default)
- Across datasets
- arithmetic_mean
- Ground truth
- not required
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 10.1016/0377-0427(87)90125-7
Difference in the normalized Shannon entropy of the cluster size distribution between an estimated partition and the true partition. The normalized Shannon entropy of a partition measures how evenly elements are spread across clusters, scaled to [0, 1] by the log of the number of clusters so that a perfectly balanced partition reaches 1 and a fully concentrated one reaches 0. This metric is the absolute difference between the estimated partition's normalized entropy and the true partition's normalized entropy, as used in the Duo 2018 clustering benchmark (the s.norm.vs.true column). A value of 0 means the estimated clustering matches the truth in how evenly it distributes elements across clusters; larger values mean the estimated cluster size profile departs further from the truth.
- Task
- clustering
- Scale
- ratio
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Duo A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research 2018, 7:1141. 10.12688/f1000research.15666.3
Deviation of the estimated number of clusters from the true number of clusters, as used in the Duo 2018 clustering benchmark (the nclust.vs.true column). It is the absolute difference between the number of clusters a method reports and the number of clusters in the reference partition. A value of 0 means the method recovered the correct cluster count; larger values mean it over- or under-clustered. This metric is sparsely populated in Duo 2018: 101 of the 168 method by data set cells are missing, because several methods do not report a fixed cluster count for every data set, so callers should expect many NaN entries and account for partial coverage when aggregating.
- Task
- clustering
- Scale
- ratio
- Range
- [0, inf]
- Normalization
- rank
- Across datasets
- arithmetic_mean
- Ground truth
- required
Duo A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research 2018, 7:1141. 10.12688/f1000research.15666.3
Classification
Fraction of correctly classified instances. Defined as the number of matches between predicted and true labels, divided by the total number of instances.
- Task
- classification
- Scale
- ratio
- Range
- [0, 1]
- Normalization
- min_max (default)
- Across datasets
- arithmetic_mean
- Ground truth
- required
STATO: STATO_0000415 · HUGGINGFACE_EVALUATE: accuracy
Harmonic mean of precision and recall. Used for binary classification or per-class evaluation in multiclass settings. Equals 1 only when both precision and recall are 1.
- Task
- classification, retrieval
- Scale
- ratio
- Range
- [0, 1]
- Normalization
- min_max (default)
- Across datasets
- arithmetic_mean
- Ground truth
- required
van Rijsbergen CJ. Information Retrieval, 2nd edition. Butterworth-Heinemann, 1979.
STATO: STATO_0000628 · HUGGINGFACE_EVALUATE: f1
Slope of a logistic recalibration model that regresses the observed binary outcome on the linear predictor (the log-odds) of a risk model. A value of 1 means the predicted risks are correctly scaled. A slope below 1 means the predictions are too extreme, the usual signature of an overfit model; a slope above 1 means they are too moderate. The slope is therefore a target_value metric: neither the highest nor the lowest value is best, the ideal is exactly 1. It is one of the standard calibration summaries for clinical risk-prediction models alongside the calibration intercept and the calibration plot.
- Task
- classification
- Scale
- interval
- Range
- unbounded
- Normalization
- target_relative
- Across datasets
- arithmetic_mean
- Ground truth
- required
Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC Medicine 2019, 17:230. 10.1186/s12916-019-1466-7
STATO: STATO_0000687
Forecasting
Symmetric mean absolute percentage error between a point forecast and the realized future values, in percent. Using the M4 competition definition, each horizon step contributes 2 * |F - A| / (|F| + |A|), averaged over the horizon and expressed as a percentage. The symmetric denominator bounds the value in [0, 200], unlike the classic MAPE which is unbounded and undefined when an actual value is zero.
- Task
- forecasting
- Scale
- ratio
- Range
- [0, 200]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Makridakis S. Accuracy measures: theoretical and practical concerns. International Journal of Forecasting, 1993. 10.1016/0169-2070(93)90079-3
HUGGINGFACE_EVALUATE: smape
Mean absolute forecast error scaled by the in-sample mean absolute error of the seasonal naive method on the same series. A value below 1 means the forecast beats the in-sample seasonal naive baseline; a value above 1 means it does not. Because the scaling uses the series' own history, MASE is scale-free and comparable across series of different magnitudes, and it is defined even when actual values are zero.
- Task
- forecasting
- Scale
- ratio
- Range
- [0, inf]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Hyndman RJ, Koehler AB. Another look at measures of forecast accuracy. International Journal of Forecasting, 2006. 10.1016/j.ijforecast.2006.03.001
HUGGINGFACE_EVALUATE: mase
Efficiency
Wall-clock seconds elapsed during the execution of a method on a given input. A measured metric: the value comes from a timer surrounding the method invocation, not from a comparison of outputs.
- Task
- efficiency
- Scale
- ratio
- Range
- [0, inf]
- Normalization
- log_min_max
- Across datasets
- geometric_mean
- Ground truth
- not required
UO: UO_0000010
Maximum resident set size of the process executing a method, in bytes. Measured by the operating system or a monitoring tool wrapping the process.
- Task
- efficiency
- Scale
- ratio
- Range
- [0, inf]
- Normalization
- log_min_max
- Across datasets
- geometric_mean
- Ground truth
- not required
UO: UO_0000233
Single-cell integration
Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures degree to which batches mix within a cell type, from the batch silhouette width inverted so well-mixed batches score high. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 10.1016/0377-0427(87)90125-7
OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures how compactly and separately the cell-type labels form clusters in the integrated embedding, from the silhouette width rescaled to the unit interval. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 10.1016/0377-0427(87)90125-7
OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures silhouette width restricted to the most batch-isolated cell labels, measuring how well rare or isolated types separate. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8
OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures best F1 of clustering the isolated labels against the rest over a clustering resolution sweep. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8
STATO: STATO_0000628 · OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures local batch mixing, from the acceptance rate of the knn batch-effect test relative to the global batch composition. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Buttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nature Methods, 2019. 10.1038/s41592-018-0254-1
OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures batch mixing, from the inverse Simpson index of batch labels in local neighbourhoods, rescaled to the unit interval. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 2019. 10.1038/s41592-019-0619-0
OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures cell-type label purity, from the inverse Simpson index of cell-type labels in local neighbourhoods, rescaled so pure neighbourhoods score high. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 2019. 10.1038/s41592-019-0619-0
OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures fraction of cells of a given label that remain in a single connected component of the integrated knn graph, averaged over labels. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8
OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures mean per-batch overlap of the highly variable genes computed before versus after integration. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- not required
Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8
OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures conservation of the cell-cycle signal, from the variance explained by S and G2M cell-cycle scores before versus after integration. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- not required
Tirosh I, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science, 2016. 10.1126/science.aad0501
OBI: OBI_0002631
Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures change in the variance attributable to batch, regressed on principal components, before versus after integration. scIB rescales it to the unit interval where higher is better.
- Task
- integration
- Scale
- interval
- Range
- [0, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- not required
Buttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nature Methods, 2019. 10.1038/s41592-018-0254-1
OBI: OBI_0200104
Spatial
Spatial transcriptomics metric from the OpenProblems spatially variable genes task. Correlation between a method's ranking of spatially variable genes and a reference ranking; higher means closer agreement with the reference.
- Task
- spatial
- Scale
- interval
- Range
- [-1, 1]
- Normalization
- min_max
- Across datasets
- arithmetic_mean
- Ground truth
- required
Spearman C. The proof and measurement of association between two things. American Journal of Psychology 1904, 15:72-101. The Spearman rank correlation coefficient is what the OpenProblems spatially variable genes task uses to score a method's ranking against the reference ranking. 10.2307/1412159
STATO: STATO_0000201 · HUGGINGFACE_EVALUATE: spearmanr
Transportation (toy set)
Average travel speed in kilometres per hour. A measured quantity: the value is distance covered divided by time taken, not a comparison of outputs. The transportation example uses it, along with cost and co2, to run the registry and the MCDA pipeline on a problem that has nothing to do with bioinformatics.
- Task
- transportation, efficiency
- Scale
- ratio
- Range
- [0, inf]
- Normalization
- log_min_max
- Across datasets
- arithmetic_mean
- Ground truth
- not required
UO: UO_0010008
Monetary cost of travel per kilometre, in a generic currency unit. A measured quantity, lower is better. One of the three metrics in the transportation example, which exercises the registry and the MCDA pipeline outside bioinformatics.
- Task
- transportation, efficiency
- Scale
- ratio
- Range
- [0, inf]
- Normalization
- log_min_max
- Across datasets
- arithmetic_mean
- Ground truth
- not required
Carbon dioxide emitted per kilometre travelled, in grams. A measured quantity, lower is better, with a true zero for zero-emission modes such as walking or cycling. Used as a domain-neutral performance metric in the transportation example. Because several modes emit exactly zero, the card recommends a rank normalization, which stays defined when a column carries hard zeros, rather than a log rescale.
- Task
- transportation, sustainability
- Scale
- ratio
- Range
- [0, inf]
- Normalization
- rank
- Across datasets
- arithmetic_mean
- Ground truth
- not required
UO: UO_0000021