Metric cards

Every metric in beam’s registry is described by a metric card: a small YAML file that declares what the metric measures, its measurement scale, its polarity, its declared range, how to normalize it, and how to pool it across datasets. The pipeline reads these fields so that ranking decisions follow from the metric definition rather than from a hidden default. The cards below are generated from the registry, so this page always matches the shipped cards.

For how the pipeline consumes these fields, see Cards and pipeline.

Clustering

arihigher is better
Adjusted Rand Index (Hubert-Arabie)

Corrected-for-chance similarity between two partitions of the same set of elements. Defined as the Rand Index adjusted by its expected value under a random model of partition pairs. Equals 1 for identical partitions and 0 for random partitions; can be negative for worse-than-random agreement.

Task
clustering
Scale
interval
Range
[-1, 1]
Normalization
baseline_relative
Across datasets
arithmetic_mean
Ground truth
required

Hubert L, Arabie P. Comparing partitions. Journal of Classification, 1985. 10.1007/BF01908075

STATO: STATO_0000593

nmihigher is better
Normalized mutual information

Mutual information between two partitions, normalized to [0, 1] using one of several conventions (arithmetic mean of entropies, geometric mean, or max). Equals 1 for identical partitions and 0 for independent partitions. Unlike ARI, NMI is not chance-corrected.

Task
clustering
Scale
interval
Range
[0, 1]
Normalization
min_max (default)
Across datasets
arithmetic_mean
Ground truth
required

Strehl A, Ghosh J. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. JMLR, 2002.

silhouettehigher is better
Silhouette coefficient

Mean silhouette coefficient over all clustered instances. For each instance, s = (b - a) / max(a, b), where a is the mean distance to other instances in the same cluster and b is the mean distance to instances in the nearest other cluster. Ranges over [-1, 1].

Task
clustering
Scale
interval
Range
[-1, 1]
Normalization
min_max (default)
Across datasets
arithmetic_mean
Ground truth
not required

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 10.1016/0377-0427(87)90125-7

shannon_entropy_difflower is better
Shannon entropy difference of cluster sizes

Difference in the normalized Shannon entropy of the cluster size distribution between an estimated partition and the true partition. The normalized Shannon entropy of a partition measures how evenly elements are spread across clusters, scaled to [0, 1] by the log of the number of clusters so that a perfectly balanced partition reaches 1 and a fully concentrated one reaches 0. This metric is the absolute difference between the estimated partition's normalized entropy and the true partition's normalized entropy, as used in the Duo 2018 clustering benchmark (the s.norm.vs.true column). A value of 0 means the estimated clustering matches the truth in how evenly it distributes elements across clusters; larger values mean the estimated cluster size profile departs further from the truth.

Task
clustering
Scale
ratio
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Duo A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research 2018, 7:1141. 10.12688/f1000research.15666.3

nclust_deviationlower is better
Cluster-count deviation from the truth

Deviation of the estimated number of clusters from the true number of clusters, as used in the Duo 2018 clustering benchmark (the nclust.vs.true column). It is the absolute difference between the number of clusters a method reports and the number of clusters in the reference partition. A value of 0 means the method recovered the correct cluster count; larger values mean it over- or under-clustered. This metric is sparsely populated in Duo 2018: 101 of the 168 method by data set cells are missing, because several methods do not report a fixed cluster count for every data set, so callers should expect many NaN entries and account for partial coverage when aggregating.

Task
clustering
Scale
ratio
Range
[0, inf]
Normalization
rank
Across datasets
arithmetic_mean
Ground truth
required

Duo A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research 2018, 7:1141. 10.12688/f1000research.15666.3

Classification

accuracyhigher is better
Classification accuracy

Fraction of correctly classified instances. Defined as the number of matches between predicted and true labels, divided by the total number of instances.

Task
classification
Scale
ratio
Range
[0, 1]
Normalization
min_max (default)
Across datasets
arithmetic_mean
Ground truth
required

STATO: STATO_0000415 · HUGGINGFACE_EVALUATE: accuracy

f1_scorehigher is better
F1 score

Harmonic mean of precision and recall. Used for binary classification or per-class evaluation in multiclass settings. Equals 1 only when both precision and recall are 1.

Task
classification, retrieval
Scale
ratio
Range
[0, 1]
Normalization
min_max (default)
Across datasets
arithmetic_mean
Ground truth
required

van Rijsbergen CJ. Information Retrieval, 2nd edition. Butterworth-Heinemann, 1979.

STATO: STATO_0000628 · HUGGINGFACE_EVALUATE: f1

calibration_slopetarget value
Calibration slope

Slope of a logistic recalibration model that regresses the observed binary outcome on the linear predictor (the log-odds) of a risk model. A value of 1 means the predicted risks are correctly scaled. A slope below 1 means the predictions are too extreme, the usual signature of an overfit model; a slope above 1 means they are too moderate. The slope is therefore a target_value metric: neither the highest nor the lowest value is best, the ideal is exactly 1. It is one of the standard calibration summaries for clinical risk-prediction models alongside the calibration intercept and the calibration plot.

Task
classification
Scale
interval
Range
unbounded
Normalization
target_relative
Across datasets
arithmetic_mean
Ground truth
required

Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC Medicine 2019, 17:230. 10.1186/s12916-019-1466-7

STATO: STATO_0000687

Forecasting

smapelower is better
Symmetric Mean Absolute Percentage Error

Symmetric mean absolute percentage error between a point forecast and the realized future values, in percent. Using the M4 competition definition, each horizon step contributes 2 * |F - A| / (|F| + |A|), averaged over the horizon and expressed as a percentage. The symmetric denominator bounds the value in [0, 200], unlike the classic MAPE which is unbounded and undefined when an actual value is zero.

Task
forecasting
Scale
ratio
Range
[0, 200]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Makridakis S. Accuracy measures: theoretical and practical concerns. International Journal of Forecasting, 1993. 10.1016/0169-2070(93)90079-3

HUGGINGFACE_EVALUATE: smape

maselower is better
Mean Absolute Scaled Error

Mean absolute forecast error scaled by the in-sample mean absolute error of the seasonal naive method on the same series. A value below 1 means the forecast beats the in-sample seasonal naive baseline; a value above 1 means it does not. Because the scaling uses the series' own history, MASE is scale-free and comparable across series of different magnitudes, and it is defined even when actual values are zero.

Task
forecasting
Scale
ratio
Range
[0, inf]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Hyndman RJ, Koehler AB. Another look at measures of forecast accuracy. International Journal of Forecasting, 2006. 10.1016/j.ijforecast.2006.03.001

HUGGINGFACE_EVALUATE: mase

Efficiency

runtimelower is better
Wall-clock runtime

Wall-clock seconds elapsed during the execution of a method on a given input. A measured metric: the value comes from a timer surrounding the method invocation, not from a comparison of outputs.

Task
efficiency
Scale
ratio
Range
[0, inf]
Normalization
log_min_max
Across datasets
geometric_mean
Ground truth
not required

UO: UO_0000010

peak_memorylower is better
Peak resident set size

Maximum resident set size of the process executing a method, in bytes. Measured by the operating system or a monitoring tool wrapping the process.

Task
efficiency
Scale
ratio
Range
[0, inf]
Normalization
log_min_max
Across datasets
geometric_mean
Ground truth
not required

UO: UO_0000233

Single-cell integration

asw_batchhigher is better
Batch ASW (silhouette over batches)

Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures degree to which batches mix within a cell type, from the batch silhouette width inverted so well-mixed batches score high. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 10.1016/0377-0427(87)90125-7

OBI: OBI_0002631

asw_labelhigher is better
Cell-type ASW (silhouette over labels)

Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures how compactly and separately the cell-type labels form clusters in the integrated embedding, from the silhouette width rescaled to the unit interval. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 10.1016/0377-0427(87)90125-7

OBI: OBI_0002631

isolated_label_aswhigher is better
Isolated-label ASW

Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures silhouette width restricted to the most batch-isolated cell labels, measuring how well rare or isolated types separate. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8

OBI: OBI_0002631

isolated_label_f1higher is better
Isolated-label F1

Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures best F1 of clustering the isolated labels against the rest over a clustering resolution sweep. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8

STATO: STATO_0000628 · OBI: OBI_0002631

kbethigher is better
kBET (k-nearest-neighbour batch effect test)

Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures local batch mixing, from the acceptance rate of the knn batch-effect test relative to the global batch composition. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Buttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nature Methods, 2019. 10.1038/s41592-018-0254-1

OBI: OBI_0002631

ilisihigher is better
Integration LISI (iLISI)

Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures batch mixing, from the inverse Simpson index of batch labels in local neighbourhoods, rescaled to the unit interval. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 2019. 10.1038/s41592-019-0619-0

OBI: OBI_0002631

clisihigher is better
Cell-type LISI (cLISI)

Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures cell-type label purity, from the inverse Simpson index of cell-type labels in local neighbourhoods, rescaled so pure neighbourhoods score high. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 2019. 10.1038/s41592-019-0619-0

OBI: OBI_0002631

graph_connectivityhigher is better
Graph connectivity

Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures fraction of cells of a given label that remain in a single connected component of the integrated knn graph, averaged over labels. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8

OBI: OBI_0002631

hvg_overlaphigher is better
Highly variable gene overlap

Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures mean per-batch overlap of the highly variable genes computed before versus after integration. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
not required

Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8

OBI: OBI_0002631

cell_cycle_conservationhigher is better
Cell-cycle conservation

Single-cell data-integration metric from the scIB benchmark, a biological-conservation score. Measures conservation of the cell-cycle signal, from the variance explained by S and G2M cell-cycle scores before versus after integration. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
not required

Tirosh I, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science, 2016. 10.1126/science.aad0501

OBI: OBI_0002631

pcrhigher is better
Principal component regression comparison

Single-cell data-integration metric from the scIB benchmark, a batch-removal score. Measures change in the variance attributable to batch, regressed on principal components, before versus after integration. scIB rescales it to the unit interval where higher is better.

Task
integration
Scale
interval
Range
[0, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
not required

Buttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nature Methods, 2019. 10.1038/s41592-018-0254-1

OBI: OBI_0200104

Spatial

correlationhigher is better
Spatial-variability rank correlation

Spatial transcriptomics metric from the OpenProblems spatially variable genes task. Correlation between a method's ranking of spatially variable genes and a reference ranking; higher means closer agreement with the reference.

Task
spatial
Scale
interval
Range
[-1, 1]
Normalization
min_max
Across datasets
arithmetic_mean
Ground truth
required

Spearman C. The proof and measurement of association between two things. American Journal of Psychology 1904, 15:72-101. The Spearman rank correlation coefficient is what the OpenProblems spatially variable genes task uses to score a method's ranking against the reference ranking. 10.2307/1412159

STATO: STATO_0000201 · HUGGINGFACE_EVALUATE: spearmanr

Transportation (toy set)

speedhigher is better
Travel speed

Average travel speed in kilometres per hour. A measured quantity: the value is distance covered divided by time taken, not a comparison of outputs. The transportation example uses it, along with cost and co2, to run the registry and the MCDA pipeline on a problem that has nothing to do with bioinformatics.

Task
transportation, efficiency
Scale
ratio
Range
[0, inf]
Normalization
log_min_max
Across datasets
arithmetic_mean
Ground truth
not required

UO: UO_0010008

costlower is better
Monetary cost per distance

Monetary cost of travel per kilometre, in a generic currency unit. A measured quantity, lower is better. One of the three metrics in the transportation example, which exercises the registry and the MCDA pipeline outside bioinformatics.

Task
transportation, efficiency
Scale
ratio
Range
[0, inf]
Normalization
log_min_max
Across datasets
arithmetic_mean
Ground truth
not required
co2lower is better
Carbon dioxide emissions per distance

Carbon dioxide emitted per kilometre travelled, in grams. A measured quantity, lower is better, with a true zero for zero-emission modes such as walking or cycling. Used as a domain-neutral performance metric in the transportation example. Because several modes emit exactly zero, the card recommends a rank normalization, which stays defined when a column carries hard zeros, rather than a log rescale.

Task
transportation, sustainability
Scale
ratio
Range
[0, inf]
Normalization
rank
Across datasets
arithmetic_mean
Ground truth
not required

UO: UO_0000021