Metric cards

Every metric in beam’s registry is described by a metric card: a small YAML file that declares what the metric measures, its measurement scale, its polarity, its declared range, how to normalize it, and how to pool it across datasets. The pipeline reads these fields so that ranking decisions follow from the metric definition rather than from a hidden default. The cards below are generated from the registry, so this page always matches the shipped cards.

For how the pipeline consumes these fields, see Cards and pipeline.

Clustering

arihigher is better

Adjusted Rand Index (Hubert-Arabie)

Corrected-for-chance similarity between two partitions of the same set of elements. Defined as the Rand Index adjusted by its expected value under a random model of partition pairs. Equals 1 for identical partitions and 0 for random partitions; can be negative for worse-than-random agreement.

Task: clustering
Scale: interval
Range: [-1, 1]
Normalization: baseline_relative
Across datasets: arithmetic_mean
Ground truth: required

Hubert L, Arabie P. Comparing partitions. Journal of Classification, 1985. 10.1007/BF01908075

STATO: STATO_0000593

nmihigher is better

Normalized mutual information

Mutual information between two partitions, normalized to [0, 1] using one of several conventions (arithmetic mean of entropies, geometric mean, or max). Equals 1 for identical partitions and 0 for independent partitions. Unlike ARI, NMI is not chance-corrected.

Task: clustering
Scale: interval
Range: [0, 1]
Normalization: min_max (default)
Across datasets: arithmetic_mean
Ground truth: required

Strehl A, Ghosh J. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. JMLR, 2002.

silhouettehigher is better

Silhouette coefficient

Mean silhouette coefficient over all clustered instances. For each instance, s = (b - a) / max(a, b), where a is the mean distance to other instances in the same cluster and b is the mean distance to instances in the nearest other cluster. Ranges over [-1, 1].

Task: clustering
Scale: interval
Range: [-1, 1]
Normalization: min_max (default)
Across datasets: arithmetic_mean
Ground truth: not required

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 10.1016/0377-0427(87)90125-7

shannon_entropy_difflower is better

Shannon entropy difference of cluster sizes

Absolute difference in the normalized Shannon entropy of the cluster-size distribution between an estimated and the true partition, as used in the Duo 2018 benchmark (the s.norm.vs.true column). The normalized entropy measures how evenly elements are spread across clusters, scaled to [0, 1] by the log of the cluster count, so a perfectly balanced partition reaches 1 and a fully concentrated one reaches 0. A value of 0 means the estimated clustering matches the truth in how evenly it spreads elements; larger values mean it departs further.

Task: clustering
Scale: ratio
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Duo A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research 2018, 7:1141. 10.12688/f1000research.15666.3

nclust_deviationlower is better

Cluster-count deviation from the truth

Absolute difference between the number of clusters a method reports and the number in the reference partition, as used in the Duo 2018 benchmark (the nclust.vs.true column). A value of 0 means the method recovered the correct cluster count; larger values mean it over- or under-clustered. The metric is sparse in Duo 2018, where 101 of 168 method by dataset cells are missing because several methods do not report a fixed cluster count for every dataset, so expect many NaN entries when aggregating.

Task: clustering
Scale: ratio
Range: [0, inf]
Normalization: rank
Across datasets: arithmetic_mean
Ground truth: required

Duo A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research 2018, 7:1141. 10.12688/f1000research.15666.3

Classification

accuracyhigher is better

Classification accuracy

Fraction of correctly classified instances. Defined as the number of matches between predicted and true labels, divided by the total number of instances.

Task: classification
Scale: ratio
Range: [0, 1]
Normalization: min_max (default)
Across datasets: arithmetic_mean
Ground truth: required

STATO: STATO_0000415 · HUGGINGFACE_EVALUATE: accuracy

f1_scorehigher is better

F1 score

Harmonic mean of precision and recall. Used for binary classification or per-class evaluation in multiclass settings. Equals 1 only when both precision and recall are 1.

Task: classification, retrieval
Scale: ratio
Range: [0, 1]
Normalization: min_max (default)
Across datasets: arithmetic_mean
Ground truth: required

van Rijsbergen CJ. Information Retrieval, 2nd edition. Butterworth-Heinemann, 1979.

STATO: STATO_0000628 · HUGGINGFACE_EVALUATE: f1

calibration_slopetarget value

Calibration slope

Slope of a logistic recalibration model that regresses the observed binary outcome on a risk model's linear predictor (the log-odds). A slope of 1 means the predicted risks are correctly scaled. Below 1 the predictions are too extreme, the usual sign of overfitting; above 1 they are too moderate. The ideal is exactly 1, so this is a target_value metric, where neither the highest nor the lowest value is best.

Task: classification
Scale: interval
Range: unbounded
Normalization: target_relative
Across datasets: arithmetic_mean
Ground truth: required

Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW. Calibration: the Achilles heel of predictive analytics. BMC Medicine 2019, 17:230. 10.1186/s12916-019-1466-7

STATO: STATO_0000687

Forecasting

smapelower is better

Symmetric Mean Absolute Percentage Error

Symmetric mean absolute percentage error between a point forecast and the realized future values, in percent. Using the M4 competition definition, each horizon step contributes 2 * |F - A| / (|F| + |A|), averaged over the horizon and expressed as a percentage. The symmetric denominator bounds the value in [0, 200], unlike the classic MAPE which is unbounded and undefined when an actual value is zero.

Task: forecasting
Scale: ratio
Range: [0, 200]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Makridakis S. Accuracy measures: theoretical and practical concerns. International Journal of Forecasting, 1993. 10.1016/0169-2070(93)90079-3

HUGGINGFACE_EVALUATE: smape

maselower is better

Mean Absolute Scaled Error

Mean absolute forecast error scaled by the in-sample mean absolute error of the seasonal naive method on the same series. A value below 1 means the forecast beats the in-sample seasonal naive baseline; a value above 1 means it does not. Because the scaling uses the series' own history, MASE is scale-free and comparable across series of different magnitudes, and it is defined even when actual values are zero.

Task: forecasting
Scale: ratio
Range: [0, inf]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Hyndman RJ, Koehler AB. Another look at measures of forecast accuracy. International Journal of Forecasting, 2006. 10.1016/j.ijforecast.2006.03.001

HUGGINGFACE_EVALUATE: mase

Efficiency

runtimelower is better

Wall-clock runtime

Wall-clock seconds elapsed during the execution of a method on a given input. A measured metric: the value comes from a timer surrounding the method invocation, not from a comparison of outputs.

Task: efficiency
Scale: ratio
Range: [0, inf]
Normalization: log_min_max
Across datasets: geometric_mean
Ground truth: not required

UO: UO_0000010

peak_memorylower is better

Peak resident set size

Maximum resident set size of the process executing a method, in bytes. Measured by the operating system or a monitoring tool wrapping the process.

Task: efficiency
Scale: ratio
Range: [0, inf]
Normalization: log_min_max
Across datasets: geometric_mean
Ground truth: not required

UO: UO_0000233

Single-cell integration

asw_batchhigher is better

Batch ASW (silhouette over batches)

Batch-removal score from the scIB single-cell integration benchmark. It measures inverts the batch silhouette width so well-mixed batches within a cell type score high, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 10.1016/0377-0427(87)90125-7

OBI: OBI_0002631

asw_labelhigher is better

Cell-type ASW (silhouette over labels)

Biological-conservation score from the scIB single-cell integration benchmark. It measures the cell-type silhouette width, so labels that form compact, separated clusters score high, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 1987. 10.1016/0377-0427(87)90125-7

OBI: OBI_0002631

isolated_label_aswhigher is better

Isolated-label ASW

Biological-conservation score from the scIB single-cell integration benchmark. It measures the silhouette width of the most batch-isolated labels, how well rare types separate, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8

OBI: OBI_0002631

isolated_label_f1higher is better

Isolated-label F1

Biological-conservation score from the scIB single-cell integration benchmark. It measures the best F1 of clustering the isolated labels against the rest over a resolution sweep, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8

STATO: STATO_0000628 · OBI: OBI_0002631

kbethigher is better

kBET (k-nearest-neighbour batch effect test)

Batch-removal score from the scIB single-cell integration benchmark. It measures local batch mixing from the kBET acceptance rate against the global batch composition, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Buttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nature Methods, 2019. 10.1038/s41592-018-0254-1

OBI: OBI_0002631

ilisihigher is better

Integration LISI (iLISI)

Batch-removal score from the scIB single-cell integration benchmark. It measures batch mixing from the inverse Simpson index of batch labels in local neighbourhoods, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 2019. 10.1038/s41592-019-0619-0

OBI: OBI_0002631

clisihigher is better

Cell-type LISI (cLISI)

Biological-conservation score from the scIB single-cell integration benchmark. It measures cell-type label purity from the inverse Simpson index in local neighbourhoods, so pure neighbourhoods score high, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Korsunsky I, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature Methods, 2019. 10.1038/s41592-019-0619-0

OBI: OBI_0002631

graph_connectivityhigher is better

Graph connectivity

Batch-removal score from the scIB single-cell integration benchmark. It measures the fraction of each label's cells that stay in one connected component of the integrated knn graph, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8

OBI: OBI_0002631

hvg_overlaphigher is better

Highly variable gene overlap

Biological-conservation score from the scIB single-cell integration benchmark. It measures the per-batch overlap of the highly variable genes before and after integration, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: not required

Luecken MD, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature Methods, 2022. 10.1038/s41592-021-01336-8

OBI: OBI_0002631

cell_cycle_conservationhigher is better

Cell-cycle conservation

Biological-conservation score from the scIB single-cell integration benchmark. It measures how much of the cell-cycle signal, the variance explained by S and G2M scores, survives integration, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: not required

Tirosh I, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science, 2016. 10.1126/science.aad0501

OBI: OBI_0002631

pcrhigher is better

Principal component regression comparison

Batch-removal score from the scIB single-cell integration benchmark. It measures the change in batch-associated variance (principal component regression) before and after integration, rescaled to [0, 1] where higher is better.

Task: integration
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: not required

Buttner M, Miao Z, Wolf FA, Teichmann SA, Theis FJ. A test metric for assessing single-cell RNA-seq batch correction. Nature Methods, 2019. 10.1038/s41592-018-0254-1

OBI: OBI_0200104

Spatial

correlationhigher is better

Spatial-variability rank correlation

Spatial transcriptomics metric from the OpenProblems spatially variable genes task. Correlation between a method's ranking of spatially variable genes and a reference ranking; higher means closer agreement with the reference.

Task: spatial
Scale: interval
Range: [-1, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Spearman C. The proof and measurement of association between two things. American Journal of Psychology 1904, 15:72-101. The Spearman rank correlation coefficient is what the OpenProblems spatially variable genes task uses to score a method's ranking against the reference ranking. 10.2307/1412159

STATO: STATO_0000201 · HUGGINGFACE_EVALUATE: spearmanr

Transportation (toy set)

speedhigher is better

Travel speed

Average travel speed in kilometres per hour. A measured quantity, where the value is distance divided by time, not a comparison of outputs. The transportation example uses it, with cost and co2, to run the registry and the MCDA pipeline outside bioinformatics.

Task: transportation, efficiency
Scale: ratio
Range: [0, inf]
Normalization: log_min_max
Across datasets: arithmetic_mean
Ground truth: not required

UO: UO_0010008

costlower is better

Monetary cost per distance

Monetary cost of travel per kilometre, in a generic currency unit. A measured quantity, lower is better. One of the three metrics in the transportation example, which runs the registry and the MCDA pipeline outside bioinformatics.

Task: transportation, efficiency
Scale: ratio
Range: [0, inf]
Normalization: log_min_max
Across datasets: arithmetic_mean
Ground truth: not required

co2lower is better

Carbon dioxide emissions per distance

Carbon dioxide emitted per kilometre travelled, in grams. A measured quantity, lower is better, with a true zero for zero-emission modes such as walking or cycling. Used as a domain-neutral performance metric in the transportation example. Because several modes emit exactly zero, the card recommends a rank normalization, which stays defined when a column carries hard zeros, rather than a log rescale.

Task: transportation, sustainability
Scale: ratio
Range: [0, inf]
Normalization: rank
Across datasets: arithmetic_mean
Ground truth: not required

UO: UO_0000021

Other

cell_type_annotation_agreementhigher is better

Cell type annotation agreement

Ontology-aware agreement between a method's cell type annotation and the manual expert annotation, averaged over the annotated cell types of a dataset. Each cell type scores 1 for a full match, 0.5 for a partial match (correct broad lineage, wrong fine type), and 0 for a mismatch, following the scoring of Hou and Ji (2024). The reported value is the mean of these per-cell-type scores, in the unit interval, higher meaning closer agreement with the manual annotation.

Task: classification
Scale: interval
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Hou W, Ji Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nature Methods, 2024. 10.1038/s41592-024-02235-4

cell_type_annotation_full_match_ratehigher is better

Cell type annotation full match rate

Fraction of a dataset's annotated cell types for which a method's annotation is a full match to the manual expert annotation, using the ontology-aware scoring of Hou and Ji (2024). It is the strict companion to cell_type_annotation_agreement: partial matches count toward the agreement mean but not toward this rate, so a method that often lands the broad lineage without the fine type scores well on agreement and poorly here.

Task: classification
Scale: ratio
Range: [0, 1]
Normalization: min_max
Across datasets: arithmetic_mean
Ground truth: required

Hou W, Ji Z. Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nature Methods, 2024. 10.1038/s41592-024-02235-4