Overview

Implemented metrics:

General API reference for metrics: API Reference. For implementation details and lifecycle hooks, see base metrics API.

Intuition

Exact vs. approximate distribution metrics

On small enumerable environments where the enumeration is easily implementable, we recommend preferring the Exact Distribution metric: it compares the policy-induced terminal distribution with the ground truth without any sampling errors. Additionally, we support Approximate Distribution metric that relies on samples stored in a first-in first-out buffer and estimates the empirical distribution over this buffer. This implicit averaging over past policies lags behind the latest policy updates.

Evidence bounds

For general environments, we recommend tracking the ELBO. In all baselines, we included the computation of train ELBO over the train samples to track the progress (also called RL reward). The ELBO metric is supposed to be computed in the evaluation mode to assess the current quality of the policy without any exploration tricks. The ideal value of ELBO is \(0\) when the environment exposes \(\log Z\), and otherwise it is supposed to be equal to \(\log Z\), and always presents a lower bound on \(\log Z\). Because the ELBO can peak even if the policy concentrates on a single mode, treat it as a measure of within-mode quality rather than global coverage.

For coverage across modes, we recommend using the EUBO whenever sampling from the ground-truth distribution is available.

Correlation-based metrics

As a metric to qualify the sampling quality without any access to a ground truth distribution, we recommend correlation metrics: this type of metrics estimates stochastically a marginal distribution of the current sampling policy over the test set, that can be either fixed (recommended) or generated by the current policy.

Coverage and Top-K metrics

Additionally, we implemented the mode coverage and Top-K reward and diversity metric to track the exploration and exploitation properties of the sampler. We have to acknowledge that these metrics do not assess the sampling quality directly, although they might be a helpful indicator for under- and over-exploration.