Background
Generative Flow Networks (GFlowNets) are probabilistic models designed to sample complex combinatorial objects with probability proportional to a user-defined reward. Concretely, we work with a finite discrete space \(\mathcal{X}\) and an unnormalized reward function \({\mathcal{R} : \mathcal{X} \rightarrow \mathbb{R}_{\ge 0}}\). Our goal is to learn a sampler whose marginal distribution over objects \(x \in \mathcal{X}\) matches \(\mathcal{R}(x)/Z\), where the normalizing constant \({Z = \sum_{x \in \mathcal{X}} \mathcal{R}(x)}\) is typically unknown.
To describe the generative process, GFlowNets build a directed acyclic graph (DAG) \(\mathcal{G} = (\mathcal{S}, \mathcal{E})\). Every node in \(\mathcal{S}\) represents a partially constructed object, edges in \(\mathcal{E}\) correspond to incremental edits, and there is a unique initial state \(s_0\) with no parents. Terminal states coincide with the target space \(\mathcal{X}\), so any complete trajectory \(\tau = (s_0, s_1, \ldots, s_{n_\tau})\) records the sequence of construction steps that leads to a finished object \(s_{n_\tau} \in \mathcal{X}\).
GFlowNet training revolves around two stochastic policies. The forward policy \(\mathcal{P}_F(s' \mid s)\) chooses how to expand a state by selecting one of its children, while the backward policy \(\mathcal{P}_B(s' \mid s)\) picks a parent to undo a step. A pair of forward/backward policies is considered proper if the induced distributions over complete trajectories match,
When this trajectory balance condition holds, sampling in the forward direction yields terminal states with the desired reward-proportional probabilities.
In practice we parameterize \(\mathcal{P}_F\) (and sometimes supporting quantities such as state flows or \(\mathcal{P}_B\)) with neural networks, then optimize an objective that encourages trajectory balance. The main families of objectives are:
- Detailed Balance (DB) enforces local consistency between pairs of neighbouring states by matching forward and backward flows.
- Trajectory Balance (TB) lifts the constraint to entire trajectories, introducing a learnable estimate \(Z_\theta\) for the normalizing constant.
- Subtrajectory Balance (SubTB) interpolates between DB and TB by applying weighted DB constraints to every subpath within a trajectory.
These objectives can be trained on-policy with trajectories sampled from the current forward policy, or off-policy using replay buffers and exploration strategies. Depending on the environment, we may jointly learn \(\mathcal{P}_B\) or keep it fixed (e.g., uniform over parents); in either case, there always exists a unique forward policy that satisfies the trajectory balance constraint for a given backward policy.
The gfnx library provides ready-to-use implementations of these ideas, making it straightforward to prototype new objectives, environments, and inference procedures.