Autoregressive LLMs generate one token at a time. Discrete-diffusion LLMs only partly escape this: they still make hard token commitments one by one, just not left-to-right. Flow-based language models are the first practical, truly parallel alternative, refining every position continuously, committing nothing until the end. And surprisingly, they fit naturally with the same softmax-and-cross-entropy geometry LLMs already use, and are naturally distillable to one-step generators.
The problem with autoregression β and why discrete diffusion doesn't really fix it
Autoregressive LLMs have been the undisputed frontier of language modeling β one of the most consequential technologies of the last decade. The recipe is almost as simple as it gets: predict the next token, append it, repeat. Though it works remarkably well, it also carries a structural constraint in its very definition: token $n$ cannot be computed until tokens $1, \dots, n-1$ already exist. Generation is a chain, and chains are sequential. More GPUs can make training faster, but at inference time you are still waiting for token $k$ before token $k+1$ can even begin.
This leads to a natural question: do we need to generate text left-to-right at all? When people write, they usually do not. We sketch an idea, fill in a sentence from the middle, revise the opening, rewrite the ending. Language is often produced non-chronologically. Left-to-right generation is not a property of language itself; it is a modeling choice.
This is exactly the appeal of diffusion and flow models β the same families that now power state-of-the-art image and video generation. Instead of producing one token at a time, they begin from noise and refine the entire sequence in parallel. Every position is updated at every step.
Do standard text diffusion models solve this problem? The ones you have probably heard of β discrete diffusion LLMs (dLLMs) β are typically trained with a masking or corruption schedule. At time $t$, roughly a fraction $t$ of the tokens are masked or randomly corrupted, while the rest remain visible, and the model is trained to predict the corrupted ones from context. At generation time, this process is run in reverse: you start from a fully masked or random sequence, reveal some tokens, feed the partially revealed sequence back in, reveal more, and continue until the sequence is complete. By the time you are 75% of the way through the process, roughly 75% of the tokens have already been fixed.
This does remove one limitation of autoregression: the model is no longer forced to generate strictly left to right, and it gets to choose which tokens to commit to first. But the deeper structure has not really changed. Generation is still a sequence of hard, irreversible commitments. Once a token is unmasked, it is fixed; the model does not revisit it. In that sense, a dLLM is still fundamentally autoregressive β just an autoregressive model that has learned its own generation order.
What we actually want is something stronger: true parallel generation. Every position should remain continuously refineable, with nothing finalized until the end β so that changing beliefs about one token keep influencing every other token throughout generation. This is what flow-based language models offer. And importantly, they sit in a genuine sweet spot: they keep the discrete training geometry of language modeling β softmax and cross-entropy β while remaining continuous as generative processes. Because the model is continuous, we can distill it into one or two steps with standard tools. Because it still speaks the native probabilistic language of token prediction, it remains well suited to discrete data. And because generation is refinement rather than commitment, the same objective gives us a natural handle for fine-grained test-time steering.
Zooming out for a second: two branches of research have been developing mostly in parallel. Continuous flows for discrete data (Variational Flow Matching (VFM)[1], and in a similar spirit CDCD[3]) on one side, and distillation of continuous flows into one or two steps (distillation through flow map matching[6]) on the other. What recently became clear is that combining them is the real recipe: predict a simplex-valued endpoint, train with KL / cross-entropy, then apply standard distillation to collapse the continuous flow into a few-step (or even single-step) generator. Moreover, as shown in VFM, these tools are in no way limited to text β the same machinery applies to any categorical data, including discrete structures like molecular graphs.
This post is my attempt at a pedagogical entry point into it β flow matching, the variational reframing, and categorical flow maps for distillation. The story is still being written: nobody has done a full flow-based LLM pretraining run that matches a well-tuned AR baseline at scale, and due to the different nature of these models there's a lot we're still figuring out β some directions I'll outline at the end of the post. If you've been looking for a research area where the foundations are just settling and there's real room to push, now is a good moment to get in.
There are really two parts to the story. In Phase 1, we learn a flow β a continuous trajectory from noise to text. This is the conceptual core: it is what makes truly parallel generation possible. In Phase 2, we distill that flow into a model that can jump through the trajectory in one or two steps. This is the practical payoff: once the process is continuous, it becomes possible to compress it, accelerate it, and later even steer it at test time.
Learning to transform noise into text
Flow matching, intuitively
At heart, flows are simple. We want to move points that look like Gaussian noise at time $t=0$ to points that look like samples from our data distribution at time $t=1$. In other words, we want to learn how points should move. At each location $x$ and time $t$, we assign a direction β a velocity β so that if we start from a noise sample $x_0$ and follow those directions, we eventually arrive in the data.
There are many possible ways to move from noise to data. The key idea behind flow matching[7][8][9] is that we do not actually care which route we take, as long as we end up in the right place. So we might as well choose a route that is simple from the start.
This is done by defining a simple stochastic bridge between $p_0$ and $p_1$: a way of interpolating between the noise distribution and the data distribution as time runs from $t = 0$ to $t = 1$. The resulting family of intermediate distributions, written $(p_t)_{t \in [0, 1]}$, is called the probability path. It is simply a continuous path in distribution space from noise to data.
This is the crucial simplification. Instead of asking the model to discover both where the path should go and how to move along it, we choose the path ourselves. Once that path is fixed, the training problem becomes much easier, because the target dynamics are already built into the construction.
Concretely, sample a noise point $x_0$, sample a data point $x_1$, draw the straight line between them, and define
$$x_t = (1-t) \, x_0 + t \, x_1.$$Because $x_0$ and $x_1$ are random, $x_t$ is random too: every new pair $(x_0, x_1)$ gives a different point along a different line. So $x_t$ is not a single deterministic location, but a random variable whose distribution depends on $t$. We denote that distribution by $p_t$. In other words,
$$(1-t)\,x_0 + t\,x_1 \sim p_t.$$Notice what we have done here. We have defined $p_t$ not by writing down a density, but by specifying how to sample from it in terms of $p_0$ and $p_1$. That is already enough. By construction, $p_t$ looks like noise as $t \to 0$ and like data as $t \to 1$, so this gives us a continuous path of distributions connecting the two.
(We'll be a little loose about whether $x_t$ denotes the random variable or a sample from it; context makes it clear.)
Now here is the subtlety. Defining $p_t$ tells us where mass is at time $t$, but not yet how an individual point should move. And we cannot just tell a point to follow "its" straight line, because at a given location $x_t$, many different pairs $(x_0, x_1)$ could have produced it. If you stand at $x_t$ and ask, which way should I go?, there is no single answer: many noiseβdata pairs are compatible with that same point, each pointing in a slightly different direction.
So what's the trajectory of a single point? The natural answer is to go in the average direction of all the straight lines passing through $x_t$:
This gives us the ideal velocity field. At any point $x_t$, the velocity that transports mass from noise to data along the probability path $p_t$ is the expected direction
$$v_t(x_t) = \mathbb{E}[\, x_1 - x_0 \mid x_t \,].$$In words: given that I am currently at $x_t$, what is the average direction from noise to data among all couplings that could have passed through here?
The individual couplings we used to define $p_t$ are straight lines, but the trajectory obtained by following the velocity field is generally curved. The field doesn't know which particular $(x_0, x_1)$ pair generated the current point; it returns the average direction over all of them. As we move, the set of couplings consistent with our location changes, so the average direction changes too. The couplings are straight; the trajectory built from averaging them is not.
It is also worth noting that $x_1 - x_0$ is exactly the time derivative of the straight-line interpolation $\,x_t = (1-t)\,x_0 + t\,x_1$. So the target in the equation above is not arbitrary: it is literally the instantaneous velocity of the bridge. You can read $v_t(x_t) = \mathbb{E}[\, x_1 - x_0 \mid x_t \,]$ as saying: given that I am at $x_t$, what is the expected local direction of travel?
The same quantity can be written in a slightly different but equivalent way:
$$v_t(x_t) = \mathbb{E}\!\left[\dfrac{x_1 - x_t}{1 - t} \,\Big|\, x_t \right].$$This says the same thing from the point of view of the current location $x_t$: on average, how fast and in what direction do I need to move from where I am now in order to reach data by time $t=1$?
The term $\dfrac{x_1 - x_t}{1-t}$ is often called the conditional velocity. It is the velocity we would follow if we knew that our trajectory was heading toward a particular endpoint $x_1$. But for the straight bridge we are using, this quantity is exactly equal to $x_1 - x_0$, since
$$x_t = t\,x_1 + (1-t)\,x_0 \quad\Leftrightarrow\quad \dfrac{x_1 - x_t}{1-t} = x_1 - x_0.$$So the two expectation formulas are exactly the same object, just viewed from two different angles.
So how do we actually learn this velocity field? Learning expectations is straightforward. The expected value $\mathbb{E}[Y]$ is the single prediction that is least wrong on average under mean-squared error. A conditional expectation $\mathbb{E}[Y \mid X]$ is the same idea, but now the best prediction is allowed to depend on $X$: for each value of $X$, what value of $Y$ minimises MSE?
That gives us an immediate recipe. To learn $\mathbb{E}[Y \mid X]$, we feed $X$ into a network, ask it to predict $Y$, and train it with an MSE loss. In our case, we want to learn $\mathbb{E}[\, x_1 - x_0 \mid x_t \,]$, so we simply regress onto the target $x_1 - x_0$:
$$\mathcal{L}(\theta) = \mathbb{E}\!\left[\, \| v_t^{\theta}(x_t) - (x_1 - x_0) \|^2 \,\right].$$When this loss is minimised, by definition the network sits at the conditional expectation.
The key point is that we never need to know the "correct" velocity field explicitly at each location $x_t$. We only need samples: draw noisy points $x_t$, pair them with their corresponding targets $x_1 - x_0$, and let the MSE objective do the rest. By standard regression logic, the network is pushed toward the conditional expectation β which is exactly the velocity field we want.
In practice, the training loop is almost embarrassingly short:
for x_1 in data:# sample a data point
x_0 = torch.randn_like(x_1)# sample noise
t = torch.rand(x_0.shape[0])# sample a time
x_t = t * x_1 + (1 - t) * x_0# interpolate
Β
v_pred = model(x_t, t)# predict the velocity
loss = ((v_pred - (x_1 - x_0)) ** 2).mean()
loss.backward()
optimizer.step()
That is really it: interpolate, predict the velocity, and regress onto the straight-line target.
Once the model is trained, sampling is just as simple. We draw $x_0 \sim \mathcal{N}(0, I)$, and then integrate the learned velocity field forward from $t=0$ to $t=1$. For example, with plain Euler steps of size $\Delta t$, we update
$$x_{t + \Delta t} = x_t + \Delta t \cdot v_t^{\theta}(x_t).$$In words: at each time $t$, ask the model which direction to move, take a small step in that direction, and repeat until you reach $t=1$.
The sampling loop is therefore just as short as the training loop:
x = torch.randn(batch_size, *data_shape)# start from noise
dt = 1.0 / num_steps
for i in range(num_steps):
t = torch.full((batch_size,), i * dt)
v = model(x, t)# velocity at x_t
x = x + dt * v# take an Euler step
# x is now a sample from the data distribution
Start from noise, follow the arrows, arrive at data.
The variational perspective
Now let us look again at the key equation,
$$v_t(x_t) = \mathbb{E}\!\left[\dfrac{x_1 - x_t}{1 - t} \,\Big|\, x_t \right].$$Once we condition on $x_t$, that quantity is fixed β it is the thing we are given, not the thing being averaged over. So we can pull it outside the expectation:
$$v_t(x_t) = \dfrac{\mathbb{E}[x_1 \mid x_t] - x_t}{1 - t}.$$This is the denoiser perspective on flow matching. Instead of asking the network to predict a velocity directly, we can equivalently ask it to predict $\mathbb{E}[x_1 \mid x_t]$, the conditional mean of the clean endpoint given the noisy point $x_t$. In other words, given where we are now, what is our best guess for where this point will end up at time $t=1$?
Let us write this prediction as $\hat{x}_{1,t}^{\theta}(x_t)$ and train it by simple regression onto the true endpoint $x_1$:
$$\mathcal{L}(\theta) = \mathbb{E}\!\left[\, \| \hat{x}_{1,t}^{\theta}(x_t) - x_1 \|^2 \,\right].$$So there are really two equivalent views of the same learning problem. In the velocity view, we train the model to predict the local direction of motion. In the denoiser view, we train it to predict the clean endpoint, and recover the velocity from that prediction through
$$v_t(x_t) = \dfrac{\hat{x}_{1,t}^{\theta}(x_t) - x_t}{1 - t}.$$The denoiser does not say "which way should I move?" directly. It says "where do I think this point is trying to end up?" β and the velocity is just the arrow from the current point to that prediction.
The two objectives are linearly related: once you have a denoiser, you can recover a velocity, and once you have a velocity, you can recover a denoiser. In that sense, choosing between them is mostly a matter of taste β or of numerical convenience.
Except, for language, it is not just that.
Suppose $x_1$ is a token: a discrete object drawn from a finite vocabulary. What would it even mean to predict $\mathbb{E}[x_1 \mid x_t]$ and train it with MSE? For discrete data, we do not want to predict a point at all. We want to predict a distribution over possible tokens, trained with cross-entropy β exactly the probabilistic setup every LLM already uses.
Here is the key insight of variational flow matching. Minimising $\|\hat{x}_1 - x_1\|^2$ is exactly maximum-likelihood estimation under a Gaussian observation model with fixed variance:
$$q_t^{\theta}(x_1 \mid x_t) \;=\; \mathcal{N}\!\left(\hat{x}_{1,t}^{\theta}(x_t),\, \sigma^2 I\right).$$So flow matching was already probabilistic all along β it just happened to be using a Gaussian conditional. The "prediction" $\hat{x}_1$ is the mean of that conditional, and the MSE loss is its negative log-likelihood up to a constant. In this sense, MSE was never the fundamental object; it came from a probabilistic assumption we usually leave implicit.
Once we make that assumption explicit, we are no longer forced to use a Gaussian. For continuous data, Gaussian conditionals are natural, and MSE is exactly right. But for language, where $x_1$ is discrete, we should use a different conditional family β one whose likelihood is measured by cross-entropy rather than squared error. That is where the variational perspective becomes much more than a rephrasing: it gives us the freedom to keep the flow, while swapping in the right probabilistic geometry for discrete data.
Once you see this, the generalisation is almost unavoidable: flow matching should be treated as a probabilistic objective. Instead of thinking of the model as predicting a single endpoint, we let it model the conditional distribution $p_t(x_1 \mid x_t)$ β the posterior over clean endpoints given the current noisy point.
A small notational switch is worth making here. Up to now we wrote the network's output as $\hat{x}_{1,t}^{\theta}(x_t)$, the "predicted endpoint." From now on we write it as $\mu_t^{\theta}(x_t)$ β the mean of the predictive distribution $q_t^{\theta}(x_1 \mid x_t)$. For a Gaussian head, $\mu_t^{\theta}$ literally is the predictive mean and we're back to the denoiser. For a categorical head, $\mu_t^{\theta}$ is a vector of class probabilities β which is also the mean of a one-hot draw.
Let $q_t^{\theta}(x_1 \mid x_t)$ be our model's approximation to the true posterior. The natural way to fit it is with a distributional objective: minimise the KL divergence from the true posterior to the model β that is, to optimise for the distribution $q$ that, on average, misses the least amount of information about $p$. Up to constants that do not depend on $\theta$, this is exactly maximum likelihood, giving
$$\mathcal{L}(\theta) \;=\; \mathbb{E}\!\left[\, -\log q_t^{\theta}(x_1 \mid x_t) \,\right].$$This is Variational Flow Matching (VFM)[1]. Nothing more exotic than maximum likelihood on the endpoint distribution, with $q_t^{\theta}$ parameterised however we like. For continuous data, a Gaussian recovers the familiar MSE formulation. For discrete data, we choose a categorical and recover cross-entropy.
Note that to flow to discrete data β tokens, for example β we first need to place categorical objects into a space where flow makes sense. The standard move is to represent each token as a one-hot vector: if there are $K$ possible tokens, each token becomes a vector in $\mathbb{R}^K$ with a single $1$ and the rest $0$.
These points lie on the corners of the so-called probability simplex
$$\Delta^{K-1} \;=\; \left\{\, x \in \mathbb{R}_{\ge 0}^{K} : \sum_{k=1}^{K} x_k = 1 \,\right\},$$and any distribution over them β i.e. any convex combination β fills up the simplex exactly. So our $\mu_t^{\theta}(x_t)$ lives precisely inside this triangle: each corner is a single token, and every interior point is a probability distribution over tokens. For example, a point $\mu$ with $\mu_k = 0.7$ and $\mu_j = 0.3$ represents a 70/30 mixture between tokens $k$ and $j$.
Once we view tokens as simplex vertices, the probabilistic choice becomes natural. We let $q_t^{\theta}(x_1 \mid x_t)$ be a categorical distribution, with parameters $\mu_t^{\theta}(x_t) \in \Delta^{K-1}$ produced by a softmax head. The VFM objective then becomes
$$\mathcal{L}(\theta) \;=\; \mathbb{E}\!\left[\, -\log q_t^{\theta}(x_1 \mid x_t) \,\right] \;=\; -\mathbb{E}\!\left[\, \sum_{k=1}^{K} \mathbb{1}[x_1 = k]\, \log \mu_{t,k}^{\theta}(x_t) \,\right].$$But this is just cross-entropy against the clean token: exactly the loss every language model already uses, now applied at every noise level $t$.
So in the discrete case, VFM does not force us into an unnatural regression objective. It gives us precisely the probabilistic geometry language modelling already wants: softmax outputs, cross-entropy training, and predictions that live on the simplex.
Training therefore looks almost exactly like before:
for x_1 in data:# sample clean tokens (batch, seq_len)
x_0 = sample_noise_distribution(...)# sample a noise "sequence"
t = torch.rand(x_1.shape[0])# sample a time
x_t = t * x_1 + (1 - t) * x_0# interpolate
logits = model(x_t, t)# predict categorical over vocab
loss = F.cross_entropy(logits, x_1)# MLE = minimize KL to posterior
loss.backward()
optimizer.step()
At sampling time, we still need a velocity field, which means we need the conditional mean $\mathbb{E}_{q_t^{\theta}}[x_1 \mid x_t]$. The simplex structure gives a clean answer: if $x_1$ is one-hot and $q_t^{\theta}$ is categorical with parameters $\mu_t^{\theta}(x_t)$, the expectation is just the probability vector,
$$\mathbb{E}_{q_t^{\theta}}[x_1 \mid x_t] \;=\; \mu_t^{\theta}(x_t) \;=\; \mathrm{softmax}\!\left(\text{model}(x_t, t)\right).$$So the induced velocity is
$$v_t^{\theta}(x_t) \;=\; \dfrac{\mu_t^{\theta}(x_t) - x_t}{1 - t},$$just the vector pointing from the current point $x_t$ toward the model's predicted point on the simplex.
The recipe is simple: start from noise, follow these arrows, and end up at categorical predictions over tokens.
One small numerical note: the factor $1/(1-t)$ blows up as $t \to 1$. In practice, one stops integration slightly before $t=1$ and reads off the final token predictions directly from $\mu_t^{\theta}(x_t)$, which remains well behaved even when the velocity does not.
What this means for language
It is worth pausing on what we have actually built. At every time $t$ and every state $x_t$, the model predicts a point
$$\mu_t^{\theta}(x_t) \in \Delta^{K-1}$$on the simplex. By construction, this point is a posterior over the vocabulary: coordinate $k$ is the probability that the clean token is token $k$, given the current state. Generation is therefore a continuous process of steering $x_t$ toward its predicted posterior. Early on, that posterior is diffuse β many tokens are plausible β and as $t \to 1$, it sharpens toward a single vertex of the simplex, that is, a single token.
The softmax output of a language model is not just a convenient way to turn logits into probabilities. In the discrete setting, it is exactly the natural object the denoiser should predict. Likewise, cross-entropy is not just a useful loss that happens to work β it is precisely the maximum-likelihood, KL-to-the-posterior objective that VFM prescribes when the endpoint is categorical. The softmax-and-cross-entropy machinery of language modelling was already geometrically correct all along. VFM explains why.
One more wrinkle. A language sample is a whole sequence, $x_1 = (x_1^{1}, \dots, x_1^{n})$, so in principle the posterior $\mathbb{E}[x_1 \mid x_t]$ lives over full sequences β far too big to parameterise directly. The useful fact is that expectation is linear, so it decomposes coordinate-wise:
$$\mathbb{E}[x_1 \mid x_t] \;=\; \bigl(\, \mathbb{E}[x_1^{1} \mid x_t],\; \ldots,\; \mathbb{E}[x_1^{n} \mid x_t] \,\bigr).$$Each term is itself a point on $\Delta^{K-1}$: a posterior over the vocabulary at position $i$. So the output head is a product of per-position simplices β $n$ softmaxes, one per position.
This parallels the Gaussian case, where the learned $\mu_t^\theta$ doesn't depend on $\Sigma_t$ at convergence β the VFM objective pins down only the conditional mean, so the sampler runs off $\mu_t^\theta$ alone and cross-coordinate correlations in $q_t^\theta$ drop out. That doesn't mean $\Sigma_t$ is useless: modelling it can reshape training dynamics (it rescales gradients and can stabilise learning), and it gives you a per-step uncertainty estimate β useful for adaptive step sizes, calibration, or scoring trajectory confidence. An interesting open direction, just not one that changes the marginals you'd sample. Same story here: cross-token structure isn't modelled in the head, it lives in the conditioning. Each $\mu_t^{\theta,\,i}(x_t)$ depends on the full $x_t$, so the transformer carries the joint structure while the softmax head (or $n$ separate heads, all conditioning on the same $x_t$) reads off the marginals.
And that is the basic picture of flow-based language generation: a continuous refinement process in which the model repeatedly predicts, at every position, a posterior over what the clean token should be.
Speeding up the flow
Flow maps: jumping along trajectories
Now let us zoom back out. At the end of Phase 1, we have learned a flow: a velocity field $v_t^{\theta}$ that tells any point which way to move at any time. To sample from it, we start at $t=0$ and integrate this velocity field forward to $t=1$ with an ODE solver.
The catch is that this usually takes many small steps β often dozens or hundreds. And each step requires a full forward pass through the network. For language, that means attention over the full sequence every time. This is where flow models lose their elegance in practice: not in training, but in the cost of repeatedly evaluating the model during sampling.
This is where distillation enters: treating the trained flow as a teacher and training a faster student to reproduce the same end-to-end behaviour in fewer steps. The trajectories traced out by the trained flow become supervision, and we can ask a new question: instead of following the flow in many tiny steps, can we learn to make much larger jumps along it?
This leads to the idea of a flow map[6]. The learned flow implicitly defines a mapping from any state $(x_s, s)$ to the state it reaches at any later time $t$. In other words, it defines a function $X_{s,t}(x_s)$: if I start at point $x_s$ at time $s$, where do I end up by time $t$ if I follow the flow? That is the object we now want to learn directly.
A convenient way to parameterise it is as a secant step. Rather than predicting an instantaneous velocity, we directly predict the average velocity of the chord from $x_s$ to $x_t$ β the single direction that, followed for the interval $t - s$, lands you at the future state. Writing that direction as $v_{s,t}(x_s)$, the flow map becomes
$$X_{s,t}(x_s) \;=\; x_s \,+\, (t-s)\, v_{s,t}(x_s).$$Now, to learn the flow map, one could, in principle, simulate trajectories from the flow and regress onto their jumps. But that's slow β you need many simulation steps to get accurate training targets.
The beautiful insight of flow maps is that you don't have to: the flow map and the flow are related in a very simple way. As $s$ and $t$ get closer to each other, the flow map becomes a smaller and smaller jump, and in the limit $s \to t$, the change in the flow map is exactly the velocity of the flow,
$$\lim_{s \to t}\; \partial_t\, X_{s,t}(x) \;=\; v_t(x).$$More generally, for any $s \le t$, we have a sufficient and necessary condition: $X_{s,t}$ is the flow map of $v_t$ if and only if its own time-derivative agrees with the velocity field evaluated at the current point along the trajectory it traces:
$$\partial_t\, X_{s,t}(x_s) \;=\; v_t\!\bigl(X_{s,t}(x_s)\bigr).$$This is key. It's not a suggestive analogy β it's an identification: the equation above defines what it means to be a flow map of $v_t$, and any $X$ satisfying it pointwise is one.
Crucially, this is a pointwise relation, evaluated at a single $(s, t, x_s)$ β no simulation. That's what makes it a training target.
One nice consequence: since $\partial_t X_{s,t}$ coincides with the instantaneous velocity in the limit $s \to t$, the diagonal $v_{t,t}$ exactly recovers $v_t$. So you can train the flow and its flow map in the same network β standard flow matching on $v_{t,t}$, plus the Lagrangian loss above to extend it to general $(s,t)$. This is self-distillation: one network, both objectives, no separate teacher. In practice, since both sides of the identity share weights, optimization is stabilised by putting a stop-gradient on the velocity $v_t$ that the flow map is being pulled towards β so the flow-map head chases a detached target rather than a moving one. Whether self-distillation is fundamentally better or worse than the two-stage approach (train the flow first, distill second) is still an open question.
The Lagrangian condition is an identity the true flow map obeys. Turning it into a loss is the obvious move: pick a source time $s$, a target time $t \ge s$, sample a noiseβdata pair $(x_0, x_1)$ to get a source point $x_s$, and measure how far the two sides of the identity are apart. Averaging that squared gap over all $(s, t, x_0, x_1)$ gives the Lagrangian self-distillation (LSD) objective:
$$\mathcal{L}_{\text{distill}}(\varphi) \;=\; \mathbb{E}\!\left[\,\bigl\|\,\partial_t\, X^{\varphi}_{s,t}(x_s) \;-\; v_t\!\bigl(X^{\varphi}_{s,t}(x_s)\bigr)\,\bigr\|^2\,\right].$$Here $X^{\varphi}_{s,t}$ is parameterised as a secant step from the source point:
$$X^{\varphi}_{s,t}(x_s) \;=\; x_s \,+\, (t-s)\, v^{\varphi}_{s,t}(x_s),$$so a single network head outputs the average direction $v^{\varphi}_{s,t}$, and both the velocity field $v_t$ and the flow map $X^{\varphi}_{s,t}$ are read off from it β or equivalently, if $v_t$ is parameterised as the diagonal $v_{t,t}$, the flow and its flow map share a single network.
A few things are worth unpacking here. The network outputs both a velocity $v_t$ and a flow map $X_{s,t}$ (they share weights and are deduced from the same head); LSD pushes these two readings into consistency with each other. The inner term is evaluated at a single source point $x_s$, so there is no trajectory integration anywhere β just one forward pass to evaluate $X_{s,t}$, one to evaluate $v_t$, and a squared distance. And because the identity is exact at $s = t$ (where $X_{t,t}(x) = x$ and $\partial_t X_{t,t}(x) = v_t(x)$ by construction), the loss naturally bootstraps from the short-jump regime out to longer jumps β hence self-distillation: the short-$|t - s|$ predictions teach the long ones, with no external teacher.
The full training loop β when training the flow and its flow map simultaneously β fits in a few lines:
for x_1 in data:# clean tokens (batch, seq_len)
x_0 = sample_noise_distribution(...)# noise "sequence"
s = torch.rand(x_1.shape[0])# source time
t = torch.rand_like(s)# target time, s <= t <= 1 WLOG
x_s = (1 - s) * x_0 + s * x_1# interpolant at s
x_t = (1 - t) * x_0 + t * x_1# interpolant at t
v_st = model(x_s, s, t)# single network, two-time input
X_st = x_s + (t - s) * v_st# flow map (secant step)
# flow: standard flow matching on v_{t,t} (skip if v_t already trained)
v_tt = model(x_t, t, t)# velocity at (t, t)
loss_fm = ((v_tt - (x_1 - x_0)) ** 2).mean()
# flow map: Lagrangian self-distillation
dt_X = torch.autograd.grad(X_st.sum(), t, create_graph=True)[0]
v_tt_at_X = model(X_st, t, t)# velocity at landing point
loss_lsd = ((dt_X - v_tt_at_X) ** 2).mean()
loss = loss_fm + loss_lsd
loss.backward()
optimizer.step()
And sampling collapses to a single flow-map evaluation:
# one-step generation
x_1 = x_0 + model(x_0, s=0, t=1)
# two-step generation
x_half = x_0 + 0.5 * model(x_0, s=0.0, t=0.5)
x_1 = x_half + 0.5 * model(x_half, s=0.5, t=1.0)
If the flow map is accurate enough, we can replace many forward passes with one or a handful. The empirical results here have been striking β flow-map-based samplers can hit the quality of many-step flow integrators with 10β20Γ fewer function evaluations, and sometimes in a single step.
What flow maps buy you beyond speed
Flow maps aren't just an efficiency trick grafted onto a flow. They're a shift in what object we're learning β not the instantaneous dynamics, but the solution operator itself. And once you have that, you get something a step-by-step integrator structurally can't: the ability to see across time, to infer where a point will end up if it keeps moving the way it's moving now.
For language, this might matter more than the speedup. Take test-time guidance: you want to nudge the sample toward some property β a constraint, a reward, a classifier score. With a flow map, you can look ahead: evaluate where the current point is going to land, check whether that landing spot satisfies your constraint, and correct accordingly. A plain velocity-field sampler only sees the local tangent; it can't plan around where the trajectory is going. This opens up a much richer space of controllable generation.
A second structural point worth flagging: this kind of distillation is something flows have and discrete diffusion fundamentally does not. A dLLM generates by unmasking tokens one (batch) at a time, and each unmasking is a hard, irreversible commitment β there's no continuous trajectory to shortcut. You can't "jump" from 90% masked to fully unmasked in one step because the intermediate commitments aren't smoothly related. A flow's trajectory is smooth, and that smoothness is exactly what makes it compressible into a flow map. The inability to distill dLLMs isn't an engineering detail; it's structural.
The natural next question: can we do this for categorical flows? Can we learn a flow map over the simplex β one- or few-step language generation with all the goodies of VFM? That's where Categorical Flow Maps come in.
Categorical Flow Maps
We have all the pieces. Let's put them together.
From Phase 1, we learned that in the language setting the right thing to predict isn't a velocity in $\mathbb{R}^d$ β it's a posterior $q_t^{\theta}(x_1 \mid x_t)$ over tokens, which lives on the simplex $\Delta^{K-1}$. The velocity that falls out of this is
$$v_t^{\theta}(x_t) \;=\; \dfrac{\mathbb{E}_{q_t^{\theta}}[x_1 \mid x_t] - x_t}{1 - t}.$$Notice where it points. The numerator is (predicted endpoint on the simplex) minus (current state), so at every point the velocity pulls $x_t$ toward a simplex-valued prediction. The dynamics are a continuous flow of probability mass toward the simplex.
Now bring in Phase 2. Instead of learning the instantaneous velocity and integrating it, we learn a flow map: a function $X_{s,t}(x_s)$ that directly outputs where $x_s$ lands at time $t$. For categorical data there's a nice observation: if every infinitesimal velocity along the trajectory points toward the simplex, the whole trajectory β and its endpoint β stays on the simplex too. So the flow map's output must be simplex-valued.
Rather than let the network emit an arbitrary vector in $\mathbb{R}^K$ and hope it lands in the right geometry, we bake the simplex constraint into the parametrisation. The network predicts an endpoint $\mu_{s,t} \in \Delta^{K-1}$ β the same $\mu$ as in Phase 1, now two-time β via a softmax head, and the flow map takes a fraction $(t-s)/(1-s)$ of the way from $x_s$ toward $\mu_{s,t}$, which rewrites as a convex combination with weights scaled to the interval $[s,t]$:
$$\hat{X}_{s,t}(x_s) \;=\; x_s \;+\; (t-s)\,\dfrac{\mu_{s,t} - x_s}{1-s} \;=\; \dfrac{1-t}{1-s}\, x_s \;+\; \dfrac{t-s}{1-s}\, \mu_{s,t}.$$Two things fall out. As $t \to 1$, the first coefficient vanishes and $\hat{X}_{s,1}(x_s) = \mu_{s,1}$ β the simplex-valued prediction is the final sample. And because both coefficients are non-negative and sum to one, if $x_s$ and $\mu_{s,t}$ both lie on the simplex, so does $\hat{X}_{s,t}(x_s)$ β the simplex is preserved by construction. The network's only job is to pick the right endpoint; the geometry does the rest. This is the central design choice of Categorical Flow Maps (CFM).
With this parametrisation in hand, the rest follows. Because trajectories are continuous β we're moving continuously toward the simplex, not hopping between discrete states β all the self-distillation machinery from the last section applies directly. The trajectories are continuous, the endpoints are simplex-valued by construction, and the model learns a solution operator that can jump across time in one or a few steps.
for x_1 in data:# clean tokens, one-hot on Ξ^K
x_0 = sample_noise_distribution(...)# noise "sequence"
s = torch.rand(x_1.shape[0])# source time
t = torch.rand_like(s)# target time, s <= t <= 1 WLOG
x_s = (1 - s) * x_0 + s * x_1# interpolant at s
# endpoint parametrization: model outputs ΞΌ β Ξ^K (softmax head)
mu_st = model(x_s, s, t)# ΞΌ_{s,t}(x_s)
v_st = (mu_st - x_s) / (1 - s)# velocity from endpoint
X_st = x_s + (t - s) * v_st# flow map (secant step)
# flow: cross-entropy on endpoint at (t, t)
x_t = (1 - t) * x_0 + t * x_1# interpolant at t
mu_tt = model(x_t, t, t)# endpoint prediction
loss_fm = cross_entropy(mu_tt, x_1)# β CE instead of MSE
# flow map: Lagrangian self-distillation
dt_X = torch.autograd.grad(X_st.sum(), t, create_graph=True)[0]
mu_tt_at_X = model(X_st, t, t)# endpoint at landing point
v_tt_at_X = (mu_tt_at_X - X_st) / (1 - t)# velocity from endpoint
loss_lsd = ((dt_X - v_tt_at_X) ** 2).mean()
loss = loss_fm + loss_lsd
loss.backward()
optimizer.step()
Two things fall out for free:
Sampling becomes a handful of flow-map evaluations instead of many velocity integrations β empirically, one- or two-step generation competitive with many-step flows, the same 10β20Γ speedup consistency models brought to images, now available for discrete data. And because trajectories live in a continuous space, the full continuous-domain toolkit β classifier guidance, reward tilting, SMC-style reweighting β transfers directly.
Distillation is the deeper point here: collapsing many steps into one is something flows can do and discrete diffusion fundamentally can't. A flow map learns where will this trajectory land? β a question that only makes sense when the trajectory is a smooth function of time. dLLMs don't have that: their trajectory is a sequence of discrete commitments, with no continuous parameter to shortcut. You can't distil a staircase. That's why distilling dLLMs is so hard β every approach is fighting a structural property of the model, not an engineering detail.
So where does that leave us? We started with a problem β autoregressive LLMs are sequential, and the diffusion patch didn't fix it. Flow matching pointed at a genuine alternative, parallel and globally refinable, but naive flow matching breaks on discrete data because MSE isn't the right loss for tokens. The variational view fixed that by reframing the target as a posterior on the simplex; the flow-map view fixed the remaining speed problem by teaching the model to jump across time instead of integrating. Categorical Flow Maps are what you get when you put both fixes together: a principled, fast, controllable way to generate language that isn't just autoregression in disguise.
Where this goes next, and related work
Where this goes next
We've laid out the conceptual skeleton, but turning it into a working language-modelling recipe at scale is very much a live research program. The biggest open thread is full pretraining: everything above assumes we can train a flow-based language model to competitive quality, and nobody has yet demonstrated a pretraining run that matches a well-tuned AR baseline at the same scale. Whoever gets this to work first will have done something real, and the path is not obvious β it will involve careful choices about tokenisation, loss weighting across $t$, the noise schedule, and the interpolant itself. Post-training is equally wide open. AR models have a mature stack (SFT, DPO, PPO, GRPO); the flow-based analogue isn't settled. Our system has two learnable objects β a velocity and a flow map β and naively updating one without the other breaks their consistency. Does RLHF act on the velocity, with the flow map re-distilled afterward? On the flow map directly? Jointly? Each choice has different stability and credit-assignment tradeoffs.
Underneath the training story sits a pile of architectural choices. The interpolant is a design knob β straight-line couplings are the default, but other paths from noise to data change the flow itself, and may condition the learning problem better. The prior at $t=0$ is another: we default to Gaussian noise in the one-hot space, and so far moving the flow into an embedding space hasn't shown a clear benefit β though this needs more investigation. Transformers have been shaped by a decade of AR optimisation; how they need to be adapted for text diffusion β what a Text-DiT actually looks like β is a wide-open question. And there's the two-stage-versus-joint question: train a velocity field and then distil it into a flow map, or train both jointly. We used self-distillation in Categorical Flow Maps and it worked well, but more recent work suggests the two-stage route β train the flow, then distil β may be cleaner and give better final quality.
Conditional generation, the setting most real language tasks live in, extends from VFM for free: conditioning on any variable $y$ just enters the posterior,
$$v_t(x_t \mid y) \;=\; \dfrac{\mathbb{E}[\, x_1 \mid x_t, y \,] - x_t}{1 - t},$$and the whole framework β velocity, denoiser posterior, flow map β goes through unchanged. Making this work at LLM scale, with long prompts, RAG-style context, and tool use, is its own research program, but the theoretical scaffolding is already there. It's also worth saying that language isn't the only β or even the nearest-term β payoff. The same machinery applies anywhere data is discrete and structured, and the cleanest example is graph and molecule generation: nodes and edges are categorical, and the whole object has to be generated jointly rather than one atom at a time. VFM handles this naturally β each coordinate of the denoiser is a posterior over atom and bond types β and Categorical Flow Maps let you generate molecules in one or a few steps, with controllable guidance toward properties like binding affinity, solubility, or synthesisability. If anything, molecular design is where flow-based generation has a clearer short-term payoff than language: one-step generation with test-time control is both feasible and scientifically useful there today.
Final words
If there's one thing to take away from this post: flow-based language generation becomes plausible the moment you predict the right endpoint object in the right geometry. The right endpoint is a categorical distribution on the simplex β not a vector in $\mathbb{R}^K$. The right loss is cross-entropy, which is what KL-to-the-posterior collapses to. And the right inference-time object is a flow map β one that jumps across time in one or a few steps, with the simplex geometry baked in by construction. Everything else β variational flow matching, CatFlow, Categorical Flow Maps β is, in some sense, just paying careful attention to that one principle in the three places it matters: how you define the target, how you train, and how you sample.
And we're only at the beginning. The recipe is in place β parallel, controllable, distillable language generation that isn't secretly autoregressive β but the engineering isn't. We still don't have a frontier-scale flow-based LLM pretrain. We don't have a clean post-training story. We don't know the right architectural priors for parallel refinement. Each of those is an open question where a handful of good ideas could move things, and if you want to be nerd-sniped by any of them, now is the time.
If you want to see where this goes β the full VFM write-up, new applications, and the open questions above turning (hopefully) into answers β follow along on X. A lot is coming.
β Floor
References
- Variational Flow Matching for Graph Generation. NeurIPS 2024.
- Categorical Flow Maps. arXiv, 2026.
- Continuous Diffusion for Categorical Data. arXiv, 2022.
- Flow Map Language Models: One-step Language Modeling via Continuous Denoising. arXiv, 2026.
- Discrete Flow Maps. arXiv, 2026.
- Flow Map Matching. arXiv, 2024.
- Flow Matching for Generative Modeling. ICLR 2023.
- Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. ICLR 2023.
- Building Normalizing Flows with Stochastic Interpolants. ICLR 2023.