LLaDA—Diffusion LLM at first glance

On Feb 27, @InceptionAILabs introduce Mecury to the world, and said it's diffusion large language models.

"We are excited to introduce Mercury, the first commercial-grade diffusion large language model (dLLM)! dLLMs push the frontier of intelligence and speed with parallel, coarse-to-fine text generation." - @InceptionAILabs

What is Diffusion?

Similar with generate image in Stable Diffusion, the model use a parallel method of transforming noise into unmasked, by progressively refining random noise into coherent sequence.

What's the difference?

LLaDA (Large Language Diffusion with mAsking) generation starting with noise, refines over steps to denoising the entire sequence. While ARMs (Autoregressive Models) generates sequence token-by-token, from left to right
Training LLaDA using vanilla transformer instead of causal transformer (like GPT)
Vanilla transformer has 2 steps, encoder and decoder:
- Encoder, or Forward Data Masking process: Where the original sequence (clean data) is progressively corrupted by masking tokens over a series of steps
- Decoder, or Reverse Denoising process: Model starts to iteratively predicts the original sequence, from noisy content to unmasked tokens
- This whole training processes objective is to minimize the difference between the predicted denoised sequence and the original sequence
LLaDA has stronger reversal resoning, the score of forward reasoning and reversal reasoning are approximately equal¹. But talking in numbers, the reversal score is not much better, the forward score is quite a bit lower than ARMs. Hope to it can improve in future
LLaDA seems difficult to fine-tuned/reinforcement learning because of sensitivity to hyperparameters

References

See "Table 3. Comparison in the Poem Completion Task." - https://arxiv.org/pdf/2502.09992 ↩

LLaDA—Diffusion LLM at first glance

What is Diffusion?

What's the difference?

References

Footnotes