On Feb 27, @InceptionAILabs introduce Mecury to the world, and said it's diffusion large language models.
"We are excited to introduce Mercury, the first commercial-grade diffusion large language model (dLLM)! dLLMs push the frontier of intelligence and speed with parallel, coarse-to-fine text generation." - @InceptionAILabs
What is Diffusion?
Similar with generate image in Stable Diffusion, the model use a parallel method of transforming noise into unmasked, by progressively refining random noise into coherent sequence.
What's the difference?
- LLaDA (Large Language Diffusion with mAsking) generation starting with noise, refines over steps to denoising the entire sequence. While ARMs (Autoregressive Models) generates sequence token-by-token, from left to right
- Training LLaDA using vanilla transformer instead of causal transformer (like GPT)
- Vanilla transformer has 2 steps, encoder and decoder:
- Encoder, or Forward Data Masking process: Where the original sequence (clean data) is progressively corrupted by masking tokens over a series of steps
- Decoder, or Reverse Denoising process: Model starts to iteratively predicts the original sequence, from noisy content to unmasked tokens
- This whole training processes objective is to minimize the difference between the predicted denoised sequence and the original sequence
- LLaDA has stronger reversal resoning, the score of forward reasoning and reversal reasoning are approximately equal1. But talking in numbers, the reversal score is not much better, the forward score is quite a bit lower than ARMs. Hope to it can improve in future
- LLaDA seems difficult to fine-tuned/reinforcement learning because of sensitivity to hyperparameters
References
- Andrej Karpathy tweet
- Large Language Diffusion Models project page
- Large Language Diffusion Models paper (arXiv)
- Large Language Diffusion Models paper (Hugging Face)
Footnotes
-
See "Table 3. Comparison in the Poem Completion Task." - https://arxiv.org/pdf/2502.09992 ↩