By DeepSeek, ChatGPT, Gemini, Claude with W.H.L.
Embedded Language Flows (ELF)
Embedded Language Flows (ELF) is a family of continuous diffusion language models introduced in May 2026 by researchers at the Massachusetts Institute of Technology (MIT), including Keya Hu and Kaiming He. [Hu et al., arXiv:2605.10938]
ELF performs text generation within a fully continuous embedding space, converting representations back to discrete tokens only at the final sampling step. The architecture was designed to preserve continuity throughout the denoising trajectory, in contrast to earlier continuous diffusion language models that repeatedly alternated between continuous embeddings and discrete token representations during generation.
The ELF architecture was released in three model variants — ELF-B (105 million parameters), ELF-M (342 million parameters), and ELF-L (652 million parameters) — and was implemented using JAX. According to the authors, ELF achieved competitive benchmark results on unconditional text generation, machine translation, and summarization tasks while using fewer sampling steps and less training data than several previously reported diffusion language models. [Hu et al., arXiv:2605.10938]
Background
Language modeling has historically been dominated by autoregressive architectures that generate text sequentially, one token at a time. Diffusion language models (DLMs) provide an alternative framework in which a model progressively denoises a corrupted representation into coherent text. Because denoising can operate over entire sequences simultaneously, diffusion-based approaches have been investigated for their potential advantages in parallel generation and infilling tasks.
Early diffusion language models generally fell into two categories:
- Discrete DLMs, such as MDLM and D3PM, which define the diffusion process directly over discrete token spaces.
- Continuous DLMs, such as Diffusion-LM and Plaid, which map tokens into continuous vector representations and perform denoising in embedding space.
Although continuous DLMs shared conceptual similarities with image diffusion systems, they often underperformed discrete approaches empirically because repeated discretization interrupted the continuity of the learned embedding-space trajectory during generation. [Hu et al., arXiv:2605.10938]
The ELF project also represented one of the first major language-modeling efforts involving computer vision researcher Kaiming He, whose earlier work included the development of ResNet architectures and masked image-modeling systems.
Method
ELF differs from earlier continuous diffusion language models by maintaining continuous embedding representations throughout nearly the entire denoising process. The method consists of three primary stages.
Encoding
Input text is converted into high-dimensional continuous vectors using a pretrained T5 encoder. These embeddings encode lexical and contextual information within a shared semantic space. [Hu et al., arXiv:2605.10938]
Flow-matching denoising
Starting from Gaussian noise, ELF applies a continuous-time flow-matching process that learns a velocity field guiding noisy embeddings toward coherent text representations.
Unlike autoregressive systems that predict tokens sequentially, ELF treats text generation as a continuous trajectory through embedding space. Flow matching replaces stochastic diffusion trajectories with directly learned continuous vector fields, simplifying training and sampling relative to some earlier diffusion formulations.
The model uses a shared-weight Transformer trained in two modes:
- Denoising mode (approximately 80% of training), optimized using mean-squared-error objectives to learn global embedding-space trajectories.
- Decoding mode (approximately 20% of training), optimized using cross-entropy objectives to improve mapping from continuous vectors back to discrete vocabulary tokens. [Hu et al., arXiv:2605.10938]
This dual-objective strategy was intended to balance continuous semantic modeling with accurate token reconstruction.
Final discretization
Only at the final sampling step does ELF project denoised embeddings into token logits for discrete token sampling.
Because the denoising process remains fully continuous until the final stage, ELF can adopt techniques originally associated with image-domain diffusion systems with relatively few architectural modifications. These include classifier-free guidance, which combines conditional and unconditional velocity estimates during denoising, and self-conditioning, in which earlier model predictions are reused as auxiliary conditioning signals during later denoising steps. [Hu et al., arXiv:2605.10938]
Training
The ELF models were trained on approximately 45 billion tokens. [Hu et al., arXiv:2605.10938] According to the authors, this represented substantially fewer training tokens than those reported for several contemporary discrete diffusion language models.
Training and inference were implemented in JAX, and the architecture was designed to reuse optimization strategies and training techniques previously developed for continuous image diffusion systems.
Experiments
The ELF paper evaluated the architecture on both unconditional and conditional language-generation tasks.
Unconditional generation
For unconditional text generation, the models were trained on the OpenWebText corpus. Generation quality was evaluated using generative perplexity (Gen. PPL) computed with GPT-2 Large, while output diversity was measured using unigram entropy metrics. [Hu et al., arXiv:2605.10938]
With 32 sampling steps, ELF-B achieved a reported generative perplexity of 24.1 on OpenWebText. The authors reported that this result matched or exceeded several previously published diffusion-language-model baselines under comparable evaluation settings. [Hu et al., arXiv:2605.10938]
Conditional generation
ELF-B was also fine-tuned for conditional generation tasks, including machine translation on the WMT14 German-to-English dataset and summarization on the XSum dataset.
Reported benchmark results
| Task | Dataset | Metric | Score |
|---|---|---|---|
| Unconditional generation | OpenWebText | Generative perplexity (lower is better) | 24.1 |
| Machine translation | WMT14 De→En | BLEU (higher is better) | 26.4 |
| Summarization | XSum | ROUGE-1 / ROUGE-2 / ROUGE-L | 36.0 / 12.2 / 27.8 |
Reception
Because ELF was initially released as an arXiv preprint, independent coverage and peer-reviewed evaluation remained limited shortly after publication. Early technical discussion focused primarily on the architecture’s continuous embedding-space formulation and its adaptation of flow-matching techniques previously associated with image diffusion systems.
Researchers also discussed ELF as part of a broader trend toward applying continuous diffusion methodologies to natural language generation.
See also
- Diffusion model
- Diffusion-LM
- Flow matching
- Language model
- ResNet
- T5
References
Hu, K.; Qiu, L.; Lu, Y.; Zhao, H.; Li, T.; Kim, Y.; Andreas, J.; He, K. (2026). “ELF: Embedded Language Flows.” arXiv:2605.10938.
External links
Official JAX implementation: https://github.com/lillian039/ELF
ELF paper on arXiv: https://arxiv.org/abs/2605.10938
Date of Current Version: 05.13.2026
Initial Draft: DeepSeek-V4
Peer Reviews: GPT-5.4, Gemini 3 Thinking, Claude Sonnet 4.6 Adaptive Thinking
Revisions: GPT-5.4, Claude Sonnet 4.6 Adaptive Thinking
Final Version for Publication: GPT-5.4

Leave a comment