Files
myWiki/raw/papers/elf-embedded-language-flows-2026.md

29 lines
2.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "ELF: Embedded Language Flows"
created: 2026-05-13
updated: 2026-05-13
type: raw-paper
source: https://arxiv.org/abs/2605.10938
tags: [diffusion-language-model, flow-matching, continuous-embeddings, language-generation]
---
# ELF: Embedded Language Flows
**arXiv:** 2605.10938
**Authors:** Keya Hu*, Linlu Qiu*, Yiyang Lu, Hanhong Zhao, Tianhong Li, Yoon Kim, Jacob Andreas, Kaiming He (MIT; *equal contribution)
**Date:** 2026-05-11
**Categories:** cs.CL, cs.AI, cs.LG
**Code:** https://github.com/lillian039/ELF
## Abstract
Diffusion and flow-based models have become the de facto approaches for generating continuous data. Their success has attracted growing interest in applying them to language modeling. Unlike their image-domain counterparts, today's leading diffusion language models (DLMs) primarily operate over discrete tokens. In this paper, we show that continuous DLMs can be made effective with minimal adaptation to the discrete domain. We propose Embedded Language Flows (ELF), a class of diffusion models in continuous embedding space based on continuous-time Flow Matching. Unlike existing DLMs, ELF predominantly stays within the continuous embedding space until the final time step, where it maps to discrete tokens using a shared-weight network. This formulation makes it straightforward to adapt established techniques from image-domain diffusion models, e.g., classifier-free guidance (CFG). Experiments show that ELF substantially outperforms leading discrete and continuous DLMs, achieving better generation quality with fewer sampling steps.
## Key Claims
1. Continuous DLMs can match/exceed discrete DLMs with proper design — the performance gap is due to algorithmic choices, not inherent discreteness of language.
2. **Shared-weight discretization**: A single network handles both denoising (MSE loss, t<1) and decoding (CE loss, t=1) via a binary mode token, eliminating the need for a separate decoder.
3. **x-prediction** parameterization aligns denoising and decoding objectives, enabling effective weight sharing that v-prediction cannot support.
4. **CFG is naturally applicable** to continuous DLMs and significantly improves generation quality; training-time CFG avoids inference overhead.
5. ELF-B (105M) outperforms 170M baselines (MDLM, Duo, FLM, LangFlow) with **10× fewer training tokens** and **fewer sampling steps** (32 vs 1024), without distillation.