Files
myWiki/raw/papers/maes-leworldmodel-2026.md

89 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels"
authors: ["Lucas Maes", "Quentin Le Lidec", "Damien Scieur", "Yann LeCun", "Randall Balestriero"]
arxiv: "2603.19312v3"
published: "2026-03-13 (updated 2026-06-03)"
categories: [cs.LG, cs.AI]
affiliations: ["Mila & Université de Montréal", "New York University", "Samsung SAIL", "Brown University"]
source: https://arxiv.org/abs/2603.19312
code: linked in paper
---
# LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
> Lucas Maes*, Quentin Le Lidec*, Damien Scieur, Yann LeCun, Randall Balestriero (* equal contribution)
## Abstract (原文)
Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term losses, exponential moving averages, pre-trained encoders, or auxiliary supervision to avoid representation collapse. In this work, we introduce LeWorldModel (LeWM), the first JEPA that trains stably end-to-end from raw pixels using only two loss terms: a next-embedding prediction loss and a regularizer enforcing Gaussian-distributed latent embeddings. This reduces tunable loss hyperparameters from six to one compared to the only existing end-to-end alternative. With ~15M parameters trainable on a single GPU in a few hours, LeWM plans up to 48× faster than foundation-model-based world models while remaining competitive across diverse 2D and 3D control tasks. Beyond control, we show that LeWM's latent space encodes meaningful physical structure through probing of physical quantities. Surprise evaluation confirms that the model reliably detects physically implausible events.
## 核心贡献
1. **首个无需训练启发式stop-gradient/EMA/预训练编码器)的端到端 JEPA 世界模型**
2. 仅用 **2 个损失项 + 1 个可调超参** λ(对比 PLDM 的 6 个超参)
3. ~15M 参数,单 GPU 数小时训练
4. 规划速度比 DINO-WM 快 **48×**token 数减少 ~200×
5. Push-T 成功率 **96%**PLDM 提升 18%
6. 潜在空间编码有意义的物理结构,可通过 probing 提取物理量
7. Surprise 评估确认能可靠检测物理不合理事件
## 架构
### 编码器
- ViT-Tiny (~5M 参数): Patch 14×14, 12 层, 3 注意力头, 隐藏维 192
- 关键设计: **BatchNorm** 投影头(非 LayerNorm因为 LN 限制方差分布阻碍 SIGReg
### 预测器
- Transformer (~10M 参数): 6 层, 16 注意力头, 10% dropout
- 动作条件通过 **AdaLN**(自适应层归一化)注入,初始化为零实现渐进式影响
- 时间因果掩码自回归预测下一帧表示
### 训练目标
$$\mathcal{L} = \|\hat{Z}_{t+1} - Z_{t+1}\|^2 + \lambda \cdot SIGReg(Z)$$
- 无 stop-gradient区别于 I-JEPA/V-JEPA
- 无 EMA区别于 BYOL/DINO
- 无预训练编码器(区别于 DINO-WM
- SIGReg 通过 Cramér-Wold 定理强制嵌入匹配各向同性高斯分布 N(0,I)
## 关键消融
| 消融 | Push-T 成功率 |
|------|-------------|
| LeWM (完整) | **96.0%** |
| 无 SIGReg 正则化 | 坍缩 (~30%) |
| 无 AdaLN (简单拼接动作) | 下降 |
| BatchNorm → LayerNorm | 下降SIGReg 优化困难) |
## 与现有方法的对比定位
| 方法 | 端到端 | 任务无关 | 像素输入 | 无重建 | 无奖励 | 防坍塌保证 |
|------|--------|---------|---------|--------|--------|----------|
| PLDM | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ (6超参) |
| DINO-WM | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ (冻结编码器) |
| Dreamer | ✅ | ❌ | ✅ | ❌ | ❌ | N/A |
| TD-MPC | ✅ | ❌ | ❌ | ✅ | ❌ | N/A |
| **LeWM** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ (1超参) |
## 局限
1. 当前 latent world model 规划仍局限于**短视界**,自回归误差随规划长度累积
2. 依赖足够交互覆盖度的离线数据集
3. 简单场景中 SIGReg 强制高维高斯先验可能导致表征学习困难
4. 需显式动作标签(可通过逆动力学建模缓解)
5. 实验限于 Push-T、Reacher、TwoRoom、OGBench-Cube 等**低维受控任务**
6. OGBench-Cube 上略逊 SOTADINO-WM 受益于 DINOv2 预训练)
## 意义定位
**JEPA 路线的重要里程碑,而非世界模型问题的最终答案。** 验证了端到端 JEPA 世界模型的工程可行性,是 LeCun 在访谈中唯一推荐的具体世界模型论文。
## 相关概念
- [[leworldmodel]]
- [[jepa]]
- [[sigreg]]
- [[pldm]]
- [[world-model-lecun]]
- [[representation-collapse]]