20260625:很多新内容
This commit is contained in:
81
papers/vla-jepa-2026.md
Normal file
81
papers/vla-jepa-2026.md
Normal file
@@ -0,0 +1,81 @@
|
||||
---
|
||||
title: "VLA-JEPA (Sun et al., 2026)"
|
||||
created: 2026-06-24
|
||||
updated: 2026-06-24
|
||||
type: paper
|
||||
tags: ["vla", "jepa", "world-model", "robot-learning", "pretraining", "latent-action"]
|
||||
sources:
|
||||
- "https://arxiv.org/abs/2602.10098"
|
||||
code: "https://github.com/ginwind/VLA-JEPA/"
|
||||
---
|
||||
|
||||
# VLA-JEPA
|
||||
|
||||
> Sun*, Zhang*, Qi, Ren, Liu, Zhu, Sun, Jin†, Chen† | arXiv:2602.10098 | cs.RO / cs.CV | Feb 2026
|
||||
|
||||
## 问题
|
||||
|
||||
[[vla-vision-language-action|VLA]] 的 [[latent-action-pretraining|latent-action 预训练]] 从互联网视频学习机器人策略是一个有吸引力的方向。但当前的 latent-action 目标存在系统性缺陷:锚定在**像素变化**而非**动作相关的状态转移**上。
|
||||
|
||||
四种失败模式:
|
||||
|
||||
| 模式 | 描述 |
|
||||
|------|------|
|
||||
| [[appearance-bias-vla|外观偏见]] | 像素级目标偏向纹理/光照/背景,而非可控自由度 |
|
||||
| 噪声运动放大 | 相机运动和无关背景变化主导信号 |
|
||||
| [[information-leakage-vla|信息泄漏]] | 未来帧作为输入 → latent action 坍缩为编码未来而非转移动态 |
|
||||
| 多阶段复杂性 | 三阶段+流水线的工程脆弱性 |
|
||||
|
||||
## 核心方案:[[leakage-free-state-prediction|Leakage-free State Prediction]]
|
||||
|
||||
VLA-JEPA 将 [[jepa|JEPA]] 范式引入 VLA:**在 latent space 预测而非 pixel space**。
|
||||
|
||||
### 架构
|
||||
|
||||
- **VLM Backbone**:Qwen3-VL-2B,输出 latent action tokens
|
||||
- **[[latent-world-model|Latent World Model]]**:V-JEPA2 encoder(frozen target)+ autoregressive Transformer(predictor)
|
||||
- **Action Head**:[[flow-matching|Conditional Flow-Matching]]
|
||||
|
||||
### 关键设计
|
||||
|
||||
```
|
||||
Target Encoder (frozen, no grad) Student (VLM backbone)
|
||||
↓ ↓
|
||||
Future frames → latent targets Current observation only
|
||||
↓ ↓
|
||||
JEPA alignment loss
|
||||
(predict in latent space)
|
||||
```
|
||||
|
||||
**未来帧仅作监督目标,永不作为输入**——消除信息泄漏捷径。
|
||||
|
||||
### 训练
|
||||
|
||||
- 预训练:Something-Something-v2(220K 人类视频)+ Droid(76K 机器人轨迹)
|
||||
- 微调:LIBERO(~2K 专家演示)/ Fractal + BridgeV2 / 100 真实演示
|
||||
- 8×A100,Qwen3-VL-2B backbone
|
||||
|
||||
## 关键结果
|
||||
|
||||
### LIBERO
|
||||
|
||||
| Method | Spatial | Object | Goal | Long | Avg |
|
||||
|--------|---------|--------|------|------|-----|
|
||||
| VLA-JEPA | 96.2 | 99.6 | 99.6 | 97.2 | **98.2** |
|
||||
| π0.5 | 97.5 | 91.5 | 74.5 | 90.1 | 88.9 |
|
||||
| OpenVLA-OFT | 97.6 | 97.9 | 94.5 | 96.8 | 96.7 |
|
||||
|
||||
### SimplerEnv
|
||||
Google Robot 平均最高;WidowX 平均第二。使用 villa-X 不到 1% 的训练数据。
|
||||
|
||||
### Robustness (LIBERO-Plus)
|
||||
在 7 个扰动维度(光照/纹理/颜色/相机/…)下保持强劲性能。
|
||||
|
||||
## 核心洞察
|
||||
|
||||
1. **JEPA 的 embodied 应用** — 将 JEPA 从视频表示学习扩展到机器人动作策略,证明了 latent-space prediction 对 embodied AI 的通用价值
|
||||
2. **信息泄漏是根本问题** — 当前 latent-action 方法的失败根源不是模型容量不足,而是架构缺陷(未来信息泄漏)。修复架构比堆数据更有效
|
||||
3. **数据效率** — 用更少数据超越用更多数据的对比方法,证明学对目标比学更多数据更重要
|
||||
|
||||
## 来源
|
||||
[原始存档](raw/papers/vla-jepa-2026.md) | [arXiv](https://arxiv.org/abs/2602.10098) | [GitHub](https://github.com/ginwind/VLA-JEPA/)
|
||||
Reference in New Issue
Block a user