20260625:很多新内容
This commit is contained in:
41
raw/papers/vla-jepa-2026.md
Normal file
41
raw/papers/vla-jepa-2026.md
Normal file
@@ -0,0 +1,41 @@
|
||||
---
|
||||
title: "VLA-JEPA: Enhancing VLA with Latent World Model"
|
||||
author: "Jingwen Sun*, Wenyao Zhang*, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin†, Zhibo Chen†"
|
||||
source: "arXiv 2602.10098v2"
|
||||
date: "2026-02-10 (updated 2026-02-14)"
|
||||
type: paper
|
||||
venue: "arXiv (cs.RO, cs.CV)"
|
||||
tags: ["vla", "jepa", "world-model", "robot-learning", "pretraining", "latent-action"]
|
||||
code: "https://github.com/ginwind/VLA-JEPA/"
|
||||
---
|
||||
|
||||
# VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
|
||||
|
||||
> Sun*, Zhang*, Qi, Ren, Liu, Zhu, Sun, Jin†, Chen†
|
||||
> USTC / SJTU / Tsinghua / EIT / UCAS / Nankai | arXiv:2602.10098v2 | cs.RO / cs.CV
|
||||
|
||||
## 核心问题
|
||||
|
||||
当前 VLA 的 latent-action 预训练目标学错了东西:它们锚定在像素变化而非动作相关的状态转移上,导致四种失败模式:
|
||||
1. 像素级目标偏向外观而非动作语义
|
||||
2. 真实视频中相机运动和背景变化主导信号
|
||||
3. 信息泄漏使 latent action 坍缩为捷径(编码未来而非转移动态)
|
||||
4. 多阶段训练流水线复杂且脆弱
|
||||
|
||||
## 核心方案:Leakage-free State Prediction
|
||||
|
||||
VLA-JEPA 将 JEPA 范式引入 VLA 预训练:
|
||||
- Target encoder 从未来帧产生 latent target(仅作监督,永不作为输入)
|
||||
- Student 仅见当前观察
|
||||
- 在 latent space(非 pixel space)预测——天然鲁棒于相机运动和背景变化
|
||||
- 简单两阶段:JEPA 预训练 → Action-head 微调
|
||||
|
||||
架构:Qwen3-VL-2B (VLM backbone) + V-JEPA2 encoder (world model) + Flow-Matching action head
|
||||
|
||||
## 关键结果
|
||||
|
||||
- **LIBERO**:SOTA 平均成功率,4 个 task suite 中 2 个最优
|
||||
- **SimplerEnv**:Google Robot 最高平均成功率,WidowX 第二
|
||||
- **LIBERO-Plus**:7 个扰动维度下的强劲鲁棒性
|
||||
- **数据效率**:使用远少于对比方法的训练数据达到更优性能
|
||||
- **Real-world Franka**:真实机器人验证成功
|
||||
Reference in New Issue
Block a user