Files
myWiki/raw/papers/vla-jepa-2026.md

42 lines
1.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "VLA-JEPA: Enhancing VLA with Latent World Model"
author: "Jingwen Sun*, Wenyao Zhang*, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin†, Zhibo Chen†"
source: "arXiv 2602.10098v2"
date: "2026-02-10 (updated 2026-02-14)"
type: paper
venue: "arXiv (cs.RO, cs.CV)"
tags: ["vla", "jepa", "world-model", "robot-learning", "pretraining", "latent-action"]
code: "https://github.com/ginwind/VLA-JEPA/"
---
# VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
> Sun*, Zhang*, Qi, Ren, Liu, Zhu, Sun, Jin†, Chen†
> USTC / SJTU / Tsinghua / EIT / UCAS / Nankai | arXiv:2602.10098v2 | cs.RO / cs.CV
## 核心问题
当前 VLA 的 latent-action 预训练目标学错了东西:它们锚定在像素变化而非动作相关的状态转移上,导致四种失败模式:
1. 像素级目标偏向外观而非动作语义
2. 真实视频中相机运动和背景变化主导信号
3. 信息泄漏使 latent action 坍缩为捷径(编码未来而非转移动态)
4. 多阶段训练流水线复杂且脆弱
## 核心方案Leakage-free State Prediction
VLA-JEPA 将 JEPA 范式引入 VLA 预训练:
- Target encoder 从未来帧产生 latent target仅作监督永不作为输入
- Student 仅见当前观察
- 在 latent space非 pixel space预测——天然鲁棒于相机运动和背景变化
- 简单两阶段JEPA 预训练 → Action-head 微调
架构Qwen3-VL-2B (VLM backbone) + V-JEPA2 encoder (world model) + Flow-Matching action head
## 关键结果
- **LIBERO**SOTA 平均成功率4 个 task suite 中 2 个最优
- **SimplerEnv**Google Robot 最高平均成功率WidowX 第二
- **LIBERO-Plus**7 个扰动维度下的强劲鲁棒性
- **数据效率**:使用远少于对比方法的训练数据达到更优性能
- **Real-world Franka**:真实机器人验证成功