42 lines
1.1 KiB
Markdown
42 lines
1.1 KiB
Markdown
---
|
||
title: "VLA-JEPA (模型)"
|
||
created: 2026-06-24
|
||
updated: 2026-06-24
|
||
type: concept
|
||
tags: ["vla", "jepa", "world-model", "robot-learning"]
|
||
sources:
|
||
- "[[vla-jepa-2026]]"
|
||
---
|
||
|
||
# VLA-JEPA
|
||
|
||
VLA-JEPA 是将 JEPA 范式引入 Vision-Language-Action 模型的预训练框架。核心思想:通过 leakage-free state prediction 在 latent space 学习动作相关的动态抽象。
|
||
|
||
## 架构
|
||
|
||
- VLM Backbone:Qwen3-VL-2B
|
||
- Latent World Model:V-JEPA2 encoder (frozen target) + autoregressive Transformer predictor
|
||
- Action Head:Conditional Flow-Matching
|
||
|
||
## 关键设计原则
|
||
|
||
1. Target encoder 从未来帧产生 latent target → 仅作监督目标
|
||
2. Student 仅见当前观察 → 消除信息泄漏
|
||
3. Latent space 预测(非 pixel space)→ 鲁棒于外观变化
|
||
|
||
## 训练流程
|
||
|
||
两阶段简化为:JEPA Pretraining → Action-Head Fine-tuning
|
||
(vs 传统 latent-action 方法的多阶段流水线)
|
||
|
||
## 性能
|
||
|
||
LIBERO 平均 98.2%(SOTA),SimplerEnv 领先,数据效率远超对比方法。
|
||
|
||
## 参考
|
||
- [[vla-jepa-2026]]
|
||
- [[jepa]]
|
||
- [[vla-vision-language-action]]
|
||
- [[leakage-free-state-prediction]]
|
||
- [[latent-world-model]]
|