Files
myWiki/concepts/vla-jepa.md

42 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "VLA-JEPA (模型)"
created: 2026-06-24
updated: 2026-06-24
type: concept
tags: ["vla", "jepa", "world-model", "robot-learning"]
sources:
- "[[vla-jepa-2026]]"
---
# VLA-JEPA
VLA-JEPA 是将 JEPA 范式引入 Vision-Language-Action 模型的预训练框架。核心思想:通过 leakage-free state prediction 在 latent space 学习动作相关的动态抽象。
## 架构
- VLM BackboneQwen3-VL-2B
- Latent World ModelV-JEPA2 encoder (frozen target) + autoregressive Transformer predictor
- Action HeadConditional Flow-Matching
## 关键设计原则
1. Target encoder 从未来帧产生 latent target → 仅作监督目标
2. Student 仅见当前观察 → 消除信息泄漏
3. Latent space 预测(非 pixel space→ 鲁棒于外观变化
## 训练流程
两阶段简化为JEPA Pretraining → Action-Head Fine-tuning
vs 传统 latent-action 方法的多阶段流水线)
## 性能
LIBERO 平均 98.2%SOTASimplerEnv 领先,数据效率远超对比方法。
## 参考
- [[vla-jepa-2026]]
- [[jepa]]
- [[vla-vision-language-action]]
- [[leakage-free-state-prediction]]
- [[latent-world-model]]