SidneyZhang/myWiki

Files

Sidney Zhang 6021dea160

20260625:很多新内容

2026-06-25 14:08:47 +08:00

1.2 KiB

Raw Blame History

title, created, updated, type, tags, sources

title

created

updated

type

tags

sources

Latent World Model (Robotics)

2026-06-24

2026-06-24

concept

world-model

jepa

robot-learning

latent-representation

Latent World Model (Embodied)

Latent World Model 是 VLA-JEPA 中的世界模型组件，基于 JEPA 范式在 latent space 中建模状态转移动态。

架构

Target Encoder：V-JEPA2，frozen，从未来帧产生 latent world state targets
Predictor：Autoregressive Transformer (12 层, 8 注意力头, 2048-dim)
注意力：单时间步内双向（K 个 latent action token + N 个 image latent token），跨时间步因果

训练目标

\mathcal{L}_{WM} = \sum_{k=1}^{T} \mathbb{E}_{s_{t_k} \sim F(\cdot)} (\hat{s}_{t_k} - s_{t_k})

Target encoder F(·) 提供 ground-truth world state，predictor 学习预测。

可解释为 ELBO 最大化：

\log p(s_{1:T} | z_{0:T-1}) \geq \sum \mathbb{E}[\log p_\theta(\hat{s} | s)] - D_{KL}(F \| p_\theta^{WM})

与通用 World Model 的区别

不同于 Dreamer 等 pixel-space world model，Latent World Model 在语义空间运行，天然过滤像素噪声。

参考