SidneyZhang/myWiki

Files

Sidney Zhang b116710e4c

20260514:增加新内容

2026-05-14 13:54:52 +08:00

2.0 KiB

Raw Permalink Blame History

title, created, updated, type, tags, sources

title

created

updated

type

tags

sources

GFlowNet 微调

2026-05-12

2026-05-12

concept

gflownet

reinforcement-learning

llm-fine-tuning

arxiv:2311.09278

arxiv:2503.18929

arxiv:2402.15211

GFlowNet 微调

GFlowNet 微调 是使用 Generative Flow Networks (GFlowNets) 的目标函数对 LLM 进行后训练的方法，核心优势是 off-policy 兼容 和 多样性采样。

GFlowNets 基础

GFlowNets 训练层次化生成模型，使其从给定未归一化密度（奖励函数）按比例采样：$\pi_\theta(x) \propto R(x)$。

关键区别：GFlowNets 学习的是分布匹配而非奖励最大化——自然产生多样化输出。

三种主要目标

目标	公式	特点
Flow Matching (FM)	`\sum_{s' \to s} F(s') = \sum_{s \to s''} F(s)`	最基础
Detailed Balance (DB)	$F(s)P_F(s'	s) = F(s')P_B(s
Trajectory Balance (TB)	`(\log \frac{Z\prod P_F}{R})^2`	用于 LLM 微调

LLM 应用

Hu et al. (ICLR 2024) — GFlowNet Fine-Tuning

首次将 GFlowNets 用于 LLM 微调，利用 off-policy 性质进行 KL 正则化 RL 推理。

Lee et al. (ICLR 2025) — Red-Teaming

使用 TB + MLE smoothing 生成多样化、可迁移的对抗攻击提示。

Bartoldson et al. (NeurIPS 2025) — TBA

将 TB 目标扩展到分布式异步 RL，实现 4×–50× 加速。参见 tba 和 trajectory-balance-objective。

为什么 GFlowNets 适合 LLM？

Off-Policy：不需要当前策略数据 → 支持 replay buffer / 异步训练
多样性：学习分布而非最大值 → 避免 mode collapse
无 Critic：不需要价值网络 → 绕开 LLM 中价值估计的困难
与 REINFORCE 等价：TB_VarGrad 梯度 = mean-baseline REINFORCE + KL reward

相关概念

trajectory-balance-objective — TB 目标详解
tba — 异步分布式实现
off-policy-llm-post-training — Off-policy 范式
bartoldson-tba-2025