20260514:增加新内容
This commit is contained in:
54
concepts/gflownet-fine-tuning.md
Normal file
54
concepts/gflownet-fine-tuning.md
Normal file
@@ -0,0 +1,54 @@
|
||||
---
|
||||
title: "GFlowNet 微调"
|
||||
created: 2026-05-12
|
||||
updated: 2026-05-12
|
||||
type: concept
|
||||
tags: ["gflownet", "reinforcement-learning", "llm-fine-tuning"]
|
||||
sources: ["arxiv:2311.09278", "arxiv:2503.18929", "arxiv:2402.15211"]
|
||||
---
|
||||
|
||||
# GFlowNet 微调
|
||||
|
||||
**GFlowNet 微调** 是使用 Generative Flow Networks (GFlowNets) 的目标函数对 LLM 进行后训练的方法,核心优势是 **off-policy 兼容** 和 **多样性采样**。
|
||||
|
||||
## GFlowNets 基础
|
||||
|
||||
GFlowNets 训练层次化生成模型,使其从给定未归一化密度(奖励函数)按比例采样:$\pi_\theta(x) \propto R(x)$。
|
||||
|
||||
关键区别:GFlowNets 学习的是**分布匹配**而非奖励最大化——自然产生多样化输出。
|
||||
|
||||
### 三种主要目标
|
||||
|
||||
| 目标 | 公式 | 特点 |
|
||||
|------|------|------|
|
||||
| Flow Matching (FM) | $\sum_{s' \to s} F(s') = \sum_{s \to s''} F(s)$ | 最基础 |
|
||||
| Detailed Balance (DB) | $F(s)P_F(s'|s) = F(s')P_B(s|s')$ | 前后向一致 |
|
||||
| **Trajectory Balance (TB)** | $(\log \frac{Z\prod P_F}{R})^2$ | **用于 LLM 微调** |
|
||||
|
||||
## LLM 应用
|
||||
|
||||
### Hu et al. (ICLR 2024) — GFlowNet Fine-Tuning
|
||||
|
||||
首次将 GFlowNets 用于 LLM 微调,利用 off-policy 性质进行 KL 正则化 RL 推理。
|
||||
|
||||
### Lee et al. (ICLR 2025) — Red-Teaming
|
||||
|
||||
使用 TB + MLE smoothing 生成多样化、可迁移的对抗攻击提示。
|
||||
|
||||
### Bartoldson et al. (NeurIPS 2025) — TBA
|
||||
|
||||
将 TB 目标扩展到分布式异步 RL,实现 4×–50× 加速。参见 [[tba|TBA]] 和 [[trajectory-balance-objective|TB 目标]]。
|
||||
|
||||
## 为什么 GFlowNets 适合 LLM?
|
||||
|
||||
1. **Off-Policy**:不需要当前策略数据 → 支持 replay buffer / 异步训练
|
||||
2. **多样性**:学习分布而非最大值 → 避免 mode collapse
|
||||
3. **无 Critic**:不需要价值网络 → 绕开 LLM 中价值估计的困难
|
||||
4. **与 REINFORCE 等价**:TB\_VarGrad 梯度 = mean-baseline REINFORCE + KL reward
|
||||
|
||||
## 相关概念
|
||||
|
||||
- [[trajectory-balance-objective]] — TB 目标详解
|
||||
- [[tba|TBA]] — 异步分布式实现
|
||||
- [[off-policy-llm-post-training]] — Off-policy 范式
|
||||
- [[bartoldson-tba-2025|论文页面]]
|
||||
Reference in New Issue
Block a user