20260625:很多新内容
This commit is contained in:
30
concepts/streaming-inference.md
Normal file
30
concepts/streaming-inference.md
Normal file
@@ -0,0 +1,30 @@
|
||||
---
|
||||
title: "Streaming Inference"
|
||||
created: 2026-06-25
|
||||
updated: 2026-06-25
|
||||
type: concept
|
||||
tags: [inference, streaming, real-time, deployment]
|
||||
sources:
|
||||
- "[[wan-streamer]]"
|
||||
- "[[thinker-performer-pipeline]]"
|
||||
---
|
||||
|
||||
# Streaming Inference
|
||||
|
||||
**Streaming Inference**(流式推理)是一种推理部署范式,模型以流式(增量)方式消费输入并生成输出,而非等待完整输入后再批量处理。在实时交互场景中,流式推理是实现低延迟响应的关键技术。
|
||||
|
||||
## 关键要素
|
||||
|
||||
1. **因果约束**:推理过程中不能访问未来信息
|
||||
2. **增量状态管理**:每个流式单元到达后立即更新内部状态(KV-cache 等)
|
||||
3. **流水线重叠**:不同阶段的处理(编码、推理、解码)在连续流式单元间重叠执行
|
||||
|
||||
## 在 Wan-Streamer 中的实现
|
||||
|
||||
Wan-Streamer 的 [[thinker-performer-pipeline|Thinker-Performer Pipeline]] 将流式推理拆分为两个重叠的进程,通过 KV-cache 交换维持统一状态,实现 160ms 流式单元的实时吞吐。
|
||||
|
||||
## 参考
|
||||
|
||||
- [[wan-streamer]]
|
||||
- [[thinker-performer-pipeline]]
|
||||
- [[kv-cache]]
|
||||
Reference in New Issue
Block a user