Files
myWiki/concepts/streaming-inference.md

31 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Streaming Inference"
created: 2026-06-25
updated: 2026-06-25
type: concept
tags: [inference, streaming, real-time, deployment]
sources:
- "[[wan-streamer]]"
- "[[thinker-performer-pipeline]]"
---
# Streaming Inference
**Streaming Inference**(流式推理)是一种推理部署范式,模型以流式(增量)方式消费输入并生成输出,而非等待完整输入后再批量处理。在实时交互场景中,流式推理是实现低延迟响应的关键技术。
## 关键要素
1. **因果约束**:推理过程中不能访问未来信息
2. **增量状态管理**每个流式单元到达后立即更新内部状态KV-cache 等)
3. **流水线重叠**:不同阶段的处理(编码、推理、解码)在连续流式单元间重叠执行
## 在 Wan-Streamer 中的实现
Wan-Streamer 的 [[thinker-performer-pipeline|Thinker-Performer Pipeline]] 将流式推理拆分为两个重叠的进程,通过 KV-cache 交换维持统一状态,实现 160ms 流式单元的实时吞吐。
## 参考
- [[wan-streamer]]
- [[thinker-performer-pipeline]]
- [[kv-cache]]