Files
myWiki/concepts/hybrid-recall-pipeline.md

43 lines
1.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Hybrid Recall Pipeline (BM25 + Dense)"
created: 2026-06-24
updated: 2026-06-24
type: concept
tags: ["information-retrieval", "hybrid-search", "bm25", "dense-retrieval", "rrf"]
sources:
- "[[atlas-agent-memory-architecture-2026]]"
---
# Hybrid Recall Pipeline
Atlas 记忆系统的混合召回管线BM25 词法检索 + Dense 语义检索双通路并行,经 RRF 融合和 Cross-encoder 重排序后返回 top-K。
## 四阶段管线
1. **Verbatim Pre-Recall**:用户原话不经 LLM 改写,保护精确 token
2. **双通路并行检索**
- BM25multi_match 跨 text/title/name/description/trigger_texttext 权重 2×
- DenseJina v5 embeddings + ES semantic_text knn
3. **RRF 融合**rank_constant=30强信号权重window_size=max(80, k×8)
4. **Cross-encoder 重排序**Jina v2 reranker 逐对评分 top-80 → top-K
## Ablation 贡献分布
| 组件 | 贡献 |
|------|------|
| Dense-only | 0.845 |
| BM25-only | 0.708 |
| Full (hybrid) | 0.89 |
| Reranker (单点) | -0.238 |
## 关键设计决策
- **BM25 不能省略**:版本号/错误码/人名等精确 token 只有词法检索能抓到
- **Dense 是主力**:语义意图如"数据库偏好"在文档中无直接匹配
- **Query expansion 反效果**BM25+ dense 已覆盖精确和语义LLM paraphrasing 引入噪音
## 参考
- [[atlas-agent-memory-architecture-2026]]
- [[bm25-financial-retrieval]]
- [[verbatim-pre-recall]]