43 lines
1.4 KiB
Markdown
43 lines
1.4 KiB
Markdown
---
|
||
title: "Hybrid Recall Pipeline (BM25 + Dense)"
|
||
created: 2026-06-24
|
||
updated: 2026-06-24
|
||
type: concept
|
||
tags: ["information-retrieval", "hybrid-search", "bm25", "dense-retrieval", "rrf"]
|
||
sources:
|
||
- "[[atlas-agent-memory-architecture-2026]]"
|
||
---
|
||
|
||
# Hybrid Recall Pipeline
|
||
|
||
Atlas 记忆系统的混合召回管线:BM25 词法检索 + Dense 语义检索双通路并行,经 RRF 融合和 Cross-encoder 重排序后返回 top-K。
|
||
|
||
## 四阶段管线
|
||
|
||
1. **Verbatim Pre-Recall**:用户原话不经 LLM 改写,保护精确 token
|
||
2. **双通路并行检索**:
|
||
- BM25:multi_match 跨 text/title/name/description/trigger_text,text 权重 2×
|
||
- Dense:Jina v5 embeddings + ES semantic_text knn
|
||
3. **RRF 融合**:rank_constant=30(强信号权重),window_size=max(80, k×8)
|
||
4. **Cross-encoder 重排序**:Jina v2 reranker 逐对评分 top-80 → top-K
|
||
|
||
## Ablation 贡献分布
|
||
|
||
| 组件 | 贡献 |
|
||
|------|------|
|
||
| Dense-only | 0.845 |
|
||
| BM25-only | 0.708 |
|
||
| Full (hybrid) | 0.89 |
|
||
| Reranker (单点) | -0.238 |
|
||
|
||
## 关键设计决策
|
||
|
||
- **BM25 不能省略**:版本号/错误码/人名等精确 token 只有词法检索能抓到
|
||
- **Dense 是主力**:语义意图如"数据库偏好"在文档中无直接匹配
|
||
- **Query expansion 反效果**:BM25+ dense 已覆盖精确和语义,LLM paraphrasing 引入噪音
|
||
|
||
## 参考
|
||
- [[atlas-agent-memory-architecture-2026]]
|
||
- [[bm25-financial-retrieval]]
|
||
- [[verbatim-pre-recall]]
|