--- title: "Hybrid Recall Pipeline (BM25 + Dense)" created: 2026-06-24 updated: 2026-06-24 type: concept tags: ["information-retrieval", "hybrid-search", "bm25", "dense-retrieval", "rrf"] sources: - "[[atlas-agent-memory-architecture-2026]]" --- # Hybrid Recall Pipeline Atlas 记忆系统的混合召回管线:BM25 词法检索 + Dense 语义检索双通路并行,经 RRF 融合和 Cross-encoder 重排序后返回 top-K。 ## 四阶段管线 1. **Verbatim Pre-Recall**:用户原话不经 LLM 改写,保护精确 token 2. **双通路并行检索**: - BM25:multi_match 跨 text/title/name/description/trigger_text,text 权重 2× - Dense:Jina v5 embeddings + ES semantic_text knn 3. **RRF 融合**:rank_constant=30(强信号权重),window_size=max(80, k×8) 4. **Cross-encoder 重排序**:Jina v2 reranker 逐对评分 top-80 → top-K ## Ablation 贡献分布 | 组件 | 贡献 | |------|------| | Dense-only | 0.845 | | BM25-only | 0.708 | | Full (hybrid) | 0.89 | | Reranker (单点) | -0.238 | ## 关键设计决策 - **BM25 不能省略**:版本号/错误码/人名等精确 token 只有词法检索能抓到 - **Dense 是主力**:语义意图如"数据库偏好"在文档中无直接匹配 - **Query expansion 反效果**:BM25+ dense 已覆盖精确和语义,LLM paraphrasing 引入噪音 ## 参考 - [[atlas-agent-memory-architecture-2026]] - [[bm25-financial-retrieval]] - [[verbatim-pre-recall]]