20260625:很多新内容

This commit is contained in:
2026-06-25 14:08:47 +08:00
parent 91fac5b6fc
commit 6021dea160
375 changed files with 19263 additions and 251 deletions

View File

@@ -0,0 +1,42 @@
---
title: "Hybrid Recall Pipeline (BM25 + Dense)"
created: 2026-06-24
updated: 2026-06-24
type: concept
tags: ["information-retrieval", "hybrid-search", "bm25", "dense-retrieval", "rrf"]
sources:
- "[[atlas-agent-memory-architecture-2026]]"
---
# Hybrid Recall Pipeline
Atlas 记忆系统的混合召回管线BM25 词法检索 + Dense 语义检索双通路并行,经 RRF 融合和 Cross-encoder 重排序后返回 top-K。
## 四阶段管线
1. **Verbatim Pre-Recall**:用户原话不经 LLM 改写,保护精确 token
2. **双通路并行检索**
- BM25multi_match 跨 text/title/name/description/trigger_texttext 权重 2×
- DenseJina v5 embeddings + ES semantic_text knn
3. **RRF 融合**rank_constant=30强信号权重window_size=max(80, k×8)
4. **Cross-encoder 重排序**Jina v2 reranker 逐对评分 top-80 → top-K
## Ablation 贡献分布
| 组件 | 贡献 |
|------|------|
| Dense-only | 0.845 |
| BM25-only | 0.708 |
| Full (hybrid) | 0.89 |
| Reranker (单点) | -0.238 |
## 关键设计决策
- **BM25 不能省略**:版本号/错误码/人名等精确 token 只有词法检索能抓到
- **Dense 是主力**:语义意图如"数据库偏好"在文档中无直接匹配
- **Query expansion 反效果**BM25+ dense 已覆盖精确和语义LLM paraphrasing 引入噪音
## 参考
- [[atlas-agent-memory-architecture-2026]]
- [[bm25-financial-retrieval]]
- [[verbatim-pre-recall]]