Files
myWiki/concepts/multimodal-rag.md
2026-06-01 10:46:01 +08:00

34 lines
1.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "多模态 RAG (Multimodal RAG)"
created: 2026-05-21
type: concept
tags: ["rag", "multimodal", "retrieval"]
sources: ["[[when-large-multimodal-models-confront-evolving-knowledge]]"]
---
# 多模态 RAG (Multimodal RAG)
## 定义
多模态 RAGMM-RAG将[[rag|检索增强生成]]扩展到多模态场景,通过检索外部多模态知识来增强 LMM 的知识密集型任务表现。
## 三种检索策略
| 策略 | 检索依据 | LLaVA-v1.5 CEM | Qwen-VL-Chat CEM |
|------|---------|---------------|-----------------|
| Text-Only | 仅文本特征 | 24.05% | 21.79% |
| Image-Only | 仅视觉特征 | 25.25% | 22.31% |
| UniIR | 多模态特征融合 | **40.68%** | **32.75%** |
## 关键发现
1. MM-RAG 优于 SFTFull-FT/LoRA但最高仅 40.68% CEM——**远未达到理想水平**
2. UniIR 融合多模态特征检索显著优于单模态检索
3. 即使提供了充分上下文Sufficient Context模型仍不能完美回答——揭示了**利用能力**而非**检索能力**是瓶颈
## 参见
- [[rag|RAG]]
- [[sufficient-context-paradox|充分上下文悖论]]
- [[evolving-knowledge-injection|进化知识注入]]