20260625:很多新内容
This commit is contained in:
31
raw/papers/longmem-eval-2025.md
Normal file
31
raw/papers/longmem-eval-2025.md
Normal file
@@ -0,0 +1,31 @@
|
||||
---
|
||||
title: "LongMemEval: Benchmarking Long-Term Interactive Memory (Raw Archive)"
|
||||
created: 2026-06-25
|
||||
updated: 2026-06-25
|
||||
type: raw
|
||||
tags: ["memory-benchmark", "chat-assistant", "long-term-memory"]
|
||||
source: "https://arxiv.org/abs/2410.10813"
|
||||
---
|
||||
|
||||
# LongMemEval — Raw Archive
|
||||
|
||||
## Metadata
|
||||
|
||||
- **Title**: LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
|
||||
- **Authors**: Di Wu (UCLA), Hongwei Wang, Wenhao Yu (Tencent AI Lab Seattle), Yuwei Zhang (UC San Diego), Kai-Wei Chang (UCLA), Dong Yu (Tencent AI Lab Seattle)
|
||||
- **Venue**: ICLR 2025
|
||||
- **arXiv**: 2410.10813
|
||||
- **Date**: 2024-10-14 (v1), 2025-03-04 (v2)
|
||||
- **Category**: cs.CL
|
||||
- **Code**: https://github.com/xiaowu0162/LongMemEval
|
||||
|
||||
## Abstract
|
||||
|
||||
Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. We introduce LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading.
|
||||
|
||||
## Key Contributions
|
||||
|
||||
1. First comprehensive memory benchmark featuring 5 core abilities + abstention
|
||||
2. Unified three-stage memory framework (indexing → retrieval → reading) with four control points
|
||||
3. Empirically validated design optimizations: round granularity, fact-augmented keys, time-aware query expansion
|
||||
4. Two standard settings: S (~115k tokens) and M (~1.5M tokens)
|
||||
Reference in New Issue
Block a user