This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. The authors find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically: (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. These findings are validated across multiple LLMs and datasets.
Key Concepts
Interaction-based explanation: Decomposing LLM inference patterns into AND-OR interactions between input tokens
Three interaction types: Removed (eliminated during SFT), Preserved (retained throughout), Newly emerged (acquired during SFT)