Transformers need glasses! Information Over-Squashing in Language Tasks

预备知识

核心思想

经验性的观察

Copying

20250511142544

Counting

20250511143754

20250511145405

理论分析

20250511154551

注: 作者的证明思路: 证明 attention 部分在有限 precision 的条件下会随着 $n$ 的增加趋于一致, 但是作者的证明过于粗糙和不严谨了.

20250511162505

Seq-VCR

参考文献

  1. Barbero F., Banino A., Kapturowski S., Kumaran D., Araujo J. G. M., Vitvitskyi A., Pascanu R., and Velickovic P. Transformers need glasses! Information Over-Squashing in Language Tasks. NeurIPS, 2024. [PDF] [Code]
  2. Arefin M. R., Subbaraj G., Gontier N., LeCun Y., Rish I., Shwartz-Ziv R., and Pal C. Seq-VCR: Preventing Collapse in Intermediate Transformer Representations for Enhanced Reasoning. ICLR, 2025. [PDF] [Code]