Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport

研究背景

核心思想

Attention & One-sided EOT


proof:


Attention Framework

Mechanism $\Omega (\bm{p})$ $\bm{p}^*$
Softmax $-\tau H(\bm{p})$ $p_j = \frac{\exp(\langle \bm{q}, \bm{k}_j \rangle / \tau)}{\sum_l \exp(\langle \bm{q}, \bm{k}_l \rangle / \tau)}$ 最常见的 Attention
Sparsemax $\frac{1}{2} \sum_{j} p_j^2$ $p_j = (\langle \bm{q}, \bm{k}_j \rangle - \tau)_+$ 稀疏 Attention, $\tau$ 使得 $\sum_j p_j = 1$
PriorSoftmax $\tau\text{KL}(\bm{p} \| \bm{\pi}) = \sum_{j=1}^m p_j \log \frac{p_j}{\pi_j}$ $p_j= \frac{\pi_j \exp(\langle \bm{q}, \bm{k}_j \rangle / \tau)}{\sum_l \pi_l \exp(\langle \bm{q}, \bm{k}_l \rangle / \tau)} $ $\pi$ 为人为给定的先验

Backward

参考文献

  1. Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport. arXiv, 2025. [PDF] [Code]