Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport

预备知识

核心思想

Attention & One-sided EOT


proof:


Attention Framework

Mechanism$\Omega (\bm{p})$$\bm{p}^*$
Softmax$-\tau H(\bm{p})$$p_j = \frac{\exp(\langle \bm{q}, \bm{k}_j \rangle / \tau)}{\sum_l \exp(\langle \bm{q}, \bm{k}_l \rangle / \tau)}$最常见的 Attention
Sparsemax$\frac{1}{2} \sum_{j} p_j^2$$p_j = (\langle \bm{q}, \bm{k}_j \rangle - \tau)_+$稀疏 Attention, $\tau$ 使得 $\sum_j p_j = 1$
PriorSoftmax$\tau\text{KL}(\bm{p} \| \bm{\pi}) = \sum_{j=1}^m p_j \log \frac{p_j}{\pi_j}$$p_j= \frac{\pi_j \exp(\langle \bm{q}, \bm{k}_j \rangle / \tau)}{\sum_l \pi_l \exp(\langle \bm{q}, \bm{k}_l \rangle / \tau)} $$\pi$ 为人为给定的先验

Backward

参考文献

  1. Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport. arXiv, 2025. [PDF] [Code]