Connection Bottleneck in Attention

## Connection Bottleneck in Attention

<h3 id="over-squashing-in-large-language-models">‘Over-Squashing’ in Large Language Models</h3>
<div class="slide-img">
  <img src="https://raw.githubusercontent.com/MTandHJ/blog_source/master/images/20250519155809.png" 
  alt="Image" 
  style="max-width: 100%; height: auto;margin: 0 auto;">
</div>
<ul>
<li><strong>Over-Squashing:</strong> Early tokens 有更多的影响
<ul>
<li><strong>Representational Collapse:</strong> 随着序列长度增加, 表示趋近${}^{\tiny [1]}$</li>
<li><strong>Attention Sink:</strong> LLMs 总是倾向于给予 <bos> token 很高的权重${}^{\tiny [2,3]}$</li>
</ul>
</li>
</ul>
<div class="slide-ref">
    <div style="width: 100px; height: 1px; background: black; margin-bottom: 5px;"></div>
    <p style="margin: 2px 0;">[1] Barbero F., et al. Transformers need glasses! Information over-squashing in language tasks. NeurIPS, 2024.</p>
    <p style="margin: 2px 0;">[2] Barbero F., et al. Why do LLMs attend to the first token? arXiv, 2025.</p>
    <p style="margin: 2px 0;">[3] Wu X., et al. On the Emergence of Position Bias in Transformers. arXiv, 2025.</p>
</div>
<p>Note:
在 GNN 中, over-squashing 指的是膨胀的感受野导致每个邻居的贡献很有限;
而在 LLM 中, over-squashing 指的是 early tokens 会产生更多的影响.
虽然二者可能都会导致类似 representational collpase 的现象, 但是严格来说不能混为一谈.
实际上, LLM 中是否存在所谓的 over-squashing 问题也是个未知数, 因为 Causal Attention 实际上已经是 Graph 领域里一个推荐的方案了.

<h3 id="copying">Copying</h3>
<div class="slide-cols">

<div class="slide-col-half">
<ul>
<li><strong>First-token</strong> copying:
<ul>
<li><strong>Input:</strong> ‘$\textcolor{red}{0}111\ldots 111$’; <strong>Target:</strong> ‘$0$’</li>
</ul>
</li>
</ul>
</div>

<div class="slide-col-half">
<ul>
<li><strong>Last-token</strong> copying:
<ul>
<li><strong>Input:</strong> ‘$111\ldots 111\textcolor{red}{0}$’; <strong>Target:</strong> ‘$0$’</li>
</ul>
</li>
</ul>
</div>
</div>
<ul>
<li>逐步<u>增加 ‘1’ </u> 以增加序列长度:
<ul>
<li>(B) <span style="color: gray"> Hint: It’s not necessarily a 1, check carefully </span>;</li>
<li>(C) <span style="color: gray"> ‘$0111 \ldots 11$’ 替换为 ‘$0111 \ldots 11 \: 0111 \ldots 11 \: \ldots$ </span></li>
</ul>
</li>
</ul>
<div class="slide-img">
  <img src="https://raw.githubusercontent.com/MTandHJ/blog_source/master/images/20250511142544.png" 
  alt="Image" 
  style="max-width: 100%; height: auto;margin: 0 auto;">
</div>
<p>Note:
Copying 的例子有趣在于: First-token copying 比起 Last-token copying 反而更容易.
通过 ‘over-squashing’ 解释就是, first-token copying 能够产生更多的影响.

<h3 id="positional-encoding">Positional Encoding</h3>
$$
A_{ij} = \frac{\exp(S_{ij})}{\sum_j \exp(S_{ij})}, \quad
S_{ij} = \textcolor{blue}{\langle \bm{q}_i, \bm{k}_j \rangle} / \sqrt{d}.
$$<ul>
<li>RoPE (Rotary Positional Encoding)</li>
</ul>
$$
  \langle \bm{q}_i, \bm{k}_j \rangle = (R_{i, \theta} \bm{q}_i)^T (R_{j, \theta} \bm{k}_j) = \bm{q}_i^T R_{j-i} \bm{k}_j, \\
  {}\\
  \tiny
  R_{i, \theta} := \left [
  \begin{array}{ccccccc}
  \cos (i\theta_0) & -\sin (i \theta_0) & 0 & 0 & \cdots & 0 & 0 \\
  \sin (i \theta_0) & \cos (i \theta_0) & 0 & 0 & \cdots & 0 & 0 \\
  0 & 0 & \cos (i\theta_1) & -\sin (i \theta_1) & \cdots & 0 & 0 \\
  0 & 0 & \sin (i \theta_1) & \cos (i \theta_1) & \cdots & 0 & 0 \\
  \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\
  0 & 0 &  0 & 0 & \cdots & \cos (i \theta_{d/2 - 1}) & -\sin (i \theta_{d / 2 - 1}) \\
  0 & 0 &  0 & 0 & \cdots & \sin (i \theta_{d/2 - 1}) & \cos (i \theta_{d / 2 - 1})  \\
  \end{array}
  \right ].
$$<ul>
<li>$\theta_i = b^{-2i / d}$ 表示基本的旋转单位, $b$<span style="color: gray">ase</span> 越大, 旋转的角度越小.</li>
</ul>
<p>Note:
位置编码有可能可以缓解 Connection Bottleneck

<h3 id="rope-的高频">RoPE 的高频</h3>
<ul>
<li><strong>猜想:</strong> 过大的旋转角度会导致对应维度所得结果趋于噪声</li>
</ul>
<div class="slide-img">
  <img src="https://raw.githubusercontent.com/MTandHJ/blog_source/master/images/20250512211237.png" 
  alt="Image" 
  style="max-width: 80%; height: auto;margin: 0 auto;">
</div>
<div class="slide-cols">

<div class="slide-col-half">
$$
\underset{\text{Freq}\downarrow \quad \text{Norm} \uparrow}{\xrightarrow{\|\bm{q}_{0:1}\|, \|\bm{q}_{2:3}\|, \cdots, \|\bm{q}_{d-1:d}\|}}
$$</div>

<div class="slide-col-half">
<ul>
<li><strong>Exception</strong>: <strong>First</strong> and <strong>Last</strong> Layers</li>
</ul>
</div>
</div>
<div class="slide-ref">
    <div style="width: 100px; height: 1px; background: black; margin-bottom: 5px;"></div>
    <p style="margin: 2px 0;">Barbero F., et al. Round and Round We Go! What makes Rotary Positional Encodings useful? ICLR, 2025.</p>
</div>
<p>Note:
这是一个隐式的例子: 作者将维度两两分组, 假设模长越大越偏向于语义信息.
在绝大部分 layers 中高频部分仅被分配了较小的模长, 例外是初始的和最后的一些层.

<h3 id="counting-314-1">Counting (314 ‘1’)</h3>
<table>
  <thead>
      <tr>
          <th></th>
          <th style="text-align: center">Length</th>
          <th style="text-align: center">‘1…’</th>
          <th style="text-align: center">‘1,1…’</th>
          <th style="text-align: center">‘1,1,1,1,1;1,…’</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mistral Medium</td>
          <td style="text-align: center">32k</td>
          <td style="text-align: center">500</td>
          <td style="text-align: center">500</td>
          <td style="text-align: center">500</td>
      </tr>
      <tr>
          <td>Deepseek-R1</td>
          <td style="text-align: center">64K</td>
          <td style="text-align: center">264</td>
          <td style="text-align: center">$\textcircled{\small 1}$</td>
          <td style="text-align: center">299</td>
      </tr>
      <tr>
          <td>GPT-4o</td>
          <td style="text-align: center">128k</td>
          <td style="text-align: center">300</td>
          <td style="text-align: center">300</td>
          <td style="text-align: center">340</td>
      </tr>
      <tr>
          <td>Llama3.3-70b</td>
          <td style="text-align: center">130K</td>
          <td style="text-align: center">‘The string 1111…’</td>
          <td style="text-align: center">150</td>
          <td style="text-align: center">‘…The string is 1,1,1,1,1;1,…’</td>
      </tr>
      <tr>
          <td>o4-mini</td>
          <td style="text-align: center">200k</td>
          <td style="text-align: center">232</td>
          <td style="text-align: center">270</td>
          <td style="text-align: center">319</td>
      </tr>
  </tbody>
</table>
<p style="font-size:1rem">$\textcircled{\small 1}$ Given that, and since the sequence is uniform, the count is the number of '1's, which is the total numbers in the sequence.  Given that, and since counting manually is not feasible, the answer is that all numbers are '1's, hence the count is equal to the number of numbers in the sequence. But since the exact count isn't provided, perhaps the answer is to recognize that every number is '1'.</p>