## Graph-enhanced Optimizers for Structure-aware Recommendation Embedding Evolution

Background

$\textcircled{\small 1}$ Embedding $\xrightarrow{\text{实体 (User, Item) 的向量表示}}$ 现代推荐系统的基础

Background

$\textcircled{\small 2}$ 多元信息 $\xrightarrow{\text{交互信息, 类别相似性}}$ 潜在的结构性约束

❓Embedding 学习如何高效融入这些结构性先验

Background

$\textcircled{\small 3}$ 图结构先验 $\underset{\text{超大规模 Embedding table}}{\xrightarrow{\text{训练随机性: 数据采样, dropout}}}$ 😞低效的信息融合

$\mathbf{E}$: Embedding       $\mathcal{L}$: Loss       $\nabla_{\mathbf{E}} \mathcal{L}$: Gradient

Background

$\textcircled{\small 4}$ 图+序列模型: 过于依赖特定场景, 高昂的训练/推理代价

LightGCN: GNN-only       SASRec: Transformer-only

SR-GNN/LESSR/MAERec: GNN-based sequence models

Weighted Adjacency Matrix $\mathbf{A}$

🤔 如何形式化定义图结构先验?

$$ \mathcal{G} = (\mathcal{V}, \mathbf{A} = [w_{ij}]), \quad w_{ij} \textcolor{green}{\uparrow} \longrightarrow \text{两个节点越相似} \textcolor{green}{\uparrow}. $$

🤔 如何刻画 $\mathbf{X} \in \mathbb{R}^{|\mathcal{V}| \times d}$ 与邻接矩阵 $\mathbf{A} \in \mathbb{R}^{n \times n}$ 所刻画节点相似度的一致性?

$$ \mathcal{J}_{smoothness} (\mathbf{X}; \mathcal{G}) := \sum_{i, j \in \mathcal{V}} w_{ij} \left \| \frac{\mathbf{x}_i}{d_i} - \frac{\mathbf{x}_j}{d_j} \right \|. $$

💡 $\mathcal{J}_{smoothness}\downarrow$ $\longrightarrow$ 越相似的两个节点的表示越接近

Zhou D., et al. Learning With Local and Global Consistency. NeurIPS, 2003. Chen S., et al. Signal denoising on graphs via graph filtering. GlobalSIP, 2014.

Structure-aware Embedding Evolution

$$ \Delta \mathbf{E}_{t-1} \xrightarrow{\textcolor{blue}{\psi}(\cdot)} \psi(\mathbf{E}_{t-1}). $$

$\textcircled{\small 1}$ Structure-aware: $\small \mathcal{J}_{smoothness} \left (\textcolor{blue}{\psi} (\Delta \mathbf{E}) \right) \le \mathcal{J}_{smoothness} \left (\Delta \mathbf{E} \right)$

$\textcircled{\small 2}$ Direction-aware: $\small \left\langle \textcolor{blue}{\psi} (\Delta \mathbf{E}), \Delta \mathbf{E} \right\rangle > 0, \quad \forall \Delta \mathbf{E} \not= \bm{0}$

Smoothness vs. Convergence

  • 理想的 $\psi$ 应当是平滑性和收敛性的平衡:
$$ \psi^* \left( \Delta \mathbf{E}; \beta \right) = \underset{\Delta}{\text{argmin}} \: (1 - \beta) \| \Delta - \Delta \mathbf{E}\| + \beta \mathcal{J}_{smoothness}(\Delta). $$
  • 直觉上,
$$ \text{Smoothness} \xleftarrow{\beta \rightarrow 1} \psi^*(\Delta \mathbf{E}) \xrightarrow{\beta \rightarrow 0} \text{Convergence} $$
  • 闭式解: $\psi^*(\Delta \mathbf{E}; \beta) = (1 - \beta) \left(\mathbf{I} - \beta \mathbf{\tilde{A}}\right)^{-1} \Delta \mathbf{E}$
    • 😞矩阵逆难以精确求解

近似解

$\textcircled{\small 1}$ $L$-layer iterative approximation:

$$ \hat{\psi}_{iter} (\Delta \mathbf{E}) := \hat{\psi}_L (\Delta \mathbf{E}), \: \hat{\psi}_l (\Delta \mathbf{E}) = \beta \mathbf{\tilde{A}} \hat{\psi}_{l-1} (\Delta \mathbf{E}) + (1 - \beta) \Delta \mathbf{E}. $$

$\textcircled{\small 2}$ $L$-layer Neumann series approximation:

$$ \hat{\psi}_{nsa} (\Delta \mathbf{E}) := (1 - \beta) \sum_{l=1}^L \beta^l \mathbf{\tilde{A}}^l \Delta \mathbf{E}. $$ Klicpera J., et al. Predict Then Propagate: Graph Neural Networks Meet Personalized Pagerank. ICLR, 2019. Huang Q., et al. Combining Label Propagation and Simple Models Out-performs Graph Neural Networks. ICLR, 2021.

近似解的缺陷

$\textcircled{\small 1}$ $\psi_{iter}$ 的受限平滑性:

$\textcircled{\small 2}$ $\psi_{nsa}$ 的次优收敛性:

😄 SEvo:

$$ \hat{\psi} (\Delta \mathbf{E}; \beta) = \frac{1 - \beta}{\textcolor{orange}{1 - \beta^{L+1}}} \sum_{l=0}^L \beta^l \mathbf{\tilde{A}}\Delta \mathbf{E}. $$

SGD + SEvo

  • SEvo 可以直接应用在其它优化器所衍生的更新量

Adam + SEvo

AdamW + SEvo

Overall Comparisons

Accuracy: 平均 10+% 的提升

Efficiency: 略微训练消耗 & 推理成本增加

Empirical Analysis

收敛性: $\textcolor{orange}{1 / (1 - \beta^{L+1})}$ 加快收敛     ✅ 平滑性: $\beta \rightarrow 1$ 愈加平滑

Ablation Study

泛化性: 适用于 SGD/Adam/AdamW

Theorem 1: 简单的迭代近似仅在较小的 $\beta$ 可行

AdamW correction: 稀疏梯度矫正的必要性

类别一致先验

类别一致性: 类别一致的表示更加接近

教师模型先验

知识迁移性: SEvo 本身就有较强的知识蒸馏能力

泛化性: SEvo 可以和其它知识蒸馏方法结合

Thanks!