VQ-VAE
VQ-VAE
向量量化:
$$ \bm{z} \rightarrow \bm{c}_{k^*}, \quad k^* = \text{argmin}_{k: \bm{c}_k \in \mathcal{C}} \|\bm{c}_k - \bm{z} \|. $$STE (straight-through estimator):
$$ \bm{q} = \text{STE}(\bm{c}_{k^*}) := \bm{z} + \textcolor{blue}{\text{sg}} \left(\bm{c}_{k^*} - \bm{z} \right) \\ \text{d}\bm{q} = \text{d}{\bm{z}} + \underbrace{\text{d} \:{\text{sg} \left(\bm{c}_{k^*} - \bm{z} \right)}}_{=0} $$Loss:
$$ \mathcal{L} = \underbrace{\| g(\bm{q}) - \bm{x} \|_F^2}_{\mathcal{L}_{recon}} + \underbrace{ \| \bm{c}_{k^*} - \text{sg} (\bm{z}) \|_F^2 + \beta \cdot \| \bm{z} - \text{sg} (\bm{c}_{k^*})\|_F^2. }_{\mathcal{L}_{commit}} $$
Note:
- STE 的引入会导致传回 Encoder 的梯度不太准确;
- Codebook 的学习仅仅依赖于 Commitment Loss.
VQ-GAN
- 图片 Token 化 + Next-token prediction $p(s_i | s_{< i}, \textcolor{red}{condition})$
Why Discrete Representation Learning?
✅ 离散编码更适合生成式XXX
$\textcircled{\small 1}$ 更容易作为词表的拓展
$\textcircled{\small 2}$ (Rec) 有希望打破最近邻匹配的限制
✅ 可控性: 类似自然语言的可操控性
$\textcircled{\small 1}$ 理解各编码的含义并加以操控
$\textcircled{\small 2}$ (Rec) 生成的多样性
✅ 鲁棒性: 高效的信息压缩带来惊艳的去噪效果
Challenges
- Undesirable Gradient Estimator:
- Codebook Collapse: Low codebook usage
- Codebook 中部分向量过于接近而造成的冗余
- Codebook 中部分向量由于训练始终匹配不到 $\bm{z}$ 导致的冗余
Note: VQ-VAE 广为人知的几个问题
Solutions
Undesirable Gradient Estimator:
- Gumbel-softmax estimator${}^{\text{[1]}}$;
- Rotation-trick estimator${}^{\text{[2]}}$
Codebook Collapse:
- 对于 codebook 采用 K-means ++ 初始化${}^{\text{[3]}}$;
- Fixed Codebook${}^{\text{[4]}}$;
- Fixed Codebook + Trainable linear transformation${}^{\text{[5]}}$
Rotation Trick
‘旋转’ $\nabla_{q} \mathcal{L}$ 得到 $\nabla_{z} \mathcal{L}$ 满足
$$ \angle (\bm{z}, \nabla_z \mathcal{L}) = \angle(\bm{q}, \nabla_q \mathcal{L}). $$
Note: Rotation Trick 希望梯度和向量夹角一致.
Rotation Trick
等价于利用 ‘旋转’ 矩阵 $R$:
$$ \bm{q} = \text{sg}[\gamma R] \bm{z} + \text{sg}[\bm{c} - rR \bm{z}], \quad \textcolor{red}{R \bm{z} / \|\bm{z}\| = \bm{c} / \|\bm{c}\|} $$Householder transformation: 给定向量 $\bm{v}$ 及过原点的正交平面 $\bm{v}^{\perp} := \{\bm{u}: \bm{u}^T \bm{v} = 0\}$, 向量 $\bm{x}$ 关于 $\bm{v}^{\perp}$ 的反射为
$$ \underbrace{\Big(I - 2 \frac{\bm{v} \bm{v}^T}{\|\bm{v}\|^2} \Big)}_{\text{Householder matrix } P} \bm{x} $$性质: $\bm{x} = \alpha \bm{v}^{\perp} + \beta \bm{v} \rightarrow P\bm{x} = \alpha \bm{v}^{\perp} \textcolor{red}{-} \beta \bm{v}$
Reflection
$$ R = \left(I - 2 \frac{\bm{r}\bm{r}^T}{\|\bm{r}\|^2} \right), \quad \bm{r} := \frac{\bm{z}}{\|\bm{z}\|} - \frac{\bm{c}}{\|\bm{c}\|} $$Rotation
$$ R = \left(I - 2 \frac{\bm{c}\bm{c}^T}{\|\bm{c}\|^2} \right) \left(I - 2 \frac{\bm{r}\bm{r}^T}{\|\bm{r}\|^2} \right), \quad \bm{r} := \frac{\bm{z}}{\|\bm{z}\|} + \frac{\bm{c}}{\|\bm{c}\|} $$STE vs Rotation vs Reflection
STE: $\nabla_{z} \mathcal{L} \equiv \nabla_{q} \mathcal{L}$
Rotation: $\bm{z}$ 基本上与 $\bm{q}$ 的更新"行为"保持一致
Reflection: $\bm{z}$ 基本上与 $\bm{q}$ 的更新"行为"可能非常不一致
Rotation Trick
🌟 Rotation trick:
$$ \mathbf{q} = \text{sg}\Big[ \frac{\|\bm{c}\|}{\|\bm{z}\|} R \Big] \bm{z} \textcolor{red}{+ 0} $$🌟 内积不变性 (❓$\textcolor{red}{+0}$):
$$ \langle \nabla_{z} \mathcal{L}, \bm{z} \rangle =\langle \frac{\|\bm{c}\|}{\|\bm{z}\|} R^T \nabla_q \mathcal{L}, \bm{z} \rangle =\langle \nabla_q \mathcal{L}, \frac{\|\bm{c}\|}{\|\bm{z}\|} R \bm{z} \rangle =\langle \nabla_q \mathcal{L}, \bm{q} \rangle $$Residual Quantization (RQ-VAE)
😞 $\text{Size}\textcolor{red}{\downarrow} \longrightarrow$ 表达能力$\textcolor{red}{\downarrow}$ vs $\text{Size}\textcolor{green}{\uparrow} \longrightarrow$ Collapse$\textcolor{red}{\uparrow}$
RQ-VAE:
$$ \bm{z} \overset{\phi}{\rightarrow} \textcolor{red}{\bm{c}_{k_1}} \overset{\bm{z} - \bm{c}_{k_1}}{\longrightarrow} \bm{r}_1 \overset{\phi}{\rightarrow} \textcolor{red}{\bm{c}_{k_2}} \overset{\bm{r}_1 - \bm{c}_{k_2}}{\longrightarrow} \bm{r}_2 \rightarrow \cdots $$连续近似:
$$ \bm{q} = \bm{z} + \text{sg}\Big(\sum_{i=1}^{N} \bm{c}_{k_i} - \bm{z} \Big) $$离散编码: $(k_1, k_2, \ldots, k_N)$
Fixed Codebook
固定 Codebook 为 (size: $|\mathcal{C}| = (2 \lfloor L / 2 \rfloor + 1)^d$):
$$ \mathcal{C} = \{-\lfloor L / 2 \rfloor, -\lfloor L / 2 \rfloor + 1, \ldots, 0, \ldots \lfloor L / 2 \rfloor - 1, \lfloor L / 2 \rfloor\}^{d}. $$比如 $L = 3, d=3$:
$$ \mathcal{C} = \{ (-1, -1, -1), (-1, -1, 0), \ldots, (1, 1, 1) \}. $$量化:
$$ \bm{q} = \textcolor{red}{\text{round}} \big( \textcolor{blue}{\tanh} (\bm{z}) \big). $$
SimVQ
😞 Codebook 每个批次仅少量向量得到训练.
😄 SimVQ 固定 Codebook 仅训练一个 Linear Transformation $W$:
$$ \mathcal{C} \longrightarrow \{W \bm{c}_1, W \bm{c}_2, \ldots, W \bm{c}_K\} $$TIGER
传统推荐 (matching):
$$ \bm{e}_u^T \bm{e}_v, \quad v \in \mathcal{V}. $$生成式推荐:
TIGER
- 生成式推荐 (T5-based):
- Beam Search $\overset{?}{\gg}$ Approximate Nearest Neighbor
Beam Search❓
Amazon2014Beauty_1000_LOU
#Users: 12,595 #Items: 75,253
Encoder: All-MiniLM-L12-V2
Attributes: (title, categories, brand)
#Blocks$\textcolor{green}{\uparrow}$ $\longrightarrow$ #Invalids $\textcolor{green}{\downarrow}$
Cold-Start Item Recommendation❓
- Cold-start items 可直接编码, 但
Note: 虽然应用 VQ 可以很好地支持冷启动d的商品 (可以相当方便地进行编码), 但是 LIGER 发现, 利用 VQ 训练的非常容易过拟合到出现过的组合中去, 反而冷启动的效果特别差.
RQ-VAE vs (Hierarchical/Residual) KMeans❓
- RQ-VAE 相较于 (Hierarchical/Residual) KMeans 的优势?
| HR@1 | HR@5 | HR@10 | NDCG@5 | NDCG@10 | |
|---|---|---|---|---|---|
| Random | 0.0025 | 0.0080 | 0.0114 | 0.0052 | 0.0063 |
| KMeans | 0.0038 | 0.0154 | 0.0246 | 0.0096 | 0.0126 |
| STE | 0.0023 | 0.0111 | 0.0188 | 0.0067 | 0.0091 |
| Rotation | 0.0041 | 0.0122 | 0.0195 | 0.0083 | 0.0106 |
| SimVQ | 0.0029 | 0.0092 | 0.0164 | 0.0060 | 0.0083 |
Note: 参数还没有细调.
Semantic Features + Collaborative Signals❓
- 微调 Encoder:
Note:
- 通过 miniCPM-V-8B 将多模态信息整合为 $\mathbf{M} \in \mathbb{R}^{N_M \times d_t}$ 大小的 token vectors (per item).
- 通过 QFormer 进一步融合得到 $\mathbf{\tilde{M}} \in \mathbb{R}^{N_{\tilde{M}} \times d_t}$, 通常 $N_{\tilde{M}} = 4$ (而 $N_M = 1280$).
- 通过 item-item 间的相似度构建高质量的 item-pair dataset $\mathcal{D}_{pair}$, 然后通过 item-item 间的对比学习来促使 item features 融合进这部分信息.
- 此外, 额外引入 Caption loss, 即通过 LLaMA3 来预测 Caption, 保证 features 不会丢失内容信息.
总结
Vector Quantization: 一种优雅的 Tokenizer
优势:
- (Encoder-Decoder) 统一的离散表示
- (Rec) 具有一定的可解释性
- (Rec) 似乎能激发推荐场景的 Scaling 能力
不足:
- (Encoder-Decoder) Undesirable gradient estimator
- (Encoder-Decoder) Codebook collapse
- (Rec) 似乎不太擅长冷启场景 (如何修正 Beam search)
- (Rec) RQ-VAE 似乎没有必要
Decoder-Encoder-XXX Vector Quantization
Decoder-Encoder-XXX Vector Quantization
- 实验结果:
| HR@1 | HR@5 | HR@10 | NDCG@5 | NDCG@10 | |
|---|---|---|---|---|---|
| Random | 0.0025 | 0.0080 | 0.0114 | 0.0052 | 0.0063 |
| KMeans | 0.0038 | 0.0154 | 0.0246 | 0.0096 | 0.0126 |
| Rotation | 0.0041 | 0.0122 | 0.0195 | 0.0083 | 0.0106 |
| DEX-VQ | 0.0033 | 0.0126 | 0.0216 | 0.0079 | 0.0107 |