Background: Pre-training & Fine-tuning
Pre-training & Fine-tuning: 多个领域的致胜法宝
- (CV) ResNet, SimCLR, MAE, ViT …
- (NLP) BERT, GPT …
- (CV & NLP) CLIP, BLIP, SigLIP …
Pre-training & Fine-tuning for Next-Item Recommendation:
- (美好愿景) 😄 quick deployment; 😄 better generalizability
- (研究现状) 😞 效果远远逊色于 domain-specific 模型
Background: Pre-training & Fine-tuning
Next-Item Recommendation:
$$ [v_1, v_2, \ldots, v_t] \rightarrow v_{t+1} $$推荐数据的异构性 (heterogeneity):
- diverse user behaviors
- non-negligible domain gaps
RQI: Transferable Capabilities
Computer Vision:
- (图像分类) 特征提取、模式识别 …
- (图像生成) 数据分布建模 …
Natural Language Processing:
- (传统语料) 语义理解、语法保持
- (数学/代码) 逻辑推理
Transferable Capabilities for Recommendation
- 推荐需要何种能力? 长短兴趣建模?
Transferable Capabilities for Recommendation
$$ \begin{array}{rl} \textcircled{\small 1} \text{ Chronologically ordered:} & \mathbb{P}\left(v_{t+1}\,|\,v_t, v_{t-1}, \ldots, v_1; \Theta \right), \\ \textcircled{\small 2} \text{ Partially shuffled:} & \mathbb{P}\left(v_{t+1}\,|\, v_t, \{v_1, v_2, \ldots \}; \Theta \right), \\ \textcircled{\small 3} \text{ Completely shuffled:} & \mathbb{P}\left(v_{t+1}\,|\,\{v_1, v_2, \ldots, v_t\}; \Theta \right) \end{array} $$$\textcircled{\small 1} \approx \textcircled{\small 2}$: 先进的序列推荐模型并没有依赖序列性做出更加复杂的推理 (即使 HSTU 引入了 Timestamps 信息)
$\textcircled{\small 1}/\textcircled{\small 2} \gtrapprox \textcircled{\small 3}$: Latest interaction 至关重要
上述结论与数据集预处理方式、优化目标、模型表达能力无关
Markovian Nature of Next-Item Prediction
- 当前先进的序列推荐模型的推理逻辑:
- 依赖整体序列推断 “general user preferences”
- 格外强调用户最新的交互
Short-term & Long-term Interests
- 一一对应:
| 马尔科夫性 | 推荐理论 |
|---|---|
| User Identifiction | Long-term Interest$^{[1]}$ |
| Last-Item Attention | Short-term Interest$^{[2]}$ |
- 稍有不同: Long-term interest 的建模并非宣称的那样复杂
RQII: Data for Markovian Reasoner
Next-State Prediction
如何仅凭上下文推断马氏链下一时刻状态?
$$ s_1, s_2, \ldots, s_t \rightarrow s_{t+1}, \quad s_{n} \in \mathcal{S}, \: \forall n=1,2,\ldots, t+1 $$
Step1: 根据 $[s_1, s_2, \ldots, s_t]$ 估计转移概率矩阵
Step2: 确定当前时刻的状态 $s_t$
Step3: 选取 $s_t \rightarrow ?$ 最大概率的状态作为预测
- 擅长 Next-State Prediction 的模型, 需具备:
- 自适应的序列总结能力;
- 特别着重当前状态的机制
Markovian Pre-trained Transformer (MPT)
- Next-State Prediction Task:
Markovian Pre-training & Recommendation Fine-tuning
Experiments
Data Scaling: NDCG@10 vs. #Tokens
$\mathcal{L}_{\text{NSP}}$ 随着 tokens 增加逐渐下降, 且有多次骤降
在学习了 $10^{10}$ (约 10B) 左右 tokens 后, 大部分场景下都呈现饱和
不同场景下的最优训练 #Tokens 存在差异
存在理论上限 Bayes estimator
Comparison of Inference Mechanisms
MPT 和 Qwen-2.5 的 Backbone 均未经过推荐训练
MPT 会更关注自身
Qwen-2.5 的 Attention Map 基本上没有区分度
MPT 甚至会产生和 SASRec+ 类似的模式
Sensitivity Analysis: $|\mathcal{S}|$
- Number of states $|\mathcal{S}|$
Sensitivity Analysis: $\alpha$
- $\alpha$ of Dirichlet distribution
Summary
可迁移的推荐能力: 序列无关的偏好推断 & 特别关注最新交互
Next-State Prediction: ✅ Controllable ✅ Unlimited
Markovian Pre-trained Transformer (MPT): ✅ 高效 ✅ 易迁移