Theoretical
Pareto Multi-Task Learning
通过限定子区域实现 Pareto MTL
Multiple-Gradient Descent Algorithm (MGDA) for Multiobjective Optimization
从梯度融合角度理解多任务/多目标优化
Universal Prompt Tuning for Graph Neural Networks
图上特征 prompt 等价各异 graph prompt
Base of RoPE Bounds Context Length
讨论 RoPE base 对于相似 Tokens 感知能力的影响
Round and Round We Go! What makes Rotary Positional Encodings useful?
理解 RoPE 的高低频
1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed
Adam 预训练的 1-bit SGD 优化方法
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
GaLore, 低秩空间中的梯度投影以及权重更新
MICROADAM: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence
MicroAdam, 通过梯度稀疏化以及 error compensation 实现轻量的优化器