⚙️ Optimizer States (2x model size):
泛化性: ✅无需额外调参 ✅适用任意场景
灵活性: ✅非环境依赖
成功的工程实践: DeepSeek-v3 训练框架 ($g \overset{\text{BF16}}{\rightarrow} m,v \overset{\text{FP32}}{\rightarrow} \theta$)
DeepSeek-AI. DeepSeek-V3 Technical Report, 2024.
Quantization:
Dequantization:
表示精度: 42 亿 (32-bit) vs. 8 (3-bit) vs. 4 (2-bit)
量化范围: 如何将尽可能多的元素一起量化?
一阶/二阶动量:
Higham N. J. The Accuracy of Floating Point Summation. SIAM Journal on Scientific Computing. 1993.
💡 总结
❎Unsigned ❎ $\beta \uparrow$ ❎ $b \downarrow$
一定条件下:
实际上 $\beta \ge 0.9$ 为相当常见的 setting
随机信号:
Relaxed 条件:
X Fixed $\Delta$
X $z \le \Delta$
假设 $y_{k-1} \le x / \Delta \le y_k$:
High variance:
✅ Easy to implement
✅ State decay alignment
😄 No Signal Swamping
😞 额外的符号表示 (1 bit)
😞 直接决定更新方向 (误差敏感)
💡 总结:
$\rightarrow$ Bits $\downarrow$ or $\beta \uparrow$
$\rightarrow$ Quantization errors $\uparrow$
$\rightarrow$ gradient variance $\uparrow$
$\rightarrow$ worse convergence
Li H., et al. Convergence of Adam under Relaxed Assumptions. NeurIPS, 2023.
😒 传统方法: $\underset{\text{Training from scratch}}{\xrightarrow{\text{Ultra-Low-Bit}}}$ degeneration/collapse
😊 SOLO: Robust to bits/tasks/models