↗️ 模型大小飞速增加 vs. 硬件价格居高不下
⚙️ Optimizer States (2x model size):
Quantization:
Dequantization:
Higham N. J. The Accuracy of Floating Point Summation. SIAM Journal on Scientific Computing. 1993.
💡 总结
一定条件下:
实际上 $\beta \ge 0.9$ 为相当常见的 setting
随机信号:
Relaxed 条件:
X Fixed $\Delta$
X $z \le \Delta$
假设 $y_{k-1} \le x / \Delta \le y_k$:
High variance:
✅ Easy to implement
✅ State decay alignment
❎ Singal Swamping
☑️ Sign representation
☑️ Descent direction
💡 总结:
$\rightarrow$ Bits $\downarrow$ or $\beta \uparrow$
$\rightarrow$ Quantization errors $\uparrow$
$\rightarrow$ gradient variance $\uparrow$
$\rightarrow$ worse convergence
😒 传统方法: $\underset{\text{Training from scratch}}{\xrightarrow{\text{Ultra-Low-Bit}}}$ degeneration/collapse
😊 SOLO: Robust to bits/tasks/models