Optimizer States (2x model size):
DeepSeek-v3 训练框架: $g \overset{\text{BF16}}{\rightarrow} m, v \overset{\text{FP32}}{\rightarrow} \theta$
Quantization:
Dequantization:
Dettmers T., et al. 8-bit Optimizers via Block-wise Quantization. ICLR, 2022.
Li B., et al. Memory Efficient Optimizers with 4-bit States. NeurIPS, 2023.
Higham N. J. The Accuracy of Floating Point Summation. SIAM Journal on Scientific Computing. 1993.
一定条件下:
实际上 $\beta \ge 0.9$ 为相当常见的 setting
X Fixed $\Delta$
X $z \le \Delta$
假设 $\iota_{k-1} \le x / \Delta \le \iota_k$:
High variance:
Easy to implement
State Decay Alignment
X Singal Swamping
✓ Sign representation
✓ Descent direction
$\rightarrow$ Bits $\downarrow$ or $\beta \uparrow$
$\rightarrow$ Quantization errors $\uparrow$
$\rightarrow$ gradient variance $\uparrow$
$\rightarrow$ bad convergence