1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

预备知识

核心思想

20250507172614

参考文献

  1. Seide F., Fu H., Droppo J., Li G., and Yu D. 1-Bit Stochastic Gradient Descent and Its Application to Data-Parallel Distributed Training of Speech DNNs. , 2014. [PDF] [Code]
  2. Tang H., Gan S., Awan A. A., Rajbhandari S., Lian X., Liu J., Zhang C., and He Y. 1-bit Adam: Communication efficient large-scale training with adam's convergence speed. ICML, 2021. [PDF] [Code]