Compressing Gradient Optimizers via Count-Sketches
An ICML 2019 paper by Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, Anshumali Shrivastava
Trained with Activation Checkpointing and Mixed Precision Training (FP16) on Nvidia V100 DGX-1 servers
BERT-Large | Adam | Count-Min Sketch (CMS) - RMSprop |
---|---|---|
Time (Days) | 5.32 | 5.52 |
Size (MB) | 7,097 | 5,133 |
Test Perplexity | 4.04 | 4.18 |
- Install Requirements
- Add optimizers folder to $PYTHONPATH
- torch
- torchvision
- cupy
- pynvrtc
- ImageNet - ResNet-18
- LM1B - Transformer / LSTM
- Wikitext-2 - LSTM
We support compressing the dense layers of the neural network without update sparsity. During training, we update the auxiliary variables and perform the gradient update for each parameter in a single fused CUDA kernel. The dense kernel is equivalent to the sparse kernel. The main difference is that we explicitly avoid generating the auxiliary variables for the dense layers in global memory. Instead, we access them inside the shared memory of the GPU Streaming Multiprocessor. Without this key feature, our approach would not save any GPU memory for the dense layers. In the sparse case, we assume that the non-zero gradient updates is significantly smaller than the auxiliary variable. (See dense_exp_cms.py for more details)