You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hello, thank you for the outstanding work in quantised attention mechanisms. we see a 300% speedup during training on a 4090 based cluster. of course, this is because the QKV linears are not receiving gradients on the backward pass.
i've done some hackjob implementation that approximates the gradient but this is really not good results - however, the speedup remains mostly in-tact.
it's seeming like an incredible challenge to support quantised backward pass with meaningful gradients, keeping rounded and quantised values in the computation graph.
i know the suggestion in another issue thread (unrelated to autograd) is that since H100 is used for large-scale pretraining, that it would be most useful to implement the forward pass for it before working on backward.
however, it's notably faster to train with even an approximated gradient using quantised attention for 12B models like Flux.
just wanted to open this issue to keep it trackable by others who are waiting for news or updates regarding this challenge.
The text was updated successfully, but these errors were encountered:
Thanks for your attention! Quantizing the backward pass poses additional and unexplored challenges. However, it also means great potentials and opportunities. We shall try to solve this challenge with our best effort in the future.
@jason-huang03 maybe just use rematerialization trick (rematerialize) to get the ctx for backward pass ?
basically the same approach for flash attention
then we could work out the kernel from there for the backward
hello, thank you for the outstanding work in quantised attention mechanisms. we see a 300% speedup during training on a 4090 based cluster. of course, this is because the QKV linears are not receiving gradients on the backward pass.
i've done some hackjob implementation that approximates the gradient but this is really not good results - however, the speedup remains mostly in-tact.
it's seeming like an incredible challenge to support quantised backward pass with meaningful gradients, keeping rounded and quantised values in the computation graph.
i know the suggestion in another issue thread (unrelated to autograd) is that since H100 is used for large-scale pretraining, that it would be most useful to implement the forward pass for it before working on backward.
however, it's notably faster to train with even an approximated gradient using quantised attention for 12B models like Flux.
just wanted to open this issue to keep it trackable by others who are waiting for news or updates regarding this challenge.
The text was updated successfully, but these errors were encountered: