You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My motivation for this is that I'm always using steps to proxy time, but this is painful when I run on different GPUs or most commonly run the code locally (Macbook pro) to check it runs, then run it remotely on a beefy GPU.
For saving I want to "not lose significant amounts of work" and for me this is always time based (E.g. I don't want to wait for 6 hours but see the model crash and loose all progress at hour 5) since both my waiting and my paying for compute is time based. Similarly for eval, I don't want to eval every minute (expensive!) but do want to eval often enough to know the model is still progressing. Time based solves this.
Your contribution
I'm down to write code for this! I don't know the deeper architectural considerations so greatly welcome the community's insights. Thanks :)
The text was updated successfully, but these errors were encountered:
Feature request
When building a training config for trainer I'd love to be able to do something akin to:
training_args = SFTConfig( eval_strategy="time", eval_minutes=15, save_strategy="time", save_minutes=30 )
Motivation
My motivation for this is that I'm always using steps to proxy time, but this is painful when I run on different GPUs or most commonly run the code locally (Macbook pro) to check it runs, then run it remotely on a beefy GPU.
For saving I want to "not lose significant amounts of work" and for me this is always time based (E.g. I don't want to wait for 6 hours but see the model crash and loose all progress at hour 5) since both my waiting and my paying for compute is time based. Similarly for eval, I don't want to eval every minute (expensive!) but do want to eval often enough to know the model is still progressing. Time based solves this.
Your contribution
I'm down to write code for this! I don't know the deeper architectural considerations so greatly welcome the community's insights. Thanks :)
The text was updated successfully, but these errors were encountered: