-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics and plots for failed queue experiments #9787
Comments
Related to #9776. It might close that one. Not sure what the behavior will be for experiments that fail during dvc setup. |
They are unavailable because we don't make a commit at the end of a failed exp run at all. In the actual experiments code we don't have a concept of failed exps at all, a finished exp ref is only ever considered to be successful. The current behavior for failed exps is just an extension of the celery queue, which is why it only contains information about the initial queued exp state (plus the celery logs that get exposed in And if the metrics are Implementing this would be on the level of a substantial feature and not any kind of quick fix. |
How do we collect them while the experiment is running and why can't we do the same after they fail? |
We read them directly from the tempdir workspace while it's running. The tempdir gets deleted after execution so we can't do the same thing for failed experiments, unless we want to start retaining the temporary workspace copies for queued experiment runs. |
Discussed with @pmrowla that the best solution seems to be committing and making git refs even for failed experiments (enables us to track everything, apply the results of failed experiments, etc.). @pmrowla Thinking some more about it, I worry about the performance and what happens if users don't want to wait for the commit to complete, especially if it generates large dvc artifacts? The other option is to not delete the tmpdir until requested by the users, which is simpler but pollutes the workspace with more tmpdir copies. |
Metrics and plots work while queued experiments are running, but if an experiment is killed or fails in the middle, the metrics revert to the baseline and the plots are no longer available.
Here's what I did using https://github.com/dberenbaum/lstm_seq2seq:
After a couple minutes (when the first epoch completes), you should start seeing metrics and plots:
Then kill the experiments and you will see the metrics and plots info dropped:
We should preserve metrics and plots for failed experiments.
The text was updated successfully, but these errors were encountered: