Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics and plots for failed queue experiments #9787

Closed
dberenbaum opened this issue Aug 1, 2023 · 5 comments
Closed

Metrics and plots for failed queue experiments #9787

dberenbaum opened this issue Aug 1, 2023 · 5 comments
Labels
A: experiments Related to dvc exp product: VSCode Integration with VSCode extension

Comments

@dberenbaum
Copy link
Collaborator

Metrics and plots work while queued experiments are running, but if an experiment is killed or fails in the middle, the metrics revert to the baseline and the plots are no longer available.

Here's what I did using https://github.com/dberenbaum/lstm_seq2seq:

$ dvc exp run --queue -S num_samples=10000 -S model.max_epochs=5 -S 'model.optim.lr=range(0.001,0.01,0.001)'
/Users/dave/micromamba/envs/lstm_seq2seq/lib/python3.11/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.001']}'.
Queued experiment 'tamer-tils' for future execution.
/Users/dave/micromamba/envs/lstm_seq2seq/lib/python3.11/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.002']}'.
Queued experiment 'silky-seam' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.003']}'.
Queued experiment 'techy-tort' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.004']}'.
Queued experiment 'balky-polo' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.005']}'.
Queued experiment 'cruel-jato' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.006']}'.
Queued experiment 'color-stay' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.007']}'.
Queued experiment 'soppy-ludo' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.008']}'.
Queued experiment 'tenty-bice' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.009000000000000001']}'.
Queued experiment 'seely-ados' for future execution.

$ dvc queue start -j 3
Started '3' new experiments task queue workers.

After a couple minutes (when the first epoch completes), you should start seeing metrics and plots:

$ dvc exp show --no-pager
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  Experiment                 Created    State     Executor   val.loss    val.acc   epoch   step   …   optim_params.lr   …   fra.txt                            …
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  workspace                  -          -         -                 -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   …
  main                       03:48 PM   -         -                 -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── 4322af3 [tamer-tils]   -          Running   Dvc-task     4.1266   0.017616       0     15   …   0.001             8   f16099673fd64e9fda1e17927ad02248   …
  ├── c3e7398 [techy-tort]   -          Running   Dvc-task     1.7217   0.020047       0     15   …   0.003             8   f16099673fd64e9fda1e17927ad02248   …
  ├── dfa7fd7 [tenty-bice]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── c23047f [seely-ados]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── d0aabe3 [cruel-jato]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── b339a44 [color-stay]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── 50f9238 [soppy-ludo]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── 3e84fbd [balky-polo]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  └── d037aaf [silky-seam]   04:13 PM   Failed    -                 -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

$ dvc plots diff tamer-tils techy-tort
file:///private/tmp/lstm_seq2seq/dvc_plots/index.html

Then kill the experiments and you will see the metrics and plots info dropped:

$ dvc queue stop --kill
tamer-tils has been killed.
techy-tort has been killed.
All running tasks in the queue have been killed.Queue workers are stopping.

$ dvc exp show --no-pager
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  Experiment                 Created    State    Executor   optim_params.lr   latent_dim   model.batch_size   model.latent_dim   …   fra.txt                            …
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  workspace                  -          -        -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   …
  main                       03:48 PM   -        -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── dfa7fd7 [tenty-bice]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── c23047f [seely-ados]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── d0aabe3 [cruel-jato]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── b339a44 [color-stay]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── 50f9238 [soppy-ludo]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── 3e84fbd [balky-polo]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── d037aaf [silky-seam]   04:13 PM   Failed   -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── c3e7398 [techy-tort]   04:13 PM   Failed   -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  └── 4322af3 [tamer-tils]   04:13 PM   Failed   -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

$ dvc plots diff tamer-tils techy-tort
ERROR: unknown Git revision 'tamer-tils'

We should preserve metrics and plots for failed experiments.

@dberenbaum
Copy link
Collaborator Author

Related to #9776. It might close that one. Not sure what the behavior will be for experiments that fail during dvc setup.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 2, 2023

They are unavailable because we don't make a commit at the end of a failed exp run at all. In the actual experiments code we don't have a concept of failed exps at all, a finished exp ref is only ever considered to be successful. The current behavior for failed exps is just an extension of the celery queue, which is why it only contains information about the initial queued exp state (plus the celery logs that get exposed in dvc queue log).

And if the metrics are cache: true we will need to force generating a partial lock file that only contains the available metrics and ignores missing outputs due to the failed run (and account for things like incomplete writes to the metrics file that make it a corrupted file)

Implementing this would be on the level of a substantial feature and not any kind of quick fix.

@dberenbaum
Copy link
Collaborator Author

How do we collect them while the experiment is running and why can't we do the same after they fail?

@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

How do we collect them while the experiment is running and why can't we do the same after they fail?

We read them directly from the tempdir workspace while it's running. The tempdir gets deleted after execution so we can't do the same thing for failed experiments, unless we want to start retaining the temporary workspace copies for queued experiment runs.

@dberenbaum
Copy link
Collaborator Author

Discussed with @pmrowla that the best solution seems to be committing and making git refs even for failed experiments (enables us to track everything, apply the results of failed experiments, etc.). @pmrowla Thinking some more about it, I worry about the performance and what happens if users don't want to wait for the commit to complete, especially if it generates large dvc artifacts?

The other option is to not delete the tmpdir until requested by the users, which is simpler but pollutes the workspace with more tmpdir copies.

@dberenbaum dberenbaum added this to DVC Sep 6, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Sep 6, 2023
@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Sep 6, 2023
@dberenbaum dberenbaum removed the p1-important Important, aka current backlog of things to do label Mar 5, 2024
@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp product: VSCode Integration with VSCode extension
Projects
No open projects
Archived in project
Development

No branches or pull requests

2 participants