Metrics and plots for failed queue experiments #9787

dberenbaum · 2023-08-01T20:18:46Z

Metrics and plots work while queued experiments are running, but if an experiment is killed or fails in the middle, the metrics revert to the baseline and the plots are no longer available.

Here's what I did using https://github.com/dberenbaum/lstm_seq2seq:

$ dvc exp run --queue -S num_samples=10000 -S model.max_epochs=5 -S 'model.optim.lr=range(0.001,0.01,0.001)'
/Users/dave/micromamba/envs/lstm_seq2seq/lib/python3.11/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.001']}'.
Queued experiment 'tamer-tils' for future execution.
/Users/dave/micromamba/envs/lstm_seq2seq/lib/python3.11/site-packages/hydra/_internal/defaults_list.py:251: UserWarning: In 'config': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.002']}'.
Queued experiment 'silky-seam' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.003']}'.
Queued experiment 'techy-tort' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.004']}'.
Queued experiment 'balky-polo' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.005']}'.
Queued experiment 'cruel-jato' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.006']}'.
Queued experiment 'color-stay' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.007']}'.
Queued experiment 'soppy-ludo' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.008']}'.
Queued experiment 'tenty-bice' for future execution.
Queueing with overrides '{'params.yaml': ['num_samples=10000', 'model.max_epochs=5', 'model.optim.lr=0.009000000000000001']}'.
Queued experiment 'seely-ados' for future execution.

$ dvc queue start -j 3
Started '3' new experiments task queue workers.

After a couple minutes (when the first epoch completes), you should start seeing metrics and plots:

$ dvc exp show --no-pager
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  Experiment                 Created    State     Executor   val.loss    val.acc   epoch   step   …   optim_params.lr   …   fra.txt                            …
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  workspace                  -          -         -                 -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   …
  main                       03:48 PM   -         -                 -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── 4322af3 [tamer-tils]   -          Running   Dvc-task     4.1266   0.017616       0     15   …   0.001             8   f16099673fd64e9fda1e17927ad02248   …
  ├── c3e7398 [techy-tort]   -          Running   Dvc-task     1.7217   0.020047       0     15   …   0.003             8   f16099673fd64e9fda1e17927ad02248   …
  ├── dfa7fd7 [tenty-bice]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── c23047f [seely-ados]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── d0aabe3 [cruel-jato]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── b339a44 [color-stay]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── 50f9238 [soppy-ludo]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  ├── 3e84fbd [balky-polo]   04:13 PM   Queued    Dvc-task          -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
  └── d037aaf [silky-seam]   04:13 PM   Failed    -                 -          -       -      -   -   0.01              8   f16099673fd64e9fda1e17927ad02248   -
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

$ dvc plots diff tamer-tils techy-tort
file:///private/tmp/lstm_seq2seq/dvc_plots/index.html

Then kill the experiments and you will see the metrics and plots info dropped:

$ dvc queue stop --kill
tamer-tils has been killed.
techy-tort has been killed.
All running tasks in the queue have been killed.Queue workers are stopping.

$ dvc exp show --no-pager
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  Experiment                 Created    State    Executor   optim_params.lr   latent_dim   model.batch_size   model.latent_dim   …   fra.txt                            …
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  workspace                  -          -        -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   …
  main                       03:48 PM   -        -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── dfa7fd7 [tenty-bice]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── c23047f [seely-ados]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── d0aabe3 [cruel-jato]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── b339a44 [color-stay]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── 50f9238 [soppy-ludo]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── 3e84fbd [balky-polo]   04:13 PM   Queued   Dvc-task   0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── d037aaf [silky-seam]   04:13 PM   Failed   -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  ├── c3e7398 [techy-tort]   04:13 PM   Failed   -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
  └── 4322af3 [tamer-tils]   04:13 PM   Failed   -          0.01              8            512                8                  …   f16099673fd64e9fda1e17927ad02248   -
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

$ dvc plots diff tamer-tils techy-tort
ERROR: unknown Git revision 'tamer-tils'

We should preserve metrics and plots for failed experiments.

dberenbaum · 2023-08-01T21:33:13Z

Related to #9776. It might close that one. Not sure what the behavior will be for experiments that fail during dvc setup.

pmrowla · 2023-08-02T01:04:29Z

They are unavailable because we don't make a commit at the end of a failed exp run at all. In the actual experiments code we don't have a concept of failed exps at all, a finished exp ref is only ever considered to be successful. The current behavior for failed exps is just an extension of the celery queue, which is why it only contains information about the initial queued exp state (plus the celery logs that get exposed in dvc queue log).

And if the metrics are cache: true we will need to force generating a partial lock file that only contains the available metrics and ignores missing outputs due to the failed run (and account for things like incomplete writes to the metrics file that make it a corrupted file)

Implementing this would be on the level of a substantial feature and not any kind of quick fix.

dberenbaum · 2023-08-02T12:29:23Z

How do we collect them while the experiment is running and why can't we do the same after they fail?

pmrowla · 2023-08-03T09:23:42Z

How do we collect them while the experiment is running and why can't we do the same after they fail?

We read them directly from the tempdir workspace while it's running. The tempdir gets deleted after execution so we can't do the same thing for failed experiments, unless we want to start retaining the temporary workspace copies for queued experiment runs.

dberenbaum · 2023-08-03T16:55:50Z

Discussed with @pmrowla that the best solution seems to be committing and making git refs even for failed experiments (enables us to track everything, apply the results of failed experiments, etc.). @pmrowla Thinking some more about it, I worry about the performance and what happens if users don't want to wait for the commit to complete, especially if it generates large dvc artifacts?

The other option is to not delete the tmpdir until requested by the users, which is simpler but pollutes the workspace with more tmpdir copies.

dberenbaum added product: VSCode Integration with VSCode extension A: experiments Related to dvc exp labels Aug 1, 2023

dberenbaum mentioned this issue Aug 1, 2023

Plots should be more resilient to errors in specific revisions iterative/vscode-dvc#4333

Open

dberenbaum added this to DVC Sep 6, 2023

github-project-automation bot moved this to Backlog in DVC Sep 6, 2023

dberenbaum added the p1-important Important, aka current backlog of things to do label Sep 6, 2023

dberenbaum removed the p1-important Important, aka current backlog of things to do label Mar 5, 2024

dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics and plots for failed queue experiments #9787

Metrics and plots for failed queue experiments #9787

dberenbaum commented Aug 1, 2023

dberenbaum commented Aug 1, 2023

pmrowla commented Aug 2, 2023 •

edited

Loading

dberenbaum commented Aug 2, 2023

pmrowla commented Aug 3, 2023

dberenbaum commented Aug 3, 2023

Metrics and plots for failed queue experiments #9787

Metrics and plots for failed queue experiments #9787

Comments

dberenbaum commented Aug 1, 2023

dberenbaum commented Aug 1, 2023

pmrowla commented Aug 2, 2023 • edited Loading

dberenbaum commented Aug 2, 2023

pmrowla commented Aug 3, 2023

dberenbaum commented Aug 3, 2023

pmrowla commented Aug 2, 2023 •

edited

Loading