Make sure you have python
installed on Openmind.
Conda works fine for me (didn't need Singularity so far), but that's maybe because I just use standard Python packages without complex dependencies.
I usually have a tmux
session on Openmind that is always running, and with 1 or 2 live interactive sessions.
Running
python run.py
saves checkpoints to save/<path_base>/checkpoints/
where <path_base>
is specified by util.get_path_base_from_args
given run.py
's args.
It always picks up from the latest checkpoint save/<path_base>/checkpoints/latest.pt
.
You can run this on an Openmind interactive session or locally.
On openmind, run the following commands to install Luke's om-repeat script:
cd ~
git clone [email protected]:insperatum/openmind-tools.git
echo 'export PATH="$HOME/openmind-tools/bin:$PATH"' >> ~/.bashrc
This allows you to queue jobs using standard sbatch
by queueing 2h jobs with dependencies until the script finishes.
Since the run script always picks up from the latest checkpoint, it should be fine as long as the --save-interval
value is small enough.
There is a way to catch the termination signal and save checkpoints based on that but it's not really necessary.
Running
python sweep.py --cluster
on Openmind queues the jobs as defined by the sweep.get_sweep_argss
.
In this example, I'm looping over 3 random seeds x 2 number of hidden units in the first layer x 2 number of hidden units in the second layer of a multilayer perceptron.
Change util.get_path_base_from_args
accordingly so that the folder names of your models are sensible.
For instance, if you're sweeping only over the seed, while other hyperparameters are fixed, you can just set path_base
to be the seed
and not include the other hyperparameters.
If you don't provide --cluster
, the jobs are just run sequentially on the current machine (interactive session on Openmind or local machine).
Provide --rm
option to remove the save/
folder before running the sweep.
Standard out and standard error logs are saved in save/<path_base>/logs
.
Running
python plot.py
iterates over folders in save/
plots losses (and potentially other stuff) to save/<path_base>/...
.
If you provide the --repeat
flag, it does this in an infinite loop.
I usually just start another interactive session and run an infinite plotting script there to track training progress.