You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pav2 is the only test harness I've found that allows me to specify a number of nodes and execute all subsequent jobs on them (thank you). This is achieved as follows:
modes/share.yaml
scheduler: slurmschedule:
nodes: 1share_allocation: max
However, when looking at the results output, it appears that these jobs are launched serially, rather than asynchronously. See below.
Note that all of these tests are a single rank, thus they should be able to be launched with srun using the following srun args.
slurm:
srun_extra:
- --overlap
One potential issue is overwhelming SLURM. Perhaps adding another key, e.g. max_queue, that limits the number of asynchronous jobs that can be put in the srun queue will be helpful. Perhaps something as follows.
Currently the kickoff scripts simply have a pav _run command for each test to run in an allocation, which is why this is synchronous.
What we need to do is expand pav _run so that it can take multiple tests as an argument, and then manage those tests by their max_queue setting. This should look at the number of tasks each test requires via the scheduler variables, and count that against the total queue size. Note that queue size can vary from test to test (unless we make it one of the parameters that forces allocation separation), so it will be necessary to manage the number of tests dynamically. So if we have tests with 1, 2, 4, and 12 max_queue, then the size 1 test will run by itself, then any pair of the size 2, 4, and 12 tasks could run together.
I think we need a better name than max_queue. Maybe max_share_tasks?
One quick clarification: the hope would be that max_queue sets the limit of active jobs in queue. The hope would be that if I need 2000 tests to run on this single node, there is an upper limit of max_queue at any given time until all 2K tests complete.
Pav2 is the only test harness I've found that allows me to specify a number of nodes and execute all subsequent jobs on them (thank you). This is achieved as follows:
modes/share.yaml
However, when looking at the results output, it appears that these jobs are launched serially, rather than asynchronously. See below.
Edited pav results output showing launch times.
Note that all of these tests are a single rank, thus they should be able to be launched with srun using the following srun args.
One potential issue is overwhelming SLURM. Perhaps adding another key, e.g.
max_queue
, that limits the number of asynchronous jobs that can be put in the srun queue will be helpful. Perhaps something as follows.modes/share.yaml
The text was updated successfully, but these errors were encountered: