Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address issue #540 #554

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Address issue #540 #554

wants to merge 5 commits into from

Conversation

DwarKapex
Copy link
Contributor

@DwarKapex DwarKapex commented Feb 15, 2024

Address issues #540

Copy link
Collaborator

@yhtang yhtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also configure the submitted job script so that the job would fail if an individual command in it fails. Otherwise, the job will be successful from the eyes of SLURM regardless of whether the runner did its work.

.github/workflows/_runner_ondemand_slurm.yaml Outdated Show resolved Hide resolved
.github/workflows/_runner_ondemand_slurm.yaml Outdated Show resolved Hide resolved
@yhtang yhtang marked this pull request as draft February 15, 2024 22:23
@DwarKapex
Copy link
Contributor Author

Please also configure the submitted job script so that the job would fail if an individual command in it fails. Otherwise, the job will be successful from the eyes of SLURM regardless of whether the runner did its work.

IIUC, slurm job state shows what happens with the job. The state covers all scenarios including failure of individual commands, allocation failures, timeout and etc. IMO, we are good here.

@DwarKapex DwarKapex self-assigned this Feb 15, 2024
@DwarKapex DwarKapex changed the title [WIP] Address issue #540 Address issue #540 Feb 15, 2024
@DwarKapex DwarKapex marked this pull request as ready for review February 15, 2024 23:14
@yhtang
Copy link
Collaborator

yhtang commented Feb 16, 2024

Please also configure the submitted job script so that the job would fail if an individual command in it fails. Otherwise, the job will be successful from the eyes of SLURM regardless of whether the runner did its work.

IIUC, slurm job state shows what happens with the job. The state covers all scenarios including failure of individual commands, allocation failures, timeout and etc. IMO, we are good here.

That doc concerns the overall status of the job script, which can be different from the exit status of individual commands in the script. For example:

sbatch --parsable <<EOF
#!/bin/bash
#SBATCH -p <partition>
cat non-existing.txt
EOF

leads to

$ sacct -XP -j <job_id> -o State
State
FAILED

while

sbatch --parsable <<EOF
#!/bin/bash
#SBATCH -p <partition>
cat non-existing.txt
echo Hello
EOF

leads to

$ sacct -XP -j <job_id> -o State
State
COMPLETED

The difference is that in the second example, the exit status of the subsequent echo command replaces the exit status of the failing cat command.

A simple fix for this situation is to add set -e -o pipefail to the batch script.

What is more important is that the docker run command, which starts a GitHub Actions runner instance, may gracefully handle a registration/runtime error, report it in the log, and then exits without an explicit non-zero status code. For this, I'm not 100% what the specific behavior is so please check the documentation of self-hosted runners and figure out a solution accordingly.

@yhtang yhtang marked this pull request as draft February 21, 2024 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants