Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

print job logs when the job failed for debugging #591

Merged
merged 2 commits into from
Feb 28, 2025
Merged

print job logs when the job failed for debugging #591

merged 2 commits into from
Feb 28, 2025

Conversation

wwvela
Copy link
Contributor

@wwvela wwvela commented Feb 28, 2025

Issue #, if available:

  • Add the job logs print for nvidia training, nvidia inference, neuron inference when the test failed which is easier for us to debugging the issue
  • We already have the job log print for neuron training when it failed
    if err := waitForJobCompletion(job, cfg); err != nil {
    log.Printf("Job did not complete successfully: %v", err)
    logsBuf, err := gatherJobLogs(ctx, cfg, "default", "bert-training")
    if err != nil {
    log.Printf("failed to get logs: %v", err)
    } else {
    log.Println(logsBuf.String())
    }

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@wwvela wwvela requested a review from mattcjo February 28, 2025 01:25
@wwvela wwvela merged commit 2f68835 into main Feb 28, 2025
8 of 9 checks passed
@wwvela wwvela deleted the log branch February 28, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants