Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable environment variables to guard against hangs with Triton gemms #624

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ashors1
Copy link
Contributor

@ashors1 ashors1 commented Mar 7, 2024

From some local experiments, it appears that setting JAX_SHARE_BINARY_BETWEEN_HOSTS=True and JAX_SHARE_AUTOTUNE_CONFIG_BETWEEN_HOSTS=True no longer leads to failures. Attempting to re-enable them in the JAX container.

@terrykong
Copy link
Contributor

Is that praxis PR update required for this JAX-Toolbox PR?

@ashors1
Copy link
Contributor Author

ashors1 commented Mar 7, 2024

Is that praxis PR update required for this JAX-Toolbox PR?

It might not be, but I figured it was easier to update everything since it's unclear whether updating JAX without updating praxis would lead to dependency issues. Also yesterday's rosetta pax tests passed (https://github.com/NVIDIA/JAX-Toolbox/actions/runs/8185876141), so I figured the praxis bump was safe

olupton
olupton previously requested changes Mar 7, 2024
.github/container/manifest.yaml Outdated Show resolved Hide resolved
@olupton olupton dismissed their stale review March 7, 2024 18:58

Resolved

@ashors1
Copy link
Contributor Author

ashors1 commented Mar 8, 2024

Looks like I was incorrect; the problems that exist with JAX_SHARE_BINARY_BETWEEN_HOSTS=True and JAX_SHARE_AUTOTUNE_CONFIG_BETWEEN_HOSTS=True are still present. Converting this PR to a draft while we wait for these issues to be resolved.

@ashors1 ashors1 marked this pull request as draft March 8, 2024 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants