Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Attempt to fix FBGEMM CPU build (#3499)" #3528

Closed
wants to merge 17 commits into from
Closed

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Dec 23, 2024

This reverts commit 5c16f4b. This is not needed anymore after pytorch/pytorch#143423. I think this will also fix the issue with building torchrec CPU https://github.com/pytorch/FBGEMM/actions/runs/12470608879/job/34806045264?pr=3528#step:18:219⁩

Testing

https://github.com/pytorch/FBGEMM/actions/runs/12470608879

Copy link

netlify bot commented Dec 23, 2024

Deploy Preview for pytorch-fbgemm-docs failed.

Name Link
🔨 Latest commit 04f3ad2
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/676c5a0708c0d70008867379

@huydhn huydhn requested a review from q10 December 23, 2024 17:28
@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@huydhn
Copy link
Contributor Author

huydhn commented Dec 23, 2024

Hmm, CPU build fails when this is reverted, so maybe we need this after all This is fixed by using system gcc, it's also gcc-11 now, same as the one from conda.

@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@huydhn
Copy link
Contributor Author

huydhn commented Dec 25, 2024

I think I get this working for all linux builds now, but the fix is more complicated than I thought. I need to:

  1. Use gcc-11 from Nova docker image to fix the GLIBCXX issue.
  2. Use CUDA installation from Nova to find all the necessary CUDA libraries there. However, LD_LIBRARY_PATH, CUDNN_INCLUDE_DIR, and CUDNN_LIBRARY env variables need to be updated to point there. It looks like FBGEMM needs to do relocate the wheel similar to what vision does. What do you think? @atalman @q10
  3. The CUDA aarch64 build is different than CUDA x86 build because it runs on regular arm runner without NVIDIA gpu, so it doesn't have CUDA installation nor nvidia-smi. This leads to this failure https://github.com/pytorch/FBGEMM/actions/runs/12485012915/job/34843246877#step:18:242. Luckily, FBGEMM already has a logic to fix that, which I reuse. I just realize that there should not be a CUDA aarch64 build at all because with-cuda is set to false there https://github.com/pytorch/FBGEMM/blob/main/.github/workflows/build_wheels_linux_aarch64.yml#L34. This looks like a bug in Nova code, I'm going to push for a fix there.

@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@huydhn
Copy link
Contributor Author

huydhn commented Dec 25, 2024

Here is the fix to CUDA aarch64 wheel build failure pytorch/test-infra#6112, it shouldn't have been there in the first place

@facebook-github-bot
Copy link
Contributor

@huydhn merged this pull request in 7d41ee5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants