New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Revert "Attempt to fix FBGEMM CPU build (#3499)" #3528

Closed

huydhn wants to merge 17 commits into main from revert-3499

Contributor

huydhn commented Dec 23, 2024 •

edited

Loading

This reverts commit 5c16f4b. This is not needed anymore after pytorch/pytorch#143423. I think this will also fix the issue with building torchrec CPU https://github.com/pytorch/FBGEMM/actions/runs/12470608879/job/34806045264?pr=3528#step:18:219⁩

Testing

https://github.com/pytorch/FBGEMM/actions/runs/12470608879


          Revert "Attempt to fix FBGEMM CPU build (#3499)"

59157bf

This reverts commit 5c16f4b.

huydhn requested a review from atalman

December 23, 2024 17:27

facebook-github-bot added the cla signed label

netlify bot commented Dec 23, 2024 •

edited

Loading

❌ Deploy Preview for pytorch-fbgemm-docs failed.

Name	Link
🔨 Latest commit	`04f3ad2`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/676c5a0708c0d70008867379

huydhn requested a review from q10

December 23, 2024 17:28

Contributor

facebook-github-bot commented Dec 23, 2024

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Contributor Author

huydhn commented Dec 23, 2024 •

edited

Loading

~~Hmm, CPU build fails when this is reverted, so maybe we need this after all~~ This is fixed by using system gcc, it's also gcc-11 now, same as the one from conda.


          Use system gcc compiler

b01e940

Contributor

facebook-github-bot commented Dec 24, 2024

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

huydhn mentioned this pull request

Update gcc version for FBGEMM install in CI pytorch/torchrec#2654

Open

huydhn added 4 commits

December 23, 2024 18:13


          Use CUDA installation from Nova

f67daf9


          Set CUDNN_INCLUDE_DIR and CUDNN_LIBRARY

69b0f7a


          Try one more time

7b20a82


          Fix bash typo

8cae232

atalman approved these changes

View reviewed changes

Contributor

atalman left a comment

lgtm

huydhn added 5 commits

December 24, 2024 08:23


          Building is too slow

644ab0a


          Try vision approach with LD_LIBRARY_PATH

78c3f5e


          Clean up

852c2fe


          Clean up

5bd9fe4


          NOVA CUDNN is at /usr/local/cuda/lib64

000cf05

Contributor

facebook-github-bot commented Dec 24, 2024

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

q10 approved these changes

View reviewed changes

huydhn added 2 commits

December 24, 2024 19:00


          Fix libcuda path

a4c5de6


          Add a CUDA fix for aarch64

2f6f76c

Contributor Author

huydhn commented Dec 25, 2024 •

edited

Loading

I think I get this working for all linux builds now, but the fix is more complicated than I thought. I need to:

Use gcc-11 from Nova docker image to fix the GLIBCXX issue.
Use CUDA installation from Nova to find all the necessary CUDA libraries there. However, LD_LIBRARY_PATH, CUDNN_INCLUDE_DIR, and CUDNN_LIBRARY env variables need to be updated to point there. It looks like FBGEMM needs to do relocate the wheel similar to what vision does. What do you think? @atalman @q10
The CUDA aarch64 build is different than CUDA x86 build because it runs on regular arm runner without NVIDIA gpu, so it doesn't have CUDA installation nor nvidia-smi. This leads to this failure https://github.com/pytorch/FBGEMM/actions/runs/12485012915/job/34843246877#step:18:242. Luckily, FBGEMM already has a logic to fix that, which I reuse. I just realize that there should not be a CUDA aarch64 build at all because with-cuda is set to false there https://github.com/pytorch/FBGEMM/blob/main/.github/workflows/build_wheels_linux_aarch64.yml#L34. This looks like a bug in Nova code, I'm going to push for a fix there.

huydhn added 2 commits

December 24, 2024 23:48


          Writing correct bash script in one go is hard

373e9f9


          Swap the order?

933102c

Contributor

facebook-github-bot commented Dec 25, 2024

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

huydhn added 2 commits

December 25, 2024 09:52


          Also fix libnvidia-ml.so

e78ab24


          Ready to land

04f3ad2

Contributor

facebook-github-bot commented Dec 25, 2024

@huydhn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

huydhn mentioned this pull request

[Nova] Honor with-cuda flag when building aarch64 wheel pytorch/test-infra#6112

Open

Contributor Author

huydhn commented Dec 25, 2024

Here is the fix to CUDA aarch64 wheel build failure pytorch/test-infra#6112, it shouldn't have been there in the first place

facebook-github-bot closed this in

7d41ee5

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Dec 26, 2024

@huydhn merged this pull request in 7d41ee5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged