Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to CUDA 12.4 is causing segfaults in 4 Range Profiler Tests #949

Open
sraikund16 opened this issue Jun 13, 2024 · 0 comments
Open

Comments

@sraikund16
Copy link
Contributor

sraikund16 commented Jun 13, 2024

After upgrading the CUDA image to 12.4 we are having segfault failures in the following tests:
19 - CuptiRangeProfilerApiTest.asyncLaunchUserRange (SEGFAULT)
20 - CuptiRangeProfilerApiTest.asyncLaunchAutoRange (SEGFAULT)
24 - CuptiRangeProfilerTest.UserRangeTest (SEGFAULT)
25 - CuptiRangeProfilerTest.AutoRangeTest (SEGFAULT)

This is what the backtrace looks like:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7316317 in ?? () from /usr/local/cuda-12.4/extras/CUPTI/lib64/libcupti.so.12
(gdb) bt
#0 0x00007ffff7316317 in ?? () from /usr/local/cuda-12.4/extras/CUPTI/lib64/libcupti.so.12
#1 0x00007ffff72fb35a in cuptiFinalize () from /usr/local/cuda-12.4/extras/CUPTI/lib64/libcupti.so.12
#2 0x0000000000448a6e in libkineto::CuptiRBProfilerSession::deInitCupti()::{lambda()#1}::operator()() const [clone .isra.0] ()
#3 0x0000000000448cf9 in libkineto::CuptiRBProfilerSession::~CuptiRBProfilerSession() ()
#4 0x000000000041fd20 in libkineto::MockCuptiRBProfilerSession::~MockCuptiRBProfilerSession() ()
#5 0x000000000041d9a7 in CuptiRangeProfilerTest_UserRangeTest_Test::TestBody() ()
#6 0x00000000004cac2d in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::)(), char const) ()
#7 0x00000000004bd21e in testing::Test::Run() [clone .part.0] ()
#8 0x00000000004bd935 in testing::TestInfo::Run() [clone .part.0] ()
#9 0x00000000004be395 in testing::TestSuite::Run() [clone .part.0] ()
#10 0x00000000004bfb27 in testing::internal::UnitTestImpl::RunAllTests() ()
#11 0x00000000004bddfb in testing::UnitTest::Run() ()
#12 0x00000000004169a4 in main ()

cc @briancoutinho

aaronenyeshi added a commit to aaronenyeshi/kineto that referenced this issue Jun 14, 2024
…AULT in CUDA 12.4

Summary:
Github Actions CI on NVIDIA A10G instances are failing for these tests since the upgrade to 12.4:

19 - CuptiRangeProfilerApiTest.asyncLaunchUserRange (SEGFAULT)
20 - CuptiRangeProfilerApiTest.asyncLaunchAutoRange (SEGFAULT)
24 - CuptiRangeProfilerTest.UserRangeTest (SEGFAULT)
25 - CuptiRangeProfilerTest.AutoRangeTest (SEGFAULT)

We are tracking the issue here: pytorch#949

Differential Revision: D58588836

Pulled By: aaronenyeshi
facebook-github-bot pushed a commit that referenced this issue Jun 14, 2024
…AULT in CUDA 12.4 (#951)

Summary:
Pull Request resolved: #951

Github Actions CI on NVIDIA A10G instances are failing for these tests since the upgrade to 12.4:

19 - CuptiRangeProfilerApiTest.asyncLaunchUserRange (SEGFAULT)
20 - CuptiRangeProfilerApiTest.asyncLaunchAutoRange (SEGFAULT)
24 - CuptiRangeProfilerTest.UserRangeTest (SEGFAULT)
25 - CuptiRangeProfilerTest.AutoRangeTest (SEGFAULT)

We are tracking the issue here: #949

Test Plan: CI

Reviewed By: sraikund16

Differential Revision: D58588836

Pulled By: aaronenyeshi

fbshipit-source-id: 4b1c02d18e235d1c8a4f7c5162d59950cfa89dcb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant