You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After upgrading the CUDA image to 12.4 we are having segfault failures in the following tests:
19 - CuptiRangeProfilerApiTest.asyncLaunchUserRange (SEGFAULT)
20 - CuptiRangeProfilerApiTest.asyncLaunchAutoRange (SEGFAULT)
24 - CuptiRangeProfilerTest.UserRangeTest (SEGFAULT)
25 - CuptiRangeProfilerTest.AutoRangeTest (SEGFAULT)
This is what the backtrace looks like:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7316317 in ?? () from /usr/local/cuda-12.4/extras/CUPTI/lib64/libcupti.so.12
(gdb) bt
#0 0x00007ffff7316317 in ?? () from /usr/local/cuda-12.4/extras/CUPTI/lib64/libcupti.so.12 #1 0x00007ffff72fb35a in cuptiFinalize () from /usr/local/cuda-12.4/extras/CUPTI/lib64/libcupti.so.12 #2 0x0000000000448a6e in libkineto::CuptiRBProfilerSession::deInitCupti()::{lambda()#1}::operator()() const [clone .isra.0] () #3 0x0000000000448cf9 in libkineto::CuptiRBProfilerSession::~CuptiRBProfilerSession() () #4 0x000000000041fd20 in libkineto::MockCuptiRBProfilerSession::~MockCuptiRBProfilerSession() () #5 0x000000000041d9a7 in CuptiRangeProfilerTest_UserRangeTest_Test::TestBody() () #6 0x00000000004cac2d in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::)(), char const) () #7 0x00000000004bd21e in testing::Test::Run() [clone .part.0] () #8 0x00000000004bd935 in testing::TestInfo::Run() [clone .part.0] () #9 0x00000000004be395 in testing::TestSuite::Run() [clone .part.0] () #10 0x00000000004bfb27 in testing::internal::UnitTestImpl::RunAllTests() () #11 0x00000000004bddfb in testing::UnitTest::Run() () #12 0x00000000004169a4 in main ()
…AULT in CUDA 12.4
Summary:
Github Actions CI on NVIDIA A10G instances are failing for these tests since the upgrade to 12.4:
19 - CuptiRangeProfilerApiTest.asyncLaunchUserRange (SEGFAULT)
20 - CuptiRangeProfilerApiTest.asyncLaunchAutoRange (SEGFAULT)
24 - CuptiRangeProfilerTest.UserRangeTest (SEGFAULT)
25 - CuptiRangeProfilerTest.AutoRangeTest (SEGFAULT)
We are tracking the issue here: pytorch#949
Differential Revision: D58588836
Pulled By: aaronenyeshi
…AULT in CUDA 12.4 (#951)
Summary:
Pull Request resolved: #951
Github Actions CI on NVIDIA A10G instances are failing for these tests since the upgrade to 12.4:
19 - CuptiRangeProfilerApiTest.asyncLaunchUserRange (SEGFAULT)
20 - CuptiRangeProfilerApiTest.asyncLaunchAutoRange (SEGFAULT)
24 - CuptiRangeProfilerTest.UserRangeTest (SEGFAULT)
25 - CuptiRangeProfilerTest.AutoRangeTest (SEGFAULT)
We are tracking the issue here: #949
Test Plan: CI
Reviewed By: sraikund16
Differential Revision: D58588836
Pulled By: aaronenyeshi
fbshipit-source-id: 4b1c02d18e235d1c8a4f7c5162d59950cfa89dcb
After upgrading the CUDA image to 12.4 we are having segfault failures in the following tests:
19 - CuptiRangeProfilerApiTest.asyncLaunchUserRange (SEGFAULT)
20 - CuptiRangeProfilerApiTest.asyncLaunchAutoRange (SEGFAULT)
24 - CuptiRangeProfilerTest.UserRangeTest (SEGFAULT)
25 - CuptiRangeProfilerTest.AutoRangeTest (SEGFAULT)
This is what the backtrace looks like:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7316317 in ?? () from /usr/local/cuda-12.4/extras/CUPTI/lib64/libcupti.so.12
(gdb) bt
#0 0x00007ffff7316317 in ?? () from /usr/local/cuda-12.4/extras/CUPTI/lib64/libcupti.so.12
#1 0x00007ffff72fb35a in cuptiFinalize () from /usr/local/cuda-12.4/extras/CUPTI/lib64/libcupti.so.12
#2 0x0000000000448a6e in libkineto::CuptiRBProfilerSession::deInitCupti()::{lambda()#1}::operator()() const [clone .isra.0] ()
#3 0x0000000000448cf9 in libkineto::CuptiRBProfilerSession::~CuptiRBProfilerSession() ()
#4 0x000000000041fd20 in libkineto::MockCuptiRBProfilerSession::~MockCuptiRBProfilerSession() ()
#5 0x000000000041d9a7 in CuptiRangeProfilerTest_UserRangeTest_Test::TestBody() ()
#6 0x00000000004cac2d in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::)(), char const) ()
#7 0x00000000004bd21e in testing::Test::Run() [clone .part.0] ()
#8 0x00000000004bd935 in testing::TestInfo::Run() [clone .part.0] ()
#9 0x00000000004be395 in testing::TestSuite::Run() [clone .part.0] ()
#10 0x00000000004bfb27 in testing::internal::UnitTestImpl::RunAllTests() ()
#11 0x00000000004bddfb in testing::UnitTest::Run() ()
#12 0x00000000004169a4 in main ()
cc @briancoutinho
The text was updated successfully, but these errors were encountered: