Skip to content

Commit

Permalink
avoid calling libkineto::api().client()->stop twice (#1029)
Browse files Browse the repository at this point in the history
Summary:
In normal case,  torch main thread calls `performRunLoopStep` when `profilerStep` is called to collect traces. If  `maxGpuBufferCount_` is set too small, profiling may be stopped early, `profilerThread_` will also call `performRunLoopStep` to collect trace.  If the cpu trace buffer is too big, `collectTrace` may be called twice, and `libkineto::api().client()->stop` will be called twice, which will throw an uncaught `::c10::Error`. Finally, torch process quits with a core file.

Pull Request resolved: #1029

Reviewed By: davidberard98

Differential Revision: D68973735

Pulled By: sraikund16

fbshipit-source-id: 2699a2ef195dae327a07c6376fa3e6329e0f6025
  • Loading branch information
staugust authored and facebook-github-bot committed Feb 3, 2025
1 parent 3f6b67e commit d92080f
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 3 deletions.
24 changes: 21 additions & 3 deletions libkineto/src/CuptiActivityProfiler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1138,6 +1138,11 @@ void CuptiActivityProfiler::configure(
currentRunloopState_ = RunloopState::Warmup;
}

bool CuptiActivityProfiler::getCollectTraceState() {
std::lock_guard<std::recursive_mutex> guard(collectTraceStateMutex_);
return isCollectingTrace;
}

void CuptiActivityProfiler::collectTrace(
bool collection_done,
const std::chrono::time_point<std::chrono::system_clock>& now) {
Expand Down Expand Up @@ -1314,13 +1319,17 @@ const time_point<system_clock> CuptiActivityProfiler::performRunLoopStep(
VLOG_IF(1, currentIter >= 0)
<< "This state change was invoked by application's step() call";

// currentIter >= 0 means this is an iteration-based collection,
// triggered by pytorch main thread, it should be executed in another
// currentIter >= 0 means this is called from the step() api of
// the profile in pytorch main thread, it should be executed in another
// thread in case pytorch main thread is blocked
if (currentIter >= 0) {
// if collectTraceThread_ is already running, there's no need to
// execute collectTrace twice.
if (!collectTraceThread_) {
// Do not call collectTrace when profilerThread_ is collecting Trace.
// Otherwise, libkineto::api().client()->stop will be called twice,
// which leads to an unrecoverable ::c10:Error at
// disableProfiler
if (!collectTraceThread_ && !getCollectTraceState()) {
std::lock_guard<std::recursive_mutex> guard(mutex_);
collectTraceThread_ = std::make_unique<std::thread>(
&CuptiActivityProfiler::collectTrace,
Expand All @@ -1330,7 +1339,16 @@ const time_point<system_clock> CuptiActivityProfiler::performRunLoopStep(
}
break;
}
// this is executed in profilerThread_
{
std::lock_guard<std::recursive_mutex> guard(collectTraceStateMutex_);
isCollectingTrace = true;
}
collectTrace(collection_done, now);
{
std::lock_guard<std::recursive_mutex> guard(collectTraceStateMutex_);
isCollectingTrace = false;
}
} else if (derivedConfig_->isProfilingByIteration()) {
// nothing to do here
} else if (
Expand Down
6 changes: 6 additions & 0 deletions libkineto/src/CuptiActivityProfiler.h
Original file line number Diff line number Diff line change
Expand Up @@ -429,6 +429,8 @@ class CuptiActivityProfiler {

void checkTimestampOrder(const ITraceActivity* act1);

bool getCollectTraceState();

// On-demand Request Config (should not be modified)
// TODO: remove this config_, dependency needs to be removed from
// finalizeTrace.
Expand Down Expand Up @@ -494,6 +496,10 @@ class CuptiActivityProfiler {
// details.
std::unique_ptr<std::thread> collectTraceThread_{nullptr};

// Add a mutex to protect state for CollectTrace
std::recursive_mutex collectTraceStateMutex_;
bool isCollectingTrace{false};

// Runloop phase
std::atomic<RunloopState> currentRunloopState_{RunloopState::WaitForRequest};

Expand Down

0 comments on commit d92080f

Please sign in to comment.