Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve minion observer stats to capture more granular stages of minion task execution #15118

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

shounakmk219
Copy link
Collaborator

Description

This PR

  • Adds new stats to track using MinionProgressObserver
  • Expose minion endpoint to fetch the task progress stats

Newly added stats

  1. _currentStage : while the _currentState is limited to only hold MinionTaskState, we need more granular idea about which stage the minion task is currently in. This property has hold any task specific stage to give better tracking.
  2. _endTimestamp
  3. _stageTimes : with the new _currentStage property, this property stores a map of stage and its start and total time spent. This stat can help user to identify which stage is taking more time and fine tune the task configs accordingly.

These new stats are leveraged to breakdown the SegmentProcessorFramework observability further into map, reduce and segment generate stage.

Endpoint to fetch the task progress stats

Path : /tasks/subtask/progressStats
Method : GET
Query params : subtaskName
Sample Response:

{
   "Task_RealtimeToOfflineSegmentsTask_d5f1d3cb-53f7-4020-9276-a672ccba6a53_1740399840094_0":{
      "currentState":"SUCCEEDED",
      "taskId":"Task_RealtimeToOfflineSegmentsTask_d5f1d3cb-53f7-4020-9276-a672ccba6a53_1740399840094_0",
      "currentStage":"SUCCEEDED",
      "endTimestamp":1740399840317,
      "startTimestamp":1740399840190,
      "stageTimes":{
         "IN_PROGRESS":{
            "startTimeMs":1740399840191,
            "totalTimeMs":59
         },
         "GENERATE_SEGMENT":{
            "startTimeMs":1740399840256,
            "totalTimeMs":61
         },
         "SUCCEEDED":{
            "startTimeMs":1740399840326,
            "totalTimeMs":0
         },
         "MAP":{
            "startTimeMs":1740399840250,
            "totalTimeMs":6
         },
         "REDUCE":{
            "startTimeMs":1740399840256,
            "totalTimeMs":0
         }
      }
   }
}

@codecov-commenter
Copy link

codecov-commenter commented Feb 24, 2025

Codecov Report

Attention: Patch coverage is 68.14815% with 43 lines in your changes missing coverage. Please review.

Project coverage is 63.45%. Comparing base (59551e4) to head (6e22a27).
Report is 1753 commits behind head on master.

Files with missing lines Patch % Lines
...inion/api/resources/PinotTaskProgressResource.java 0.00% 16 Missing ⚠️
...che/pinot/minion/event/MinionProgressObserver.java 71.11% 9 Missing and 4 partials ⚠️
.../core/segment/processing/mapper/SegmentMapper.java 0.00% 7 Missing ⚠️
...e/pinot/spi/tasks/MinionTaskBaseObserverStats.java 85.41% 3 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #15118      +/-   ##
============================================
+ Coverage     61.75%   63.45%   +1.70%     
- Complexity      207     1480    +1273     
============================================
  Files          2436     2748     +312     
  Lines        133233   154557   +21324     
  Branches      20636    23831    +3195     
============================================
+ Hits          82274    98079   +15805     
- Misses        44911    49079    +4168     
- Partials       6048     7399    +1351     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.41% <68.14%> (+1.70%) ⬆️
java-21 63.34% <68.14%> (+1.72%) ⬆️
skip-bytebuffers-false 63.43% <68.14%> (+1.68%) ⬆️
skip-bytebuffers-true 63.32% <68.14%> (+35.60%) ⬆️
temurin 63.45% <68.14%> (+1.70%) ⬆️
unittests 63.45% <68.14%> (+1.70%) ⬆️
unittests1 56.07% <58.10%> (+9.18%) ⬆️
unittests2 34.00% <48.88%> (+6.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

String logMessage = "Caught exception while reading data.";
observer.accept(new MinionTaskBaseObserverStats.StatusEntry.Builder()
.withLevel(MinionTaskBaseObserverStats.StatusEntry.LogLevel.ERROR)
.withStatus(logMessage + " Reason : " + e.getMessage())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The space between Reason and : could be avoided. Reason: the error reason.

}
Map<String, MinionTaskBaseObserverStats> progressStatsMap = new HashMap<>();
MinionEventObserver observer = MinionEventObservers.getInstance().getMinionEventObserver(subtaskName);
MinionTaskBaseObserverStats progressStats = observer.getProgressStats();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this produce an NPE if observer is null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will, my bad. Will fix it, thanks for catching it!

}
String incomingStage = statusEntry.getStage();
if (_taskProgressStats.getCurrentStage() == null) {
_taskProgressStats.setCurrentStage(incomingStage != null ? incomingStage : MinionTaskState.UNKNOWN.name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment on conditions when the incoming stage would be null.


public void start() {
_startTimeMs = System.currentTimeMillis();
_resumeTimeMs = _startTimeMs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the real use of the field _resumeTimeMs? Do we expect the same stage to started more than once?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I was thinking when introducing it. I wanted to keep this flexibility in case we adopt this where a set of stages is ran in a loop, in which case the cumulative stage time will be tracked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants