[V1][Metrics] Add several request timing histograms #12644

markmc · 2025-02-01T09:51:47Z

Follow on from #12579, part of #10582.

Add the following:

vllm:e2e_request_latency_seconds
vllm:request_queue_time_seconds
vllm:request_inference_time_seconds
vllm:request_prefill_time_seconds
vllm:request_decode_time_seconds

e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend.

For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core:

  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to scheduled time)
  << new token timestamp (FINISHED) >>

We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core.

Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events:

QUEUED: scheduler add_request()
SCHEDULED: scheduler schedule()

There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs.

github-actions · 2025-02-01T09:51:58Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/v1/core/scheduler.py

vllm/v1/engine/async_llm.py

robertgshaw2-redhat

Hey @markmc - left some small nits. This is looking good.

The one thing I am not sure about is that doing the timestamps from the perspective of the AsyncLLM does not quite give us the granularity to make a distinction between queue_time, prefill_time, and inference_time, since if the prompt length is < chunked prefill size (which usually the case), the timestamp of scheduled_time will be the same as the timestamp of first_token_time (since we will generate the first token in the same step as the first time it is scheduled).

I'm not sure how to get around this without inserting some timing logic into EngineCore which also feels not ideal + brittle. What do you think?

markmc · 2025-02-05T17:07:56Z

Hey @markmc - left some small nits. This is looking good.

The one thing I am not sure about is that doing the timestamps from the perspective of the AsyncLLM does not quite give us the granularity to make a distinction between queue_time, prefill_time, and inference_time, since if the prompt length is < chunked prefill size (which usually the case), the timestamp of scheduled_time will be the same as the timestamp of first_token_time (since we will generate the first token in the same step as the first time it is scheduled).

I'm not sure how to get around this without inserting some timing logic into EngineCore which also feels not ideal + brittle. What do you think?

Ugh, how right you are! Yes, what I'm doing now seems very broken.

So, basically we would want scheduled_time to be before execute_model() is called?

We have:

<<< arrival timestamp >>>

[ queued interval]

<<< scheduled_timestamp >>>

[prefill interval]

<<< first token timestamp >>>

[decode interval]

<<< finished timestamp >>>

And the closest AsyncLLM can get to scheduled_timestamp is like "the completion of the first step that included this request" - we have no visibility into events that occur during a step.

So I guess our options are:

Compare timestamps across processes - e.g. record scheduled_timestamp in the core and use it for interval calculations in the frontend. I would like to avoid this because we should be using monotonic time (unaffected by system clock changes) and you can't compare monotonic timestamps from different processes.
Have the core send back an interval like "scheduling for this request happened N ms before first token"
Calculate these intervals in the core - we would need queue_time would be relative to the request's arrival in the core. I guess the computations would need to happen in the schedule() loop (queue time) and the update_from_output() loop (prefill and decode time).
Collect timestamps in the core, but calculate the intervals on the frontend side. This might not be terrible, and likely more accurate?

I'll take a stab at (4), but definitely welcome feedback!

robertgshaw2-redhat · 2025-02-05T17:17:56Z

Hey @markmc - left some small nits. This is looking good.
The one thing I am not sure about is that doing the timestamps from the perspective of the AsyncLLM does not quite give us the granularity to make a distinction between queue_time, prefill_time, and inference_time, since if the prompt length is < chunked prefill size (which usually the case), the timestamp of scheduled_time will be the same as the timestamp of first_token_time (since we will generate the first token in the same step as the first time it is scheduled).
I'm not sure how to get around this without inserting some timing logic into EngineCore which also feels not ideal + brittle. What do you think?

Ugh, how right you are! Yes, what I'm doing now seems very broken.

So, basically we would want scheduled_time to be before execute_model() is called?

We have:
<<< arrival timestamp >>>

[ queued interval]

<<< scheduled_timestamp >>>

[prefill interval]

<<< first token timestamp >>>

[decode interval]

<<< finished timestamp >>>
And the closest AsyncLLM can get to scheduled_timestamp is like "the completion of the first step that included this request" - we have no visibility into events that occur during a step.

So I guess our options are:

Compare timestamps across processes - e.g. record scheduled_timestamp in the core and use it for interval calculations in the frontend. I would like to avoid this because we should be using monotonic time (unaffected by system clock changes) and you can't compare monotonic timestamps from different processes.

Have the core send back an interval like "scheduling for this request happened N ms before first token"

Calculate these intervals in the core - we would need queue_time would be relative to the request's arrival in the core. I guess the computations would need to happen in the schedule() loop (queue time) and the update_from_output() loop (prefill and decode time).

Collect timestamps in the core, but calculate the intervals on the frontend side. This might not be terrible, and likely more accurate?

I'll take a stab at (4), but definitely welcome feedback!

I think (4) is very reasonable if we want to preserve these metrics. Especially if the timestamps are batch level (rather than per-request) the overhead should be very small (which is P0 for EngineCore)
The other option would be to just deprecate the concept of prefill_time and only implement metrics that can be computed from the POV of the AsyncLLM. This metric is valuable but I would not describe prefill_time as an absolute must have. I'm mostly just concerned about churning our users' grafana setups. The other negative of this approach is that we might want to implement future metrics that need the EngineCore level granularity and the setup you describe creates a good framework for building upon

WDYT? Do you have any experience deprecating telemetry like this?

markmc · 2025-02-06T17:49:34Z

I think (4) is very reasonable if we want to preserve these metrics. Especially if the timestamps are batch level (rather than per-request) the overhead should be very small (which is P0 for EngineCore)

See the new PR for what I got to, commit message pasted below

I might be missing something obvious on how to do this at the batch level, I'm really just thinking about it now. I guess the NEW_TOKENS event (and its timestamp) applies to all of the requests in the EngineCoreOutputs. The SCHEDULED and QUEUED events are only ever associated with a subset of the requests in EngineCoreOutputs though. So ... yeah ... probably can do better on this.

The other option would be to just deprecate the concept of prefill_time and only implement metrics that can be computed from the POV of the AsyncLLM. This metric is valuable but I would not describe prefill_time as an absolute must have. I'm mostly just concerned about churning our users' grafana setups. The other negative of this approach is that we might want to implement future metrics that need the EngineCore level granularity and the setup you describe creates a good framework for building upon

WDYT? Do you have any experience deprecating telemetry like this?

I guess I'm being conservative and assuming that removing anything that's in the example dashboard would be disruptive, and so we'd need a good reason to not add it - e.g. if the overhead was too large. So I guess the point of this discussion is to see if we can do it with a reasonable level of overhead.

Commit message below:

We want to capture (in histograms) precise pre-request timing
intervals between certain events in the engine core:

  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to scheduled time)
  << new token timestamp (FINISHED) >>

We want to collect these metrics in the frontend process, to keep the
engine core freed up as much as possible. We need to calculate these
intervals based on timestamps recorded by the engine core.

Engine core will include these timestamps in EngineCoreOutput (per
request) as a sequence of timestamped events, and the frontend will
calculate intervals and log them.

Where we recording these timestamped timestamps:
- QUEUED: scheduler add_request()
- SCHEDULED: scheduler schedule()
- NEW token: scheduler update_from_output()

There will always be a NEW_TOKEN event in each EngineCoreOutput, but there
may also be QUEUED and SCHEDULED events included.

markmc · 2025-02-07T11:38:05Z

I might be missing something obvious on how to do this at the batch level, I'm really just thinking about it now. I guess the NEW_TOKENS event (and its timestamp) applies to all of the requests in the EngineCoreOutputs. The SCHEDULED and QUEUED events are only ever associated with a subset of the requests in EngineCoreOutputs though. So ... yeah ... probably can do better on this.

Good call on that, done now 👍

mergify · 2025-02-07T15:27:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <[email protected]>

markmc · 2025-02-07T16:58:25Z

Rebased onto the logprobs commit 👍

mergify bot added the v1 label Feb 1, 2025

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 1, 2025

markmc mentioned this pull request Feb 4, 2025

[Feature][v1]: Add metrics support #10582

Open

1 task

markmc force-pushed the metrics-v1-prometheus-logger-6 branch from 5755e17 to aa6b6a9 Compare February 5, 2025 11:48

markmc marked this pull request as ready for review February 5, 2025 11:49

markmc requested review from DarkLight1337, robertgshaw2-redhat, simon-mo, WoosukKwon, njhill, ywang96, comaniac and alexm-redhat as code owners February 5, 2025 11:49

robertgshaw2-redhat reviewed Feb 5, 2025

View reviewed changes

vllm/v1/core/scheduler.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Feb 5, 2025

View reviewed changes

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

robertgshaw2-redhat reviewed Feb 5, 2025

View reviewed changes

markmc force-pushed the metrics-v1-prometheus-logger-6 branch from 3725022 to 38cf896 Compare February 7, 2025 15:24

mergify bot added the needs-rebase label Feb 7, 2025

markmc force-pushed the metrics-v1-prometheus-logger-6 branch from 38cf896 to 37e5b11 Compare February 7, 2025 16:57

mergify bot removed the needs-rebase label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1][Metrics] Add several request timing histograms #12644

[V1][Metrics] Add several request timing histograms #12644

markmc commented Feb 1, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 1, 2025

robertgshaw2-redhat left a comment

markmc commented Feb 5, 2025

robertgshaw2-redhat commented Feb 5, 2025

markmc commented Feb 6, 2025

markmc commented Feb 7, 2025

mergify bot commented Feb 7, 2025

markmc commented Feb 7, 2025

[V1][Metrics] Add several request timing histograms #12644

Are you sure you want to change the base?

[V1][Metrics] Add several request timing histograms #12644

Conversation

markmc commented Feb 1, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 1, 2025

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

markmc commented Feb 5, 2025

robertgshaw2-redhat commented Feb 5, 2025

markmc commented Feb 6, 2025

markmc commented Feb 7, 2025

mergify bot commented Feb 7, 2025

markmc commented Feb 7, 2025

markmc commented Feb 1, 2025 •

edited by github-actions bot

Loading