-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Metrics] Add several request timing histograms #12644
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
5755e17
to
aa6b6a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @markmc - left some small nits. This is looking good.
The one thing I am not sure about is that doing the timestamps from the perspective of the AsyncLLM
does not quite give us the granularity to make a distinction between queue_time
, prefill_time
, and inference_time
, since if the prompt length is < chunked prefill size (which usually the case), the timestamp of scheduled_time
will be the same as the timestamp of first_token_time
(since we will generate the first token in the same step as the first time it is scheduled).
I'm not sure how to get around this without inserting some timing logic into EngineCore
which also feels not ideal + brittle. What do you think?
Ugh, how right you are! Yes, what I'm doing now seems very broken. So, basically we would want We have:
And the closest So I guess our options are:
I'll take a stab at (4), but definitely welcome feedback! |
WDYT? Do you have any experience deprecating telemetry like this? |
See the new PR for what I got to, commit message pasted below I might be missing something obvious on how to do this at the batch level, I'm really just thinking about it now. I guess the
I guess I'm being conservative and assuming that removing anything that's in the example dashboard would be disruptive, and so we'd need a good reason to not add it - e.g. if the overhead was too large. So I guess the point of this discussion is to see if we can do it with a reasonable level of overhead. Commit message below:
|
Good call on that, done now 👍 |
3725022
to
38cf896
Compare
This pull request has merge conflicts that must be resolved before it can be |
Follow on from vllm-project#12579, part of vllm-project#10582. Add the following: - vllm:e2e_request_latency_seconds - vllm:request_queue_time_seconds - vllm:request_inference_time_seconds - vllm:request_prefill_time_seconds - vllm:request_decode_time_seconds e2e_request_latency is calculated relative to the arrival_time timestamp recorded by the frontend. For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core: ``` << queued timestamp >> [ queue interval ] << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to first token time) [ inference interval (relative to scheduled time) << new token timestamp (FINISHED) >> ``` We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core. Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events: - QUEUED: scheduler add_request() - SCHEDULED: scheduler schedule() There is an implicit NEW_TOKENS timestamp based on an initialization timestamp recorded on EngineCoreOutputs. Signed-off-by: Mark McLoughlin <[email protected]>
38cf896
to
37e5b11
Compare
Rebased onto the logprobs commit 👍 |
Follow on from #12579, part of #10582.
Add the following:
vllm:e2e_request_latency_seconds
vllm:request_queue_time_seconds
vllm:request_inference_time_seconds
vllm:request_prefill_time_seconds
vllm:request_decode_time_seconds
e2e_request_latency
is calculated relative to thearrival_time
timestamp recorded by the frontend.For the rest ... we want to capture (in histograms) precise pre-request timing intervals between certain events in the engine core:
We want to collect these metrics in the frontend process, to keep the engine core freed up as much as possible. We need to calculate these intervals based on timestamps recorded by the engine core.
Engine core will include these timestamps in EngineCoreOutput (per request) as a sequence of timestamped events, and the frontend will calculate intervals and log them. Where we record these timestamped events:
QUEUED
: scheduleradd_request()
SCHEDULED
: schedulerschedule()
There is an implicit
NEW_TOKENS
timestamp based on an initialization timestamp recorded onEngineCoreOutputs
.