Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display Token usage #418

Closed
lukehinds opened this issue Dec 19, 2024 · 4 comments · Fixed by #788
Closed

Display Token usage #418

lukehinds opened this issue Dec 19, 2024 · 4 comments · Fixed by #788

Comments

@lukehinds
Copy link
Contributor

lukehinds commented Dec 19, 2024

Can we display the amount of tokens used by any given provider, this would be useful for the new copilot free tier.

An extra would be to record the token usage per conversation. This would allow user insight into what prompts are more costly and allow optimization.

kudos @craigmcl for the idea.

Image
@lukehinds
Copy link
Contributor Author

this will require #454 to land first, so let's keep in backlog for now.

@aponcedeleonch
Copy link
Contributor

aponcedeleonch commented Jan 24, 2025

At an initial investigation the used tokens are not listed neither in the request nor the response from the LLM.

Request

{
  "messages": [...],
  "model": "gpt-4o",
  "temperature": 0.1,
  "top_p": 1,
  "max_tokens": 4096,
  "n": 1,
  "stream": true
}

max_tokens: The maximum number of tokens that can be generated in the chat completion. Reference

Response

[
"{\"id\":\"\",\"created\":0,\"model\":\"\",\"object\":\"chat.completion.chunk\",\"choices\":[]}", 
"{\"id\":\"chatcmpl-Ao5A9Sf7Q6WB751oF5OpU7Wmwcfv4\",\"created\":1736499609,\"model\":\"gpt-4o-2024-05-13\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"\",\"role\":\"assistant\"}}]}", 
....
"{\"id\":\"chatcmpl-Ao5A9Sf7Q6WB751oF5OpU7Wmwcfv4\",\"created\":1736499609,\"model\":\"gpt-4o-2024-05-13\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"finish_reason\":\"stop\",\"index\":0,\"delta\":{\"role\":\"assistant\"}}]}"
]

There are 2 alternatives:

  1. See if there's a way the LLM providers list in their response the tokens they have used. At a first glance it looks to be possible at least for OpeanAI
  2. Use our own tokenizer. We could tokenize ourselves the request and response and calculate that way the number of used tokens. The big drawback with this is that the tokens we calculate with the tokenizer may not match the tokens used by the LLM. But at least it would be an approximation

@aponcedeleonch
Copy link
Contributor

aponcedeleonch commented Jan 24, 2025

I have been playing around with the APIs. It's possible for all providers. All of them include the token usage automatically if the request is non-streaming. For streaming we need to explicitly request for it, except for Anthropic, which already includes it at the first chunk.

Anthropic

The token usage comes separated in 2 chunks. One at the beginning and another one at the end.

// First chunk
{
  "type": "message_start",
  "message": {
    "id": "msg_011itXmqtd7KHB6adpbDdwWX",
    "type": "message",
    "role": "assistant",
    "model": "claude-3-5-sonnet-20241022",
    "content": [],
    "stop_reason": null,
    "stop_sequence": null,
    "usage": {
      "input_tokens": 10,
      "cache_creation_input_tokens": 0,
      "cache_read_input_tokens": 0,
      "output_tokens": 1
    }
  }
}

// Last chunk
{
  "type": "message_delta",
  "delta": {
    "stop_reason": "end_turn",
    "stop_sequence": null
  },
  "usage": {
    "output_tokens": 13
  }
}

OpenAI, Ollama, VLLM

We need to request explicitly the token usage when the request is set to streaming, which is most of the time from clients. Note the stream_options field in the following example request

curl -s -X POST "<api>/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <token>" \
    -d '{
        "model": "unsloth/Qwen2.5-Coder-32B-Instruct",
        "stream": true,
        "stream_options": {"include_usage": true},
        "messages": [{"role": "user", "content": "Hello, world"}]
    }'

Response with token usage at the last chunk. It comes after the chunk with finish_reason: "stop".

{
  "id": "chatcmpl-4933d74a8f8b4a82a855439eeab1ae3d",
  "object": "chat.completion.chunk",
  "created": 1737723773,
  "model": "unsloth/Qwen2.5-Coder-32B-Instruct",
  "choices": [],
  "usage": {
    "prompt_tokens": 32,
    "total_tokens": 42,
    "completion_tokens": 10
  }
}

@aponcedeleonch
Copy link
Contributor

aponcedeleonch commented Jan 27, 2025

Costs

  • LiteLLM uses a static file for mapping a model name to the costs. Link to code, Link to file
  • Not sure where OpenRouter is getting their info. However we can do an API call to get the prices they use: curl "https://openrouter.ai/api/v1/models". Reference link

aponcedeleonch added a commit that referenced this issue Jan 27, 2025
Related: #418

This PR does introduces the changes necessary to track the
used tokens per request and then process them to return them
in the API.

Specific changes:
- Make sure we process all the stream and record at the very end
- Include the flag `"stream_options": {"include_usage": True},` so
the providers respond with the tokens
- Added the necessary processing for the API
- Modified the initial API models to display correctly the tokens and its price
aponcedeleonch added a commit that referenced this issue Jan 28, 2025
Related: #418

This PR does introduces the changes necessary to track the
used tokens per request and then process them to return them
in the API.

Specific changes:
- Make sure we process all the stream and record at the very end
- Include the flag `"stream_options": {"include_usage": True},` so
the providers respond with the tokens
- Added the necessary processing for the API
- Modified the initial API models to display correctly the tokens and its price
aponcedeleonch added a commit that referenced this issue Jan 28, 2025
Related: #418

This PR does introduces the changes necessary to track the
used tokens per request and then process them to return them
in the API.

Specific changes:
- Make sure we process all the stream and record at the very end
- Include the flag `"stream_options": {"include_usage": True},` so
the providers respond with the tokens
- Added the necessary processing for the API
- Modified the initial API models to display correctly the tokens and its price
aponcedeleonch added a commit that referenced this issue Jan 29, 2025
Related: #418

This PR does introduces the changes necessary to track the
used tokens per request and then process them to return them
in the API.

Specific changes:
- Make sure we process all the stream and record at the very end
- Include the flag `"stream_options": {"include_usage": True},` so
the providers respond with the tokens
- Added the necessary processing for the API
- Modified the initial API models to display correctly the tokens and its price
aponcedeleonch added a commit that referenced this issue Jan 29, 2025
* Include the token usage for every conversation and workspace

Related: #418

This PR does introduces the changes necessary to track the
used tokens per request and then process them to return them
in the API.

Specific changes:
- Make sure we process all the stream and record at the very end
- Include the flag `"stream_options": {"include_usage": True},` so
the providers respond with the tokens
- Added the necessary processing for the API
- Modified the initial API models to display correctly the tokens and its price

* Moved token recording to DB

* Changed token usage code to get info from file and added GHA to get file periodically

* formatting changes

* Move model cost to dedicated folder

* Fix problems with copilot streaming
lukehinds pushed a commit that referenced this issue Jan 31, 2025
* Include the token usage for every conversation and workspace

Related: #418

This PR does introduces the changes necessary to track the
used tokens per request and then process them to return them
in the API.

Specific changes:
- Make sure we process all the stream and record at the very end
- Include the flag `"stream_options": {"include_usage": True},` so
the providers respond with the tokens
- Added the necessary processing for the API
- Modified the initial API models to display correctly the tokens and its price

* Moved token recording to DB

* Changed token usage code to get info from file and added GHA to get file periodically

* formatting changes

* Move model cost to dedicated folder

* Fix problems with copilot streaming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants