-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add token cost tracking #402
Comments
An For example, suppose we send 3 messages, each with 100 tokens, and we get 3 replies, each with 100 tokens, and the system message is 200 tokens. Our first message will be 200+(100+100)=400 tokens, our second message will be 200+(100+100)+(100+100)=600 tokens, and our third message will be 200+(100+100)*3=800 tokens. Given the OpenAI pricing for GPT-4o of $5 for 1M input tokens and $15 for 1M output tokens, and with our 500 input tokens and 300 output tokens, we'll get some cost. Is this your understanding? As well, when we have very long conversations, there will be a moment when some of the preceding messages may be dropped or summarized to reduce the token usage, or at least get it within the context limit. In other words, the number of tokens in the latest message may not help us fully work out the cost of the request. It's almost like we need to keep track of each individual API request and the number of (estimated) input and output tokens. If the API can give us the actual number of input tokens, even better, which seems possible with OpenAI if we pass the |
@matthewbennink Yes, that's why I was thinking we update estimated_price twice. When we are generating a new message, we pass a newly created message into get_next_message_job (it's persisted to the db already) which, in turn, passes it into ai_backend. I think the moment that ai_backend is sending it's request to the API, which includes all of the previous messages in the conversation, we can add up the tokens and save the preliminary estimated_cost on the blank message. Then when the response comes back to get_next_message_job, we do a final message.save and we can do one more token cost estimate and add it onto the estimated_price we previously calculated. I didn't know about include_usage, that's cool! The OpenAI gem just passes the hash of params that we send straight on to OpenAI, so it should be supported. |
Do you think it's acceptable to store the estimated price as a float inside the database? It'd be per message, and so they'd all be very small values that added up may include some amount of rounding error. It might just average out in the end and/or it might be fine as an estimate. The alternative would be to store the token counts, perhaps store the input/prompt token count on the "user" messages and store the output/completion token count on the "assistant" messages. (I'm not sure if we'd need to represent "tool" messages differently. Are there other message roles I'm missing?) The monthly price estimate would then need to find all of those messages, sum the token counts by language model, and multiple each token count by its respective cost. That doesn't seem like it'd be particularly slow. E.g., input_cost = Message.user.created_after(Date.beginning_of_month).joins(:assistant => :language_model).sum("messages.token_count * language_models.input_cost_per_1m_tokens_in_millionths_of_cents")
output_cost = Message.assistant.created_after(Date.beginning_of_month).joins(:assistant => :language_model).sum("messages.token_count * language_models.output_cost_per_1m_tokens_in_millionths_of_cents")
total_cost_in_cents = input_cost + output_cost I'm sure I've gotten some of that wrong, but maybe the idea is there. I've never had to represent small prices before, so struggling a bit there. I figure we want to find a way to represent, e.g. 1B tokens per 1¢ as a limit, and then you can use an integer to represent the cost of X cents per 1B tokens based on today's prices. So, $5 / 1M tokens might be represented as 500000, $.01 / 1M tokens as 1000, and $.00001 / 1M tokens as 1, which seems like a price point we'll never get to. I also think it'd be perfectly reasonable to store the costs as floats per 1M or 1B tokens and just go from there. So, $5000 / 1B, $10 / 1B, and $.01 / 1B in the examples above. Given it's just an estimate, it's worth keeping things simple perhaps. But wanted to layout the distinction between storing very small prices per message like .00001 USD vs storing integer token values such as 300. Once we have a data type, I'd be happy to open up a PR to keep things moving. |
@matthewbennink hmm, my instinct is to just store the estimate. I think it should be fine to store it as a float. Is the concern you're raising that the estimate will somehow be worse if we store it as a float? I don't think I understand that. Or maybe what you're suggesting is that there is sound rounding that will inevitably occur by storing small floats which wouldn't occur if we stored tokens? I guess the key question is: what's the accuracy of floats in a postgres table? I'm actually not sure of that. I can't think of a time I had to store tiny fractions of a float. That may be a worth a little bit of investigating. I think that storing currency amounts rather than tokens will be a bit easier to deal with. It makes it so we can do a really nice query like One small improvement: instead of storing a DOLLAR value store a CENTS value. So maybe the column is named: estimate_in_cents. By shaving off two decimal points we probably get a lot more accuracy and it's easier for us humans to read 0.03 cents than to read 0.0003 dollars. And I lean towards each message having a single estimate — and that estimate is the cost for generating that whole message (both the input and output tokens required to generate that message). That could also facilitate a future auto-truncation of history when the per-message cost rises above some cutoff. I don't think you need to think about tool messages any differently than text messages except in one respect:
|
I don't fully understand, why we need the cost estimate as a database column. This is a fixed derived value from the token count and the current LLM price. If the LLM provider changes it's pricing structure in the middle of a monthly period , probably one needs more than one simple magic number per LLM, but the token counts are the truth value from which all costs can be derived. I understand that a per-LLM overall token count (for input/output tokens) could be used as an optimization means so that one doesn't need to calculate all tokens for a certain period on the fly. And this number could be calculated e.g. via a background job after each LLM roundtrip. The OpenAI cost overview and -detail page is also not exactly real-time, so a slight delay between each LLM round-trip and this calculated DB number should be acceptable. What would also interest me as a regular user of HostedGPT is not only the effective cost, but also the token count itself. Having the possibilty to make this switchable (klicking on the numbers ?) would be really nice. For non-English languages, often the token count is much higher. |
@lumpidu Yes, good point on both. We don’t need cost on messages and we could cache things on the LLM. I agree on seeing token count. It could just be a simple paren like “$14.32 (13,729 tokens)” |
Hi @lumpidu, I wanted to check in on this task and see if you had made any progress on it? And if not, let me know if you're still up for it. |
@krschacht, probably later this week I will dive into it |
I think a very first PR could consist of: internally track how much every message & conversation $ have incurred so that a user can keep a close eye on their total $ spend this month.
High level:
user.messages.created_after(Date.current.beginning_of_month).sum(:estimated_cost)
Off the top of my head, here is how I think an implementation could go:
Add a column to the messages table such as
token_count
andprice
Open backend/open_ai.rb and find the point where we're actually calling the api (
client.chat
) and add this new flaginclude_usage: true
(explained here and documented here with example code here)I think the key thing to validate is: does this final chunk that includes usage definitely report on output and input? Hopefully so. Meaning, we submit a response to OpenAI with a bunch of tokens (input) and then it replies with a bunch of tokens (output). The way I'd figure this out is to simply put a breakpoint where the chunks come in. That is the
stream_handler
method in this same file.We can double check the token counting ourselves by putting more breakpoints right when we call
client.chat
and count the number of tokens we are submitting, then after the message finishes streaming it gets saved to the database so we can just count the number of tokens inMessage.last.content_text
Once we confirm that this last chunk contains our token count, the content_chunks get passed all the way up to the worker right here so we can set
message.token_count
Anthropic, similarly, includes token counts in their streaming chunks. On this page if you search for the string "usage" it shows that their first streamed chunk shows the input tokens and their last streamed chunk shows the output tokens: https://docs.anthropic.com/en/api/messages-streaming The anthropic model is backend/anthropic.rb
We can now use SQL to sum the total tokens used during a period, but we want price. I think we do a migration on the language_model to add a
price_per_token
column tolanguage_models
table. We can populate the value for all of our language models from these references: openai and anthropicBack in
get_next_ai_message_job
where we are setting token_count we should also do the math and setmessage.price
Then I think we can display it somewhere on the person/edit page which is
views/settings/people/_form.html.erb
I think the query will just beMessage.created_after(Date.beginning_of_month).sum(:price)
This is a python token cost library that may provide some useful reference
I doubt this price we are tracking will be perfect so we'll display it as an estimated price to the user. It looks like we may need to do some additional calculations for function calling. This should probably be a subsequent PR, but some notes I've collected:
The text was updated successfully, but these errors were encountered: