-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Display Token usage #418
Comments
this will require #454 to land first, so let's keep in backlog for now. |
At an initial investigation the used tokens are not listed neither in the request nor the response from the LLM. Request {
"messages": [...],
"model": "gpt-4o",
"temperature": 0.1,
"top_p": 1,
"max_tokens": 4096,
"n": 1,
"stream": true
}
Response
There are 2 alternatives:
|
I have been playing around with the APIs. It's possible for all providers. All of them include the token usage automatically if the request is non-streaming. For streaming we need to explicitly request for it, except for Anthropic, which already includes it at the first chunk. AnthropicThe token usage comes separated in 2 chunks. One at the beginning and another one at the end. // First chunk
{
"type": "message_start",
"message": {
"id": "msg_011itXmqtd7KHB6adpbDdwWX",
"type": "message",
"role": "assistant",
"model": "claude-3-5-sonnet-20241022",
"content": [],
"stop_reason": null,
"stop_sequence": null,
"usage": {
"input_tokens": 10,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 0,
"output_tokens": 1
}
}
}
// Last chunk
{
"type": "message_delta",
"delta": {
"stop_reason": "end_turn",
"stop_sequence": null
},
"usage": {
"output_tokens": 13
}
} OpenAI, Ollama, VLLMWe need to request explicitly the token usage when the request is set to streaming, which is most of the time from clients. Note the curl -s -X POST "<api>/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <token>" \
-d '{
"model": "unsloth/Qwen2.5-Coder-32B-Instruct",
"stream": true,
"stream_options": {"include_usage": true},
"messages": [{"role": "user", "content": "Hello, world"}]
}' Response with token usage at the last chunk. It comes after the chunk with {
"id": "chatcmpl-4933d74a8f8b4a82a855439eeab1ae3d",
"object": "chat.completion.chunk",
"created": 1737723773,
"model": "unsloth/Qwen2.5-Coder-32B-Instruct",
"choices": [],
"usage": {
"prompt_tokens": 32,
"total_tokens": 42,
"completion_tokens": 10
}
} |
Can we display the amount of tokens used by any given provider, this would be useful for the new copilot free tier.
An extra would be to record the token usage per conversation. This would allow user insight into what prompts are more costly and allow optimization.
kudos @craigmcl for the idea.
The text was updated successfully, but these errors were encountered: