Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display Token usage #418

Open
lukehinds opened this issue Dec 19, 2024 · 3 comments
Open

Display Token usage #418

lukehinds opened this issue Dec 19, 2024 · 3 comments

Comments

@lukehinds
Copy link
Contributor

lukehinds commented Dec 19, 2024

Can we display the amount of tokens used by any given provider, this would be useful for the new copilot free tier.

An extra would be to record the token usage per conversation. This would allow user insight into what prompts are more costly and allow optimization.

kudos @craigmcl for the idea.

Image
@lukehinds
Copy link
Contributor Author

this will require #454 to land first, so let's keep in backlog for now.

@aponcedeleonch
Copy link
Contributor

aponcedeleonch commented Jan 24, 2025

At an initial investigation the used tokens are not listed neither in the request nor the response from the LLM.

Request

{
  "messages": [...],
  "model": "gpt-4o",
  "temperature": 0.1,
  "top_p": 1,
  "max_tokens": 4096,
  "n": 1,
  "stream": true
}

max_tokens: The maximum number of tokens that can be generated in the chat completion. Reference

Response

[
"{\"id\":\"\",\"created\":0,\"model\":\"\",\"object\":\"chat.completion.chunk\",\"choices\":[]}", 
"{\"id\":\"chatcmpl-Ao5A9Sf7Q6WB751oF5OpU7Wmwcfv4\",\"created\":1736499609,\"model\":\"gpt-4o-2024-05-13\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"\",\"role\":\"assistant\"}}]}", 
....
"{\"id\":\"chatcmpl-Ao5A9Sf7Q6WB751oF5OpU7Wmwcfv4\",\"created\":1736499609,\"model\":\"gpt-4o-2024-05-13\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"finish_reason\":\"stop\",\"index\":0,\"delta\":{\"role\":\"assistant\"}}]}"
]

There are 2 alternatives:

  1. See if there's a way the LLM providers list in their response the tokens they have used. At a first glance it looks to be possible at least for OpeanAI
  2. Use our own tokenizer. We could tokenize ourselves the request and response and calculate that way the number of used tokens. The big drawback with this is that the tokens we calculate with the tokenizer may not match the tokens used by the LLM. But at least it would be an approximation

@aponcedeleonch
Copy link
Contributor

aponcedeleonch commented Jan 24, 2025

I have been playing around with the APIs. It's possible for all providers. All of them include the token usage automatically if the request is non-streaming. For streaming we need to explicitly request for it, except for Anthropic, which already includes it at the first chunk.

Anthropic

The token usage comes separated in 2 chunks. One at the beginning and another one at the end.

// First chunk
{
  "type": "message_start",
  "message": {
    "id": "msg_011itXmqtd7KHB6adpbDdwWX",
    "type": "message",
    "role": "assistant",
    "model": "claude-3-5-sonnet-20241022",
    "content": [],
    "stop_reason": null,
    "stop_sequence": null,
    "usage": {
      "input_tokens": 10,
      "cache_creation_input_tokens": 0,
      "cache_read_input_tokens": 0,
      "output_tokens": 1
    }
  }
}

// Last chunk
{
  "type": "message_delta",
  "delta": {
    "stop_reason": "end_turn",
    "stop_sequence": null
  },
  "usage": {
    "output_tokens": 13
  }
}

OpenAI, Ollama, VLLM

We need to request explicitly the token usage when the request is set to streaming, which is most of the time from clients. Note the stream_options field in the following example request

curl -s -X POST "<api>/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <token>" \
    -d '{
        "model": "unsloth/Qwen2.5-Coder-32B-Instruct",
        "stream": true,
        "stream_options": {"include_usage": true},
        "messages": [{"role": "user", "content": "Hello, world"}]
    }'

Response with token usage at the last chunk. It comes after the chunk with finish_reason: "stop".

{
  "id": "chatcmpl-4933d74a8f8b4a82a855439eeab1ae3d",
  "object": "chat.completion.chunk",
  "created": 1737723773,
  "model": "unsloth/Qwen2.5-Coder-32B-Instruct",
  "choices": [],
  "usage": {
    "prompt_tokens": 32,
    "total_tokens": 42,
    "completion_tokens": 10
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants