Display Token usage #418

lukehinds · 2024-12-19T09:56:53Z

Can we display the amount of tokens used by any given provider, this would be useful for the new copilot free tier.

An extra would be to record the token usage per conversation. This would allow user insight into what prompts are more costly and allow optimization.

kudos @craigmcl for the idea.

lukehinds · 2025-01-06T14:39:14Z

this will require #454 to land first, so let's keep in backlog for now.

aponcedeleonch · 2025-01-24T09:56:02Z

At an initial investigation the used tokens are not listed neither in the request nor the response from the LLM.

Request

{
  "messages": [...],
  "model": "gpt-4o",
  "temperature": 0.1,
  "top_p": 1,
  "max_tokens": 4096,
  "n": 1,
  "stream": true
}

max_tokens: The maximum number of tokens that can be generated in the chat completion. Reference

Response

[
"{\"id\":\"\",\"created\":0,\"model\":\"\",\"object\":\"chat.completion.chunk\",\"choices\":[]}", 
"{\"id\":\"chatcmpl-Ao5A9Sf7Q6WB751oF5OpU7Wmwcfv4\",\"created\":1736499609,\"model\":\"gpt-4o-2024-05-13\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"\",\"role\":\"assistant\"}}]}", 
....
"{\"id\":\"chatcmpl-Ao5A9Sf7Q6WB751oF5OpU7Wmwcfv4\",\"created\":1736499609,\"model\":\"gpt-4o-2024-05-13\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"finish_reason\":\"stop\",\"index\":0,\"delta\":{\"role\":\"assistant\"}}]}"
]

There are 2 alternatives:

See if there's a way the LLM providers list in their response the tokens they have used. At a first glance it looks to be possible at least for OpeanAI
Use our own tokenizer. We could tokenize ourselves the request and response and calculate that way the number of used tokens. The big drawback with this is that the tokens we calculate with the tokenizer may not match the tokens used by the LLM. But at least it would be an approximation

aponcedeleonch · 2025-01-24T14:16:46Z

I have been playing around with the APIs. It's possible for all providers. All of them include the token usage automatically if the request is non-streaming. For streaming we need to explicitly request for it, except for Anthropic, which already includes it at the first chunk.

Anthropic

The token usage comes separated in 2 chunks. One at the beginning and another one at the end.

// First chunk
{
  "type": "message_start",
  "message": {
    "id": "msg_011itXmqtd7KHB6adpbDdwWX",
    "type": "message",
    "role": "assistant",
    "model": "claude-3-5-sonnet-20241022",
    "content": [],
    "stop_reason": null,
    "stop_sequence": null,
    "usage": {
      "input_tokens": 10,
      "cache_creation_input_tokens": 0,
      "cache_read_input_tokens": 0,
      "output_tokens": 1
    }
  }
}

// Last chunk
{
  "type": "message_delta",
  "delta": {
    "stop_reason": "end_turn",
    "stop_sequence": null
  },
  "usage": {
    "output_tokens": 13
  }
}

OpenAI, Ollama, VLLM

We need to request explicitly the token usage when the request is set to streaming, which is most of the time from clients. Note the stream_options field in the following example request

curl -s -X POST "<api>/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer <token>" \
    -d '{
        "model": "unsloth/Qwen2.5-Coder-32B-Instruct",
        "stream": true,
        "stream_options": {"include_usage": true},
        "messages": [{"role": "user", "content": "Hello, world"}]
    }'

Response with token usage at the last chunk. It comes after the chunk with finish_reason: "stop".

{
  "id": "chatcmpl-4933d74a8f8b4a82a855439eeab1ae3d",
  "object": "chat.completion.chunk",
  "created": 1737723773,
  "model": "unsloth/Qwen2.5-Coder-32B-Instruct",
  "choices": [],
  "usage": {
    "prompt_tokens": 32,
    "total_tokens": 42,
    "completion_tokens": 10
  }
}

github-actions bot added the needs-triage label Dec 19, 2024

lukehinds added feature-request dashboard and removed needs-triage labels Dec 19, 2024

lukehinds assigned aponcedeleonch Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Display Token usage #418

Display Token usage #418

lukehinds commented Dec 19, 2024 •

edited

Loading

lukehinds commented Jan 6, 2025

aponcedeleonch commented Jan 24, 2025 •

edited

Loading

aponcedeleonch commented Jan 24, 2025 •

edited

Loading

Display Token usage #418

Display Token usage #418

Comments

lukehinds commented Dec 19, 2024 • edited Loading

lukehinds commented Jan 6, 2025

aponcedeleonch commented Jan 24, 2025 • edited Loading

aponcedeleonch commented Jan 24, 2025 • edited Loading

Anthropic

OpenAI, Ollama, VLLM

lukehinds commented Dec 19, 2024 •

edited

Loading

aponcedeleonch commented Jan 24, 2025 •

edited

Loading

aponcedeleonch commented Jan 24, 2025 •

edited

Loading