-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read cached content #750
Comments
Hi @Mrxiangli do you mean cache for jit compilation? Or KV-Cache? |
Thank you for the quick response. I mean the kv cache, for example, if i want to gather the kv cache of a request that finished prefill and migrate to another GPU, is there an API to help retrieve those related kv cache? |
Got it, not yet but it should be easy to support (e.g. using triton or pure pytorch if don't care about its performance), would you mind creating a PR to support it? We can put this under https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/page.py The semantic is: # k_cache: [num_pages, page_size, num_heads, head_dim] if under NHD layout
# v_cache: [num_pages, page_size, num_heads, head_dim] if under NHD layout
# output_k: [kv_len, num_heads, head_dim]
# output_v: [kv_len, num_heads, head_dim[
for i in range(kv_len[request_idx]):
page_idx = kv_page_indices[kv_indptr[i] + i // page_size]
pos_idx = i % page_size
output_k[i] = k_cache[page_idx, pos_idx]
output_v[i] = v_cache[page_idx, pos_idx] |
Thank you for the answers! I will create a PR for this. |
Thank you for the great project. Is there any API that can be used to read the cached content?
The text was updated successfully, but these errors were encountered: