Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read cached content #750

Open
Mrxiangli opened this issue Jan 24, 2025 · 4 comments
Open

read cached content #750

Mrxiangli opened this issue Jan 24, 2025 · 4 comments

Comments

@Mrxiangli
Copy link

Thank you for the great project. Is there any API that can be used to read the cached content?

@yzh119
Copy link
Collaborator

yzh119 commented Jan 24, 2025

Hi @Mrxiangli do you mean cache for jit compilation? Or KV-Cache?

@Mrxiangli
Copy link
Author

Thank you for the quick response. I mean the kv cache, for example, if i want to gather the kv cache of a request that finished prefill and migrate to another GPU, is there an API to help retrieve those related kv cache?

@yzh119
Copy link
Collaborator

yzh119 commented Jan 24, 2025

Got it, not yet but it should be easy to support (e.g. using triton or pure pytorch if don't care about its performance), would you mind creating a PR to support it? We can put this under https://github.com/flashinfer-ai/flashinfer/blob/main/flashinfer/page.py

The semantic is:

# k_cache: [num_pages, page_size, num_heads, head_dim] if under NHD layout
# v_cache: [num_pages, page_size, num_heads, head_dim] if under NHD layout
# output_k: [kv_len, num_heads, head_dim]
# output_v: [kv_len, num_heads, head_dim[

for i in range(kv_len[request_idx]):
    page_idx = kv_page_indices[kv_indptr[i] + i // page_size]
    pos_idx = i % page_size
    output_k[i] = k_cache[page_idx, pos_idx]
    output_v[i] = v_cache[page_idx, pos_idx]

@Mrxiangli
Copy link
Author

Thank you for the answers! I will create a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants