initial gRPC spec changes for supporting index creation and querying #8829

zehiko · 2025-01-27T15:39:25Z

What

Expose creating and querying an index over an entire collection of data.

We add support for:

vector index
inverted index for full text search
btree index

Query data is always represented with a record batch with just 1 column that carries the query data.

github-actions · 2025-01-27T15:39:58Z

Latest documentation preview deployed successfully.

Result	Commit	Link
✅	`8ab34a8`	https://landing-jidb1e95s-rerun.vercel.app/docs

^{Note: This comment is updated whenever you push a commit.}

github-actions · 2025-01-27T15:41:13Z

Web viewer built successfully. If applicable, you should also test it:

I have tested the web viewer

Result	Commit	Link	Manifest
✅	`744b69c`	https://rerun.io/viewer/pr/8829	`+nightly` `+main`

^{Note: This comment is updated whenever you push a commit.}

crates/store/re_protos/proto/rerun/v0/remote_store.proto

grtlr · 2025-01-28T14:51:29Z

crates/store/re_protos/proto/rerun/v0/remote_store.proto

@@ -9,6 +9,16 @@ service StorageNode {
    rpc Query(QueryRequest) returns (stream DataframePart) {}
    rpc FetchRecording(FetchRecordingRequest) returns (stream rerun.common.v0.RerunChunk) {}

+    rpc IndexCollection(IndexCollectionRequest) returns (IndexCollectionResponse) {}


Right now everything lives under one service StorageNode. I wonder if it is cleaner to create several services that groups similar calls logically? Maybe something like CatalogService, IndexService, ...? It might be nice in the future to decide on a fine granular level which services should spin up, for example if we want to distribute loads across multiple VMs?

there is still a tight relation between catalog, collection, collection query path, collection index query path. This definitely requires bit of thinking how we split it. I'll create an issue, but I won't tackle this as part of this story.

jleibs · 2025-01-28T14:59:37Z

crates/store/re_protos/proto/rerun/v0/remote_store.proto

@@ -9,6 +9,16 @@ service StorageNode {
    rpc Query(QueryRequest) returns (stream DataframePart) {}
    rpc FetchRecording(FetchRecordingRequest) returns (stream rerun.common.v0.RerunChunk) {}

+    rpc IndexCollection(IndexCollectionRequest) returns (IndexCollectionResponse) {}


In addition to CreateIndex, we probably want an API for ReIndex that only requires the collection + ColumnDescriptor but doesn't need the parameters.

Tracking which recordings are in the index so we know whether reindexing is necessary would be nice to think about, but not necessary yet.

https://github.com/rerun-io/dataplatform/issues/156

This will be tackled in a first follow up.

…le with IndedColumn

This reverts commit 8ab34a8.

zehiko added 4 commits January 24, 2025 11:32

WIP

5048299

rebase

c58ae63

export ComponentName from re_dataframe

23ca90a

grpc spec definition improvements

da5b224

zehiko added exclude from changelog PRs with this won't show up in CHANGELOG.md remote-store remote store gRPC API labels Jan 27, 2025

zehiko requested a review from jleibs January 27, 2025 15:39

zehiko self-assigned this Jan 27, 2025

zehiko marked this pull request as draft January 27, 2025 15:39

few more updated to the spec

463ae84