Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache object stores per scheme + bucket #99

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aykut-bozkurt
Copy link
Collaborator

@aykut-bozkurt aykut-bozkurt commented Jan 16, 2025

In this PR, we cache object stores per scheme (e.g. s3) + bucket (or container) combination in each Postgres session. This will reduce authentication costs by only doing it at the first time.

object_store does not perform sts assume_role to get temp token, so pg_parquet make use of aws sdk to perform it. And then configure object_store with the temp token that it fetched. This is why, pg_parquet checks expiration of the tokens for s3 store. It will fetch the temp token if it expires. (obviously if you configured temp token auth via config)

For azure blob store, we do not need to recreate tokens because object_store handles it.

Closes #93

@@ -4,5 +4,6 @@ trap "echo 'Caught termination signal. Exiting...'; exit 0" SIGINT SIGTERM

# create azurite container
az storage container create -n $AZURE_TEST_CONTAINER_NAME --connection-string $AZURE_STORAGE_CONNECTION_STRING
az storage container create -n ${AZURE_TEST_CONTAINER_NAME}2 --connection-string $AZURE_STORAGE_CONNECTION_STRING
Copy link
Collaborator Author

@aykut-bozkurt aykut-bozkurt Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created one more container to test object store cache per (schema, container) pair.

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/cache-object-stores branch from 39bbefe to 17d4aa8 Compare January 16, 2025 20:19
let expiration_warn_msg =
format!("credentials for {bucket} expired at {expire_at:?}");

let expiration_hint = "New credentials will be automatically retrieved if configured.\n\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if new credentials will be automatically retrieved, do we need to throw a warning?

Copy link
Collaborator Author

@aykut-bozkurt aykut-bozkurt Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanted to give hint to refresh the session but seems like a bad spot. (retry should not warn if it succeeds)


if let Some(item) = self.cache.get(&key) {
if item.expired(&key.bucket) {
self.cache.remove(&key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this release all the memory associated with the object store?

Copy link
Collaborator Author

@aykut-bozkurt aykut-bozkurt Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, impl Drop is called for each field in the structs.

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/numeric-with-larger-scale branch from b8f9061 to 80f9a11 Compare January 22, 2025 14:00
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/cache-object-stores branch from 17d4aa8 to 9d90243 Compare January 22, 2025 14:04
Copy link
Collaborator

@marcoslot marcoslot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing it once per session is still not very ideal. Is there an easy way to share across sessions?

@aykut-bozkurt aykut-bozkurt force-pushed the aykut/numeric-with-larger-scale branch 3 times, most recently from b8856c4 to 852b6af Compare January 24, 2025 14:34
Base automatically changed from aykut/numeric-with-larger-scale to main January 24, 2025 14:43
In this PR, we cache object stores per scheme (e.g. s3) + bucket (container) combination
in each Postgres session. This will reduce authentication costs by only doing it at the first
time.

object_store does not perform sts assume_role to get temp token, so pg_parquet
make use of aws sdk to perform it. And then configure object_store with the temp
token that it fetched. This is why, pg_parquet checks expiration of the tokens for s3 store.
It will fetch the temp token if it expires. (obviously if you configured temp token auth via config)

For azure blob store, we do not need to recreate tokens because object_store handles it.
@aykut-bozkurt aykut-bozkurt force-pushed the aykut/cache-object-stores branch from 9d90243 to a493016 Compare January 24, 2025 14:46
Copy link

codecov bot commented Jan 24, 2025

Codecov Report

Attention: Patch coverage is 96.15385% with 13 lines in your changes missing coverage. Please review.

Project coverage is 91.88%. Comparing base (fa728df) to head (a493016).

Files with missing lines Patch % Lines
src/object_store/object_store_cache.rs 89.76% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #99      +/-   ##
==========================================
+ Coverage   91.82%   91.88%   +0.06%     
==========================================
  Files          77       77              
  Lines       10023    10320     +297     
==========================================
+ Hits         9204     9483     +279     
- Misses        819      837      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aykut-bozkurt
Copy link
Collaborator Author

doing it once per session is still not very ideal. Is there an easy way to share across sessions?

Not easily doable as we have ObjectStore object in our cache item, which internally makes heap allocations. (that prevents us from allocating it at Postgres shared memory)

We may try to persist temp tokens across a postgres shared memory hash map. Object stores (client connections inside them) would be created per PG session (once per bucket) but we would reuse the temp tokens across the sessions.

TL;DR; seems hard to share connections but we may try to share temp tokens across the sessions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache object stores per postgres session
2 participants