Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlobUploader utilities to enable handling of large data in instrumentation #3122

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

michaelsafyan
Copy link

@michaelsafyan michaelsafyan commented Dec 19, 2024

Description

Provides an experimental library for uploading signals data to blob storage as a proof-of-concept to help inform direction of instrumentation that handles request/response data, with a focus on GenAI multimodal.

Related discussion to this PR:

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Wrote unit tests for the relevant files added.

Does This PR Require a Core Repo Change?

Unsure.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@michaelsafyan michaelsafyan changed the title [DRAFT] BlobUploader utilities to enable handling of large data in instrumentation BlobUploader utilities to enable handling of large data in instrumentation Jan 13, 2025
@michaelsafyan
Copy link
Author

Ran tox, but this concerningly attempted to run sudo; I quit the tool at that point. Will file a bug for this behavior.

@michaelsafyan
Copy link
Author

Looks like I'm already getting some review comments, so will convert from DRAFT to READY.

@michaelsafyan michaelsafyan marked this pull request as ready for review January 14, 2025 16:14
Copy link
Contributor

@samuelcolvin samuelcolvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've review most of the code and suggested type hints, and a few simple improvements like f-strings.

But more generally I suggest this should be fundamentially reconsidered:

The obvious and most scalable way to implement blob uploads today is using pre-signed URLs.

The idea would be:

  1. the client makes a get or post request to an endpoint provided by the backend with the file type (optional), name (optional) and size (you could implement support for more specific attributes like width/height etc)
  2. The backend returns a URL (most likely an S3 style pre-signed URL), and a reference
  3. the client posts the data to that URL, raises and error if the response is not 2XX
  4. the client stores the reference in the OTEL data

There are lots of advantages of this approach IMHO:

  • it means the client needs to implement ZERO logic related to different providers and object stores, it just gets a URL and posts data to it
  • this pre-signed url approach is already implement by S3, GCS and ever other S3 compatible object store, so it should be pretty easy for backends to implement
  • if backends want to implement things differently, they can, the client logic is completely independent of the signing method, destination URL etc.

self._labels[k] = labels[k]

@staticmethod
def from_data_uri(uri: str, labels: Optional[dict] = None) -> "Blob":
Copy link
Contributor

@samuelcolvin samuelcolvin Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be easier to extend if this was a classmethod that returned cls(raw_bytes, content_type=content_type, labels=labels).

Alternatively, if this class shouldn't be subclassed, it should be marked as final.

Comment on lines +24 to +29
This object conteptually has the following properties:

- raw_bytes: the actual data (payload) of the Blob
- content_type: metadata about the content type (e.g. "image/jpeg")
- labels: key/value data that can be used to identify and contextualize
the object such as {"trace_id": "...", "span_id": "...", "filename": ...}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this duplicates the docs on the properties.


'traces/12345/spans/56789'
'traces/12345/spans/56789/events/0'
'traces/12345/spans/56789/events/some.event.name'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we want to include some kind of customer or project reference in the path?

"""Returns a variant of the Blob with the content type auto-detected if needed."""
if blob.content_type is not None:
return blob
content_type = detect_content_type(blob.raw_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we infer the content type from the labels, instead of inspecting the bytes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants