BlobUploader utilities to enable handling of large data in instrumentation #3122

michaelsafyan · 2024-12-19T22:48:20Z

Description

Provides an experimental library for uploading signals data to blob storage as a proof-of-concept to help inform direction of instrumentation that handles request/response data, with a focus on GenAI multimodal.

Related discussion to this PR:

Library for uploading blobs as part of instrumentation #GenAI #MultiModal #3065

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)

How Has This Been Tested?

Wrote unit tests for the relevant files added.

Does This PR Require a Core Repo Change?

Unsure.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

Followed the style guidelines of this project
Changelogs have been updated
Unit tests have been added
Documentation has been updated

...telemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob_uploader.py

opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/content_type.py

michaelsafyan · 2025-01-14T16:14:19Z

Ran tox, but this concerningly attempted to run sudo; I quit the tool at that point. Will file a bug for this behavior.

michaelsafyan · 2025-01-14T16:14:40Z

Looks like I'm already getting some review comments, so will convert from DRAFT to READY.

samuelcolvin

I've review most of the code and suggested type hints, and a few simple improvements like f-strings.

But more generally I suggest this should be fundamentially reconsidered:

The obvious and most scalable way to implement blob uploads today is using pre-signed URLs.

The idea would be:

the client makes a get or post request to an endpoint provided by the backend with the file type (optional), name (optional) and size (you could implement support for more specific attributes like width/height etc)
The backend returns a URL (most likely an S3 style pre-signed URL), and a reference
the client posts the data to that URL, raises and error if the response is not 2XX
the client stores the reference in the OTEL data

There are lots of advantages of this approach IMHO:

it means the client needs to implement ZERO logic related to different providers and object stores, it just gets a URL and posts data to it
this pre-signed url approach is already implement by S3, GCS and ever other S3 compatible object store, so it should be pretty easy for backends to implement
if backends want to implement things differently, they can, the client logic is completely independent of the signing method, destination URL etc.

samuelcolvin · 2025-01-21T08:52:50Z

opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob.py

+                    self._labels[k] = labels[k]
+
+    @staticmethod
+    def from_data_uri(uri: str, labels: Optional[dict] = None) -> "Blob":


this would be easier to extend if this was a classmethod that returned cls(raw_bytes, content_type=content_type, labels=labels).

Alternatively, if this class shouldn't be subclassed, it should be marked as final.

opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob.py

samuelcolvin · 2025-01-21T08:55:22Z

opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob.py

+    This object conteptually has the following properties:
+
+      - raw_bytes: the actual data (payload) of the Blob
+      - content_type: metadata about the content type (e.g. "image/jpeg")
+      - labels: key/value data that can be used to identify and contextualize
+         the object such as {"trace_id": "...", "span_id": "...", "filename": ...}


this duplicates the docs on the properties.

samuelcolvin · 2025-01-21T09:08:17Z

...nstrumentation/src/opentelemetry/instrumentation/_blobupload/backend/google/gcs/_gcs_impl.py

+
+       'traces/12345/spans/56789'
+       'traces/12345/spans/56789/events/0'
+       'traces/12345/spans/56789/events/some.event.name'


what happens if we want to include some kind of customer or project reference in the path?

...nstrumentation/src/opentelemetry/instrumentation/_blobupload/backend/google/gcs/_gcs_impl.py

samuelcolvin · 2025-01-21T09:10:02Z

...entation/src/opentelemetry/instrumentation/_blobupload/utils/simple_blob_uploader_adaptor.py

+    """Returns a variant of the Blob with the content type auto-detected if needed."""
+    if blob.content_type is not None:
+        return blob
+    content_type = detect_content_type(blob.raw_bytes)


can't we infer the content type from the labels, instead of inspecting the bytes?

...entation/src/opentelemetry/instrumentation/_blobupload/utils/simple_blob_uploader_adaptor.py

Co-authored-by: Samuel Colvin <[email protected]>

Start of BlobUploader code in Python client library.

4341b2e

michaelsafyan requested a review from a team as a code owner December 19, 2024 22:48

michaelsafyan marked this pull request as draft December 19, 2024 22:48

michaelsafyan mentioned this pull request Dec 19, 2024

Library for uploading blobs as part of instrumentation #GenAI #MultiModal #3065

Open

michaelsafyan and others added 10 commits December 20, 2024 14:39

Implement the adaptor, add comments.

924cd37

Implement the GCS uploader.

84fe250

Merge branch 'open-telemetry:main' into blob_upload

dba3aea

Merge branch 'open-telemetry:main' into blob_upload

8cd6ce1

Began adding tests.

9906a13

Upload current snapshot.

8a4362e

Add dependencies.

1667374

Add more tests and fix some of the code that wasn't working.

41b7eea

Completed writing unit tests for functionality implemented so far.

2b51a15

Merge branch 'open-telemetry:main' into blob_upload

410099a

haf reviewed Jan 13, 2025

View reviewed changes

...telemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob_uploader.py Show resolved Hide resolved

haf reviewed Jan 13, 2025

View reviewed changes

opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/content_type.py Outdated Show resolved Hide resolved

michaelsafyan changed the title ~~[DRAFT] BlobUploader utilities to enable handling of large data in instrumentation~~ BlobUploader utilities to enable handling of large data in instrumentation Jan 13, 2025

michaelsafyan and others added 2 commits January 14, 2025 09:04

Merge branch 'open-telemetry:main' into blob_upload

d147a79

Add license comments and documentation.

0a3430e

michaelsafyan marked this pull request as ready for review January 14, 2025 16:14

michaelsafyan added 3 commits January 14, 2025 11:28

Remove redundant explicit inheritance from object per review comment.

587e61e

Format with ruff.

c25a6b8

Address additional ruff checks that could not be automatically fixed.

a7bb5f5

samuelcolvin suggested changes Jan 21, 2025

View reviewed changes

Apply suggestions from code review

7f88a2b

Co-authored-by: Samuel Colvin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BlobUploader utilities to enable handling of large data in instrumentation #3122

BlobUploader utilities to enable handling of large data in instrumentation #3122

michaelsafyan commented Dec 19, 2024 •

edited

Loading

michaelsafyan commented Jan 14, 2025

michaelsafyan commented Jan 14, 2025

samuelcolvin left a comment •

edited

Loading

samuelcolvin Jan 21, 2025 •

edited

Loading

samuelcolvin Jan 21, 2025

samuelcolvin Jan 21, 2025

samuelcolvin Jan 21, 2025

BlobUploader utilities to enable handling of large data in instrumentation #3122

Are you sure you want to change the base?

BlobUploader utilities to enable handling of large data in instrumentation #3122

Conversation

michaelsafyan commented Dec 19, 2024 • edited Loading

Description

Type of change

How Has This Been Tested?

Does This PR Require a Core Repo Change?

Checklist:

michaelsafyan commented Jan 14, 2025

michaelsafyan commented Jan 14, 2025

samuelcolvin left a comment • edited Loading

Choose a reason for hiding this comment

samuelcolvin Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

samuelcolvin Jan 21, 2025

Choose a reason for hiding this comment

samuelcolvin Jan 21, 2025

Choose a reason for hiding this comment

samuelcolvin Jan 21, 2025

Choose a reason for hiding this comment

michaelsafyan commented Dec 19, 2024 •

edited

Loading

samuelcolvin left a comment •

edited

Loading

samuelcolvin Jan 21, 2025 •

edited

Loading