-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BlobUploader utilities to enable handling of large data in instrumentation #3122
base: main
Are you sure you want to change the base?
Conversation
...telemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob_uploader.py
Show resolved
Hide resolved
opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/content_type.py
Outdated
Show resolved
Hide resolved
Ran |
Looks like I'm already getting some review comments, so will convert from DRAFT to READY. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've review most of the code and suggested type hints, and a few simple improvements like f-strings.
But more generally I suggest this should be fundamentially reconsidered:
The obvious and most scalable way to implement blob uploads today is using pre-signed URLs.
The idea would be:
- the client makes a get or post request to an endpoint provided by the backend with the file type (optional), name (optional) and size (you could implement support for more specific attributes like width/height etc)
- The backend returns a URL (most likely an S3 style pre-signed URL), and a reference
- the client posts the data to that URL, raises and error if the response is not 2XX
- the client stores the reference in the OTEL data
There are lots of advantages of this approach IMHO:
- it means the client needs to implement ZERO logic related to different providers and object stores, it just gets a URL and posts data to it
- this pre-signed url approach is already implement by S3, GCS and ever other S3 compatible object store, so it should be pretty easy for backends to implement
- if backends want to implement things differently, they can, the client logic is completely independent of the signing method, destination URL etc.
self._labels[k] = labels[k] | ||
|
||
@staticmethod | ||
def from_data_uri(uri: str, labels: Optional[dict] = None) -> "Blob": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be easier to extend if this was a classmethod
that returned cls(raw_bytes, content_type=content_type, labels=labels)
.
Alternatively, if this class shouldn't be subclassed, it should be marked as final.
opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob.py
Outdated
Show resolved
Hide resolved
opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob.py
Outdated
Show resolved
Hide resolved
opentelemetry-instrumentation/src/opentelemetry/instrumentation/_blobupload/api/blob.py
Outdated
Show resolved
Hide resolved
This object conteptually has the following properties: | ||
|
||
- raw_bytes: the actual data (payload) of the Blob | ||
- content_type: metadata about the content type (e.g. "image/jpeg") | ||
- labels: key/value data that can be used to identify and contextualize | ||
the object such as {"trace_id": "...", "span_id": "...", "filename": ...} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this duplicates the docs on the properties.
|
||
'traces/12345/spans/56789' | ||
'traces/12345/spans/56789/events/0' | ||
'traces/12345/spans/56789/events/some.event.name' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if we want to include some kind of customer or project reference in the path?
...nstrumentation/src/opentelemetry/instrumentation/_blobupload/backend/google/gcs/_gcs_impl.py
Outdated
Show resolved
Hide resolved
"""Returns a variant of the Blob with the content type auto-detected if needed.""" | ||
if blob.content_type is not None: | ||
return blob | ||
content_type = detect_content_type(blob.raw_bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we infer the content type from the labels, instead of inspecting the bytes?
...entation/src/opentelemetry/instrumentation/_blobupload/utils/simple_blob_uploader_adaptor.py
Outdated
Show resolved
Hide resolved
...entation/src/opentelemetry/instrumentation/_blobupload/utils/simple_blob_uploader_adaptor.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Samuel Colvin <[email protected]>
Description
Provides an experimental library for uploading signals data to blob storage as a proof-of-concept to help inform direction of instrumentation that handles request/response data, with a focus on GenAI multimodal.
Related discussion to this PR:
Type of change
Please delete options that are not relevant.
How Has This Been Tested?
Wrote unit tests for the relevant files added.
Does This PR Require a Core Repo Change?
Unsure.
Checklist:
See contributing.md for styleguide, changelog guidelines, and more.