-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BlobUploader utilities to enable handling of large data in instrumentation #3122
base: main
Are you sure you want to change the base?
Changes from all commits
4341b2e
924cd37
84fe250
dba3aea
8cd6ce1
9906a13
8a4362e
1667374
41b7eea
2b51a15
410099a
d147a79
0a3430e
587e61e
c25a6b8
a7bb5f5
7f88a2b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
# Blob Uploader Library (Experimental) | ||
|
||
The Blob Uploader library provides an experimental way to | ||
"write-aside" large or sensitive payloads to a blob storage | ||
system, while retaining references to the written-aside destination | ||
in the operations backend where telemetry is being written. | ||
|
||
This is particularly intended for the use case of request/response | ||
logging, where typical telemetry backends may be unsuitable for | ||
writing this data, either due to size reasons or due to privacy | ||
reasons. GenAI multi-modal prompt/response logging is a particularly | ||
salient motivation for this feature, though general HTTP request/ | ||
response logging is another situation where this is applicable. | ||
|
||
## Usage: Instrumentation Library | ||
|
||
Instrumentation libraries should provide general hooks for handling | ||
requests/responses (or other large blobs) that should be only | ||
conditionally included in telemetry signals. The hooks should provide | ||
enough context to allow a user of the instrumentation library to | ||
conditionally choose what to do with the content including but not | ||
limited to: dropping, including in the telemetry signal, or writing | ||
to a BlobUploader and retaining a reference to the destination URI. | ||
|
||
For example: | ||
|
||
``` | ||
|
||
class RequestHook(abc.ABC): | ||
|
||
@abc.abstractmethod | ||
def handle_request(self, context, signal, request): | ||
pass | ||
|
||
|
||
class ResponseHook(abc.ABC): | ||
|
||
@abc.abstractmethod: | ||
def handle_response(self, context, signal, response): | ||
pass | ||
|
||
|
||
class FooInstrumentationLibrary: | ||
|
||
def __init__(self, | ||
# ..., | ||
request_hook: Optional[RequestHook]=None, | ||
response_hook: Optional[ResponseHook]=None, | ||
# ...) | ||
|
||
... | ||
``` | ||
|
||
|
||
## Usage: User of Instrumentation Library | ||
|
||
Users of instrumentation libraries can use the Blob Uploader | ||
libraries to implement relevant request/response hooks. | ||
|
||
For example: | ||
|
||
``` | ||
from opentelemetry.instrumentation._blobupload.api import ( | ||
NOT_PROVIDED, | ||
Blob, | ||
BlobUploaderProvider, | ||
get_blob_uploader, | ||
set_blob_uploader_provider) | ||
|
||
|
||
class MyBlobUploaderRequestHook(RequestHook): | ||
# ... | ||
|
||
def handle_request(self, context, signal, request): | ||
if not self.should_uploader(context): | ||
return | ||
use_case = self.select_use_case(context, signal) | ||
uploader = get_blob_uploader(use_case) | ||
blob = Blob( | ||
request.raw_bytes, | ||
content_type=request.content_type, | ||
labels=self.generate_blob_labels(context, signal, request)) | ||
uri = uploader.upload_async(blob) | ||
if uri == NOT_UPLOADED: | ||
return | ||
signal.attributes[REQUEST_ATTRIBUTE] = uri | ||
|
||
# ... | ||
|
||
class MyBlobUploaderProvider(BlobUploaderProvider): | ||
|
||
def get_blob_uploader(self, use_case=None): | ||
# ... | ||
|
||
|
||
def main(): | ||
set_blob_uploader_provider(MyBlobUploaderProvider()) | ||
instrumentation_libary = FooInstrumentationLibrary( | ||
# ..., | ||
request_hook=MyBlobUploaderRequestHook(), | ||
# ... | ||
) | ||
# ... | ||
|
||
``` | ||
|
||
## Future Work | ||
|
||
As can be seen from the above usage examples, there is quite a | ||
bit of common boilerplate both for instrumentation libraries (e.g. | ||
defining the set of hook interfaces) and for consumers of those | ||
instrumentation libraries (e.g. implementing variants of those hook | ||
interfaces that make use of the BlobUploader libraries). | ||
|
||
A potential future improvement would be to define a common set of | ||
hook interfaces for this use case that can be be reused across | ||
instrumentation libraries and to provide simple drop-in | ||
implementations of those hooks that make use of BlobUploader. | ||
|
||
Beyond this, boilerplate to define a custom 'BlobUploaderProvider' | ||
could be reduced by expanding the capabilities of the default | ||
provider, so that most common uses are covered with a minimal | ||
set of environment variables (if optional deps are present). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Copyright The OpenTelemetry Authors | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Copyright The OpenTelemetry Authors | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
"""Exposes API methods to callers from the package name.""" | ||
|
||
from opentelemetry.instrumentation._blobupload.api.blob import Blob | ||
from opentelemetry.instrumentation._blobupload.api.blob_uploader import ( | ||
BlobUploader, | ||
) | ||
from opentelemetry.instrumentation._blobupload.api.constants import ( | ||
NOT_UPLOADED, | ||
) | ||
from opentelemetry.instrumentation._blobupload.api.content_type import ( | ||
detect_content_type, | ||
) | ||
from opentelemetry.instrumentation._blobupload.api.labels import ( | ||
generate_labels_for_event, | ||
generate_labels_for_span, | ||
generate_labels_for_span_event, | ||
) | ||
from opentelemetry.instrumentation._blobupload.api.provider import ( | ||
BlobUploaderProvider, | ||
get_blob_uploader, | ||
set_blob_uploader_provider, | ||
) | ||
|
||
__all__ = [ | ||
"Blob", | ||
"BlobUploader", | ||
"NOT_UPLOADED", | ||
"detect_content_type", | ||
"generate_labels_for_event", | ||
"generate_labels_for_span", | ||
"generate_labels_for_span_event", | ||
"BlobUploaderProvider", | ||
"get_blob_uploader", | ||
"set_blob_uploader_provider", | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
# Copyright The OpenTelemetry Authors | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import base64 | ||
import json | ||
from types import MappingProxyType as _frozendict | ||
from typing import Mapping, Optional | ||
|
||
|
||
class Blob: | ||
"""Represents an opaque binary object and associated metadata. | ||
|
||
This object conteptually has the following properties: | ||
|
||
- raw_bytes: the actual data (payload) of the Blob | ||
- content_type: metadata about the content type (e.g. "image/jpeg") | ||
- labels: key/value data that can be used to identify and contextualize | ||
the object such as {"trace_id": "...", "span_id": "...", "filename": ...} | ||
Comment on lines
+24
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this duplicates the docs on the properties. |
||
""" | ||
|
||
def __init__( | ||
self, | ||
raw_bytes: bytes, | ||
content_type: Optional[str] = None, | ||
labels: Optional[Mapping[str, str]] = None, | ||
): | ||
"""Initialize the blob with an explicit set of properties. | ||
|
||
Args: | ||
raw_bytes: the required payload | ||
content_type: the MIME type describing the type of data in the payload | ||
labels: additional key/value data about the Blob | ||
""" | ||
self._raw_bytes = raw_bytes | ||
self._content_type = content_type | ||
self._labels = {} | ||
if labels is not None: | ||
if isinstance(labels, dict): | ||
self._labels.update(labels) | ||
else: | ||
for k in labels: | ||
self._labels[k] = labels[k] | ||
|
||
@staticmethod | ||
def from_data_uri(uri: str, labels: Optional[Mapping[str, str]] = None) -> "Blob": | ||
"""Instantiate a blob from a 'data:...' URI. | ||
|
||
Args: | ||
uri: A URI in the 'data:' format. Supports a subset of 'data:' URIs | ||
that encode the data with the 'base64' extension and that include | ||
a content type. Should work with any normal 'image/jpeg', 'image/png', | ||
'application/pdf', 'audio/aac', and many others. DOES NOT SUPPORT | ||
encoding data as percent-encoded text (no "base64"). | ||
|
||
labels: Additional key/value data to include in the constructed Blob. | ||
""" | ||
if not uri.startswith("data:"): | ||
raise ValueError( | ||
'Invalid "uri"; expected "data:" prefix. Found: "{}"'.format( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a reason we're using In general I think f-strings would be preferred in modern python. In particular, f-strings are around 2x faster to evaluate: In [1]: %timeit 'this is a test {}'.format(1)
70.4 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [2]: %timeit f'this is a test {1}'
37.9 ns ± 0.239 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, including the full uri in the exception message could lead to extremely long exception messages, I would omit it. |
||
uri | ||
) | ||
) | ||
if ";base64," not in uri: | ||
raise ValueError( | ||
'Invalid "uri"; expected ";base64," section. Found: "{}"'.format( | ||
uri | ||
) | ||
) | ||
data_prefix_len = len("data:") | ||
after_data_prefix = uri[data_prefix_len:] | ||
if ";" not in after_data_prefix: | ||
raise ValueError( | ||
'Invalid "uri"; expected ";" in URI. Found: "{}"'.format(uri) | ||
) | ||
content_type, remaining = after_data_prefix.split(";", 1) | ||
while not remaining.startswith("base64,"): | ||
_, remaining = remaining.split(";", 1) | ||
assert remaining.startswith("base64,") | ||
base64_len = len("base64,") | ||
base64_encoded_content = remaining[base64_len:] | ||
raw_bytes = base64.b64decode(base64_encoded_content) | ||
return Blob(raw_bytes, content_type=content_type, labels=labels) | ||
|
||
@property | ||
def raw_bytes(self) -> bytes: | ||
"""Returns the raw bytes (payload) of this Blob.""" | ||
return self._raw_bytes | ||
|
||
@property | ||
def content_type(self) -> Optional[str]: | ||
"""Returns the content type (or None) of this Blob.""" | ||
return self._content_type | ||
|
||
@property | ||
def labels(self) -> Mapping[str, str]: | ||
"""Returns the key/value metadata of this Blob.""" | ||
return _frozendict(self._labels) | ||
|
||
def __eq__(self, o: Any) -> bool: | ||
return ( | ||
(isinstance(o, Blob)) and | ||
(self.raw_bytes == o.raw_bytes) and | ||
(self.content_type == o.content_type) and | ||
(self.labels == o.labels) | ||
) | ||
|
||
def __repr__(self) -> str: | ||
params = [repr(self._raw_bytes)] | ||
if self._content_type is not None: | ||
params.append(f"content_type={self._content_type!r}") | ||
if self._labels: | ||
params.append("labels={}".format(json.dumps(self._labels, sort_keys=True))) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are we JSON formatting? this will be much slower, and not lead to the correct repr. |
||
params_string = ", ".join(params) | ||
return "Blob({})".format(params_string) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Copyright The OpenTelemetry Authors | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
"""Defines an interface for performing asynchronous blob uploading.""" | ||
|
||
import abc | ||
michaelsafyan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
from opentelemetry.instrumentation._blobupload.api.blob import Blob | ||
from opentelemetry.instrumentation._blobupload.api.constants import ( | ||
NOT_UPLOADED, | ||
) | ||
|
||
|
||
class BlobUploader(abc.ABC): | ||
"""Pure abstract base class representing a component that does blob uploading.""" | ||
|
||
@abc.abstractmethod | ||
def upload_async(self, blob: Blob) -> str: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there any support for async upload methods? |
||
return NOT_UPLOADED |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Copyright The OpenTelemetry Authors | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
"""Defines constants that are used by the '_blobupload' package.""" | ||
|
||
# Special constant used to indicate that a BlobUploader did not upload. | ||
NOT_UPLOADED = "/dev/null" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you implemented this as a dataclass (perhaps frozen), you could avoid having to implement:
__init__
@property
methods__eq__
probably__repr__