Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlobUploader utilities to enable handling of large data in instrumentation #3122

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions opentelemetry-instrumentation/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,14 @@ dependencies = [
"packaging >= 18.0",
]

[project.optional-dependencies]
gcs = [
"google-cloud-storage==2.19.0"
]
magic = [
"python-magic==0.4.27"
]

[project.scripts]
opentelemetry-bootstrap = "opentelemetry.instrumentation.bootstrap:run"
opentelemetry-instrument = "opentelemetry.instrumentation.auto_instrumentation:run"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Blob Uploader Library (Experimental)

The Blob Uploader library provides an experimental way to
"write-aside" large or sensitive payloads to a blob storage
system, while retaining references to the written-aside destination
in the operations backend where telemetry is being written.

This is particularly intended for the use case of request/response
logging, where typical telemetry backends may be unsuitable for
writing this data, either due to size reasons or due to privacy
reasons. GenAI multi-modal prompt/response logging is a particularly
salient motivation for this feature, though general HTTP request/
response logging is another situation where this is applicable.

## Usage: Instrumentation Library

Instrumentation libraries should provide general hooks for handling
requests/responses (or other large blobs) that should be only
conditionally included in telemetry signals. The hooks should provide
enough context to allow a user of the instrumentation library to
conditionally choose what to do with the content including but not
limited to: dropping, including in the telemetry signal, or writing
to a BlobUploader and retaining a reference to the destination URI.

For example:

```

class RequestHook(abc.ABC):

@abc.abstractmethod
def handle_request(self, context, signal, request):
pass


class ResponseHook(abc.ABC):

@abc.abstractmethod:
def handle_response(self, context, signal, response):
pass


class FooInstrumentationLibrary:

def __init__(self,
# ...,
request_hook: Optional[RequestHook]=None,
response_hook: Optional[ResponseHook]=None,
# ...)

...
```


## Usage: User of Instrumentation Library

Users of instrumentation libraries can use the Blob Uploader
libraries to implement relevant request/response hooks.

For example:

```
from opentelemetry.instrumentation._blobupload.api import (
NOT_PROVIDED,
Blob,
BlobUploaderProvider,
get_blob_uploader,
set_blob_uploader_provider)


class MyBlobUploaderRequestHook(RequestHook):
# ...

def handle_request(self, context, signal, request):
if not self.should_uploader(context):
return
use_case = self.select_use_case(context, signal)
uploader = get_blob_uploader(use_case)
blob = Blob(
request.raw_bytes,
content_type=request.content_type,
labels=self.generate_blob_labels(context, signal, request))
uri = uploader.upload_async(blob)
if uri == NOT_UPLOADED:
return
signal.attributes[REQUEST_ATTRIBUTE] = uri

# ...

class MyBlobUploaderProvider(BlobUploaderProvider):

def get_blob_uploader(self, use_case=None):
# ...


def main():
set_blob_uploader_provider(MyBlobUploaderProvider())
instrumentation_libary = FooInstrumentationLibrary(
# ...,
request_hook=MyBlobUploaderRequestHook(),
# ...
)
# ...

```

## Future Work

As can be seen from the above usage examples, there is quite a
bit of common boilerplate both for instrumentation libraries (e.g.
defining the set of hook interfaces) and for consumers of those
instrumentation libraries (e.g. implementing variants of those hook
interfaces that make use of the BlobUploader libraries).

A potential future improvement would be to define a common set of
hook interfaces for this use case that can be be reused across
instrumentation libraries and to provide simple drop-in
implementations of those hooks that make use of BlobUploader.

Beyond this, boilerplate to define a custom 'BlobUploaderProvider'
could be reduced by expanding the capabilities of the default
provider, so that most common uses are covered with a minimal
set of environment variables (if optional deps are present).
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright The OpenTelemetry Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Copyright The OpenTelemetry Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Exposes API methods to callers from the package name."""

from opentelemetry.instrumentation._blobupload.api.blob import Blob
from opentelemetry.instrumentation._blobupload.api.blob_uploader import (
BlobUploader,
)
from opentelemetry.instrumentation._blobupload.api.constants import (
NOT_UPLOADED,
)
from opentelemetry.instrumentation._blobupload.api.content_type import (
detect_content_type,
)
from opentelemetry.instrumentation._blobupload.api.labels import (
generate_labels_for_event,
generate_labels_for_span,
generate_labels_for_span_event,
)
from opentelemetry.instrumentation._blobupload.api.provider import (
BlobUploaderProvider,
get_blob_uploader,
set_blob_uploader_provider,
)

__all__ = [
"Blob",
"BlobUploader",
"NOT_UPLOADED",
"detect_content_type",
"generate_labels_for_event",
"generate_labels_for_span",
"generate_labels_for_span_event",
"BlobUploaderProvider",
"get_blob_uploader",
"set_blob_uploader_provider",
]
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Copyright The OpenTelemetry Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import base64
import json
from types import MappingProxyType as _frozendict
from typing import Mapping, Optional


class Blob:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you implemented this as a dataclass (perhaps frozen), you could avoid having to implement:

  • most of __init__
  • the three @property methods
  • __eq__ probably
  • __repr__

"""Represents an opaque binary object and associated metadata.

This object conteptually has the following properties:

- raw_bytes: the actual data (payload) of the Blob
- content_type: metadata about the content type (e.g. "image/jpeg")
- labels: key/value data that can be used to identify and contextualize
the object such as {"trace_id": "...", "span_id": "...", "filename": ...}
Comment on lines +24 to +29
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this duplicates the docs on the properties.

"""

def __init__(
self,
raw_bytes: bytes,
content_type: Optional[str] = None,
labels: Optional[Mapping[str, str]] = None,
):
"""Initialize the blob with an explicit set of properties.

Args:
raw_bytes: the required payload
content_type: the MIME type describing the type of data in the payload
labels: additional key/value data about the Blob
"""
self._raw_bytes = raw_bytes
self._content_type = content_type
self._labels = {}
if labels is not None:
if isinstance(labels, dict):
self._labels.update(labels)
else:
for k in labels:
self._labels[k] = labels[k]

@staticmethod
def from_data_uri(uri: str, labels: Optional[Mapping[str, str]] = None) -> "Blob":
"""Instantiate a blob from a 'data:...' URI.

Args:
uri: A URI in the 'data:' format. Supports a subset of 'data:' URIs
that encode the data with the 'base64' extension and that include
a content type. Should work with any normal 'image/jpeg', 'image/png',
'application/pdf', 'audio/aac', and many others. DOES NOT SUPPORT
encoding data as percent-encoded text (no "base64").

labels: Additional key/value data to include in the constructed Blob.
"""
if not uri.startswith("data:"):
raise ValueError(
'Invalid "uri"; expected "data:" prefix. Found: "{}"'.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we're using .format() not f-strings?

In general I think f-strings would be preferred in modern python.

In particular, f-strings are around 2x faster to evaluate:

In [1]: %timeit 'this is a test {}'.format(1)
70.4 ns ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [2]: %timeit f'this is a test {1}'
37.9 ns ± 0.239 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, including the full uri in the exception message could lead to extremely long exception messages, I would omit it.

uri
)
)
if ";base64," not in uri:
raise ValueError(
'Invalid "uri"; expected ";base64," section. Found: "{}"'.format(
uri
)
)
data_prefix_len = len("data:")
after_data_prefix = uri[data_prefix_len:]
if ";" not in after_data_prefix:
raise ValueError(
'Invalid "uri"; expected ";" in URI. Found: "{}"'.format(uri)
)
content_type, remaining = after_data_prefix.split(";", 1)
while not remaining.startswith("base64,"):
_, remaining = remaining.split(";", 1)
assert remaining.startswith("base64,")
base64_len = len("base64,")
base64_encoded_content = remaining[base64_len:]
raw_bytes = base64.b64decode(base64_encoded_content)
return Blob(raw_bytes, content_type=content_type, labels=labels)

@property
def raw_bytes(self) -> bytes:
"""Returns the raw bytes (payload) of this Blob."""
return self._raw_bytes

@property
def content_type(self) -> Optional[str]:
"""Returns the content type (or None) of this Blob."""
return self._content_type

@property
def labels(self) -> Mapping[str, str]:
"""Returns the key/value metadata of this Blob."""
return _frozendict(self._labels)

def __eq__(self, o: Any) -> bool:
return (
(isinstance(o, Blob)) and
(self.raw_bytes == o.raw_bytes) and
(self.content_type == o.content_type) and
(self.labels == o.labels)
)

def __repr__(self) -> str:
params = [repr(self._raw_bytes)]
if self._content_type is not None:
params.append(f"content_type={self._content_type!r}")
if self._labels:
params.append("labels={}".format(json.dumps(self._labels, sort_keys=True)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we JSON formatting? this will be much slower, and not lead to the correct repr.

params_string = ", ".join(params)
return "Blob({})".format(params_string)
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright The OpenTelemetry Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Defines an interface for performing asynchronous blob uploading."""

import abc
michaelsafyan marked this conversation as resolved.
Show resolved Hide resolved

from opentelemetry.instrumentation._blobupload.api.blob import Blob
from opentelemetry.instrumentation._blobupload.api.constants import (
NOT_UPLOADED,
)


class BlobUploader(abc.ABC):
"""Pure abstract base class representing a component that does blob uploading."""

@abc.abstractmethod
def upload_async(self, blob: Blob) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any support for async upload methods?

return NOT_UPLOADED
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Copyright The OpenTelemetry Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Defines constants that are used by the '_blobupload' package."""

# Special constant used to indicate that a BlobUploader did not upload.
NOT_UPLOADED = "/dev/null"
Loading
Loading