[receiver/otlp] Review telemetry #11139

mx-psi · 2024-09-11T17:06:19Z

As part of the stabilization of the OTLP receiver, we want to ensure that it has the core self telemetry needed to understand the component and debug common issues. This issue tracks reviewing existing telemetry and defining any new telemetry needed to make the component observable. It should be rare that we add new telemetry as per the 1.0 tenet that "The Collector is already used in production at scale and has been tested in a variety of environments".

jade-guiton-dd · 2025-01-10T15:28:33Z

Now that observability requirements for stable components have been formally defined, we can assess more precisely which of the proposals above will need to be implemented before stabilization.

Unlike for the previous review, I will split this up between the issues for the different OTLP components.

`receiver/otlp`

How much data the component outputs.

✅ This should be covered by pipeline instrumentation, once implemented.

Other possible discrepancies between input and output, if any.

✅ The receiver does not willfully drop items, create them, or hold them, so this is not applicable.

Processing performance.

⚠️ Currently, the span emitted by receiverhelper only encompasses downstream processing, so the different in timing between it and the span emitted by otelgrpc/otelhttp can be used to determine the latency. However, this is an implementation detail, and should probably not be relied on.

⚠️ Moreover, receiverhelper does not currently support profiles. This means latency is not observable for them, which would be a blocker if profiles were a stable signal.

➡️ Possible solution: If pipeline instrumentation is implemented in a way that generates spans for each Consume operation (see issue #11743 for discussion), then the difference between that and the otelgrpc/otelhttp span may be used to unambiguously measure latency. In that case, no additional telemetry would be necessary from the receiver or receiverhelper.

➡️ An alternate solution would be to modify the spans emitted by receiverhelper to make sure that users can differentiate between spans covering the entire component operation and spans covering downstream processing. This could be done by modifying its API to create two spans, for example.

All internal telemetry emitted by a component should have attributes identifying the specific component instance that it originates from. This should follow the same conventions as the pipeline universal telemetry.

In the case of "singleton" receivers like the OTLP receiver, the component ID (eg. otlp or otlp/foo) is enough to identify the instance.

⚠️ The receiver attribute on receiverhelper metrics and the name of receiverhelper spans allows users to identify the source instance. It is also possible for spans and metrics generated by otelhttp, by comparing the net.host.port attribute with the config. However, this is not uniform and does not follow pipeline telemetry conventions.

❌ otelgrpc spans and metrics do not give any port information, which prevents identifying the source component (in cases where no receiverhelper span is emitted). This is problematic when there are multiple gRPC endpoints configured.

➡️ To address the most pressing issue, an issue should be filed on otelgrpc to implement the server.address and server.port attributes defined in the semantic conventions.

➡️ My long-term recommendation would be to modify receiverhelper to emit attributes according to the pipeline telemetry conventions, and to inject said attributes into otelgrpc and otelhttp as well. For otelhttp, this can be done with ContextWithLabeler for metrics and WithSpanOptions for spans; for otelgrpc, with WithMetricAttributes and WithSpanAttributes.

HTTP path

How much data the component receives.

✅ The number of HTTP requests can be observed using the count of the http.server.request.duration histogram, or the number of traces created by otelhttp. The size of the requests can also be observed using http.server.request_size.

How much data is dropped because of errors.

Details for error conditions.

There are 3 error modes for HTTP requests:

Failure to read request: (data dropped because of internal error)
(Possible reasons: invalid method, invalid content type, failure to read body, failure to unmarshal body)
HTTP 405/415/400 returned (observable using otelhttp span and metric attributes)
Success span emitted by otelhttp
Downstream error: (data dropped because of downstream error)
HTTP 500 returned
Error spans emitted by otelhttp and receiverhelper; the latter contains the error message
otelcol_receiver_refused_xxxx metric emitted, and also recorded by pipeline instrumentation
Failure to marshal response: (no data dropped, indicative of a bug)
HTTP 500 returned
Error span emitted by otelhttp
Success span emitted by receiverhelper

⚠️ From this breakdown, we can see that it is technically possible to differentiate all three cases, which allows counting failed requests and distinguishing downstream errors from internal errors. However, this requires correlation between multiple data points. Moreover, in case 1, the error details can only be deduced from the status code.

⚠️ For profiles, no spans are currently emitted by receiverhelper, and the downstream error messages are not recorder. This would be a blocker if profiles were not currently experimental.

➡️ A possible improvement would be:

Adding an otelcol_receiver_http_requests counter metric, with an outcome attribute similar to that set by pipeline instrumentation. If my amendment of the RFC (Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956) lands, this would have three values (success, failure, rejected), to differentiate between success, internal errors, and downstream errors.
Log the error messages for cases 1 and 3, either as a log, or as a span event on the otelhttp span.

I am unsure if failures to marshal the response should be a success or a failure, as no data is dropped. This case should be fairly anecdotal however, especially since the response message does not typically contain any data.

➡️ A more general solution may be to update the receiverhelper API, in a way that would allow us to output errors as failed spans all 3 cases, and to automatically output a metric with the appropriate outcome attribute. This is similar to, and may be combined with, the second suggestion in the "Processing performance" section.

gRPC path

How much data the component receives.

✅ rpc.server.duration, rpc.server.request.size, and the traces created by otelgrpc, can be used in the same way as their HTTP equivalents.

How much data is dropped because of errors.

Details for error conditions.

There are 4 error modes for gRPC requests:

Basic protocol errors (wrong protocol, wrong path or content-type):
An error is returned to the client, but no telemetry is emitted.
Errors in payload unmarshaling, potentially other internal gRPC errors:
Error span emitted by otelgrpc
Early connection shutdown:
Success span emitted by otelgrpc (?)
info level log with no span id emitted
Downstream error:
Error spans emitted by otelgrpc and receiverhelper; both contain the error message
otelcol_receiver_refused_xxxx metric emitted, and also recorded by pipeline instrumentation

⚠️ Some protocol and connection-related errors are not properly observable. We could consider this data loss, but we may also consider monitoring these to be the responsability of the client?

➡️ If we decide we want to monitor more types of errors on the Collector side, my recommendation would be to open an issue with otelgrpc to check whether this information is exposed by the "stats" system the gRPC library provides.

jade-guiton-dd · 2025-01-13T13:47:22Z

To sum up, it seems to me the only strict blocker for stabilization according to the requirements we defined is:

otelgrpc adding attributes allowing users to identify the source receiver (tracked in [otelgrpc] Emit attributes identifying host endpoint go-contrib#6608)

However, if we want to avoid adding new telemetry or immediately deprecating telemetry post-1.0, I would suggest waiting until:

Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956 has been merged
the pipeline instrumentation RFC has been implemented (Add singleton flags to factories and standardize attributes #12057 should be a good first step)
a conclusion has been reached on RFC - Universal Component Telemetry #11743
receiverhelper has been modified based on the above suggestions:
- using the same component-identifying attributes as the RFC: [receiverhelper] Implement new component-identifying attributes #12117
- outputting a metric tracking the outcome of requests (waiting on Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956)
- clarifying the semantics of the emitted spans (waiting on RFC - Universal Component Telemetry #11743), potentially emitting more
otelgrpc updates to a more recent semantic convention? (currently using v1.17.0, which is 2 years old)

Note that the above is only true for traces/metrics/logs. The fact that receiverhelper does not currently support profiles is an additional blocker for stabilization of the component for profiles.

mx-psi · 2025-01-13T15:50:50Z

Marking this as blocked until we get a reply on open-telemetry/opentelemetry-go-contrib/issues/6608

mx-psi · 2025-01-13T15:54:36Z

Agree on the blockers, I think we should file issues for them (the otelgrpc semantic conventions version one maybe it could be a simple PR?)

mx-psi added the receiver/otlp label Sep 11, 2024

mx-psi added this to the go.opentelemetry.io/receiver/otlp 1.0 milestone Sep 11, 2024

mx-psi added this to Collector: v1 Sep 11, 2024

github-project-automation bot moved this to Todo in Collector: v1 Sep 11, 2024

mx-psi modified the milestones: go.opentelemetry.io/receiver/otlp 1.0, Self observability Sep 11, 2024

mx-psi assigned jade-guiton-dd Oct 21, 2024

This comment was marked as outdated.

Sign in to view

mx-psi moved this from Blocked to In Progress in Collector: v1 Dec 16, 2024

jade-guiton-dd moved this from In Progress to Todo in Collector: v1 Dec 16, 2024

jade-guiton-dd moved this from Todo to In Progress in Collector: v1 Dec 16, 2024

jade-guiton-dd mentioned this issue Jan 13, 2025

[otelgrpc] Emit attributes identifying host endpoint open-telemetry/opentelemetry-go-contrib#6608

Open

mx-psi moved this from In Progress to Blocked in Collector: v1 Jan 13, 2025

This was referenced Jan 17, 2025

[receiverhelper] Implement new component-identifying attributes #12117

Open

RFC - Universal Component Telemetry #11743

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/otlp] Review telemetry #11139

[receiver/otlp] Review telemetry #11139

mx-psi commented Sep 11, 2024

This comment was marked as outdated.

jade-guiton-dd commented Jan 10, 2025 •

edited

Loading

jade-guiton-dd commented Jan 13, 2025 •

edited

Loading

mx-psi commented Jan 13, 2025

mx-psi commented Jan 13, 2025

[receiver/otlp] Review telemetry #11139

[receiver/otlp] Review telemetry #11139

Comments

mx-psi commented Sep 11, 2024

This comment was marked as outdated.

jade-guiton-dd commented Jan 10, 2025 • edited Loading

receiver/otlp

HTTP path

gRPC path

jade-guiton-dd commented Jan 13, 2025 • edited Loading

mx-psi commented Jan 13, 2025

mx-psi commented Jan 13, 2025

jade-guiton-dd commented Jan 10, 2025 •

edited

Loading

`receiver/otlp`

jade-guiton-dd commented Jan 13, 2025 •

edited

Loading