Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/otlp] Review telemetry #11139

Open
mx-psi opened this issue Sep 11, 2024 · 5 comments
Open

[receiver/otlp] Review telemetry #11139

mx-psi opened this issue Sep 11, 2024 · 5 comments
Assignees

Comments

@mx-psi
Copy link
Member

mx-psi commented Sep 11, 2024

As part of the stabilization of the OTLP receiver, we want to ensure that it has the core self telemetry needed to understand the component and debug common issues. This issue tracks reviewing existing telemetry and defining any new telemetry needed to make the component observable. It should be rare that we add new telemetry as per the 1.0 tenet that "The Collector is already used in production at scale and has been tested in a variety of environments".

@jade-guiton-dd

This comment was marked as outdated.

@mx-psi mx-psi moved this from Blocked to In Progress in Collector: v1 Dec 16, 2024
@jade-guiton-dd jade-guiton-dd moved this from In Progress to Todo in Collector: v1 Dec 16, 2024
@jade-guiton-dd jade-guiton-dd moved this from Todo to In Progress in Collector: v1 Dec 16, 2024
@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Jan 10, 2025

Now that observability requirements for stable components have been formally defined, we can assess more precisely which of the proposals above will need to be implemented before stabilization.

Unlike for the previous review, I will split this up between the issues for the different OTLP components.

receiver/otlp

  1. How much data the component outputs.

✅ This should be covered by pipeline instrumentation, once implemented.

  1. Other possible discrepancies between input and output, if any.

✅ The receiver does not willfully drop items, create them, or hold them, so this is not applicable.

  1. Processing performance.

⚠️ Currently, the span emitted by receiverhelper only encompasses downstream processing, so the different in timing between it and the span emitted by otelgrpc/otelhttp can be used to determine the latency. However, this is an implementation detail, and should probably not be relied on.

⚠️ Moreover, receiverhelper does not currently support profiles. This means latency is not observable for them, which would be a blocker if profiles were a stable signal.

➡️ Possible solution: If pipeline instrumentation is implemented in a way that generates spans for each Consume operation (see issue #11743 for discussion), then the difference between that and the otelgrpc/otelhttp span may be used to unambiguously measure latency. In that case, no additional telemetry would be necessary from the receiver or receiverhelper.

➡️ An alternate solution would be to modify the spans emitted by receiverhelper to make sure that users can differentiate between spans covering the entire component operation and spans covering downstream processing. This could be done by modifying its API to create two spans, for example.

All internal telemetry emitted by a component should have attributes identifying the specific component instance that it originates from. This should follow the same conventions as the pipeline universal telemetry.

In the case of "singleton" receivers like the OTLP receiver, the component ID (eg. otlp or otlp/foo) is enough to identify the instance.

⚠️ The receiver attribute on receiverhelper metrics and the name of receiverhelper spans allows users to identify the source instance. It is also possible for spans and metrics generated by otelhttp, by comparing the net.host.port attribute with the config. However, this is not uniform and does not follow pipeline telemetry conventions.

otelgrpc spans and metrics do not give any port information, which prevents identifying the source component (in cases where no receiverhelper span is emitted). This is problematic when there are multiple gRPC endpoints configured.

➡️ To address the most pressing issue, an issue should be filed on otelgrpc to implement the server.address and server.port attributes defined in the semantic conventions.

➡️ My long-term recommendation would be to modify receiverhelper to emit attributes according to the pipeline telemetry conventions, and to inject said attributes into otelgrpc and otelhttp as well. For otelhttp, this can be done with ContextWithLabeler for metrics and WithSpanOptions for spans; for otelgrpc, with WithMetricAttributes and WithSpanAttributes.

HTTP path

  1. How much data the component receives.

✅ The number of HTTP requests can be observed using the count of the http.server.request.duration histogram, or the number of traces created by otelhttp. The size of the requests can also be observed using http.server.request_size.

  1. How much data is dropped because of errors.
  2. Details for error conditions.

There are 3 error modes for HTTP requests:

  • Failure to read request: (data dropped because of internal error)
    (Possible reasons: invalid method, invalid content type, failure to read body, failure to unmarshal body)
    HTTP 405/415/400 returned (observable using otelhttp span and metric attributes)
    Success span emitted by otelhttp
  • Downstream error: (data dropped because of downstream error)
    HTTP 500 returned
    Error spans emitted by otelhttp and receiverhelper; the latter contains the error message
    otelcol_receiver_refused_xxxx metric emitted, and also recorded by pipeline instrumentation
  • Failure to marshal response: (no data dropped, indicative of a bug)
    HTTP 500 returned
    Error span emitted by otelhttp
    Success span emitted by receiverhelper

⚠️ From this breakdown, we can see that it is technically possible to differentiate all three cases, which allows counting failed requests and distinguishing downstream errors from internal errors. However, this requires correlation between multiple data points. Moreover, in case 1, the error details can only be deduced from the status code.

⚠️ For profiles, no spans are currently emitted by receiverhelper, and the downstream error messages are not recorder. This would be a blocker if profiles were not currently experimental.

➡️ A possible improvement would be:

  • Adding an otelcol_receiver_http_requests counter metric, with an outcome attribute similar to that set by pipeline instrumentation. If my amendment of the RFC (Amend Pipeline Component Telemetry RFC to add a "rejected" outcome #11956) lands, this would have three values (success, failure, rejected), to differentiate between success, internal errors, and downstream errors.
  • Log the error messages for cases 1 and 3, either as a log, or as a span event on the otelhttp span.

I am unsure if failures to marshal the response should be a success or a failure, as no data is dropped. This case should be fairly anecdotal however, especially since the response message does not typically contain any data.

➡️ A more general solution may be to update the receiverhelper API, in a way that would allow us to output errors as failed spans all 3 cases, and to automatically output a metric with the appropriate outcome attribute. This is similar to, and may be combined with, the second suggestion in the "Processing performance" section.

gRPC path

  1. How much data the component receives.

rpc.server.duration, rpc.server.request.size, and the traces created by otelgrpc, can be used in the same way as their HTTP equivalents.

  1. How much data is dropped because of errors.
  2. Details for error conditions.

There are 4 error modes for gRPC requests:

  • Basic protocol errors (wrong protocol, wrong path or content-type):
    An error is returned to the client, but no telemetry is emitted.
  • Errors in payload unmarshaling, potentially other internal gRPC errors:
    Error span emitted by otelgrpc
  • Early connection shutdown:
    Success span emitted by otelgrpc (?)
    info level log with no span id emitted
  • Downstream error:
    Error spans emitted by otelgrpc and receiverhelper; both contain the error message
    otelcol_receiver_refused_xxxx metric emitted, and also recorded by pipeline instrumentation

⚠️ Some protocol and connection-related errors are not properly observable. We could consider this data loss, but we may also consider monitoring these to be the responsability of the client?

➡️ If we decide we want to monitor more types of errors on the Collector side, my recommendation would be to open an issue with otelgrpc to check whether this information is exposed by the "stats" system the gRPC library provides.

@jade-guiton-dd
Copy link
Contributor

jade-guiton-dd commented Jan 13, 2025

To sum up, it seems to me the only strict blocker for stabilization according to the requirements we defined is:

However, if we want to avoid adding new telemetry or immediately deprecating telemetry post-1.0, I would suggest waiting until:

Note that the above is only true for traces/metrics/logs. The fact that receiverhelper does not currently support profiles is an additional blocker for stabilization of the component for profiles.

@mx-psi mx-psi moved this from In Progress to Blocked in Collector: v1 Jan 13, 2025
@mx-psi
Copy link
Member Author

mx-psi commented Jan 13, 2025

Marking this as blocked until we get a reply on open-telemetry/opentelemetry-go-contrib/issues/6608

@mx-psi
Copy link
Member Author

mx-psi commented Jan 13, 2025

Agree on the blockers, I think we should file issues for them (the otelgrpc semantic conventions version one maybe it could be a simple PR?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Blocked
Development

No branches or pull requests

2 participants