Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reason dimension to exporter and receiver failure metrics #10158

Conversation

0x006EA1E5
Copy link

Description

Adds reason attribute to otelcol_exporter_send_failed_* and otelcol_receiver_refused_* metrics

Link to tracking issue

Fixes #10157

Testing

TODO

Documentation

TODO

@0x006EA1E5 0x006EA1E5 requested review from a team and djaglowski May 15, 2024 13:57
Copy link

linux-foundation-easycla bot commented May 15, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: 0x006EA1E5 / name: Greg Eales (59f17fd)

@0x006EA1E5 0x006EA1E5 force-pushed the exporter-failure-reason-attribute branch from 80426d9 to 47a2db5 Compare May 15, 2024 13:59
@0x006EA1E5 0x006EA1E5 force-pushed the exporter-failure-reason-attribute branch from 47a2db5 to 59f17fd Compare May 15, 2024 14:03
@bogdandrutu
Copy link
Member

In general things that are high cardinality like generic "errors" are not best suited for metrics, and usually they should just be recorded like logs or span attributes.

@0x006EA1E5
Copy link
Author

0x006EA1E5 commented May 16, 2024

In general things that are high cardinality like generic "errors" are not best suited for metrics, and usually they should just be recorded like logs or span attributes.

The suggestion is to use the GRPC status code, not the actual error text, i.e., https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L37

So the cardinality is around 17 at most.

Status code is commonly used as a metric dimension, for example for http metrics.

And typically, (in my experience of the collector), the actual number of statuses seen in responses will be much lower, so time series will not be generated for most of the possible values. Actually, in my experience, the error will normally be UNAVAILABLE, with a much lower number of UNKNOWN, DEADLINE_EXCEEDED, and RESOURCE_EXHAUSTED, so I wouldn't expect cardinality to increase by so much overall.

I'm also suggesting that we only add this dimension when the telemetry is configured as LevelDetailed, so users worried about cardinality have some control here.

@bogdandrutu
Copy link
Member

The suggestion is to use the GRPC status code, not the actual error text, i.e., https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L37

Why not accept the code then?

@0x006EA1E5
Copy link
Author

Why not accept the code then?

I don't understand. Do you mean use the numeric status code instead of the status code text?

@0x006EA1E5
Copy link
Author

I don't mind using either the numeric code, or the equivalent name, although I would think the name would be a bit more informative / easier to read.

And as the exporter could be using either GRPC or HTTP (or potentially another protocol), then the GRPC status code number may be a bit confusing.

@0x006EA1E5
Copy link
Author

Is there anything I can do to progress this?

@atoulme
Copy link
Contributor

atoulme commented Jun 5, 2024

You can join a SIG meeting, we have one for the collector in 10 minutes. It runs weekly.

@codeboten
Copy link
Contributor

There's an otep that discusses additional details around monitoring a telemetry pipeline open-telemetry/oteps#259, might be worth taking a look there as well

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jun 21, 2024
Copy link
Contributor

github-actions bot commented Jul 6, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jul 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add 'reason' attribute to otelcol_exporter_send_failed_* metrics
4 participants