Add SDK span telemetry metrics #1631

JonasKunz · 2024-11-29T10:06:39Z

Changes

With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.

We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.

I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.

Prior work

This PR can be seen as a follow up to the closed OTEP 259:

The OTEP originally superseded Add processed/exported Span metrics. #184, which initially focused only on SDK exporter metrics.
Add processed/exported Span metrics. #184 was closed in favor of the predecessor of OTEP 259 to instead allow monitoring of entire "pipelines" using unified metrics across SDK exporters and collector components : A pipeline starts with an SDK exporter and goes through the processing of collector(s)
Finally, that OTEP was closed in favor of this collector RFC which adds metrics just for the collector components.

So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).

In my opinion, it is a good thing to separate the collector and SDK self-metrics:

There have been concerns about both using the same metrics for both: How do you distinguish the metrics exposed by collector components from the self-monitoring metrics exposed by an Otel-SDK used in the collector for e.g. tracing the collector itself?
Though many concepts between the collector and SDK share the same name, they are not the same thing (to my knowledge, I'm not a collector expert): For example processors in the collector are designed to form pipelines potentially mutating the data as it passes through. In contrast, SDK span processor don't form pipelines (at least not visible to the SDK, those would be hidden custom implementations). Instead SDK span processors are merely observers with multiple callbacks for the span lifecycle. So it would feel like "shoehorning" things into the same metric, even though they are not the same concepts.
Separating collector and SDK metrics makes their evolution and reaching agreements a lot easier: When using separate metrics and namespaces, collector metrics can focus on the collector implementation and SDK metrics can be defined just using the SDK spec. If combine both in shared metrics, those will have to be always be aligned with both the SDK spec and the collector implementation. I think this would make maintenance much harder for little benefit.
I have a hard time finding benefits of sharing metrics for SDK and collector: The main benefit I find would of course be easier dashboarding / analysis. However, I do think having to look at two sets of metrics to do so is a fine tradeoff, considering the difficulties with the unification listed above and shown by the history of OTEP 259.

Existing Metrics in Java SDK

For reference, here is what the existing health metrics currently look like in the Java SDK:

Batch Span Processor metrics

Gauge queueSize, value is the current size of the queue
- Attribute spanProcessorType=BatchSpanProcessor (there was a former ExecutorServiceSpanProcessor which has been removed)
- This metric currently causes collisions if two BatchSpanProcessor instances are used
Counter processedSpans, value is the number of spans submitted to the Processor
- Attribute spanProcessorType=BatchSpanProcessor
- Attribute dropped (boolean), true for the number of spans which could not be processed due to a full queue

The SDK also implements pretty much the same metrics for the BatchLogRecordProcessor just span replaced everywhere with log

Exporter metrics

Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a type attribute.
Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:

exporterName=otlp
transport is one of grpc, http (= protobuf) or http-json

The transport is used just for the instrumentation scope name: io.opentelemetry.exporters.<exporterName>-<transport>

Based on that, the following metrics are exposed:

Counter <exporterName>.exporter.seen: The number of records (spans, metrics or logs) submitted to the exporter
- Attribute type: one of span, metric or log
Counter <exporterName>.exporter.exported: The number of records (spans, metrics or logs) actually exported (or failed)
- Attribute type: one of span, metric or log
- Attribute success (boolean): false for exporter failures

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
schema-next.yaml updated with changes to existing conventions.

model/telemetry/metrics.yaml

model/telemetry/registry.yaml

model/telemetry/metrics.yaml

lmolkova · 2024-12-03T00:55:56Z

Related #1580

model/telemetry/registry.yaml

JonasKunz · 2025-01-14T12:08:08Z

Did you consider making processor/exporter metrics generic to the signal type (work for logs, metrics, or spans)?

@dashpole I considered it. For processors I discarded the idea, because log / metrics / span processors work quite differently and have different specifications which can evolve independently, so I'd see it rather as a risk for the future evolution of the metrics to couple them together.

For the exporters I'm less opinionated, because their tasks/processing aren't really that dependent on the type of telemetry they handle. Here I could see using a single metric with an attribute to identify the type of telemetry e.g. signal_type=spans/logs/metrics.

Upsides of using a single exporter metric for all signal types:

Easier dashboarding, e.g. no need to sum across metrics to get an "all data exported" counter
Trivial to add new signal types (e.g. profiling) IFF they match the current metric design

Downsides:

It kind of prohibits later addition of signal-specific attributes to the metrics, which allow deeper analysis if desired (e.g. number of exported spans by span type). These attributes would then be always empty for other signal types, which to me is a symptom of incorrect metric design.
If a new signal type is added which doesn't match the current metric design, things get ugly

For me the upside was just not big enough and therefore I decided to propose to go with the slightly more verbose, but less risky route. The number of signal types isn't going to explode, so having to duplicate some dashboarding work for signals is okay IMO.

But this isn't a strong opinion on my side, if you feel different I'm happy to change it.

model/otel/registry.yaml

dashpole

Super excited for this!

docs/otel/sdk-metrics.md

dashpole · 2025-01-23T03:33:37Z

model/otel/registry.yaml

+        brief: >
+          A name uniquely identifying the instance of the OpenTelemetry SDK component within its containing SDK instance.
+        note: |
+          The attribute value MUST follow a `<type-name>/<instance-counter>` pattern, e.g. `batching_span_processor/0`.


I just implemented this in open-telemetry/opentelemetry-go#6153. I would personally prefer if these were separated into separate attributes for the type and the instance

I dropped the component typename as attribute per @jmacd remarks in this comment.

I'm happy to reintroduce it if you see any relevant use-case for it, like @jmacd I couldn't come up with any.

However, I would in that case just add the typename as additional attribute and keep the otel.sdk.component.name as is, because I feel having just a numeric ID is not very intuitive when inspecting attribute values.

I generally find it harder to use attributes that are joined in that way. E.g. querying for type=foo is simpler than querying for type=foo/.*. It can also be difficult to group by the type if it is joined with an ID. In PromQL, I would need to use label_replace() with a regex extract the prefix in order to group by it.

Generally, I don't expect the ID to be useful for users except as a way to distinguish different instantiations of the exporter/processor/sdk. Given these components don't have a user-provided "name" or "Id" like the collector components do, it seems like it will generally be a UUID, or pointer, and won't be easy for users to consume.

it will generally be a UUID, or pointer

It must not be of that kind, because that will blow up the cardinatlity considering application restarts. That's specifically why I prescribed the isntance-counter approach in the current PR state.

I generally find it harder to use attributes that are joined in that way. E.g. querying for type=foo is simpler than querying for type=foo/.*.

I understand. Then I would suggest to

Add otel.sdk.component.type

Keep otel.sdk.component.name but be less presriptive:

It MUST be a low-cardinality, unique within SDK instance identifier for the component.

It MAY be generated based on the <type-name>/<instance-counter> that is enforced in the current state of the PR

WDYT? Also cc @jmacd

Added otel.sdk.component.type in e737b7d.

Note that I've also changed the values for exporter to be specific to the kind of data: E.g. otlp_http_exporter changed to otlp_http_span_exporter. The reason is that the value needs to be unique for the component type, but exporters for logs, spans and metrics are different components, so otlp_http_exporter would have caused an overlap here.

That doesn't seem necessary to me. The metrics names are already split by signal type (i.e. contains "span"), so there is no need to split the exporter name by signal type.

JonasKunz added 3 commits November 29, 2024 11:03

Added SDK span telemetry metrics

8f2b666

Fix formatting

8b2a1db

Fix yamllint

8bbea82

JonasKunz force-pushed the sdk-telemetry branch from 04f924f to 8bbea82 Compare November 29, 2024 10:26

JonasKunz added 2 commits November 29, 2024 11:26

Merge remote-tracking branch 'otel/main' into sdk-telemetry

e15696f

Changelog

cef63f2

JonasKunz commented Nov 29, 2024

View reviewed changes

model/telemetry/metrics.yaml Outdated Show resolved Hide resolved

JonasKunz marked this pull request as ready for review November 29, 2024 10:40

JonasKunz requested review from a team as code owners November 29, 2024 10:40