-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SDK span telemetry metrics #1631
base: main
Are you sure you want to change the base?
Conversation
04f924f
to
8bbea82
Compare
Related #1580 |
@dashpole I considered it. For processors I discarded the idea, because log / metrics / span processors work quite differently and have different specifications which can evolve independently, so I'd see it rather as a risk for the future evolution of the metrics to couple them together. For the exporters I'm less opinionated, because their tasks/processing aren't really that dependent on the type of telemetry they handle. Here I could see using a single metric with an attribute to identify the type of telemetry e.g. Upsides of using a single exporter metric for all signal types:
Downsides:
For me the upside was just not big enough and therefore I decided to propose to go with the slightly more verbose, but less risky route. The number of signal types isn't going to explode, so having to duplicate some dashboarding work for signals is okay IMO. But this isn't a strong opinion on my side, if you feel different I'm happy to change it. |
084ca18
to
c73965a
Compare
c73965a
to
ef5bd53
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super excited for this!
model/otel/registry.yaml
Outdated
brief: > | ||
A name uniquely identifying the instance of the OpenTelemetry SDK component within its containing SDK instance. | ||
note: | | ||
The attribute value MUST follow a `<type-name>/<instance-counter>` pattern, e.g. `batching_span_processor/0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just implemented this in open-telemetry/opentelemetry-go#6153. I would personally prefer if these were separated into separate attributes for the type and the instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped the component typename as attribute per @jmacd remarks in this comment.
I'm happy to reintroduce it if you see any relevant use-case for it, like @jmacd I couldn't come up with any.
However, I would in that case just add the typename as additional attribute and keep the otel.sdk.component.name
as is, because I feel having just a numeric ID is not very intuitive when inspecting attribute values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally find it harder to use attributes that are joined in that way. E.g. querying for type=foo
is simpler than querying for type=foo/.*
. It can also be difficult to group by the type if it is joined with an ID. In PromQL, I would need to use label_replace()
with a regex extract the prefix in order to group by it.
Generally, I don't expect the ID to be useful for users except as a way to distinguish different instantiations of the exporter/processor/sdk. Given these components don't have a user-provided "name" or "Id" like the collector components do, it seems like it will generally be a UUID, or pointer, and won't be easy for users to consume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will generally be a UUID, or pointer
It must not be of that kind, because that will blow up the cardinatlity considering application restarts. That's specifically why I prescribed the isntance-counter
approach in the current PR state.
I generally find it harder to use attributes that are joined in that way. E.g. querying for type=foo is simpler than querying for type=foo/.*.
I understand. Then I would suggest to
- Add
otel.sdk.component.type
- Keep
otel.sdk.component.name
but be less presriptive:- It MUST be a low-cardinality, unique within SDK instance identifier for the component.
- It MAY be generated based on the
<type-name>/<instance-counter>
that is enforced in the current state of the PR
WDYT? Also cc @jmacd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added otel.sdk.component.type
in e737b7d.
Note that I've also changed the values for exporter to be specific to the kind of data: E.g. otlp_http_exporter
changed to otlp_http_span_exporter
. The reason is that the value needs to be unique for the component type, but exporters for logs, spans and metrics are different components, so otlp_http_exporter
would have caused an overlap here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't seem necessary to me. The metrics names are already split by signal type (i.e. contains "span"), so there is no need to split the exporter name by signal type.
Changes
With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.
We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.
I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.
Prior work
This PR can be seen as a follow up to the closed OTEP 259:
So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).
In my opinion, it is a good thing to separate the collector and SDK self-metrics:
Existing Metrics in Java SDK
For reference, here is what the existing health metrics currently look like in the Java SDK:
Batch Span Processor metrics
queueSize
, value is the current size of the queuespanProcessorType
=BatchSpanProcessor
(there was a formerExecutorServiceSpanProcessor
which has been removed)BatchSpanProcessor
instances are usedprocessedSpans
, value is the number of spans submitted to the ProcessorspanProcessorType
=BatchSpanProcessor
dropped
(boolean
),true
for the number of spans which could not be processed due to a full queueThe SDK also implements pretty much the same metrics for the
BatchLogRecordProcessor
justspan
replaced everywhere withlog
Exporter metrics
Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a
type
attribute.Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:
exporterName
=otlp
transport
is one ofgrpc
,http
(= protobuf) orhttp-json
The transport is used just for the instrumentation scope name:
io.opentelemetry.exporters.<exporterName>-<transport>
Based on that, the following metrics are exposed:
Counter
<exporterName>.exporter.seen
: The number of records (spans, metrics or logs) submitted to the exportertype
: one ofspan
,metric
orlog
Counter
<exporterName>.exporter.exported
: The number of records (spans, metrics or logs) actually exported (or failed)type
: one ofspan
,metric
orlog
success
(boolean):false
for exporter failuresMerge requirement checklist
[chore]