-
Notifications
You must be signed in to change notification settings - Fork 166
WIP: pipeline monitoring otep #259
WIP: pipeline monitoring otep #259
Conversation
- `otelcol_outgoing_items`: Exported, dropped, and discarded items (Collector) | ||
- `otelcol_incoming_items`: Received and inserted data items (Collector) | ||
- `otelsdk_outgoing_items`: Exported, dropped, and discarded items (SDK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty agnostic as to whether these should use periods or underscores. I kept underscores for now but please let me know if I should change them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TylerHelmuth, you left feedback on Josh's PR about this. Please let me know your preference.
|
||
### Retries | ||
|
||
*WIP: add details* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice if there was some kind of conservation property we could maintain in the presence of retries where we have N errors and need to a single success/failure status. It seems to me, also, something similar will be applicable for the case of fanout-consumer logic.
Should we have a separate metric for the extra fanout factor which is N-1
in both of these cases? This number will be needed somewhere to understand pipeline conservation through fanout and retries, I think.
|
||
### Recommended conventional attributes | ||
|
||
- `otel.error` (boolean): This is true or false depending on whether the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TylerHelmuth, you recommended adding pipeline
to the otel.
prefix here (ie otel.pipeline.error
. Do you mean for all of the attributes below? I'm not sure how these are all attributes of an otel pipeline.
- `otel.error` (boolean): This is true or false depending on whether the | ||
outcome is considered a failure or a success. See the chart below. | ||
- `otel.outcome` (string): This describes the outcome in a more specific | ||
way than `otel.error`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kristinapathak As mentioned at the SIG, I am interested in adding a similar attribute to the otelcol_exporter_send_failed_*
metrics as part of open-telemetry/opentelemetry-collector#10158
I would be happy to use outcome
if that is thought best.
My one concern is that I see that attribute is used on another metric we use in our org, specifically the Micrometer generated http.server.requests
metric, see here for the ENUM values. I see that this is not defined for the http semconv, but I just thought it worth noting here for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how outcome
will be used by a variety of metrics. My hope is that with the otel
prefix (ie otel.outcome
) there is no conflict with the attribute name.
Do the outcome values defined here work to solve open-telemetry/opentelemetry-collector#10157 or is more detail needed? If others are also in favor of this attribute, my hope is that you can update your PR to match this. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this proposal strikes me as more intuitive than previous iterations but I still have a few questions regarding collector pipelines.
the OpenTelemetry Collector as Collector pipelines. A Collector can contain | ||
multiple Collector pipelines which can contain multiple segments. Each segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Collector can contain multiple Collector pipelines which can contain multiple segments.
I'm unclear what this is saying. Is it saying that a single Collector pipeline contains multiple segments?
If so, how are those segments defined? For example, in the following pipeline, what are the segments?
receivers: [r1, r2]
processors: [p1, p2]
exporters: [e1, e2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The segments would be defined as:
r1, r2, p1, p2, e1
r1, r2, p1, p2, e2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, each collector pipeline contains a segment per exporter, where each of these segments contains all the receivers and all the processors of the pipeline, in addition to the exporter.
Can we update the language in this section to state this more clearly? Currently, it reads as "a receiver, zero or more processors, and an exporter" which doesn't appear accurate.
Can we also state "Components can be a part of multiple segments" before describing the relationship between segments and collector pipelines, since it's a prerequisite to understanding?
exporters. If the pipeline is synchronous, the outcome for the incoming item is | ||
recorded based on the rules in the below order: | ||
|
||
1. If there is a permanent error, that is used as the outcome. If there are | ||
multiple permanent errors, choose them in the following order: | ||
`rejected`, `deferred:rejected`, `unknown`, `deferred:unknown`. | ||
2. If there is a transient error, that is used as the outcome. If there are | ||
multiple transient errors, choose them in the following order: | ||
`dropped`, `deferred:dropped`, `timeout`, `deferred:timeout`, `exhausted`, | ||
`deferred:exhausted`, `retryable`, `deferred:retryable`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these rules are described as applying to synchronous pipelines, should they include deferred outcomes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops that's a good point! Deferred doesn't belong here
|
||
Additional examples of these outcomes can be found in the Appendix. | ||
|
||
### Collector Pipelines With Multiple Exporter Components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some other arrangements of components which I'm trying to fit to the proposed model. Some of them may be worth including in this document as well. Here's how I'm understanding them:
- A single collector pipeline with multiple exporters. As already noted, this includes a fanout point. The document describes how to aggregate outcomes from multiple exporters, effectively ensuring that
Incoming(Segment) == Outgoing(Segment)
. - A single collector pipeline with multiple receivers. This is fairly straightforward and common. The incoming items are just summed together. Probably doesn't require a dedicated section but we could include it for symmetry.
- A single receiver shared by multiple collector pipelines. This is the other type of fanout point in the collector. From here there are some similar considerations to the first case, but instead of fanning out to multiple exporters, we instead fanout to entire pipelines. What is the relationship between an item arriving at such a receiver and the outcomes of passing it to multiple pipelines? For example, one pipeline may successfully export the item while the other encounters an error which propagates back to the receiver. Does this count as "origin:received" for both pipelines and then each pipeline show a different outcome?
- A single exporter shared by multiple collector pipelines. This is the other type of merge point in the collector. In this case, I think synchronous outcomes can resolve back to a specific pipeline, but async outcomes may not be related to any specific pipeline. For example, if 2 pipelines each send 10 items to an exporter, which then batches all 20 items together into a single export request, the outcome may be "20 deferred:rejected", but it would be incorrect for either pipeline to incorporate that count directly. Is there any way to handle this? Otherwise, maybe this is just a caveat for interpreting deferred outcomes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about adding to alternatives considered:
The collector's logic for fanoutconsumer
uses multierr
, building on Go's Unwrap() []error
idiom to enclose all the errors. In my opinion, it would be nice to see fanoutconsumer
manage the logic of deciding how to transform multiple errors into a single error, so that we could specify that fanoutconsumer dictates the N-to-1 problem and the observability mechanism just follows whatever it decides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thank you @kristinapathak.
While I think I would approve it if all the "WIP" sections were flushed out, but they're minor and I think this document stands in sufficient detail for an implementation to be prototyped. Probably the next step is to prototype these metrics in the collector and an SDK.
*WIP: Figure this out. This is a bit subjective. What does an end user expect | ||
when calculating total items dropped in failure?* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, the idea of "total" signifies starting from the beginning of the pipeline at the SDKs, meaning to use a sum of all the SDK-inserted items and compare against some point later in the pipeline to measure how many are original items are lost somehow. This could mean looking at a gateway collector's exporter counts and comparing to the SDK-inserted counts, for example.
|
||
Additional examples of these outcomes can be found in the Appendix. | ||
|
||
### Collector Pipelines With Multiple Exporter Components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about adding to alternatives considered:
The collector's logic for fanoutconsumer
uses multierr
, building on Go's Unwrap() []error
idiom to enclose all the errors. In my opinion, it would be nice to see fanoutconsumer
manage the logic of deciding how to transform multiple errors into a single error, so that we could specify that fanoutconsumer dictates the N-to-1 problem and the observability mechanism just follows whatever it decides.
|
||
### Retries | ||
|
||
*WIP: add details* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice if there was some kind of conservation property we could maintain in the presence of retries where we have N errors and need to a single success/failure status. It seems to me, also, something similar will be applicable for the case of fanout-consumer logic.
Should we have a separate metric for the extra fanout factor which is N-1
in both of these cases? This number will be needed somewhere to understand pipeline conservation through fanout and retries, I think.
…ms (#11144) This updating the existing metric points that were recently added to use signal as an attribute instead of separating the metric name. It follows the suggestions in [otep 259](open-telemetry/oteps#259) for the metric and attribute names. Putting this in draft to get some feedback from @djaglowski before moving forward with this change --------- Signed-off-by: Alex Boten <[email protected]>
Apologies to @kristinapathak, I think we should close this. See open-telemetry/opentelemetry-collector#11311 |
@jmacd those only cover the collector, right? Is the plan that a separate PR that focuses on the SDK metrics only be opened? |
A continuation of @jmacd's work on #238 and #249
The goal of this OTEP is to define a semantic convention for metrics that provide information on the flow of data through a pipeline, providing insights both between and within segments of telemetry pipelines.
WIP.
Main focuses currently: