diff --git a/oteps/0000-template.md b/oteps/0000-template.md new file mode 100644 index 00000000000..131f23cf52c --- /dev/null +++ b/oteps/0000-template.md @@ -0,0 +1,41 @@ +# Replace this with your awesome OTEP title + +Short (one sentence) summary, e.g., something that would be appropriate for a [CHANGELOG](https://keepachangelog.com/) or release notes. + +## Motivation + +Why should we make this change? What new value would it bring? What use cases does it enable? + +## Explanation + +Explain the proposed change as though it was already implemented and you were explaining it to a user. Depending on which layer the proposal addresses, the "user" may vary, or there may even be multiple. + +We encourage you to use examples, diagrams, or whatever else makes the most sense! + +## Internal details + +From a technical perspective, how do you propose accomplishing the proposal? In particular, please explain: + +* How the change would impact and interact with existing functionality +* Likely error modes (and how to handle them) +* Corner cases (and how to handle them) + +While you do not need to prescribe a particular implementation - indeed, OTEPs should be about **behaviour**, not implementation! - it may be useful to provide at least one suggestion as to how the proposal *could* be implemented. This helps reassure reviewers that implementation is at least possible, and often helps them inspire them to think more deeply about trade-offs, alternatives, etc. + +## Trade-offs and mitigations + +What are some (known!) drawbacks? What are some ways that they might be mitigated? + +Note that mitigations do not need to be complete *solutions*, and that they do not need to be accomplished directly through your proposal. A suggested mitigation may even warrant its own OTEP! + +## Prior art and alternatives + +What are some prior and/or alternative approaches? For instance, is there a corresponding feature in OpenTracing or OpenCensus? What are some ideas that you have rejected? + +## Open questions + +What are some questions that you know aren't resolved yet by the OTEP? These may be questions that could be answered through further discussion, implementation experiments, or anything else that the future may bring. + +## Future possibilities + +What are some future changes that this proposal would enable? diff --git a/oteps/0001-telemetry-without-manual-instrumentation.md b/oteps/0001-telemetry-without-manual-instrumentation.md new file mode 100644 index 00000000000..8e4d2d9b1bc --- /dev/null +++ b/oteps/0001-telemetry-without-manual-instrumentation.md @@ -0,0 +1,113 @@ +# (Open) Telemetry Without Manual Instrumentation + +_Cross-language requirements for automated approaches to extracting portable telemetry data with zero source code modification._ + +## Motivation + +The purpose of OpenTelemetry is to make robust, portable telemetry a built-in feature of cloud-native software. For some software and some situations, that instrumentation can literally be part of the source code. In other situations, it’s not so simple: for example, we can’t necessarily edit or even recompile some of our software, the OpenTelemetry instrumentation only exists as a plugin, or instrumentation just never rises to the top of the priority list for a service-owner. Furthermore, there is occasionally a desire to disable instrumentation for a single plugin or module at runtime, again without requiring developers to make changes to source code. + +One way to navigate situations like these is with a software layer that adds OpenTelemetry instrumentation to a service without modifying the source code for that service. (In the conventional APM world, these software layers are often called “agents”, though that term is overloaded and ambiguous so we try avoid it in this document.) + +### Why “cross-language”? + +Many people have correctly observed that “agent” design is highly language-dependent. This is certainly true, but there are still higher-level “product” objectives for OpenTelemetry that can guide the design choices we make across languages and help users form a consistent impression of what OpenTelemetry provides (and what it does not). + +### Suggested reading + +* This GitHub issue: [Propose an "Auto-Instrumentation SIG"](https://github.com/open-telemetry/community/pull/87) +* [Rough notes from the June 11, 2019 meeting](https://docs.google.com/document/d/1ix0WtzB5j-DRj1VQQxraoqeUuvgvfhA6Sd8mF5WLNeY/edit) following this ^^ issue +* The [rough draft for this RFC](https://docs.google.com/document/d/1sovSQIGdxXtsauxUNp4qUMEIJZzObdukzPT52eyPCHM/edit#), including the comments + +## Proposed guidelines + +### Requirements + +Without further ado, here are a set of requirements for “official” OpenTelemetry efforts to accomplish zero-source-code-modification instrumentation (i.e., “OpenTelemetry agents”) in any given language: + +* _Manual_ source code modifications "very strongly discouraged", with an exception for languages or environments that leave no credible alternatives. Any code changes must be trivial and `O(1)` per source file (rather than per-function, etc). +* Licensing must be permissive (e.g., ASL / BSD) +* Packaging must allow vendors to “wrap” or repackage the portable (OpenTelemetry) library into a single asset that’s delivered to customers + * That is, vendors do not want to require users to comprehend both an OpenTelemetry package and a vendor-specific package +* Explicit, whitebox OpenTelemetry instrumentation must interoperate with the “automatic” / zero-source-code-modification / blackbox instrumentation. + * If the blackbox instrumentation starts a Span, whitebox instrumentation must be able to discover it as the active Span (and vice versa) + * Relatedly, there also must be a way to discover and avoid potential conflicts/overlap/redundancy between explicit whitebox instrumentation and blackbox instrumentation of the same libraries/packages + * That is, if a developer has already added the “official” OpenTelemetry plugin for, say, gRPC, then when the blackbox instrumentation effort adds gRPC support, it should _not_ “double-instrument” it and create a mess of extra spans/etc +* From the standpoint of the actual telemetry being gathered, the same standards and expectations (about tagging, metadata, and so on) apply to "whitebox" instrumentation and automatic instrumentation +* The code in the OpenTelemetry package must not take a hard dependency on any particular vendor/vendors (that sort of functionality should work via a plugin or registry mechanism) + * Further, the code in the OpenTelemetry package must be isolated to avoid possible conflicts with the host application (e.g., shading in Java, etc) + +### Nice-to-have properties + +* Run-time integration (vs compile-time integration) +* Automated and modular testing of individual library/package plugins + * Note that this also makes it easy to test against multiple different versions of any given library +* A fully pluggable architecture, where plugins can be registered at runtime without requiring changes to the central repo at github.com/open-telemetry + * E.g., for ops teams that want to write a plugin for a proprietary piece of legacy software they are unable to recompile +* Augemntation of whitebox instrumentation by blackbox instrumentation (or, perhaps, vice versa). That is, not only can the trace context be shared by these different flavors of instrumentation, but even things like in-flight Span objects can be shared and co-modified (e.g., to use runtime interposition to grab local variables and attach them to a manually-instrumented span). + +## Trade-offs and mitigations + +Approaching a problem this language-specific at the cross-language altitude is intrinsically challenging since "different languages are different" – e.g., in Go there is no way to perform the kind of runtime interpositioning that's possible in Python, Ruby, or even Java. + +There is also a school of thought that we should only be focusing on the bits and bytes that actually escape the running process and ignore how that's actually accomplished. This has a certain elegance to it, but it also runs afoul of the need to have manual instrumentation interoperate with the zero-touch instrumentation, especially when it comes to the (shared) distributed context itself. + +## Proposal + +### What is our desired end state for OpenTelemetry end-users? + +To reiterate much of the above: + +* First and foremost, **portable OpenTelemetry instrumentation can be installed without manual source code modification** +* There’s one “clear winner” when it comes to portable, automatic instrumentation; just like with OpenTracing and OpenCensus, this is a situation where choice is not necessarily a good thing. End-users who wish to contribute instrumentation plugins should not have their enthusiasm and generosity diluted across competing projects. +* As much as such a thing is possible, consistency across languages +* Broad coverage / “plugin support” +* Broad vendor support for OpenTelemetry +* All other things being equal, get all of these ^^ benefits ASAP! + +### What's the basic proposal? + +Given the desired end state, the Datadog tracers seem like the closest-fit, permissively-licensed option out there today. We asked Datadog's leadership whether they would be interested in donating that code to OpenTelemetry, and they were receptive to the idea. (I.e., this would not be a "hard fork" that must be maintained in parallel forever) + +### The overarching (technical) process, per-language + +* Start with [the Datadog `dd-trace-foo` tracers](https://github.com/DataDog) +* For each language: + * Fork the Datadog `datadog/dd-trace-foo` repo into a `open-telemetry/auto-instr-foo` OpenTelemetry repo (exact naming TBD) + * In parallel: + * The `dd-trace-foo` codebases already do a good job separating Datadog-specific functionality from general-purpose functionality. Where needed, make that boundary even more explicit through an API (or "SPI", really). + * Create a new `dd-trace-foo` lib that wraps `auto-instr-foo` and includes the Datadog-specific pieces factored out above + * Show that it’s also possible to bind to arbitrary OpenTelemetry-based tracers to the above API/SPI + * Declare the forked `auto-instr-foo` repository ready for production beta use + * For some (ideally brief) period: + * When new plugins are added to Datadog's (original) repo, merge them over into the `auto-instr-foo` repo + * Allow Datadog end-users to bind to either for some period of time (ultimately Datadog's decision on timeline here, and does not prevent other tracers from using `auto-instr-foo`) + * Finally, when the combination of `auto-instr-foo` and a Datadog wrapper is functionally equivalent to the `dd-trace-foo` mainline, the latter can be safely replaced by the former. + * Note that, by design, this is not expected to affect Datadog end-users + * Moved repo is GA’d: all new plugins (and improvements to the auto-instrumentation core) happen in the `auto-instr-foo` repo + +There are some languages that will have OpenTelemetry support before they have Datadog `dd-trace-foo` support. In those situations, we will fall back to the requirements in this OTEP and leave the technical determinations up to the language SIG and the OpenTelemetry TC. + +### Governance of the auto-instrumentation libraries + +Each `auto-instr-foo` repository must have at least one [Maintainer](https://github.com/open-telemetry/community/blob/master/community-membership.md#maintainer) in common with the main `opentelemetry-foo` language repository. There are no other requirements or constraints about the set of maintainers/approvers for the main language repository and the respective auto-instrumentation repository; in particular, there may be maintainers/approvers of the main language repository that are not maintainers/approvers for the auto-instrumentation repository, and vice versa. + +### Mini-FAQ about this proposal + +**Will this be the only auto-instrumentation story for OpenTelemetry?** It need not be. The auto-instrumentation libraries described above will have no privileged access to OpenTelemetry APIs, and as such they have no exclusive advantage over any other auto-instrumentation libraries. + +**What about auto-instrumenting _Project X_? Why aren't we using that instead??** First of all, there's nothing preventing any of us from taking great ideas from _Project X_ and incorporating them into these auto-instrumentation libraries. We propose that we start with the Datadog codebases and iterate from there as need be. If there are improvements to be made in any given language, they will be welcomed by all. + +## Prior art and alternatives + +There are many proprietary APM language agents – no need to survey them all here. There is a much smaller list of "APM agents" (or other auto-instrumentation efforts) that are already permissively-licensed OSS. For instance, when we met to discuss options for JVM (longer notes [here](https://docs.google.com/document/d/1ix0WtzB5j-DRj1VQQxraoqeUuvgvfhA6Sd8mF5WLNeY/edit#heading=h.kjctiyv4rxup)), we came away with the following list: + +* [Honeycomb's Java beeline](https://github.com/honeycombio/beeline-java) +* [Datadog's Java tracer](https://github.com/datadog/dd-trace-java) +* [Glowroot](https://glowroot.org/) +* [SpecialAgent](https://github.com/opentracing-contrib/java-specialagent) + +The most obvious "alternative approach" would be to choose "starting points" independently in each language. This has several problems: + +* Higher likelihood of "hard forks": we want to avoid an end state where two projects (the OpenTelemetry version, and the original version) evolve – and diverge – independently +* Higher likelihood of "concept divergence" across languages: while each language presents unique requirements and challenges, the Datadog auto-instrumentation libraries were written by a single organization with some common concepts and architectural requirements (they were also written to be OpenTracing-compatible, which greatly increases our odds of success given the similarities to OpenTelemetry) +* Datadog would also like a uniform strategy here, and this donation requires their consent (unless we want to do a hard fork, which is suboptimal for everyone). So starting with the Datadog libraries in "all but one" (or "all but two", etc) languages makes this less palatable for them diff --git a/oteps/0005-global-init.md b/oteps/0005-global-init.md new file mode 100644 index 00000000000..6c8ec8e0378 --- /dev/null +++ b/oteps/0005-global-init.md @@ -0,0 +1,110 @@ +# Global SDK initialization + +**Status**: proposed + +Specify the behavior of OpenTelemetry APIs and implementations at startup. + +## Motivation + +OpenTelemetry is designed with a separation between the API and the +SDK which implements it, allowing an application to configure and bind +any compatible SDK at runtime. OpenTelemetry is designed to support +"zero touch" instrumentation for third party libraries through the use +of a global instance. + +In many programming environments, it is possible for libraries of code +to auto-initialize, allowing them to begin operation concurrently with +the main program, e.g., while initializing static program state. This +presents a set of opposing requirements: (1) the API supports a +configurable SDK; (2) third party libraries may use OpenTelemetry +without configuration. + +## Explanation + +There are several acceptable ways to address this situation. The +feasibility of each approach varies by language. The implementation +must select one of the following strategies: + +### Service provider mechanism + +Where the language provides a commonly accepted way to inject SDK +components, it should be preferred. The Java SPI supports loading and +configuring the global SDK before it is first used, and because of +this property the service provider mechanism case leaves little else +to specify. + +### Explicit initializer + +When it is not possible to ensure the SDK is installed and configured +before the API is first used, loading the SDK is handed off to the +user "at the right time", as stated in [Ruby issue +19](https://github.com/open-telemetry/opentelemetry-ruby/issues/19). +In this case, a number of requirements must be specified, as discussed +next. + +## Requirements: Explicit initializer + +OpenTelemetry specifies that the default implementation is +non-operational (i.e., a "no-op"), requiring that API method calls +result in effectively zero instrumentation overhead. We expect third +party libraries to use the global SDK before it is installed, which is +addressed in a requirement stated below. + +The explicit initializer method should take independent `Tracer` and +`Meter` objects (e.g., `opentelemetry.Init(Tracer, Meter)`). The SDK +may be installed no more than once. After the first SDK installed, +subsequent calls to the explicit initializer shall log console +warnings. + +In common language, uses of the global SDK instance (i.e., the Tracer +and Meter) must "begin working" once the SDK is installed, with the +following stipulations: + +### Tracer + +There may be loss of spans at startup. + +Spans that are started before the SDK is installed are not recovered, +they continue as No-op spans. + +### Meter + +There may be loss of metrics at startup. + +Metric SubMeasure objects (i.e., metrics w/ predefined labels) +initialized before the SDK is installed will redirect to the global +SDK after it is installed. + +### Concrete types + +Keys, tags, attributes, labels, resources, span context, and +distributed context are specified as pure API objects, therefore do +not depend on the SDK being installed. + +## Trade-offs and mitigations + +### Testing support + +Testing should be performed without depending on the global SDK. + +### Synchronization + +Since the global Tracer and Meter objects are required to begin +working once the SDK is installed, there is some implied +synchronization overhead at startup, overhead we expect to fall after +the SDK is installed. We recommend explicitly installing a No-op SDK +to fully disable instrumentation, as this approach will have a lower +overhead than leaving the OpenTelemetry library uninitialized. + +## Prior art and alternatives + +As an example that does not qualify as "commonly accepted", see [Go +issue 52](https://github.com/open-telemetry/opentelemetry-go/issues/52) +which demonstrates using the Go `plugin` package to load a +configurable SDK prior to first use. + +## Open questions + +What other options should be passed to the explicit global initializer? + +Is there a public test for "is the SDK installed; is it a no-op"? diff --git a/oteps/0007-no-out-of-band-reporting.md b/oteps/0007-no-out-of-band-reporting.md new file mode 100644 index 00000000000..0addd963e0c --- /dev/null +++ b/oteps/0007-no-out-of-band-reporting.md @@ -0,0 +1,61 @@ +# Remove support to report out-of-band telemetry from the API + +## TL;DR + +This section tries to summarize all the changes proposed in this RFC: + +1. Remove API requirement to support reporting out-of-band telemetry. +2. Move Resource to SDK, API will always report telemetry for the current application so no need to +allow configuring the Resource in any instrumentation. +3. New APIs should be designed without this requirement. + +## Motivation + +Currently the API package is designed with a goal to support reporting out-of-band telemetry, but +this requirements forces a lot of trade-offs and unnecessary complicated APIs (e.g. `Resource` must +be exposed in the API package to allow telemetry to be associated with the source of the telemetry). + +Reporting out-of-band telemetry is a required for the OpenTelemetry ecosystem, but this can be done +via a few different other options that does not require to use the API package: + +* The OpenTelemetry Service, users can write a simple [receiver][otelsvc-receiver] that parses and +produces the OpenTelemetry data. +* Using the SDK's exporter framework, users can write directly OpenTelemetry data. + +## Internal details + +Here is a list of decisions and trade-offs related to supporting out-of-band reporting: + +1. Add `Resource` concept into the API. + * Example in the create metric we need to allow users to specify the resource, see + [here][create-metric]. The developer that writes the instrumentation has no knowledge about where + the monitored resource is deployed so there is no way to configure the right resource. +2. [RFC](./trace/0002-remove-spandata.md) removes support to report SpanData. + * This will require that the trace API has to support all the possible fields to be configured + via the API, for example need to allow users to set a pre-generated `SpanId` that can be avoided + if we do not support out-of-band reporting. +3. Sampling logic for out-of-band spans will get very complicated because it will be incorrect to +sample these data. +4. Associating the source of the telemetry with the telemetry data gets very simple. All data +produced by one instance of the API implementation belongs to only one Application. + +This can be rephrased as "one API implementation instance" can report telemetry about only the +current Application. + +### Resource changes + +This RFC does not suggest to remove the `Resource` concept or to modify any API in this interface, +it only suggests to move this concept to the SDK level. + +Every implementation of the API (SDK in OpenTelemetry case) instance will have one `Resource` that +describes the running Application. There may be cases where in the same binary there are multiple +Application running (e.g. Java application server), every application will have it's own SDK +instance configured with it's own `Resource`. + +## Related Issues + +* [opentelemetry-specification/62](https://github.com/open-telemetry/opentelemetry-specification/issues/62) +* [opentelemetry-specification/61](https://github.com/open-telemetry/opentelemetry-specification/issues/61) + +[otelsvc-receiver]: https://github.com/open-telemetry/opentelemetry-service#config-receivers +[create-metric]: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/metrics/api.md#create-metric diff --git a/oteps/0016-named-tracers.md b/oteps/0016-named-tracers.md new file mode 100644 index 00000000000..a6e918f261a --- /dev/null +++ b/oteps/0016-named-tracers.md @@ -0,0 +1,140 @@ +# Named Tracers and Meters + +_Associate Tracers and Meters with the name and version of the instrumentation library which reports telemetry data by parameterizing the API which the library uses to acquire the Tracer or Meter._ + +## Suggested reading + +* [Proposal: Tracer Components](https://github.com/open-telemetry/opentelemetry-specification/issues/10) +* [Global Instance discussions](https://github.com/open-telemetry/opentelemetry-specification/labels/global%20instance) +* [Proposal: Add a version resource](https://github.com/open-telemetry/oteps/pull/38) + +## Motivation + +The mechanism of "Named Tracers and Meters" proposed here is motivated by the following scenarios: + +### Faulty or expensive instrumentation + +For an operator of an application using OpenTelemetry, there is currently no way to influence the amount of data produced by instrumentation libraries. Instrumentation libraries can easily "spam" backend systems, deliver bogus data, or -- in the worst case -- crash or slow down applications. These problems might even occur suddenly in production environments because of external factors such as increasing load or unexpected input data. + +### Instrumentation library identification + +If an instrumentation library hasn't implemented [semantic conventions](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/overview.md#semantic-conventions) correctly or those conventions change over time, it's currently hard to interpret and sanitize data produced by it selectively. The produced Spans or Metrics cannot later be associated with the library which reported them, either in the processing pipeline or the backend. + +### Disable instrumentation of pre-instrumented libraries + +It is the eventual goal of OpenTelemetry that library vendors implement the OpenTelemetry API, obviating the need to auto-instrument their library. An operator should be able to disable the telemetry that is built into some database driver or other library and provide their own integration if the built-in telemetry is lacking in some way. This should be possible even if the developer of that database driver has not provided a configuration to disable telemetry. + +## Solution + +This proposal attempts to solve the stated problems by introducing the concept of: + +* _Named Tracers and Meters_ which are associated with the **name** (e.g. _"io.opentelemetry.contrib.mongodb"_) and **version** (e.g._"semver:1.0.0"_) of the library which acquired them. +* A `TracerProvider` / `MeterProvider` as the only means of acquiring a Tracer or Meter. + +Based on the name and version, a Provider could provide a no-op Tracer or Meter to specific instrumentation libraries, or a Sampler could be implemented that discards Spans or Metrics from certain libraries. Also, by providing custom Exporters, Span or Metric data could be sanitized before it gets processed in a back-end system. However, this is beyond the scope of this proposal, which only provides the fundamental mechanisms. + +## Explanation + +From a user perspective, working with _Named Tracers / Meters_ and `TracerProvider` / `MeterProvider` is conceptually similar to how e.g. the [Java logging API](https://docs.oracle.com/javase/7/docs/api/java/util/logging/Logger.html#getLogger(java.lang.String)) and logging frameworks like [log4j](https://www.slf4j.org/apidocs/org/slf4j/LoggerFactory.html) work. In analogy to requesting Logger objects through LoggerFactories, an instrumentation library would create specific Tracer / Meter objects through a TracerProvider / MeterProvider. + +New Tracers or Meters can be created by providing the name and version of an instrumentation library. The version (following the convention proposed in ) is basically optional but _should_ be supplied since only this information enables following scenarios: + +* Only a specific range of versions of a given instrumentation library need to be suppressed, while other versions are allowed (e.g. due to a bug in those specific versions). +* Go modules allow multiple versions of the same middleware in a single build so those need to be determined at runtime. + +```java +// Create a tracer/meter for a given instrumentation library in a specific version. +Tracer tracer = OpenTelemetry.getTracerProvider().getTracer("io.opentelemetry.contrib.mongodb", "semver:1.0.0"); +Meter meter = OpenTelemetry.getMeterProvider().getMeter("io.opentelemetry.contrib.mongodb", "semver:1.0.0"); +``` + +These factories (`TracerProvider` and `MeterProvider`) replace the global `Tracer` / `Meter` singleton objects as ubiquitous points to request Tracer and Meter instances. + + The _name_ used to create a Tracer or Meter must identify the _instrumentation_ libraries (also referred to as _integrations_) and not the library being instrumented. These instrumentation libraries could be libraries developed in an OpenTelemetry repository, a 3rd party implementation, or even auto-injected code (see [Open Telemetry Without Manual Instrumentation OTEP](https://github.com/open-telemetry/oteps/blob/master/text/0001-telemetry-without-manual-instrumentation.md)). See also the examples for identifiers at the end. +If a library (or application) has instrumentation built-in, it is both the instrumenting and instrumented library and should pass its own name here. In all other cases (and to distinguish them from that case), the distinction between instrumenting and instrumented library is very important. For example, if an HTTP library `com.example.http` is instrumented by either `io.opentelemetry.contrib.examplehttp`, then it is important that the Tracer is not named `com.example.http`, but `io.opentelemetry.contrib.examplehttp` after the actual instrumentation library. + +If no name (null or empty string) is specified, following the suggestions in ["error handling proposal"](https://github.com/open-telemetry/opentelemetry-specification/pull/153), a "smart default" will be applied and a default Tracer / Meter implementation is returned. + +### Examples (of Tracer and Meter names) + +Since Tracer and Meter names describe the libraries which use those Tracers and Meters, their names should be defined in a way that makes them as unique as possible. +The name of the Tracer / Meter should represent the identity of the library, class or package that provides the instrumentation. + +Examples (based on existing contribution libraries from OpenTracing and OpenCensus): + +* `io.opentracing.contrib.spring.rabbitmq` +* `io.opentracing.contrib.jdbc` +* `io.opentracing.thrift` +* `io.opentracing.contrib.asynchttpclient` +* `io.opencensus.contrib.http.servlet` +* `io.opencensus.contrib.spring.sleuth.v1x` +* `io.opencesus.contrib.http.jaxrs` +* `github.com/opentracing-contrib/go-amqp` (Go) +* `github.com/opentracing-contrib/go-grpc` (Go) +* `OpenTracing.Contrib.NetCore.AspNetCore` (.NET) +* `OpenTracing.Contrib.NetCore.EntityFrameworkCore` (.NET) + +## Internal details + +By providing a `TracerProvider` / `MeterProvider` and _Named Tracers / Meters_, a vendor or OpenTelemetry implementation gains more flexibility in providing Tracers and Meters and which attributes they set in the resulting Spans and Metrics that are produced. + +On an SDK level, the SpanData class and its Metrics counterpart are extended with a `getLibraryResource` function that returns the resource associated with the Tracer / Meter that created it. + +## Glossary of Terms + +### Instrumentation library + +Also known as the trace/metrics reporter, this may be either a library/module/plugin provided by OpenTelemetry that instruments an existing library, a third party integration which instruments some library, or a library that has implemented the OpenTelemetry API in order to instrument itself. In any case, the instrumentation library is the library which provides tracing and metrics data to OpenTelemetry. + +examples: + +* `@opentelemetry/plugin-http` +* `io.opentelemetry.redis` +* `redis-client` (in this case, `redis-client` has instrumented itself with the OpenTelemetry API) + +### Tracer / Meter name and version + +When an instrumentation library acquires a Tracer/Meter, it provides its own name and version to the Tracer/Meter Provider. This name/version two-tuple is said to be the Tracer/Meter's _name_ and _version_. Note that this is the name and version of the library which acquires the Tracer/Meter, and not the library it is monitoring. In cases where the library is instrumenting itself using the OpenTelemetry API, they may be the same. + +example: If the `http` version `semver:3.0.0` library is being instrumented by a library with the name `io.opentelemetry.contrib.http` and version `semver:1.3.2`, then the tracer name and version are also `io.opentelemetry.contrib.http` and `semver:1.3.2`. If that same `http` library has built-in instrumentation through use of the OpenTelemetry API, then the tracer name and version would be `http` and `semver:3.0.0`. + +### Meter namespace + +Meter name is used as a namespace for all metrics created by it. This allows a telemetry library to register a metric using any name, such as `latency`, without worrying about collisions with a metric registered under the same name by a different library. + +example: The libraries `redis` and `io.opentelemetry.redis` may both register metrics with the name `latency`. These metrics can still be uniquely identified even though they have the same name because they are registered under different namespaces (`redis` and `io.opentelemetry.redis` respectively). In this case, the operator may disable one of these metrics because they are measuring the same thing. + +## Prior art and alternatives + +This proposal originates from an `opentelemetry-specification` proposal on [components](https://github.com/open-telemetry/opentelemetry-specification/issues/10) since having a concept of named Tracers would automatically enable determining this semantic `component` property. + +Alternatively, instead of having a `TracerProvider`, existing (global) Tracers could return additional indirection objects (called e.g. `TraceComponent`), which would be able to produce spans for specifically named traced components. + +```java +TraceComponent traceComponent = OpenTelemetry.Tracing.getTracer().componentBuilder("io.opentelemetry.contrib.mongodb", "semver:1.0.0"); +Span span = traceComponent.spanBuilder("someMethod").startSpan(); +``` + +Overall, this would not change a lot compared to the `TracerProvider` since the levels of indirection until producing an actual span are the same. + +Instead of setting the `component` property based on the given Tracer names, those names could also be used as _prefixes_ for produced span names (e.g. ``). However, with regard to data quality and semantic conventions, a dedicated `component` set on spans is probably preferred. + +Instead of using plain strings as an argument for creating new Tracers, a `Resource` identifying an instrumentation library could be used. Such resources must have a _version_ and a _name_ label (there could be semantic convention definitions for those labels). This implementation alternative mainly depends on the availability of the `Resource` data type on an API level (see ). + +```java +// Create resource for given instrumentation library information (name + version) +Map libraryLabels = new HashMap<>(); +libraryLabels.put("name", "io.opentelemetry.contrib.mongodb"); +libraryLabels.put("version", "1.0.0"); +Resource libraryResource = Resource.create(libraryLabels); +// Create tracer for given instrumentation library. +Tracer tracer = OpenTelemetry.getTracerProvider().getTracer(libraryResource); +``` + +Those given alternatives could be applied to Meters and Metrics in the same way. + +## Future possibilities + +Based on the Resource information identifying a Tracer or Meter these could be configured (enabled / disabled) programmatically or via external configuration sources (e.g. environment). + +Based on this proposal, future "signal producers" (i.e. logs) can use the same or a similar creation approach. diff --git a/oteps/0035-opentelemetry-protocol.md b/oteps/0035-opentelemetry-protocol.md new file mode 100644 index 00000000000..c3578849c50 --- /dev/null +++ b/oteps/0035-opentelemetry-protocol.md @@ -0,0 +1,464 @@ +# OpenTelemetry Protocol Specification + +**Author**: Tigran Najaryan, Omnition Inc. + +OpenTelemetry Protocol (OTLP) specification describes the encoding, transport and delivery mechanism of telemetry data between telemetry sources, intermediate nodes such as collectors and telemetry backends. + +## Table of Contents + +- [Motivation](#motivation) +- [Protocol Details](#protocol-details) + - [Export Request and Response](#export-request-and-response) + - [OTLP over gRPC](#otlp-over-grpc) + - [Export Response](#export-response) + - [Throttling](#throttling) + - [gRPC Service Definition](#grpc-service-definition) + - [Other Transports](#other-transports) +- [Implementation Recommendations](#implementation-recommendations) + - [Multi-Destination Exporting](#multi-destination-exporting) +- [Trade-offs and mitigations](#trade-offs-and-mitigations) + - [Request Acknowledgements](#request-acknowledgements) + - [Duplicate Data](#duplicate-data) + - [Partial Success](#partial-success) +- [Future Versions and Interoperability](#future-versions-and-interoperability) +- [Prior Art, Alternatives and Future Possibilities](#prior-art-alternatives-and-future-possibilities) +- [Open Questions](#open-questions) +- [Appendix A - Protocol Buffer Definitions](#appendix-a---protocol-buffer-definitions) +- [Appendix B - Performance Benchmarks](#appendix-b---performance-benchmarks) + - [Throughput - Sequential vs Concurrent](#throughput---sequential-vs-concurrent) + - [CPU Usage - gRPC vs WebSocket/Experimental](#cpu-usage---grpc-vs-websocketexperimental) + - [Benchmarking Raw Results](#benchmarking-raw-results) +- [Glossary](#glossary) +- [Acknowledgements](#acknowledgements) + +## Motivation + +OTLP is a general-purpose telemetry data delivery protocol designed in the scope of OpenTelemetry project. It is an incremental improvement of OpenCensus protocol. Compared to OpenCensus protocol OTLP has the following improvements: + +- Ensures high reliability of data delivery and clear visibility when the data cannot be delivered. OTLP uses acknowledgements to implement reliable delivery. + +- It is friendly to Level 7 Load Balancers and allows them to correctly map imbalanced incoming traffic to a balanced outgoing traffic. This allows to efficiently operate large networks of nodes where telemetry data generation rates change over time. + +- Allows backpressure signalling from telemetry data destinations to sources. This is important for implementing reliable multi-hop telemetry data delivery all the way from the source to the destination via intermediate nodes, each having different processing capacity and thus requiring different data transfer rates. + +## Protocol Details + +OTLP defines the encoding of telemetry data and the protocol used to exchange data between the client and the server. + +This specification defines how OTLP is implemented over [gRPC](https://grpc.io/) and specifies corresponding [Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview) schema. Future extensions to OTLP may define implementations over other transports. For details of gRPC service definition see section [gRPC Transport](#grpc-service-definition). + +OTLP is a request/response style protocols: the clients send requests, the server replies with corresponding responses. This document defines one requests and response type: `Export`. + +### Export Request and Response + +After establishing the underlying transport the client starts sending telemetry data using `Export` requests. The client continuously sends a sequence of `Export` requests to the server and expects to receive a response to each request: + +![Request-Response](images/otlp-request-response.png) + +_Note: this protocol is concerned with reliability of delivery between one pair of client/server nodes and aims to ensure that no data is lost in-transit between the client and the server. Many telemetry collection systems have intermediary nodes that the data must travel across until reaching the final destination (e.g. application -> agent -> collector -> backend). End-to-end delivery guarantees in such systems is outside of the scope of OTLP. The acknowledgements described in this protocol happen between a single client/server pair and do not span intermediary nodes in multi-hop delivery paths._ + +#### OTLP over gRPC + +For gRPC transport OTLP uses Unary RPC to send export requests and receives responses. + +After sending the request the client MAY wait until the response is received from the server. In that case there will be at most only one request in flight that is not yet acknowledged by the server. + +![Unary](images/otlp-sequential.png) + +Sequential operation is recommended when simplicity of implementation is desirable and when the client and the server are connected via very low-latency network, such as for example when the client is an instrumented application and the server is a OpenTelemetry Service running as a local daemon. + +The implementations that need to achieve high throughput SHOULD support concurrent Unary calls to achieve higher throughput. The client SHOULD send new requests without waiting for the response to the earlier sent requests, essentially creating a pipeline of requests that are currently in flight that are not acknowledged. + +![Streaming](images/otlp-concurrent.png) + +The number of concurrent requests SHOULD be configurable. + +The maximum achievable throughput is `max_concurrent_requests * max_request_size / (network_latency + server_response_time)`. For example if the request can contain at most 100 spans, network roundtrip latency is 200ms and server response time is 300 ms, then the maximum achievable throughput with one concurrent request is `100 spans / (200ms+300ms)` or 200 spans per second. It is easy to see that in high latency networks or when the server response time is high to achieve good throughput the requests need to be very big or a lot concurrent requests must be done. + +If the client is shutting down (e.g. when the containing process wants to exit) the client will optionally wait until all pending acknowledgements are received or until an implementation specific timeout expires. This ensures reliable delivery of telemetry data. The client implementation SHOULD expose an option to turn on and off the waiting during shutdown. + +If the client is unable to deliver a certain request (e.g. a timer expired while waiting for acknowledgements) the client SHOULD record the fact that the data was not delivered. + +#### Export Response + +The server may respond with either a success or an error to export requests. + +The success response indicates telemetry data is successfully processed by the server. If the server receives an empty request (a request that does not carry any telemetry data) the server SHOULD respond with success. + +When using gRPC transport, success response is returned via `ExportResponse` message. + +When an error is returned by the server it falls into 2 broad categories: retryable and not-retryable: + +- Retryable errors indicate that processing of telemetry data failed and the client SHOULD record the error and may retry exporting the same data. This can happen when the server is temporarily unable to process the data. + +- Not-retryable errors indicate that processing of telemetry data failed and the client MUST NOT retry sending the same telemetry data. The telemetry data MUST be dropped. This can happen, for example, when the request contains bad data and cannot be deserialized or otherwise processed by the server. The client SHOULD maintain a counter of such dropped data. + +When using gRPC transport the server SHOULD indicate retryable errors using code [Unavailable](https://godoc.org/google.golang.org/grpc/codes) and MAY supply additional [details via status](https://godoc.org/google.golang.org/grpc/status#Status.WithDetails) using [RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40) containing 0 value of RetryDelay. Here is a sample Go code to illustrate: + +```go + // Do this on server side. + st, err := status.New(codes.Unavailable, "Server is unavailable"). + WithDetails(&errdetails.RetryInfo{RetryDelay: &duration.Duration{Seconds: 0}}) + if err != nil { + log.Fatal(err) + } + + return st.Err() +``` + +To indicate not-retryable errors the server is recommended to use code [InvalidArgument](https://godoc.org/google.golang.org/grpc/codes) and MAY supply additional [details via status](https://godoc.org/google.golang.org/grpc/status#Status.WithDetails) using [BadRequest](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L119). Other gRPC status code may be used if it is more appropriate. Here is a sample Go code to illustrate: + +```go + // Do this on server side. + st, err := status.New(codes.InvalidArgument, "Invalid Argument"). + WithDetails(&errdetails.BadRequest{}) + if err != nil { + log.Fatal(err) + } + + return st.Err() +``` + +The server MAY use other gRPC codes to indicate retryable and not-retryable errors if those other gRPC codes are more appropriate for a particular erroneous situation. The client SHOULD interpret gRPC status codes as retryable or not-retryable according to the following table: + +|gRPC Code|Retryable?| +|---------|----------| +|CANCELLED|Yes| +|UNKNOWN|No| +|INVALID_ARGUMENT|No| +|DEADLINE_EXCEEDED|Yes| +|NOT_FOUND|No| +|ALREADY_EXISTS|No| +|PERMISSION_DENIED|No| +|UNAUTHENTICATED|No| +|RESOURCE_EXHAUSTED|Yes| +|FAILED_PRECONDITION|No| +|ABORTED|Yes| +|OUT_OF_RANGE|Yes| +|UNIMPLEMENTED|No| +|INTERNAL|No| +|UNAVAILABLE|Yes| +|DATA_LOSS|Yes| + +When retrying, the client SHOULD implement a backoff strategy. An exception to this is the Throttling case explained below, which provides explicit instructions about retrying interval. + +#### Throttling + +OTLP allows backpressure signalling. + +If the server is unable to keep up with the pace of data it receives from the client then it SHOULD signal that fact to the client. The client MUST then throttle itself to avoid overwhelming the server. + +To signal backpressure when using gRPC transport, the server SHOULD return an error with code [Unavailable](https://godoc.org/google.golang.org/grpc/codes) and MAY supply additional [details via status](https://godoc.org/google.golang.org/grpc/status#Status.WithDetails) using [RetryInfo](https://github.com/googleapis/googleapis/blob/6a8c7914d1b79bd832b5157a09a9332e8cbd16d4/google/rpc/error_details.proto#L40). Here is a sample Go code to illustrate: + +```go + // Do this on server side. + st, err := status.New(codes.Unavailable, "Server is unavailable"). + WithDetails(&errdetails.RetryInfo{RetryDelay: &duration.Duration{Seconds: 30}}) + if err != nil { + log.Fatal(err) + } + + return st.Err() + + ... + + // Do this on client side. + st := status.Convert(err) + for _, detail := range st.Details() { + switch t := detail.(type) { + case *errdetails.RetryInfo: + if t.RetryDelay.Seconds > 0 || t.RetryDelay.Nanos > 0 { + // Wait before retrying. + } + } + } +``` + +When the client receives this signal it SHOULD follow the recommendations outlined in documentation for `RetryInfo`: + +``` +// Describes when the clients can retry a failed request. Clients could ignore +// the recommendation here or retry when this information is missing from error +// responses. +// +// It's always recommended that clients should use exponential backoff when +// retrying. +// +// Clients should wait until `retry_delay` amount of time has passed since +// receiving the error response before retrying. If retrying requests also +// fail, clients should use an exponential backoff scheme to gradually increase +// the delay between retries based on `retry_delay`, until either a maximum +// number of retires have been reached or a maximum retry delay cap has been +// reached. +``` + +The value of `retry_delay` is determined by the server and is implementation dependant. The server SHOULD choose a `retry_delay` value that is big enough to give the server time to recover, yet is not too big to cause the client to drop data while it is throttled. + +#### gRPC Service Definition + +`Export` requests and responses are delivered using unary gRPC calls. + +This is OTLP over gRPC Service definition: + +``` +service UnaryExporter { + rpc ExportTraces(TraceExportRequest) returns (ExportResponse) {} + rpc ExportMetrics(MetricExportRequest) returns (ExportResponse) {} +} +``` + +Appendix A contains Protocol Buffer definitions for `TraceExportRequest`, `MetricExportRequest` and `ExportResponse`. + +### Other Transports + +OTLP can work over any other transport that supports message request/response capabilities. Additional transports supported by OTLP can be specified in future RFCs that extend OTLP. + +## Implementation Recommendations + +### Multi-Destination Exporting + +When the telemetry data from one client must be sent to more than one destination server there is an additional complication that must be accounted for. When one of the servers acknowledges the data and the other server does not (yet) acknowledges the client needs to make a decision about how to move forward. + +In such situation the the client SHOULD implement queuing, acknowledgement handling and retrying logic per destination. This ensures that servers do not block each other. The queues SHOULD reference shared, immutable data to be sent, thus minimizing the memory overhead caused by having multiple queues. + +![Multi-Destination Exporting](images/otlp-multi-destination.png) + +This ensures that all destination servers receive the data regardless of their speed of reception (within the available limits imposed by the size of the client-side queue). + +## Trade-offs and mitigations + +### Request Acknowledgements + +#### Duplicate Data + +In edge cases (e.g. on reconnections, network interruptions, etc) the client has no way of knowing if recently sent data was delivered if no acknowledgement was received yet. The client will typically choose to re-send such data to guarantee delivery, which may result in duplicate data on the server side. This is a deliberate choice and is considered to be the right tradeoff for telemetry data. + +### Partial Success + +The protocol does not attempt to communicate partial reception success from the server to the client (i.e. when part of the data can be received by the server and part of it cannot). Attempting to do so would complicate the protocol and implementations significantly and is left out as a possible future area of work. + +## Future Versions and Interoperability + +OTLP will evolve and change over time. Future versions of OTLP must be designed and implemented in a way that ensures that clients and servers that implement different versions of OTLP can interoperate and exchange telemetry data. Old clients must be able to talk to new servers and vice versa. If new versions of OTLP introduce new functionality that cannot be understood and supported by nodes implementing the old versions of OTLP the protocol must regress to the lowest common denominator from functional perspective. + +When possible the interoperability SHOULD be ensured between all versions of OTLP that are not declared obsolete. + +OTLP does not use explicit protocol version numbering. OTLP's interoperability of clients and servers of different versions is based on the following concepts: + +1. OTLP (current and future versions) defines a set of capabilities, some of which are mandatory, others are optional. Clients and servers must implement mandatory capabilities and can choose implement only a subset of optional capabilities. + +2. For minor changes to the protocol future versions and extension of OTLP are encouraged to use the ability of Protocol Buffers to evolve message schema in backwards compatible manner. Newer versions of OTLP may add new fields to messages that will be ignored by clients and servers that do not understand these fields. In many cases careful design of such schema changes and correct choice of default values for new fields is enough to ensure interoperability of different versions without nodes explicitly detecting that their peer node has different capabilities. + +3. More significant changes must be explicitly defined as new optional capabilities in future RFCs. Such capabilities SHOULD be discovered by client and server implementations after establishing the underlying transport. The exact discovery mechanism SHOULD be described in future RFCs which define the new capabilities and typically can be implemented by making a discovery request/response message exchange from the client to server. The mandatory capabilities defined by this specification are implied and do not require a discovery. The implementation which supports a new, optional capability MUST adjust its behavior to match the expectation of a peer that does not support a particular capability. + +The current version of OTLP is the initial version that describes mandatory capabilities only. Implementations of this specification SHOULD NOT attempt to detect the capabilities of their peers and should operate as defined in this document. + +## Prior Art, Alternatives and Future Possibilities + +We have considered using gRPC streaming instead of Unary RPC calls. This would require implementations to manually perform stream closing and opening periodically to be L7 Load Balancer friendly. Reference implementation using gRPC Streaming has shown that it results in significantly more complex and error prone code without significant benefits. Because of this Unary RPC was chosen. + +OTLP is an evolution of OpenCensus protocol based on the research and testing of its modifications in production at Omnition. The modifications include changes to data formats (see RFC0059), use of Unary PRC and backpressure signaling capability. + +OTLP uses Protocol Buffers for data encoding. Two other encodings were considered as alternative approaches: FlatBuffers and Capnproto. Both alternatives were rejected. FlatBuffers was rejected because it lacks required functionality in all languages except C++, particularly lack of verification of decoded data and inability to mutate in-memory data. Capnproto was rejected because it is not yet considered production ready, the API is not yet stable and like FlatBuffers it lacks ability mutate in-memory data. + +Both FlatBuffers and Capnproto are worth to be re-evaluated for future versions of OpenTelemetry protocol if they overcome currently known limitations. + +It is also worth researching transports other than gRPC. Other transports are not included in this RFC due to time limitations. + +Experimental implementation of OTLP over WebSockets exists and was researched as an alternate. WebSockets were not chosen as the primary transport for OTLP due to lack or immaturity of certain capabilities (such as [lack of universal support](https://github.com/gorilla/websocket#gorilla-websocket-compared-with-other-packages) for [RFC 7692](https://tools.ietf.org/html/rfc7692) message compression extension). Despite limitations the experimental implementation demonstrated good performance and WebSocket transport will be considered for inclusion in a future OTLP Extensions RFC. + +## Open Questions + +One of the goals for telemetry protocol is reducing CPU usage and memory pressure in garbage collected languages. These goals were not addressed as part of this RFC and remain open. One of the promising future ways to address this is finding a more CPU and memory efficient encoding mechanism. + +Another goal for telemetry protocol is achieving high compression ratios for telemetry data while keeping CPU consumption low. OTLP uses compression provided by gRPC transport. No further improvements to compression were considered as part of this RFC and are a future area of work. + +## Appendix A - Protocol Buffer Definitions + +This is Protocol Buffers schema for `Export` request and response: + +``` +// A request from client to server containing trace data to export. +message TraceExportRequest { + // Telemetry data. An array of ResourceSpans. + repeated ResourceSpans resourceSpans = 2; +} + +// A request from client to server containing metric data to export. +message MetricExportRequest { + // Telemetry data. An array of ResourceMetrics. + repeated ResourceMetrics resourceMetrics = 2; +} + +// A response to ExportRequest. +message ExportResponse { + // Response in an empty message. +} + +// A list of spans from a Resource. +message ResourceSpans { + Resource resource = 1; + repeated Span spans = 2; +} + +// A list of metrics from a Resource. +message ResourceMetrics { + Resource resource = 1; + repeated Metric metrics = 2; +} +``` + +`Span`, `Metric` and `Resource` schema definitions are defined in RFCNNNN (RFC number to be defined and linked from here). + +## Appendix B - Performance Benchmarks + +Benchmarking of OTLP vs other telemetry protocols was done using [reference implementation in Go](https://github.com/tigrannajaryan/exp-otelproto). + +### Throughput - Sequential vs Concurrent + +Using 20 concurrent requests shows the following throughput advantage in benchmarks compared to sequential for various values of network roundtrip latency: + +``` ++-----------+-----------------------+ ++ Latency | Concurrent/Sequential | ++ | Throughput Factor | ++-----------+-----------------------+ ++ 0.02 ms | 1.7 | ++ 2 ms | 2.1 | ++ 20 ms | 4.9 | ++ 200 ms | 6.9 | ++-----------+-----------------------+ +``` + +Benchmarking is done using Export requests each carrying 500 spans, each span containing 10 small attributes. + +### CPU Usage - gRPC vs WebSocket/Experimental + +Experimental implementation using WebSocket transport demonstrated about 30% less CPU usage on small batches compared to gRPC transport and about 7% less CPU usage on large batches. + +This shows that exploring different transports with less overhead is a promising future direction. + +### Benchmarking Raw Results + +The following is the benchmarking result, running on on a system with i7 7500U processor, 16 GB RAM. (Note that the benchmarking script sets "performance" CPU governor during execution and sets nice value of the process for more consistent results). + +``` +==================================================================================== +Legend: +GRPC/Stream/LBTimed/Sync - GRPC, streaming, load balancer friendly, close stream every 30 sec, with ack +GRPC/Stream/LBTimed/Async/N - OTLP Streaming. GRPC, N streams, load balancer friendly, close stream every 30 sec, with async ack +GRPC/Unary - OTLP Unary. One request per batch, load balancer friendly, with ack +GRPC/Unary/Async - GRPC, unary async request per batch, load balancer friendly, with ack +GRPC/OpenCensus - OpenCensus protocol, streaming, not load balancer friendly, without ack +GRPC/OpenCensusWithAck - OpenCensus-like protocol, streaming, not load balancer friendly, with ack +GRPC/Stream/NoLB - GRPC, streaming, not load balancer friendly, with ack +GRPC/Stream/LBAlways/Sync - GRPC, streaming, load balancer friendly, close stream after every batch, with ack +GRPC/Stream/LBSrv/Async - OTLP Streaming. Load balancer friendly, server closes stream every 30 sec or 1000 batches, with async ack +WebSocket/Stream/Sync - WebSocket, streaming, unknown load balancer friendliness, with sync ack +WebSocket/Stream/Async - WebSocket, streaming, unknown load balancer friendliness, with async ack +WebSocket/Stream/Async/zlib - WebSocket, streaming, unknown load balancer friendliness, with async ack, zlib compression + + +8000 small batches, 100 spans per batch, 4 attrs per span +GRPC/Stream/LBTimed/Async/1 800000 spans, CPU time 12.4 sec, wall time 5.3 sec, 645.7 batches/cpusec, 1510.0 batches/wallsec +GRPC/Stream/LBTimed/Async/10 800000 spans, CPU time 12.3 sec, wall time 3.9 sec, 650.9 batches/cpusec, 2058.4 batches/wallsec +GRPC/Unary 800000 spans, CPU time 15.3 sec, wall time 9.5 sec, 523.2 batches/cpusec, 840.0 batches/wallsec +GRPC/Unary/Async 800000 spans, CPU time 14.1 sec, wall time 4.0 sec, 565.8 batches/cpusec, 1986.3 batches/wallsec +GRPC/OpenCensus 800000 spans, CPU time 21.7 sec, wall time 10.6 sec, 368.7 batches/cpusec, 751.5 batches/wallsec +GRPC/OpenCensusWithAck 800000 spans, CPU time 23.4 sec, wall time 19.0 sec, 342.3 batches/cpusec, 420.8 batches/wallsec +GRPC/Stream/NoLB 800000 spans, CPU time 13.6 sec, wall time 9.4 sec, 588.2 batches/cpusec, 848.7 batches/wallsec +GRPC/Stream/LBAlways/Sync 800000 spans, CPU time 16.1 sec, wall time 10.0 sec, 495.7 batches/cpusec, 798.8 batches/wallsec +GRPC/Stream/LBTimed/Sync 800000 spans, CPU time 13.7 sec, wall time 9.5 sec, 585.7 batches/cpusec, 845.1 batches/wallsec +GRPC/Stream/LBSrv/Async 800000 spans, CPU time 12.7 sec, wall time 12.5 sec, 628.9 batches/cpusec, 639.8 batches/wallsec +WebSocket/Stream/Sync 800000 spans, CPU time 8.4 sec, wall time 8.3 sec, 949.0 batches/cpusec, 965.3 batches/wallsec +WebSocket/Stream/Async 800000 spans, CPU time 9.4 sec, wall time 5.4 sec, 852.0 batches/cpusec, 1492.0 batches/wallsec +WebSocket/Stream/Async/zlib 800000 spans, CPU time 23.3 sec, wall time 16.5 sec, 343.8 batches/cpusec, 484.0 batches/wallsec + +800 large batches, 500 spans per batch, 10 attrs per span +GRPC/Stream/LBTimed/Async/1 400000 spans, CPU time 11.4 sec, wall time 7.1 sec, 70.2 batches/cpusec, 113.1 batches/wallsec +GRPC/Stream/LBTimed/Async/10 400000 spans, CPU time 12.2 sec, wall time 5.8 sec, 65.8 batches/cpusec, 138.4 batches/wallsec +GRPC/Unary 400000 spans, CPU time 10.7 sec, wall time 9.6 sec, 74.7 batches/cpusec, 83.2 batches/wallsec +GRPC/Unary/Async 400000 spans, CPU time 11.9 sec, wall time 5.6 sec, 67.0 batches/cpusec, 141.8 batches/wallsec +GRPC/OpenCensus 400000 spans, CPU time 23.9 sec, wall time 14.1 sec, 33.5 batches/cpusec, 56.8 batches/wallsec +GRPC/OpenCensusWithAck 400000 spans, CPU time 22.0 sec, wall time 21.1 sec, 36.4 batches/cpusec, 38.0 batches/wallsec +GRPC/Stream/NoLB 400000 spans, CPU time 10.7 sec, wall time 9.8 sec, 74.9 batches/cpusec, 81.8 batches/wallsec +GRPC/Stream/LBAlways/Sync 400000 spans, CPU time 11.5 sec, wall time 10.2 sec, 69.9 batches/cpusec, 78.2 batches/wallsec +GRPC/Stream/LBTimed/Sync 400000 spans, CPU time 11.1 sec, wall time 10.2 sec, 71.9 batches/cpusec, 78.4 batches/wallsec +GRPC/Stream/LBSrv/Async 400000 spans, CPU time 11.3 sec, wall time 7.0 sec, 70.5 batches/cpusec, 113.6 batches/wallsec +WebSocket/Stream/Sync 400000 spans, CPU time 10.3 sec, wall time 10.1 sec, 78.0 batches/cpusec, 79.4 batches/wallsec +WebSocket/Stream/Async 400000 spans, CPU time 10.5 sec, wall time 7.2 sec, 76.2 batches/cpusec, 111.2 batches/wallsec +WebSocket/Stream/Async/zlib 400000 spans, CPU time 29.0 sec, wall time 22.1 sec, 27.6 batches/cpusec, 36.1 batches/wallsec + +2ms network roundtrip latency +800 large batches, 500 spans per batch, 10 attrs per span +GRPC/Stream/LBTimed/Async/1 400000 spans, CPU time 11.1 sec, wall time 7.0 sec, 71.9 batches/cpusec, 114.9 batches/wallsec +GRPC/Stream/LBTimed/Async/10 400000 spans, CPU time 11.4 sec, wall time 5.4 sec, 70.5 batches/cpusec, 148.0 batches/wallsec +GRPC/Unary 400000 spans, CPU time 11.5 sec, wall time 11.8 sec, 69.5 batches/cpusec, 68.1 batches/wallsec +GRPC/Unary/Async 400000 spans, CPU time 11.3 sec, wall time 5.3 sec, 70.5 batches/cpusec, 150.4 batches/wallsec +GRPC/OpenCensus 400000 spans, CPU time 23.1 sec, wall time 13.6 sec, 34.6 batches/cpusec, 58.7 batches/wallsec +GRPC/OpenCensusWithAck 400000 spans, CPU time 21.9 sec, wall time 22.6 sec, 36.6 batches/cpusec, 35.4 batches/wallsec +GRPC/Stream/NoLB 400000 spans, CPU time 11.1 sec, wall time 11.6 sec, 72.3 batches/cpusec, 69.2 batches/wallsec +GRPC/Stream/LBAlways/Sync 400000 spans, CPU time 11.5 sec, wall time 11.6 sec, 69.8 batches/cpusec, 68.9 batches/wallsec +GRPC/Stream/LBTimed/Sync 400000 spans, CPU time 11.3 sec, wall time 11.7 sec, 71.0 batches/cpusec, 68.2 batches/wallsec +GRPC/Stream/LBSrv/Async 400000 spans, CPU time 11.1 sec, wall time 6.9 sec, 72.0 batches/cpusec, 115.1 batches/wallsec +WebSocket/Stream/Sync 400000 spans, CPU time 10.8 sec, wall time 12.0 sec, 74.1 batches/cpusec, 66.5 batches/wallsec +WebSocket/Stream/Async 400000 spans, CPU time 10.6 sec, wall time 7.2 sec, 75.5 batches/cpusec, 111.8 batches/wallsec +WebSocket/Stream/Async/zlib 400000 spans, CPU time 28.6 sec, wall time 21.9 sec, 27.9 batches/cpusec, 36.6 batches/wallsec + +20ms network roundtrip latency +400 large batches, 500 spans per batch, 10 attrs per span +GRPC/Stream/LBTimed/Async/1 200000 spans, CPU time 6.2 sec, wall time 4.1 sec, 64.9 batches/cpusec, 96.7 batches/wallsec +GRPC/Stream/LBTimed/Async/10 200000 spans, CPU time 6.2 sec, wall time 3.0 sec, 64.0 batches/cpusec, 132.9 batches/wallsec +GRPC/Unary 200000 spans, CPU time 6.2 sec, wall time 13.5 sec, 64.3 batches/cpusec, 29.6 batches/wallsec +GRPC/Unary/Async 200000 spans, CPU time 5.9 sec, wall time 3.0 sec, 68.0 batches/cpusec, 132.9 batches/wallsec +GRPC/OpenCensus 200000 spans, CPU time 12.6 sec, wall time 7.5 sec, 31.8 batches/cpusec, 53.3 batches/wallsec +GRPC/OpenCensusWithAck 200000 spans, CPU time 12.0 sec, wall time 19.5 sec, 33.4 batches/cpusec, 20.5 batches/wallsec +GRPC/Stream/NoLB 200000 spans, CPU time 5.9 sec, wall time 13.3 sec, 68.3 batches/cpusec, 30.0 batches/wallsec +GRPC/Stream/LBAlways/Sync 200000 spans, CPU time 5.9 sec, wall time 13.3 sec, 68.0 batches/cpusec, 30.2 batches/wallsec +GRPC/Stream/LBTimed/Sync 200000 spans, CPU time 5.8 sec, wall time 13.3 sec, 69.3 batches/cpusec, 30.1 batches/wallsec +GRPC/Stream/LBSrv/Async 200000 spans, CPU time 5.5 sec, wall time 3.7 sec, 73.4 batches/cpusec, 107.3 batches/wallsec +WebSocket/Stream/Sync 200000 spans, CPU time 5.8 sec, wall time 14.6 sec, 69.4 batches/cpusec, 27.4 batches/wallsec +WebSocket/Stream/Async 200000 spans, CPU time 5.5 sec, wall time 3.9 sec, 72.3 batches/cpusec, 102.1 batches/wallsec +WebSocket/Stream/Async/zlib 200000 spans, CPU time 14.7 sec, wall time 11.2 sec, 27.3 batches/cpusec, 35.7 batches/wallsec + +200ms network roundtrip latency +40 large batches, 500 spans per batch, 10 attrs per span +GRPC/Stream/LBTimed/Async/1 20000 spans, CPU time 0.5 sec, wall time 3.1 sec, 74.1 batches/cpusec, 12.7 batches/wallsec +GRPC/Stream/LBTimed/Async/10 20000 spans, CPU time 0.7 sec, wall time 3.1 sec, 61.5 batches/cpusec, 12.8 batches/wallsec +GRPC/Unary 20000 spans, CPU time 0.6 sec, wall time 9.9 sec, 65.6 batches/cpusec, 4.0 batches/wallsec +GRPC/Unary/Async 20000 spans, CPU time 0.6 sec, wall time 3.6 sec, 65.6 batches/cpusec, 11.1 batches/wallsec +GRPC/OpenCensus 20000 spans, CPU time 1.1 sec, wall time 3.5 sec, 35.1 batches/cpusec, 11.3 batches/wallsec +GRPC/OpenCensusWithAck 20000 spans, CPU time 1.2 sec, wall time 10.2 sec, 32.8 batches/cpusec, 3.9 batches/wallsec +GRPC/Stream/NoLB 20000 spans, CPU time 0.6 sec, wall time 9.5 sec, 67.8 batches/cpusec, 4.2 batches/wallsec +GRPC/Stream/LBAlways/Sync 20000 spans, CPU time 0.6 sec, wall time 9.5 sec, 63.5 batches/cpusec, 4.2 batches/wallsec +GRPC/Stream/LBTimed/Sync 20000 spans, CPU time 0.6 sec, wall time 9.5 sec, 66.7 batches/cpusec, 4.2 batches/wallsec +GRPC/Stream/LBSrv/Async 20000 spans, CPU time 0.5 sec, wall time 3.3 sec, 74.1 batches/cpusec, 12.0 batches/wallsec +WebSocket/Stream/Sync 20000 spans, CPU time 0.6 sec, wall time 13.5 sec, 69.0 batches/cpusec, 3.0 batches/wallsec +WebSocket/Stream/Async 20000 spans, CPU time 0.5 sec, wall time 6.1 sec, 74.1 batches/cpusec, 6.5 batches/wallsec +WebSocket/Stream/Async/zlib 20000 spans, CPU time 1.5 sec, wall time 2.0 sec, 26.3 batches/cpusec, 19.8 batches/wallsec + + +400 large batches, 500 spans per batch, 10 attrs per span +200ms network roundtrip latency +GRPC/OpenCensus 200000 spans, CPU time 11.9 sec, wall time 10.1 sec, 33.6 batches/cpusec, 39.6 batches/wallsec +GRPC/Stream/LBTimed/Async/1 200000 spans, CPU time 5.3 sec, wall time 9.5 sec, 76.0 batches/cpusec, 41.9 batches/wallsec +GRPC/Stream/LBTimed/Async/10 200000 spans, CPU time 6.4 sec, wall time 8.9 sec, 62.3 batches/cpusec, 44.7 batches/wallsec +GRPC/Unary/Async 200000 spans, CPU time 5.8 sec, wall time 12.0 sec, 68.6 batches/cpusec, 33.3 batches/wallsec +WebSocket/Stream/Async 200000 spans, CPU time 5.3 sec, wall time 11.2 sec, 75.3 batches/cpusec, 35.7 batches/wallsec +WebSocket/Stream/Async/zlib 200000 spans, CPU time 15.1 sec, wall time 12.0 sec, 26.5 batches/cpusec, 33.4 batches/wallsec +==================================================================================== +``` + +## Glossary + +There are 2 parties involved in telemetry data exchange. In this document the party that is the source of telemetry data is called the `Client`, the party that is the destination of telemetry data is called the `Server`. + +![Client-Server](images/otlp-client-server.png) + +Examples of a Client are instrumented applications or sending side of telemetry collectors, examples of Servers are telemetry backends or receiving side of telemetry collectors (so a Collector is typically both a Client and a Server depending on which side you look from). + +Both the Client and the Server are also a `Node`. This term is used in the document when referring to either one. + +## Acknowledgements + +Special thanks to Owais Lone who helped to conduct experiments with Load Balancers, to Paulo Janotti, Bogdan Drutu and Yuri Shkuro for thoughtful discussions around the protocol. diff --git a/oteps/0038-version-semantic-attribute.md b/oteps/0038-version-semantic-attribute.md new file mode 100644 index 00000000000..9c7f28797b4 --- /dev/null +++ b/oteps/0038-version-semantic-attribute.md @@ -0,0 +1,32 @@ +# Version Semantic Attribute + +Add a standard `version` semantic attribute. + +## Motivation + +When creating trace data or metrics, it can be extremely useful to know the specific version that +emitted the iota of span or measurement being viewed. However, versions can mean different things +to different systems and users. In addition, downstream analysis systems may wish to expose +functionality related to the type of a version (such as detecting when versions are newer or older). +To support this, we should standardize a `version` attribute with optional hints as to the type of the +version. + +## Explanation + +A `version` is a semantic attribute that can be applied to other resources, such as `Service`, +`Component`, `Library`, `Device`, `Platform`, etc. A `version` attribute is optional, but recommended. +The definition of a `version` is a key-value attribute pair of `string` to `string`, with naming schemas +available to hint at the type of a version, such as the following: + +`version=semver:1.2.3` (a semantic version) +`version=git:8ae73a` (a git sha hash) +`version=0.0.4.2.20190921` (a untyped version) + +## Internal details + +Since this is just an attribute pair, no special handling is required, although SDKs may provide helper methods +to construct schema-appropriate values. + +## Prior art and alternatives + +Tagging service resources with their version is generally suggested by analysis tools -- see [JAEGER_TAGS](https://www.jaegertracing.io/docs/1.8/client-features/) for an example -- but lacks standardization. diff --git a/oteps/0066-separate-context-propagation.md b/oteps/0066-separate-context-propagation.md new file mode 100644 index 00000000000..d63362bbfe5 --- /dev/null +++ b/oteps/0066-separate-context-propagation.md @@ -0,0 +1,639 @@ +# Context Propagation: A Layered Approach + +* [Motivation](#motivation) +* [OpenTelemetry layered architecture](#opentelemetry-layered-architecture) + * [Cross-Cutting Concerns](#cross-cutting-concerns) + * [Observability API](#observability-api) + * [Correlations API](#correlations-api) + * [Context Propagation](#context-propagation) + * [Context API](#context-api) + * [Propagation API](#propagation-api) +* [Prototypes](#prototypes) +* [Examples](#examples) + * [Global initialization](#global-initialization) + * [Extracting and injecting from HTTP headers](#extracting-and-injecting-from-http-headers) + * [Simplify the API with automated context propagation](#simplify-the-api-with-automated-context-propagation) + * [Implementing a propagator](#implementing-a-propagator) + * [Implementing a concern](#implementing-a-concern) + * [The scope of current context](#the-scope-of-current-context) + * [Referencing multiple contexts](#referencing-multiple-contexts) + * [Falling back to explicit contexts](#falling-back-to-explicit-contexts) +* [Internal details](#internal-details) +* [FAQ](#faq) + +![drawing](img/0066_context_propagation_overview.png) + +A proposal to refactor OpenTelemetry into a set of separate cross-cutting concerns which +operate on a shared context propagation mechanism. + +## Motivation + +This RFC addresses the following topics: + +### Separation of concerns + +* Cleaner package layout results in an easier to learn system. It is possible to + understand Context Propagation without needing to understand Observability. +* Allow for multiple types of context propagation, each self contained with + different rules. For example, TraceContext may be sampled, while + CorrelationContext never is. +* Allow the Observability and Context Propagation to have different defaults. + The Observability systems ships with a no-op implementation and a pluggable SDK, + the context propagation system ships with a canonical, working implementation. + +### Extensibility + +* A clean separation allows the context propagation mechanisms to be used on + their own, so they may be consumed by other systems which do not want to + depend on an observability tool for their non-observability concerns. +* Allow developers to create new applications for context propagation. For + example: A/B testing, authentication, and network switching. + +## OpenTelemetry layered architecture + +The design of OpenTelemetry is based on the principles of [aspect-oriented +programming](https://en.wikipedia.org/wiki/Aspect-oriented_programming), +adopted to the needs of distributed systems. + +Some concerns "cut across" multiple abstractions in a program. Logging +exemplifies aspect orientation because a logging strategy necessarily affects +every logged part of the system. Logging thereby "cross-cuts" across all logged +classes and methods. Distributed tracing takes this strategy to the next level, +and cross-cuts across all classes and methods in all services in the entire +transaction. This requires a distributed form of the same aspect-oriented +programming principles in order to be implemented cleanly. + +OpenTelemetry approaches this by separating it's design into two layers. The top +layer contains a set of independent **cross-cutting concerns**, which intertwine +with a program's application logic and cannot be cleanly encapsulated. All +concerns share an underlying distributed **context propagation** layer, for +storing state and accessing data across the lifespan of a distributed +transaction. + +## Cross-Cutting Concerns + +### Observability API + +Distributed tracing is one example of a cross-cutting concern. Tracing code is +interleaved with regular code, and ties together independent code modules which +would otherwise remain encapsulated. Tracing is also distributed, and requires +transaction-level context propagation in order to execute correctly. + +The various observability APIs are not described here directly. However, in this new +design, all observability APIs would be modified to make use of the generalized +context propagation mechanism described below, rather than the tracing-specific +propagation system it uses today. + +Note that OpenTelemetry APIs calls should *always* be given access to the entire +context object, and never just a subset of the context, such as the value in a +single key. This allows the SDK to make improvements and leverage additional +data that may be available, without changes to all of the call sites. + +The following are notes on the API, and not meant as final. + +**`StartSpan(context, options) -> context`** +When a span is started, a new context is returned, with the new span set as the +current span. + +**`GetSpanPropagator() -> (HTTP_Extractor, HTTP_Injector)`** +When a span is extracted, the extracted value is stored in the context seprately +from the current span. + +### Correlations API + +In addition to trace propagation, OpenTelemetry provides a simple mechanism for +propagating indexes, called the Correlations API. Correlations are +intended for indexing observability events in one service with attributes +provided by a prior service in the same transaction. This helps to establish a +causal relationship between these events. For example, determining that a +particular browser version is associated with a failure in an image processing +service. + +The Correlations API is based on the [W3C Baggage specification](https://www.w3.org/TR/baggage/), +and implements the protocol as it is defined in that working group. There are +few details provided here as it is outside the scope of this OTEP to finalize +this API. + +While Correlations can be used to prototype other cross-cutting concerns, this +mechanism is primarily intended to convey values for the OpenTelemetry +observability systems. + +For backwards compatibility, OpenTracing Baggage is propagated as Correlations +when using the OpenTracing bridge. New concerns with different criteria should +be modeled separately, using the same underlying context propagation layer as +building blocks. + +The following is an example API, and not meant as final. + +**`GetCorrelation(context, key) -> value`** +To access the value for a label set by a prior event, the Correlations API +provides a function which takes a context and a key as input, and returns a +value. + +**`SetCorrelation(context, key, value) -> context`** +To record the value for a label, the Correlations API provides a function which +takes a context, a key, and a value as input, and returns an updated context +which contains the new value. + +**`RemoveCorrelation(context, key) -> context`** +To delete a label, the Correlations API provides a function +which takes a context and a key as input, and returns an updated context which +no longer contains the selected key-value pair. + +**`ClearCorrelations(context) -> context`** +To avoid sending any labels to an untrusted process, the Correlation API +provides a function to remove all Correlations from a context. + +**`GetCorrelationPropagator() -> (HTTP_Extractor, HTTP_Injector)`** +To deserialize the previous labels set by prior processes, and to serialize the +current total set of labels and send them to the next process, the Correlations +API provides a function which returns a Correlation-specific implementation of +the `HTTPExtract` and `HTTPInject` functions found in the Propagation API. + +## Context Propagation + +### Context API + +Cross-cutting concerns access data in-process using the same, shared context +object. Each concern uses its own namespaced set of keys in the context, +containing all of the data for that cross-cutting concern. + +The following is an example API, and not meant as final. + +**`CreateKey(name) -> key`** +To allow concerns to control access to their data, the Context API uses keys +which cannot be guessed by third parties which have not been explicitly handed +the key. It is recommended that concerns mediate data access via an API, rather +than provide direct public access to their keys. + +**`GetValue(context, key) -> value`** +To access the local state of an concern, the Context API provides a function +which takes a context and a key as input, and returns a value. + +**`SetValue(context, key, value) -> context`** +To record the local state of a cross-cutting concern, the Context API provides a +function which takes a context, a key, and a value as input, and returns a +new context which contains the new value. Note that the new value is not present +in the old context. + +**`RemoveValue(context, key) -> context`** +RemoveValue returns a new context with the key cleared. Note that the removed +value still remains present in the old context. + +#### Optional: Automated Context Management + +When possible, the OpenTelemetry context should automatically be associated +with the program execution context. Note that some languages do not provide any +facility for setting and getting a current context. In these cases, the user is +responsible for managing the current context. + +**`GetCurrent() -> context`** +To access the context associated with program execution, the Context API +provides a function which takes no arguments and returns a Context. + +**`SetCurrent(context)`** +To associate a context with program execution, the Context API provides a +function which takes a Context. + +### Propagation API + +Cross-cutting concerns send their state to the next process via propagators: +functions which read and write context into RPC requests. Each concern creates a +set of propagators for every type of supported medium - currently only HTTP +requests. + +The following is an example API, and not meant as final. + +**`Extract(context, []http_extractor, headers) -> context`** +In order to continue transmitting data injected earlier in the transaction, +the Propagation API provides a function which takes a context, a set of +HTTP_Extractors, and a set of HTTP headers, and returns a new context which +includes the state sent from the prior process. + +**`Inject(context, []http_injector, headers) -> headers`** +To send the data for all concerns to the next process in the transaction, the +Propagation API provides a function which takes a context, a set of +HTTP_Injectors, and adds the contents of the context in to HTTP headers to +include an HTTP Header representation of the context. + +**`HTTP_Extractor(context, headers) -> context`** +Each concern must implement an HTTP_Extractor, which can locate the headers +containing the http-formatted data, and then translate the contents into an +in-memory representation, set within the returned context object. + +**`HTTP_Injector(context, headers) -> headers`** +Each concern must implement an HTTP_Injector, which can take the in-memory +representation of its data from the given context object, and add it to an +existing set of HTTP headers. + +#### Optional: Global Propagators + +It may be convenient to create a list of propagators during program +initialization, and then access these propagators later in the program. +To facilitate this, global injectors and extractors are optionally available. +However, there is no requirement to use this feature. + +**`GetExtractors() -> []http_extractor`** +To access the global extractor, the Propagation API provides a function which +returns an extractor. + +**`SetExtractors([]http_extractor)`** +To update the global extractor, the Propagation API provides a function which +takes an extractor. + +**`GetInjectors() -> []http_injector`** +To access the global injector, the Propagation API provides a function which +returns an injector. + +**`SetInjectors([]http_injector)`** +To update the global injector, the Propagation API provides a function which +takes an injector. + +## Prototypes + +**Erlang:** +**Go:** +**Java:** +**Python:** +**Ruby:** +**C#/.NET:** + +## Examples + +It might be helpful to look at some examples, written in pseudocode. Note that +the pseudocode only uses simple functions and immutable values. Most mutable, +object-orient languages will use objects, such as a Span object, to encapsulate +the context object and hide it from the user in most cases. + +Let's describe +a simple scenario, where `service A` responds to an HTTP request from a `client` +with the result of a request to `service B`. + +``` +client -> service A -> service B +``` + +Now, let's assume the `client` in the above system is version 1.0. With version +v2.0 of the `client`, `service A` must call `service C` instead of `service B` +in order to return the correct data. + +``` +client -> service A -> service C +``` + +In this example, we would like `service A` to decide on which backend service +to call, based on the client version. We would also like to trace the entire +system, in order to understand if requests to `service C` are slower or faster +than `service B`. What might `service A` look like? + +### Global initialization + +First, during program initialization, `service A` configures correlation and tracing +propagation, and include them in the global list of injectors and extractors. +Let's assume this tracing system is configured to use B3, and has a specific +propagator for that format. Initializing the propagators might look like this: + +```php +func InitializeOpentelemetry() { + // create the propagators for tracing and correlations. + bagExtract, bagInject = Correlations::HTTPPropagator() + traceExtract, traceInject = Tracer::B3Propagator() + + // add the propagators to the global list. + Propagation::SetExtractors(bagExtract, traceExtract) + Propagation::SetInjectors(bagInject, traceInject) +} +``` + +### Extracting and injecting from HTTP headers + +These propagators can then be used in the request handler for `service A`. The +tracing and correlations concerns use the context object to handle state without +breaking the encapsulation of the functions they are embedded in. + +```php +func ServeRequest(context, request, project) -> (context) { + // Extract the context from the HTTP headers. Because the list of + // extractors includes a trace extractor and a correlations extractor, the + // contents for both systems are included in the request headers into the + // returned context. + extractors = Propagation::GetExtractors() + context = Propagation::Extract(context, extractors, request.Headers) + + // Start a span, setting the parent to the span context received from + // the client process. The new span will then be in the returned context. + context = Tracer::StartSpan(context, [span options]) + + // Determine the version of the client, in order to handle the data + // migration and allow new clients access to a data source that older + // clients are unaware of. + version = Correlations::GetCorrelation( context, "client-version") + + switch( version ){ + case "v1.0": + data, context = FetchDataFromServiceB(context) + case "v2.0": + data, context = FetchDataFromServiceC(context) + } + + context = request.Response(context, data) + + // End the current span + Tracer::EndSpan(context) + + return context +} + +func FetchDataFromServiceB(context) -> (context, data) { + request = NewRequest([request options]) + + // Inject the contexts to be propagated. Note that there is no direct + // reference to tracing or correlations. + injectors = Propagation::GetInjectors() + request.Headers = Propagation::Inject(context, injectors, request.Headers) + + // make an http request + data = request.Do() + + return data +} +``` + +### Simplify the API with automated context propagation + +In this version of pseudocode above, we assume that the context object is +explicit, and is pass and returned from every function as an ordinary parameter. +This is cumbersome, and in many languages, a mechanism exists which allows +context to be propagated automatically. + +In this version of pseudocode, assume that the current context can be stored as +a thread local, and is implicitly passed to and returned from every function. + +```php +func ServeRequest(request, project) { + extractors = Propagation::GetExtractors() + Propagation::Extract(extractors, request.Headers) + + Tracer::StartSpan([span options]) + + version = Correlations::GetCorrelation("client-version") + + switch( version ){ + case "v1.0": + data = FetchDataFromServiceB() + case "v2.0": + data = FetchDataFromServiceC() + } + + request.Response(data) + + Tracer::EndSpan() +} + +func FetchDataFromServiceB() -> (data) { + request = newRequest([request options]) + + injectors = Propagation::GetInjectors() + Propagation::Inject(request.Headers) + + data = request.Do() + + return data +} +``` + +### Implementing a propagator + +Digging into the details of the tracing system, what might the internals of a +span context propagator look like? Here is a crude example of extracting and +injecting B3 headers, using an explicit context. + +```php + func B3Extractor(context, headers) -> (context) { + context = Context::SetValue( context, + "trace.parentTraceID", + headers["X-B3-TraceId"]) + context = Context::SetValue( context, + "trace.parentSpanID", + headers["X-B3-SpanId"]) + return context + } + + func B3Injector(context, headers) -> (headers) { + headers["X-B3-TraceId"] = Context::GetValue( context, "trace.parentTraceID") + headers["X-B3-SpanId"] = Context::GetValue( context, "trace.parentSpanID") + + return headers + } +``` + +### Implementing a concern + +Now, have a look at a crude example of how StartSpan might make use of the +context. Note that this code must know the internal details about the context +keys in which the propagators above store their data. For this pseudocode, let's +assume again that the context is passed implicitly in a thread local. + +```php + func StartSpan(options) { + spanData = newSpanData() + + spanData.parentTraceID = Context::GetValue( "trace.parentTraceID") + spanData.parentSpanID = Context::GetValue( "trace.parentSpanID") + + spanData.traceID = newTraceID() + spanData.spanID = newSpanID() + + Context::SetValue( "trace.parentTraceID", spanData.traceID) + Context::SetValue( "trace.parentSpanID", spanData.spanID) + + // store the spanData object as well, for in-process propagation. Note that + // this key will not be propagated, it is for local use only. + Context::SetValue( "trace.currentSpanData", spanData) + + return + } +``` + +### The scope of current context + +Let's look at a couple other scenarios related to automatic context propagation. + +When are the values in the current context available? Scope management may be +different in each language, but as long as the scope does not change (by +switching threads, for example) the current context follows the execution of +the program. This includes after a function returns. Note that the context +objects themselves are immutable, so explicit handles to prior contexts will not +be updated when the current context is changed. + +```php +func Request() { + emptyContext = Context::GetCurrent() + + Context::SetValue( "say-something", "foo") + secondContext = Context::GetCurrent() + + print(Context::GetValue("say-something")) // prints "foo" + + DoWork() + + thirdContext = Context::GetCurrent() + + print(Context::GetValue("say-something")) // prints "bar" + + print( emptyContext.GetValue("say-something") ) // prints "" + print( secondContext.GetValue("say-something") ) // prints "foo" + print( thirdContext.GetValue("say-something") ) // prints "bar" +} + +func DoWork(){ + Context::SetValue( "say-something", "bar") +} +``` + +### Referencing multiple contexts + +If context propagation is automatic, does the user ever need to reference a +context object directly? Sometimes. Even when automated context propagation is +an available option, there is no restriction which says that concerns must only +ever access the current context. + +For example, if a concern wanted to merge the data between two contexts, at +least one of them will not be the current context. + +```php +mergedContext = MergeCorrelations( Context::GetCurrent(), otherContext) +Context::SetCurrent(mergedContext) +``` + +### Falling back to explicit contexts + +Sometimes, suppling an additional version of a function which uses explicit +contexts is necessary, in order to handle edge cases. For example, in some cases +an extracted context is not intended to be set as current context. An +alternate extract method can be added to the API to handle this. + +```php +// Most of the time, the extract function operates on the current context. +Extract(headers) + +// When a context needs to be extracted without changing the current +// context, fall back to the explicit API. +otherContext = ExtractWithContext(Context::GetCurrent(), headers) +``` + +## Internal details + +![drawing](img/0066_context_propagation_details.png) + +### Example Package Layout + +``` + Context + ContextAPI + Observability + Correlations + CorrelationAPI + HttpInjector + HttpExtractor + Metrics + MetricAPI + Trace + TracerAPI + HttpInjector + HttpExtractor + Propagation + Registry + HttpInjectorInterface + HttpExtractorInterface +``` + +### Edge Cases + +There are some complications that can arise when managing a span context extracted off the wire and in-process spans for tracer operations that operate on an implicit parent. In order to ensure that a context key references an object of the expected type and that the proper implicit parent is used, the following conventions have been established: + +### Extract + +When extracting a remote context, the extracted span context MUST be stored separately from the current span. + +### Default Span Parentage + +When a new span is created from a context, a default parent for the span can be assigned. The order is of assignment is as follows: + +* The current span. +* The extracted span. +* The root span. + +### Inject + +When injecting a span to send over the wire, the default order is of +assignment is as follows: + +* The current span. +* The extracted span. + +### Default HTTP headers + +OpenTelemetry currently uses two standard header formats for context propagation. +Their properties and requirements are integrated into the OpenTelemetry APIs. + +**Span Context -** The OpenTelemetry Span API is modeled on the `traceparent` +and `tracestate` headers defined in the [W3C Trace Context specification](https://www.w3.org/TR/trace-context/). + +**Correlation Context -** The OpenTelemetry Correlations API is modeled on the +`Baggage` headers defined in the [W3C Baggage specification](https://www.w3.org/TR/baggage/). + +### Context management and in-process propagation + +In order for Context to function, it must always remain bound to the execution +of code it represents. By default, this means that the programmer must pass a +Context down the call stack as a function parameter. However, many languages +provide automated context management facilities, such as thread locals. +OpenTelemetry should leverage these facilities when available, in order to +provide automatic context management. + +### Pre-existing context implementations + +In some languages, a single, widely used context implementation exists. In other +languages, there many be too many implementations, or none at all. For example, +Go has a the `context.Context` object, and widespread conventions for how to +pass it down the call stack. Java has MDC, along with several other context +implementations, but none are so widely used that their presence can be +guaranteed or assumed. + +In the cases where an extremely clear, pre-existing option is not available, +OpenTelemetry should provide its own context implementation. + +## FAQ + +### What about complex propagation behavior + +Some OpenTelemetry proposals have called for more complex propagation behavior. +For example, falling back to extracting B3 headers if W3C Trace-Context headers +are not found. "Fallback propagators" and other complex behavior can be modeled as +implementation details behind the Propagator interface. Therefore, the +propagation system itself does not need to provide an mechanism for chaining +together propagators or other additional facilities. + +## Prior art and alternatives + +Prior art: + +* OpenTelemetry distributed context +* OpenCensus propagators +* OpenTracing spans +* gRPC context + +## Risks + +The Correlations API is related to the [W3C Baggage](https://www.w3.org/TR/baggage/) +specification. Work on this specification has begun, but is not complete. While +unlikely, it is possible that this W3C specification could diverge from the +design or guarantees needed by the Correlations API. + +## Future possibilities + +Cleanly splitting OpenTelemetry into Aspects and Context Propagation layer may +allow us to move the Context Propagation layer into its own, stand-alone +project. This may facilitate adoption, by allowing us to share Context +Propagation with gRPC and other projects. diff --git a/oteps/0083-component.md b/oteps/0083-component.md new file mode 100644 index 00000000000..e29e6d90341 --- /dev/null +++ b/oteps/0083-component.md @@ -0,0 +1,117 @@ +# InstrumentationLibrary + +Introducing the notion of `InstrumentationLibrary` as a first class concept in +OpenTelemetry which describes the named `Tracer|Meter` in the data model. + +## Motivation + +The main motivation for this is to better model telemetry coming from different +instrumentation libraries inside an Application (`Resource`). This change +affects only the OpenTelemetry protocol and exporters, and it does not require +any API changes. + +The proposal is to make `InstrumentationLibrary` a first class concept in +OpenTelemetry, with a scope defined by the `named Tracer|Meter`. It describes +all telemetry generated by one `named Tracer|Meter`. + +## Explanation + +``` + Application ++--------------------------------------------------------------------------+ +| TracerProvider(Resource) | +| MeterProvider(Resource) | +| | +| Instrumentation Library 1 Instrumentation Library 2 | +| +--------------------------------+ +--------------------------------+ | +| | Tracer(InstrumentationLibrary) | | Tracer(InstrumentationLibrary) | | +| | Meter(InstrumentationLibrary) | | Meter(InstrumentationLibrary) | | +| +--------------------------------+ +--------------------------------+ | +| | +| Instrumentation Library 3 Instrumentation Library 4 | +| +--------------------------------+ +--------------------------------+ | +| | Tracer(InstrumentationLibrary) | | Tracer(InstrumentationLibrary) | | +| | Meter(InstrumentationLibrary) | | Meter(InstrumentationLibrary) | | +| +--------------------------------+ +--------------------------------+ | +| | ++--------------------------------------------------------------------------+ +``` + +Every application has one `TracerProvider` and one `MeterProvider` which have a +`Resource` associated with them that describes the Application. + +Every instrumentation library has one `Tracer` and one `Meter` which have a +`InstrumentationLibrary` associated with them that describes the instrumentation +library. + +In case of multi-application deployments like Spring boot (or old Java +Application servers) every Application will have it's own `TracerProvider` and +`MeterProvider` instances. + +## Internal details + +This proposal affects only the OpenTelemtry protocol, and proposes a way to +represent the telemetry data in a structured way. +For example, here is the protobuf definition for metrics: +metrics: + +```proto +// Resource information. +message Resource { + // Set of labels that describe the resource. + repeated opentelemetry.proto.common.v1.AttributeKeyValue attributes = 1; +} + +// InstrumentationLibrary is a message representing the instrumentation library +// informations such as the fully qualified name and version. +message InstrumentationLibrary { + string name = 1; + string version = 2; + // Can be extended with attributes. +} + +// A collection of InstrumentationLibraryMetrics from a Resource. +message ResourceMetrics { + // The Resource for the metrics in this message. + // If this field is not set then no Resource is known. + Resource resource = 1; + + // A list of metrics that originate from a Resource. + repeated InstrumentationLibraryMetrics instrumentation_library_metrics = 2; +} + +// A collection of Metrics from a InstrumentationLibrary. +message InstrumentationLibraryMetrics { + // The Instrumentation library information for the metrics in this message. + // If this field is not set then no InstrumentationLibrary is known. + InstrumentationLibrary instrumentation_library = 1; + + // A list of metrics that originate from a Instrumentation library. + repeated Metric metrics = 2; +} +``` + +## Trade-offs and mitigations + +Adding a new concept into the OpenTelemetry protocol and the exporter framework +may be overkill, but this concept maps easily to an already existing concept +in the API, and it is easy to explain to users. + +## Prior art and alternatives + +An alternative approach was proposed in the proto [PR comment]( +https://github.com/open-telemetry/opentelemetry-proto/pull/94#discussion_r369952371) +which suggested to enclose `Resource` at multiple levels including the +named `Tracer|Meter`. + +## Open questions + +1. Should we support more than name and version for the InstrumentationLibrary +now? + +## Future possibilities + +In the future the `InstrumentationLibrary` can be extended to support multiple +properties (attributes) that apply to the specific instance of the +instrumenting library. +Also, information about the instrumented library could be added, but that will require additional consideration about grouping, like grouping by the pair (instrumenting lib, instrumenting lib) instead of just by instrumenting lib. diff --git a/oteps/0099-otlp-http.md b/oteps/0099-otlp-http.md new file mode 100644 index 00000000000..bbeeb571e72 --- /dev/null +++ b/oteps/0099-otlp-http.md @@ -0,0 +1,171 @@ +# OTLP/HTTP: HTTP Transport Extension for OTLP + +This is a proposal to add HTTP Transport extension for +[OTLP](0035-opentelemetry-protocol.md) (OpenTelemetry Protocol). + +## Table of Contents + +* [Motivation](#motivation) +* [OTLP/HTTP Protocol Details](#otlphttp-protocol-details) + * [Request](#request) + * [Response](#response) + * [Success](#success) + * [Failures](#failures) + * [Throttling](#throttling) + * [All Other Responses](#all-other-responses) + * [Connection](#connection) + * [Parallel Connections](#parallel-connections) +* [Prior Art and Alternatives](#prior-art-and-alternatives) + +## Motivation + +OTLP can be currently communicated only via one transport: gRPC. While using +gRPC has certain benefits there are also drawbacks: + +- Some users have infrastructure limitations that make gRPC-based protocol + usage impossible. For example AWS ALB does not support gRPC connections. + +- gRPC is a relatively big dependency, which some clients are not willing to + take. Plain HTTP is a smaller dependency and is built in the standard + libraries of many programming languages. + +## OTLP/HTTP Protocol Details + +This proposal keeps the existing specification of OTLP over gRPC transport +(OTLP/gRPC for short) and defines an additional way to use OTLP protocol over +HTTP transport (OTLP/HTTP for short). OTLP/HTTP uses the same ProtoBuf payload +that is used by OTLP/gRPC and defines how this payload is communicated over HTTP +transport. + +OTLP/HTTP uses HTTP POST requests to send telemetry data from clients to +servers. Implementations MAY use HTTP/1.1 or HTTP/2 transports. Implementations +that use HTTP/2 transport SHOULD fallback to HTTP/1.1 transport if HTTP/2 +connection cannot be established. + +### Request + +Telemetry data is sent via HTTP POST request. + +The default URL path for requests that carry trace data is `/v1/traces` (for +example the full URL when connecting to "example.com" server will be +`https://example.com/v1/traces`). The request body is a ProtoBuf-encoded +[`ExportTraceServiceRequest`](https://github.com/open-telemetry/opentelemetry-proto/blob/e6c3c4a74d57f870a0d781bada02cb2b2c497d14/opentelemetry/proto/collector/trace/v1/trace_service.proto#L38) +message. + +The default URL path for requests that carry metric data is `/v1/metrics` and the +request body is a ProtoBuf-encoded +[`ExportMetricsServiceRequest`](https://github.com/open-telemetry/opentelemetry-proto/blob/e6c3c4a74d57f870a0d781bada02cb2b2c497d14/opentelemetry/proto/collector/metrics/v1/metrics_service.proto#L35) +message. + +The client MUST set "Content-Type: application/x-protobuf" request header. The +client MAY gzip the content and in that case SHOULD include "Content-Encoding: +gzip" request header. The client MAY include "Accept-Encoding: gzip" request +header if it can receive gzip-encoded responses. + +Non-default URL paths for requests MAY be configured on the client and server +sides. + +### Response + +#### Success + +On success the server MUST respond with `HTTP 200 OK`. Response body MUST be +ProtoBuf-encoded +[`ExportTraceServiceResponse`](https://github.com/open-telemetry/opentelemetry-proto/blob/e6c3c4a74d57f870a0d781bada02cb2b2c497d14/opentelemetry/proto/collector/trace/v1/trace_service.proto#L47) +message for traces and +[`ExportMetricsServiceResponse`](https://github.com/open-telemetry/opentelemetry-proto/blob/e6c3c4a74d57f870a0d781bada02cb2b2c497d14/opentelemetry/proto/collector/metrics/v1/metrics_service.proto#L44) +message for metrics. + +The server MUST set "Content-Type: application/x-protobuf" response header. If +the request header "Accept-Encoding: gzip" is present in the request the server +MAY gzip-encode the response and set "Content-Encoding: gzip" response header. + +The server SHOULD respond with success no sooner than after successfully +decoding and validating the request. + +#### Failures + +If the processing of the request fails the server MUST respond with appropriate +`HTTP 4xx` or `HTTP 5xx` status code. See sections below for more details about +specific failure cases and HTTP status codes that should be used. + +Response body for all `HTTP 4xx` and `HTTP 5xx` responses MUST be a +ProtoBuf-encoded +[Status](https://godoc.org/google.golang.org/genproto/googleapis/rpc/status#Status) +message that describes the problem. + +This specification does not use `Status.code` field and the server MAY omit +`Status.code` field. The clients are not expected to alter their behavior based +on `Status.code` field but MAY record it for troubleshooting purposes. + +The `Status.message` field SHOULD contain a developer-facing error message as +defined in `Status` message schema. + +The server MAY include `Status.details` field with additional details. Read +below about what this field can contain in each specific failure case. + +#### Bad Data + +If the processing of the request fails because the request contains data that +cannot be decoded or is otherwise invalid and such failure is permanent then the +server MUST respond with `HTTP 400 Bad Request`. The `Status.details` field in +the response SHOULD contain a +[BadRequest](https://github.com/googleapis/googleapis/blob/d14bf59a446c14ef16e9931ebfc8e63ab549bf07/google/rpc/error_details.proto#L166) +that describes the bad data. + +The client MUST NOT retry the request when it receives `HTTP 400 Bad Request` +response. + +#### Throttling + +If the server receives more requests than the client is allowed or the server is +overloaded the server SHOULD respond with `HTTP 429 Too Many Requests` or +`HTTP 503 Service Unavailable` and MAY include +["Retry-After"](https://tools.ietf.org/html/rfc7231#section-7.1.3) header with a +recommended time interval in seconds to wait before retrying. + +The client SHOULD honour the waiting interval specified in "Retry-After" header +if it is present. If the client receives `HTTP 429` or `HTTP 503` response and +"Retry-After" header is not present in the response then the client SHOULD +implement an exponential backoff strategy between retries. + +#### All Other Responses + +All other HTTP responses that are not explicitly listed in this document should +be treated according to HTTP specification. + +If the server disconnects without returning a response the client SHOULD retry +and send the same request. The client SHOULD implement an exponential backoff +strategy between retries to avoid overwhelming the server. + +### Connection + +If the client is unable to connect to the server the client SHOULD retry the +connection using exponential backoff strategy between retries. The interval +between retries must have a random jitter. + +The client SHOULD keep the connection alive between requests. + +Server implementations MAY handle OTLP/gRPC and OTLP/HTTP requests on the same +port and multiplex the connections to the corresponding transport handler based +on "Content-Type" request header. + +### Parallel Connections + +To achieve higher total throughput the client MAY send requests using several +parallel HTTP connections. In that case the maximum number of parallel +connections SHOULD be configurable. + +## Prior Art and Alternatives + +I have also considered HTTP/1.1+WebSocket transport. Experimental implementation +of OTLP over WebSocket transport has shown that it typically has better +performance than plain HTTP transport implementation (WebSocket uses less CPU, +higher throughput in high latency connections). However WebSocket transport +requires slightly more complicated implementation and WebSocket libraries are +less ubiquitous than plain HTTP, which may make implementation in certain +languages difficult or impossible. + +HTTP/1.1+WebSocket transport may be considered as a future transport for +high-performance use cases as it exhibits better performance than OTLP/gRPC and +OTLP/HTTP. diff --git a/oteps/0110-z-pages.md b/oteps/0110-z-pages.md new file mode 100644 index 00000000000..e9eac4e053a --- /dev/null +++ b/oteps/0110-z-pages.md @@ -0,0 +1,37 @@ +# zPages: general direction (#110) + +Make zPages a standard OpenTelemetry component. + +## Motivation + +Self-introspection debug pages or zPages are in-process web pages that display collected data from the process they are attached to. They are used to provide in-process diagnostics without the need of any backend to examine traces or metrics. Various implementations of zPages are widely used in many environments. The standard extensible implementation of zPages in OpenTelemetry will benefit everybody. + +## Explanation + +This OTEP is a request to get a general approval for zPages development as an experimental feature [open-telemetry/opentelemetry-specification#62](https://github.com/open-telemetry/opentelemetry-specification/pull/632). See [opencensus.io/zpages](https://opencensus.io/zpages/) for the overview of zPages. + +## Internal details + +Implementation of zPages includes multiple components - data collection (sampling, filtering, configuration), storage and aggregation, and a framework to expose this data. + +This is a request for a general direction approval. There are a few principles for the development: + +1. zPages MUST NOT be hardcoded into OpenTelemetry SDK. +2. OpenTelemetry implementation of zPages MUST be split as two separate components - one for data, another for rendering. So that, for example, data providers could be also integrated into other rendering frameworks. +3. zPages SHOULD be built as a framework that provides a way to extend information exposed from the process. Ideally all the way to replace OpenTelemetry SDK with alternative source of information. + +## Trade-offs and mitigations + +We may discover that implementation of zPages as a vendor-specific or user-specific plugins may be preferable. Based on initial investigation, extensible standard implementation will benefit everybody. + +## Prior art and alternatives + +[opencensus.io/zpages](https://opencensus.io/zpages/) + +## Open questions + +N/A + +## Future possibilities + +N/A diff --git a/oteps/0111-auto-resource-detection.md b/oteps/0111-auto-resource-detection.md new file mode 100644 index 00000000000..c4fad09e728 --- /dev/null +++ b/oteps/0111-auto-resource-detection.md @@ -0,0 +1,178 @@ +# Automatic Resource Detection + +Introduce a mechanism to support auto-detection of resources. + +## Motivation + +Resource information, i.e. attributes associated with the entity producing +telemetry, can currently be supplied to tracer and meter providers or appended +in custom exporters. In addition to this, it would be useful to have a mechanism +to automatically detect resource information from the host (e.g. from an +environment variable or from aws, gcp, etc metadata) and apply this to all kinds +of telemetry. This will in many cases prevent users from having to manually +configure resource information. + +Note there are some existing implementations of this already in the SDKs (see +[below](#prior-art-and-alternatives)), but nothing currently in the +specification. + +## Explanation + +In order to apply auto-detected resource information to all kinds of telemetry, +a user will need to configure which resource detector(s) they would like to run +(e.g. AWS EC2 detector). + +If multiple detectors are configured, and more than one of these successfully +detects a resource, the resources will be merged according to the Merge +interface already defined in the specification, i.e. the earliest matched +resource's attributes will take precedence. Each detector may be run in +parallel, but to ensure deterministic results, the resources must be merged in +the order the detectors were added. + +A default implementation of a detector that reads resource data from the +`OTEL_RESOURCE` environment variable will be included in the SDK. The +environment variable will contain of a list of key value pairs, and these are +expected to be represented in a format similar to the [W3C +Baggage](https://github.com/w3c/baggage/blob/master/baggage/HTTP_HEADER_FORMAT.md#header-content), +except that additional semi-colon delimited metadata is not supported, i.e.: +`key1=value1,key2=value2`. If the user does not specify any resource, this +detector will be run by default. + +Custom resource detectors related to specific environments (e.g. specific cloud +vendors) must be implemented as packages separate to the core SDK, and users +will need to import these separately. + +## Internal details + +As described above, the following will be added to the Resource SDK +specification: + +- An interface for "detectors", to retrieve resource information +- Specification for a global function to merge resources returned by a set of + detectors +- Details of the "from environment variable" detector implementation as + described above +- Specification that default detection (from environment variable) runs once on + startup, and is used by all tracer & meter providers by default if no custom + resource is supplied + +### Usage + +The following example in Go creates a tracer and meter provider that uses +resource information automatically detected from AWS or GCP: + +Assumes a dependency has been added on the `otel/api`, `otel/sdk`, +`otel/awsdetector`, and `otel/gcpdetector` packages. + +```go +resource, _ := sdkresource.Detect(ctx, 5 * time.Second, awsdetector.ec2, gcpdetector.gce) +tp := sdktrace.NewProvider(sdktrace.WithResource(resource)) +mp := push.New(..., push.WithResource(resource)) +``` + +### Components + +#### Detector + +The `Detector` interface will simply contain a `Detect` function that returns a +Resource. + +The `Detect` function should contain a mechanism to timeout and cancel the +operation. If a detector is not able to detect a resource, it must return an +uninitialized resource such that the result of each call to `Detect` can be +merged. + +#### Global Function + +The SDK will also provide a global `Detect` function. This will take a timeout +duration and a set of detectors that should be run and merged in order as +described in the intro, and return a resource. + +### Error Handling + +In the case of one or more detectors raising an error, there are two reasonable +options: + +1. Ignore that detector, and continue with a warning (likely meaning we will + continue without expected resource information) +2. Crash the application (raise a panic) + +The user can decide how to recover from failure. + +## Trade-offs and mitigations + +- This OTEP proposes storing Vendor resource detection packages outside of the + SDK. This ensures the SDK is free of vendor specific code. Given the + relatively straightforward & minimal amount of code generally needed to + perform resource detection, and the relatively small number of cloud + providers, we may instead decide its okay for all the resource detection code + to live in the SDK directly. + - If we do allow Vendor resource detection packages in the SDK, we presumably + need to restrict these to not being able to use non-trivial libraries +- This OTEP proposes only performing environment variable resource detection by + default. Given the relatively small number of cloud providers, we may instead + decide its okay to run all detectors by default. This raises the question of + if any restrictions would need to be put on this, and how we would handle this + in the future if the number of Cloud Providers rises. It would be difficult to + back out of running these by default as that would lead to a breaking change. +- This OTEP proposes a global function the user calls with the detectors they + want to run, and then expects the user to pass these into the providers. An + alternative option (that was previously proposed in this OTEP) would be to + supply a set of detectors directly to the metric or trace provider instead of, + or as an additional option to, a static resource. That would result in + marginally simpler setup code where the user doesn't need to call `AutoDetect` + themselves. Another advantage of this approach is that its easier to specify + default detectors and override these separately to any static resource the + user may want to provide. On the downside, this approach adds the complexity + of having to deal with the merging the detected resource with a static + resource if provided. It also potentially adds a lot of complexity around how + to avoid having detectors run multiple times since they will be configured for + each provider. Avoiding having to specify detectors for tracer & meter + providers is the primary reason for not going with that option in the end. +- The attribute proto now supports arrays & maps. We could support parsing this + out of the `OTEL_RESOURCE` environment variable similar to how Correlation + Context supports semi colon lists of keys & key-value pairs, but the added + complexity is probably not worthwhile implementing unless someone has a good + use case for this. +- In the case of an error at resource detection time, another alternative would + be to start a background thread to retry following some strategy, but it's not + clear that there would be much value in doing this, and it would add + considerable unnecessary complexity. + +## Prior art and alternatives + +This proposal is largely inspired by the existing OpenCensus specification, the +OpenCensus Go implementation, and the OpenTelemetry JS implementation. For +reference, see the relevant section of the [OpenCensus +specification](https://github.com/census-instrumentation/opencensus-specs/blob/master/resource/Resource.md#populating-resources) + +### Existing OpenTelemetry implementations + +- Resource detection implementation in JS SDK + [here](https://github.com/open-telemetry/opentelemetry-js/tree/master/packages/opentelemetry-resources): + The JS implementation is very similar to this proposal. This proposal states + that the SDK will allow detectors to be passed into telemetry providers + directly instead of just having a global `DetectResources` function which the + user will need to call and pass in explicitly. In addition, vendor specific + resource detection code is currently in the JS resource package, so this would + need to be separated. +- Environment variable resource detection in Java SDK + [here](https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk-extensions/autoconfigure/src/main/java/io/opentelemetry/sdk/autoconfigure/ResourceConfiguration.java): + This implementation does not currently include a detector interface, but is + used by default for tracer and meter providers + +## Open questions + +- Does this interfere with any other upcoming specification changes related to + resources? +- If custom detectors need to live outside the core repo, what is the + expectation regarding where they should be hosted? +- Also see the [Trade-offs and mitigations](#trade-offs-and-mitigations) section + +## Future possibilities + +When the Collector is run as an agent, the same interface, shared with the Go +SDK, could be used to append resource information detected from the host to all +kinds of telemetry in a Processor (probably as an extension to the existing +Resource Processor). This would require a translation from the SDK resource to +the collector's internal representation of a resource. diff --git a/oteps/0119-standard-system-metrics.md b/oteps/0119-standard-system-metrics.md new file mode 100644 index 00000000000..5e79f8c38be --- /dev/null +++ b/oteps/0119-standard-system-metrics.md @@ -0,0 +1,149 @@ +# Standard names for system/runtime metric instruments + +This OTEP proposes a set of standard names, labels, and semantic conventions for common system/runtime metric instruments in OpenTelemetry. The instrument names proposed here are common across the supported operating systems and runtime environments. Also included are general semantic conventions for system/runtime metrics including those not specific to a particular OS or runtime. + +This OTEP is largely based on the existing implementation in the OpenTelemetry Collector's [Host Metrics Receiver](https://github.com/open-telemetry/opentelemetry-collector/tree/1ad767e62f3dff6f62f32c7360b6fefe0fbf32ff/receiver/hostmetricsreceiver). The proposed names aim to make system/runtime metrics unambiguous and easily discoverable. See [OTEP #108](https://github.com/open-telemetry/oteps/pull/108/files) for additional motivation. + +## Trade-offs and mitigations + +When naming a metric instrument, there is a trade off between discoverability and ambiguity. For example, a metric called `system.cpu.load_average` is very discoverable, but the meaning of this metric is ambiguous. [Load average](https://en.wikipedia.org/wiki/Load_(computing)) is well defined on UNIX, but is not a standard metric on Windows. While discoverability is important, names must be unambiguous. + +## Prior art + +There are already a few implementations of system and/or runtime metric collection in OpenTelemetry: + +- **[OTEP #108](https://github.com/open-telemetry/oteps/pull/108/files)** + * Provides high level guidelines around naming metric instruments. + * Came out of a [prior proposal](https://docs.google.com/spreadsheets/d/1WlStcUe2eQoN1y_UF7TOd6Sw7aV_U0lFcLk5kBNxPsY/edit#gid=0) for system metrics? +- **Collector** + * [Host Metrics Receiver](https://github.com/open-telemetry/opentelemetry-collector/tree/1ad767e62f3dff6f62f32c7360b6fefe0fbf32ff/receiver/hostmetricsreceiver) generates metrics about the host system when run as an agent. + * Currently is the most comprehensive implementation. + * Collects system metrics for CPU, memory, swap, disks, filesystems, network, and load. + * There are plans to collect process metrics for CPU, memory, and disk I/O. + * Makes good use of labels rather than defining individual metrics. + * [Overview of collected metrics](https://docs.google.com/spreadsheets/d/11qSmzD9e7PnzaJPYRFdkkKbjTLrAKmvyQpjBjpJsR2s). + +- **Go** + * Go [has instrumentation](https://github.com/open-telemetry/opentelemetry-go-contrib/tree/master/instrumentation/runtime) to collect runtime metrics for GC, heap use, and goroutines. + * This package does not export metrics with labels, instead exporting individual metrics. + * [Overview of collected metrics](https://docs.google.com/spreadsheets/d/1r50cC9ass0A8SZIg2ZpLdvZf6HmQJsUSXFOu-rl4yaY/edit#gid=0). +- **Python** + * Python [has instrumentation](https://github.com/open-telemetry/opentelemetry-python-contrib/tree/main/instrumentation/opentelemetry-instrumentation-system-metrics) to collect some system and runtime metrics. + * Collects system CPU, memory, and network metrics. + * Collects runtime CPU, memory, and GC metrics. + * Makes use of labels, similar to the Collector. + * [Overview of collected metrics](https://docs.google.com/spreadsheets/d/1r50cC9ass0A8SZIg2ZpLdvZf6HmQJsUSXFOu-rl4yaY/edit#gid=0). + +## Semantic Conventions + +The following semantic conventions aim to keep naming consistent. Not all possible metrics are covered by these conventions, but they provide guidelines for most of the cases in this proposal: + +- **usage** - an instrument that measures an amount used out of a known total amount should be called `entity.usage`. For example, `system.filesystem.usage` for the amount of disk spaced used. A measure of the amount of an unlimited resource consumed is differentiated from **usage**. This may be time, data, etc. +- **utilization** - an instrument that measures a *value ratio* of usage (like percent, but in the range `[0, 1]`) should be called `entity.utilization`. For example, `system.memory.utilization` for the ratio of memory in use. +- **time** - an instrument that measures passage of time should be called `entity.time`. For example, `system.cpu.time` with varying values of label `state` for idle, user, etc. +- **io** - an instrument that measures bidirectional data flow should be called `entity.io` and have labels for direction. For example, `system.network.io`. +- Other instruments that do not fit the above descriptions may be named more freely. For example, `system.swap.page_faults` and `system.network.packets`. Units do not need to be specified in the names since they are included during instrument creation, but can be added if there is ambiguity. + +## Internal details + +The following standard metric instruments should be used in libraries instrumenting system/runtime metrics (here is a [spreadsheet](https://docs.google.com/spreadsheets/d/1r50cC9ass0A8SZIg2ZpLdvZf6HmQJsUSXFOu-rl4yaY/edit#gid=973941697) of the tables below). + +In the tables below, units of `1` refer to a ratio value that is always in the range `[0, 1]`. Instruments that measure an integer count of something use semantic units like `packets`, `errors`, `faults`, etc. + +### Standard System Metrics - `system.` + +--- + +#### `system.cpu.` + +**Description:** System level processor metrics. +|Name |Units |Instrument Type |Value Type|Label Key|Label Values | +|----------------------|-------|-----------------|----------|---------|-----------------------------------| +|system.cpu.time |seconds|SumObserver |Double |state |idle, user, system, interrupt, etc.| +| | | | |cpu |1 - #cores | +|system.cpu.utilization|1 |UpDownSumObserver|Double |state |idle, user, system, interrupt, etc.| +| | | | |cpu |1 - #cores | + +#### `system.memory.` + +**Description:** System level memory metrics. +|Name |Units|Instrument Type |Value Type|Label Key|Label Values | +|-------------------------|-----|-----------------|----------|---------|------------------------| +|system.memory.usage |bytes|UpDownSumObserver|Int64 |state |used, free, cached, etc.| +|system.memory.utilization|1 |UpDownSumObserver|Double |state |used, free, cached, etc.| + +#### `system.swap.` + +**Description:** System level swap/paging metrics. +|Name |Units |Instrument Type |Value Type|Label Key|Label Values| +|----------------------------|----------|-----------------|----------|---------|------------| +|system.swap.usage |pages |UpDownSumObserver|Int64 |state |used, free | +|system.swap.utilization |1 |UpDownSumObserver|Double |state |used, free | +|system.swap.page\_faults |faults |SumObserver |Int64 |type |major, minor| +|system.swap.page\_operations|operations|SumObserver |Int64 |type |major, minor| +| | | | |direction|in, out | + +#### `system.disk.` + +**Description:** System level disk performance metrics. +|Name |Units |Instrument Type|Value Type|Label Key|Label Values| +|----------------------------|----------|---------------|----------|---------|------------| +|system.disk.io|bytes |SumObserver |Int64 |device |(identifier)| +| | | | |direction|read, write | +|system.disk.operations |operations|SumObserver |Int64 |device |(identifier)| +| | | | |direction|read, write | +|system.disk.time |seconds |SumObserver |Double |device |(identifier)| +| | | | |direction|read, write | +|system.disk.merged |1 |SumObserver |Int64 |device |(identifier)| +| | | | |direction|read, write | + +#### `system.filesystem.` + +**Description:** System level filesystem metrics. +|Name |Units|Instrument Type |Value Type|Label Key|Label Values | +|-----------------------------|-----|-----------------|----------|---------|--------------------| +|system.filesystem.usage |bytes|UpDownSumObserver|Int64 |device |(identifier) | +| | | | |state |used, free, reserved| +|system.filesystem.utilization|1 |UpDownSumObserver|Double |device |(identifier) | +| | | | |state |used, free, reserved| + +#### `system.network.` + +**Description:** System level network metrics. +|Name |Units |Instrument Type |Value Type|Label Key|Label Values | +|-------------------------------|-----------|-----------------|----------|---------|----------------------------------------------------------------------------------------------| +|system.network.dropped\_packets|packets |SumObserver |Int64 |device |(identifier) | +| | | | |direction|transmit, receive | +|system.network.packets |packets |SumObserver |Int64 |device |(identifier) | +| | | | |direction|transmit, receive | +|system.network.errors |errors |SumObserver |Int64 |device |(identifier) | +| | | | |direction|transmit, receive | +|system.network.io|bytes |SumObserver |Int64 |device |(identifier) | +| | | | |direction|transmit, receive | +|system.network.connections |connections|UpDownSumObserver|Int64 |device |(identifier) | +| | | | |protocol |tcp, udp, [others](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | +| | | | |state |[e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation)| + +#### OS Specific System Metrics - `system.{os}.` + +Instrument names for system level metrics specific to a certain operating system should be prefixed with `system.{os}.` and follow the hierarchies listed above for different entities like CPU, memory, and network. For example, an instrument for counting the number of Linux merged disk operations (see [here](https://unix.stackexchange.com/questions/462704/iostat-what-is-exactly-the-concept-of-merge) and [here](https://man7.org/linux/man-pages/man1/iostat.1.html)) could be named `system.linux.disk.merged_operations`, reusing the `disk` name proposed above. + +### Standard Runtime Metrics - `runtime.` + +--- + +Runtime environments vary widely in their terminology, implementation, and relative values for a given metric. For example, Go and Python are both garbage collected languages, but comparing heap usage between the two runtimes directly is not meaningful. For this reason, this OTEP does not propose any standard top-level runtime metric instruments. See [OTEP #108](https://github.com/open-telemetry/oteps/pull/108/files) for additional discussion. + +#### Runtime Specific Metrics - `runtime.{environment}.` + +Runtime level metrics specific to a certain runtime environment should be prefixed with `runtime.{environment}.` and follow the semantic conventions outlined in [Semantic Conventions](#semantic-conventions). For example, Go runtime metrics use `runtime.go.` as a prefix. + +Some programming languages have multiple runtime environments that vary significantly in their implementation, for example [Python has many implementations](https://wiki.python.org/moin/PythonImplementations). For these languages, consider using specific `environment` prefixes to avoid ambiguity, like `runtime.cpython.` and `runtime.pypy.`. + +## Open questions + +- Should the individual runtimes have their specific naming conventions in the spec? +- Is it ok to include instruments specific to an OS (or OS family) under a top-level prefix, as long as they are unambiguous? For example, naming inode related instruments, which of the below is preferred? + 1. Top level: `system.filesystem.inodes.*` + 2. UNIX family level: `system.unix.filesystem.inodes.*` + 3. One for each UNIX OS: `system.linux.filesystem.inodes.*`, `system.freebsd.filesystem.inodes.*`, `system.netbsd.filesystem.inodes.*`, etc. diff --git a/oteps/0122-otlp-http-json.md b/oteps/0122-otlp-http-json.md new file mode 100644 index 00000000000..fe0b10cadd0 --- /dev/null +++ b/oteps/0122-otlp-http-json.md @@ -0,0 +1,144 @@ +# OTLP: JSON Encoding for OTLP/HTTP + +This is a proposal to add HTTP Transport extension supporting json serialization for +[OTLP](0035-opentelemetry-protocol.md) (OpenTelemetry Protocol). + +## Table of Contents + +* [Motivation](#motivation) +* [OTLP/HTTP+JSON Protocol Details](#otlphttpjson-protocol-details) + * [JSON Mapping](#json-mapping) + * [Request](#request) + * [Response](#response) + * [Success](#success) + * [Failures](#failures) + * [Throttling](#throttling) + * [All Other Responses](#all-other-responses) + * [Connection](#connection) + * [Parallel Connections](#parallel-connections) + +## Motivation + +Protobuf is a relatively big dependency, which some clients are not willing to take. For example, webjs, iOS/Android (in some scenarios, the size of the installation package is limited, do not want to introduce protobuf dependencies). Plain JSON is a smaller dependency and is built in the standard libraries of many programming languages. + +## OTLP/HTTP+JSON Protocol Details + +OTLP/HTTP+JSON will be consistent with the [OTLP/HTTP](0099-otlp-http.md) specification except that the payload will use JSON instead of protobuf. + +### JSON Mapping + +Use proto3 standard defined [JSON Mapping](https://developers.google.com/protocol-buffers/docs/proto3#json) for mapping between protobuf and json. `trace_id` and `span_id` is base64 encoded in OTLP/HTTP+JSON, not hex. + +### Request + +Telemetry data is sent via HTTP POST request. + +The default URL path for requests that carry trace data is `/v1/traces` (for +example the full URL when connecting to "example.com" server will be +`https://example.com/v1/traces`). The request body is a JSON-encoded +[`ExportTraceServiceRequest`](https://github.com/open-telemetry/opentelemetry-proto/blob/e6c3c4a74d57f870a0d781bada02cb2b2c497d14/opentelemetry/proto/collector/trace/v1/trace_service.proto#L38) +message. + +The default URL path for requests that carry metric data is `/v1/metrics` and the +request body is a JSON-encoded +[`ExportMetricsServiceRequest`](https://github.com/open-telemetry/opentelemetry-proto/blob/e6c3c4a74d57f870a0d781bada02cb2b2c497d14/opentelemetry/proto/collector/metrics/v1/metrics_service.proto#L35) +message. + +The client MUST set "Content-Type: application/json" request header. The +client MAY gzip the content and in that case SHOULD include "Content-Encoding: +gzip" request header. The client MAY include "Accept-Encoding: gzip" request +header if it can receive gzip-encoded responses. + +Non-default URL paths for requests MAY be configured on the client and server +sides. + +### Response + +#### Success + +On success the server MUST respond with `HTTP 200 OK`. Response body MUST be +JSON-encoded +[`ExportTraceServiceResponse`](https://github.com/open-telemetry/opentelemetry-proto/blob/e6c3c4a74d57f870a0d781bada02cb2b2c497d14/opentelemetry/proto/collector/trace/v1/trace_service.proto#L47) +message for traces and +[`ExportMetricsServiceResponse`](https://github.com/open-telemetry/opentelemetry-proto/blob/e6c3c4a74d57f870a0d781bada02cb2b2c497d14/opentelemetry/proto/collector/metrics/v1/metrics_service.proto#L44) +message for metrics. + +The server MUST set "Content-Type: application/json" response header. If +the request header "Accept-Encoding: gzip" is present in the request the server +MAY gzip-encode the response and set "Content-Encoding: gzip" response header. + +The server SHOULD respond with success no sooner than after successfully +decoding and validating the request. + +#### Failures + +If the processing of the request fails the server MUST respond with appropriate +`HTTP 4xx` or `HTTP 5xx` status code. See sections below for more details about +specific failure cases and HTTP status codes that should be used. + +Response body for all `HTTP 4xx` and `HTTP 5xx` responses MUST be a +JSON-encoded +[Status](https://godoc.org/google.golang.org/genproto/googleapis/rpc/status#Status) +message that describes the problem. + +This specification does not use `Status.code` field and the server MAY omit +`Status.code` field. The clients are not expected to alter their behavior based +on `Status.code` field but MAY record it for troubleshooting purposes. + +The `Status.message` field SHOULD contain a developer-facing error message as +defined in `Status` message schema. + +The server MAY include `Status.details` field with additional details. Read +below about what this field can contain in each specific failure case. + +#### Bad Data + +If the processing of the request fails because the request contains data that +cannot be decoded or is otherwise invalid and such failure is permanent then the +server MUST respond with `HTTP 400 Bad Request`. The `Status.details` field in +the response SHOULD contain a +[BadRequest](https://github.com/googleapis/googleapis/blob/d14bf59a446c14ef16e9931ebfc8e63ab549bf07/google/rpc/error_details.proto#L166) +that describes the bad data. + +The client MUST NOT retry the request when it receives `HTTP 400 Bad Request` +response. + +#### Throttling + +If the server receives more requests than the client is allowed or the server is +overloaded the server SHOULD respond with `HTTP 429 Too Many Requests` or +`HTTP 503 Service Unavailable` and MAY include +["Retry-After"](https://tools.ietf.org/html/rfc7231#section-7.1.3) header with a +recommended time interval in seconds to wait before retrying. + +The client SHOULD honour the waiting interval specified in "Retry-After" header +if it is present. If the client receives `HTTP 429` or `HTTP 503` response and +"Retry-After" header is not present in the response then the client SHOULD +implement an exponential backoff strategy between retries. + +#### All Other Responses + +All other HTTP responses that are not explicitly listed in this document should +be treated according to HTTP specification. + +If the server disconnects without returning a response the client SHOULD retry +and send the same request. The client SHOULD implement an exponential backoff +strategy between retries to avoid overwhelming the server. + +### Connection + +If the client is unable to connect to the server the client SHOULD retry the +connection using exponential backoff strategy between retries. The interval +between retries must have a random jitter. + +The client SHOULD keep the connection alive between requests. + +Server implementations MAY handle OTLP/gRPC, OTLP/HTTP requests and OTLP/HTTP+JSON on the same +port and multiplex the connections to the corresponding transport handler based +on "Content-Type" request header. + +### Parallel Connections + +To achieve higher total throughput the client MAY send requests using several +parallel HTTP connections. In that case the maximum number of parallel +connections SHOULD be configurable. diff --git a/oteps/0143-versioning-and-stability.md b/oteps/0143-versioning-and-stability.md new file mode 100644 index 00000000000..ba46423f3aa --- /dev/null +++ b/oteps/0143-versioning-and-stability.md @@ -0,0 +1,165 @@ +# Versioning and stability for OpenTelemetry clients + +OpenTelemetry is a large project with strict compatibility requirements. This proposal defines the stability guarantees offered by the OpenTelemetry clients, along with a versioning and lifecycle proposal which defines how we meet those requirements. + +Language implementations are expected to follow this proposal exactly, unless a language or package manager convention interferes significantly. Implementations must take this cross-language proposal, and produce a language-specific proposal which details how these requirements will be met. + +Note: In this document, the term OpenTelemetry specifically refers to the OpenTelemetry clients. It does not refer to the specification or the Collector. + +## Design goals + +**Ensure that end users stay up to date with the latest release.** +We want all users to stay up to date with the latest version of OpenTelemetry. We do not want to create hard breaks in support, of any kind, which leave users stranded on older versions. It must always be possible to upgrade to the latest minor version of OpenTelemetry, without creating compilation or runtime errors. + +**Never create a dependency conflict between packages which rely on different versions of OpenTelemetry. Avoid breaking all stable public APIs.** +Backwards compatibility is a strict requirement. Instrumentation APIs cannot create a version conflict, ever. Otherwise, OpenTelemetry cannot be embedded in widely shared libraries, such as web frameworks. Code written against older versions of the API must work with all newer versions of the API. Transitive dependencies of the API cannot create a version conflict. The OpenTelemetry API cannot depend on "foo" if there is any chance that any library or application may require a different, incompatible version of "foo." A library using OpenTelemetry should never become incompatible with other libraries due to a version conflict in one of OpenTelemetry's dependencies. Theoretically, APIs can be deprecated and eventually removed, but this is a process measured in years and we have no plans to do so. + +**Allow for multiple levels of package stability within the same release.** +Provide maintainers a clear process for developing new, experimental APIs alongside stable APIs. DIfferent packages within the same release may have different levels of stability. This means that an implementation wishing to release stable tracing today must ensure that experimental metrics are factored out in such a way that breaking changes to metrics API do not destabilize the trace API packages. + +## Relevant architecture + +![Cross cutting concerns](img/0143_cross_cutting.png) + +At the highest architectural level, OpenTelemetry is organized into signals. Each signal provides a specialized form of observability. For example, tracing, metrics, and baggage are three separate signals. Signals share a common subsystem – context propagation – but they function independently from each other. + +Each signal provides a mechanism for software to describe itself. A codebase, such as an API handler or a database client, takes a dependency on various signals in order to describe itself. OpenTelemetry instrumentation code is then mixed into the other code within that codebase. This makes OpenTelemetry a **cross-cutting concern** - a piece of software which must be mixed into many other pieces of software in order to provide value. Cross-cutting concerns, by their very nature, violate a core design principle – separation of concerns. As a result, OpenTelemetry requires extra care and attention to avoid creating issues for the codebase which depend upon these cross-cutting APIs. + +OpenTelemetry is designed to separate the portion of each signal which must be imported as cross-cutting concerns from the portions of OpenTelemetry which can be managed independently. OpenTelemetry is also designed to be an extensible framework. To accomplish this these goals, each signal consists of four types of packages: + +**API -** API packages consist of the cross-cutting public interfaces used for instrumentation. Any portion of OpenTelemetry which 3rd-party libraries and application code depend upon is considered part of the API. To manage different levels of stability, every signal has its own, independent API package. These individual APIs may also be bundled up into a shared global API, for convenience. + +**SDK -** The implementation of the API. The SDK is managed by the application owner. Note that the SDKs includes additional public interfaces which are not considered part of the API package, as they are not cross-cutting concerns. These public interfaces are defined as **constructors** and **plugin interfaces**. Examples of plugin interfaces include the SpanProcessor, Exporter, and Sampler interfaces. Examples of constructors include configuration objects, environment variables, and SDK builders. Application owners may interact with SDK constructors; plugin authors may interact with SDK plugin interfaces. Instrumentation authors must never directly reference any SDK package of any kind, only the API. + +**Semantic Conventions -** A schema defining the attributes which describe common concepts and operations which the signal observes. Note that unlike the API or SDK, stable conventions for all signals may be placed in the same package, as they are often useful across different signals. + +**Contrib –** plugins and instrumentation that make use of the API or SDK interfaces, but are not part of the core packages necessary for running OTel. The term "contrib" specifically refers to the plugins and instrumentation maintained by the OpenTelemetry organization outside of the SDK; it does not refer to third party plugins hosted elsewhere, or core plugins which are required to be part of the SDK release, such as OTLP Exporters and TraceContext Propagators. **API Contrib** refers to packages which depend solely upon the API; **SDK Contrib** refers to packages which also depend upon the SDK. + +## Signal lifecycle + +OpenTelemetry is structured around signals. Each signal represents a coherent, stand-alone set of functionality. Each signal follows a lifecycle. + +![API Lifecycle](img/0143_api_lifecycle.png) + +### Lifecycle stages + +**Experimental –** Breaking changes and performance issues may occur. Components may not be feature-complete. The experiment may be discarded. + +**Stable –** Stability guarantees apply, based on component type (API, SDK, Conventions, and Contrib). Long term dependencies may now be taken against these packages. + +**Deprecated –** this signal has been replaced but is still retains the same stability guarantees. + +**Removed -** a deprecated signal is no longer supported, and is removed. + +All signal components may become stable together, or one by one in the following order: API, Semantic Conventions, API Contrib, SDK, SDK Contrib. + +When transitioning from experimental to stable to deprecated, packages **should not move or otherwise break how they are imported by users**. Do NOT use and "experimental" directory or package suffix. + +Optionally, package **version numbers** MAY include a suffix, such as -alpha, -beta, -rc, or -experimental, to differentiate stable and experimental packages. + +### Stability + +Once a signal component is marked as stable, the following rules apply until the end of that signal’s existence. + +**API Stability -** +No backward-incompatible changes to the API are allowed unless the major version number is incremented. All existing API calls must continue to compile and function against all future minor versions of the same major version. ABI compatibility for the API may be offered on a language by language basis. + +**SDK Stability -** +Public portions of the SDK must remain backwards compatible. There are two categories: **plugin interfaces** and **constructors**. Examples of plugins include the SpanProcessor, Exporter, and Sampler interfaces. Examples of constructors include configuration objects, environment variables, and SDK builders. + +ABI compatibility for SDK plugin interfaces and constructors may be offered on a language by language basis. + +**Semantic Conventions Stability -** +Semantic Conventions may not be removed once they are stable. New conventions may be added to replace usage of older conventions, but the older conventions are never removed, they will only be marked as deprecated in favor of the newer ones. + +**Contrib Stability -** +Plugins and instrumentation are kept up to date, and are released simultaneously (or shortly after) the latest release of the API. The goal is to ensure users can update to the latest version of OpenTelemetry, and not be held back by the plugins that they depend on. + +Public portions of contrib packages (constructors, configuration, interfaces) must remain backwards compatible. ABI compatibility for contrib packages may be offered on a language by language basis. + +Telemetry produced by contrib instrumentation must also remain stable and backwards compatible, to avoid breaking alerts and dashboards. This means that existing data may not be mutated or removed without a major version bump. Additional data may be added. This applies to spans, metrics, resources, attributes, events, and any other data types that OpenTelemetry emits. + +### Deprecation + +In theory, signals could be replaced. When this happens, they are marked as deprecated. + +Code is only marked as deprecated when the replacement becomes stable. Deprecated code still abides by the same support guarantees as stable code. Deprecated APIs remain stable and backwards compatible. + +### Removal + +Packages are end-of-life’d by being removed from the release. The release then makes a major version bump. + +We currently have no plans for deprecating signals or creating a major version past v1.0. + +For clarity, it is still possible to create a new, backwards incompatible version of an existing type of signal without actually moving to v2.0 and breaking support. Allow me to explain. + +Imagine we develop a new, better tracing API - let's call it AwesomeTrace. We will never mutate the current tracing API into AwesomeTrace. Instead, AwesomeTrace would be added as an entirely new signal which coexists and interoperates with the current tracing signal. This would make adding AwesomeTrace a minor version bump, *not* v2.0. v2.0 would mark the end of support for current tracing, not the addition of AwesomeTrace. And we don't want to ever end that support, if we can help it. + +This is not actually a theoretical example. OpenTelemetry already supports two tracing APIs: OpenTelemetry and OpenTracing. We invented a new tracing API, but continue to support the old one. + +## Version Numbers + +OpenTelemetry follows [semver 2.0](https://semver.org/) conventions, with the following distinction. + +OpenTelemetry clients have four components: API, Semantic Conventions, SDK, and Contrib. + +For the purposes of versioning, all code within a component is treated as if it were part of a single package, and versioned with the same version number, except for Contrib, which may be a collection of packages versioned separately. + +* All packages within the API share the same version number. API packages for all signals version together, across all signals. Signals do not have separate version numbers. There is one version number that applies to all signals that are included in the API release that is labeled with that particular version number. +* All packages within the SDK share the same version number. SDK packages for all signals version together, across all signals. There is one version number that applies to all signals that are included in the SDK release that is labeled with that particular version number. +* All Semantic Conventions are contained within a single package with a single version number. +* Each contrib package has it's own version. +* The API, SDK, Semantic Conventions, and contrib components are not required to share a version number. For example, the latest version of `opentelemetry-python-api` may be at v1.2.3, while the latest version of `opentelemetry-python-sdk` may be at v2.3.1. +* Different language implementations do not need to have matching version numbers. For example, it is fine to have `opentelemetry-python-api` at v1.2.8 when `opentelemetry-java-api` is at v1.3.2. +* Language implementations do not need to match the version of the specification they implement. For example, it is fine for v1.8.2 of `opentelemetry-python-api` to implement v1.1.1 of the specification. + +**Exception:** in some languages, package managers may react poorly to experimental packages having a version higher than 0.X. In these cases, a language-specific workaround is required. Go, Ruby, and Javascript are examples. + +**Major version bump** +Major version bumps only occur when there is a breaking change to a stable interface, or the removal of deprecated signals. + +OpenTelemetry values long term support. The expectation is that we will version to v1.0 once the first set of packages are declared stable. OpenTelemetry will then remain at v1.0 for years. There are no plans for a v2.0 of OpenTelemetry at this time. Additional stable packages, such as metrics and logs, will be added as minor version bumps. + +**Minor version bump** +Most changes to OpenTelemetry result in a minor version bump. + +* New backward-compatible functionality added to any component. +* Breaking changes to internal SDK components. +* Breaking changes to experimental signals. +* New experimental packages are added. +* Experimental packages become stable. + +**Patch version bump** +Patch versions make no changes which would require recompilation or potentially break application code. The following are examples of patch fixes. + +* Bug fixes which don't require minor version bump per rules above. +* Security fixes. +* Documentation. + +Currently, OpenTelemetry does NOT have plans to backport bug and security fixes to prior minor versions. Security and bug fixes are only applied to the latest minor version. We are committed to making it feasible for end users to stay up to date with the latest version of OpenTelemetry. + +## Long Term Support + +![long term support](img/0143_long_term.png) + +### API support + +Major versions of the API will be supported for a minimum of **three years** after the release of the next major API version. Support covers the following areas. + +API stability, as defined above, will be maintained. + +A version of the SDK which supports the last major version of the API will continue to be maintained during this period. Bug and security fixes will be backported. Additional feature development is not guaranteed. + +Contrib packages available when the API is versioned will continue to be maintained for the duration of this period. Bug and security fixes will be backported. Additional feature development is not guaranteed. + +### SDK Support + +SDK stability, as defined above, will be maintained for a minimum of **one year** after after the release of the next major SDK version. + +### Contrib Support + +Contrib stability, as defined above, will be maintained for a minimum of **one year** after after the release of the next major version of a contrib package. + +## OpenTelemetry GA + +The term “OpenTelemetry GA” refers to the point at which a stable version of both tracing and metrics has been released in at least three languages. diff --git a/oteps/0147-upgrade-procedures.md b/oteps/0147-upgrade-procedures.md new file mode 100644 index 00000000000..ef3b94e620a --- /dev/null +++ b/oteps/0147-upgrade-procedures.md @@ -0,0 +1,56 @@ +# The OpenTelemetry approach to upgrading + +Managing widely distributed software at scale requires careful design related to backwards compatibility, versioning, and upgrading. The OpenTelemetry approach is described below. If you are planning on using OpenTelemetry, it can be helpful to understand how we approach this problem. + +## Component Overview + +To facilitate smooth upgrading and long term support, OpenTelemetry clients are factored into several components. We use the following terms in the rest of this document. + +Packages is a generic term for units of code which reference each other via some form of dependency management. Note that every programming language has a different approach to dependency management, and may use a different term, such as module or library, to represent this concept. + +The API refers to the set of software packages that contain all of the interfaces and constants needed to write OpenTelemetry instrumentation. An implementation of the API may be registered during application startup. If no other implementation is registered, the API registers a no-op implementation by default. + +The SDK refers to a framework which implements the API, provided by the OpenTelemetry project. While alternative API implementations may be written to handle special cases, we expect most users to install the SDK when running OpenTelemetry. + +Plugin Interfaces refer to extension points provided by the SDK. These include interfaces for controlling sampling, exporting data, and various other lifecycle hooks. Note that these interfaces are not part of the API. They are part of the SDK. + +Instrumentation refers to any code which calls the API. This includes the instrumentation provided by the OpenTelemetry project, third party instrumentation, plus application code and libraries which instrument themselves natively. + +Plugins refer to any package which implements an SDK Plugin Interface. This includes the plugins provided by the OpenTelemetry project, plus third party plugins. + +There is an important distinction between Plugins and Instrumentation. Plugins implement the Plugin Interfaces. Instrumentation calls the API. This difference is relevant to OpenTelemetry’s approach to upgrading. + +## The OpenTelemetry upgrade path + +Before we get into the design requirements, here’s how upgrading actually works. Note that all OpenTelemetry components (API, SDK, plugins, instrumentation) have separate version numbers. + +### API changes + +When new functionality is added to the OpenTelemetry API, a new minor version of the API is released. These API changes are always additive and backwards compatible from the perspective of existing Instrumentation packages which import and call prior versions. Instrumentation written against all prior minor versions of the API continues to work, and may be composed together into the same application without creating a dependency conflict. + +API implementations are expected to always target the latest version of the API. When a new version of the API is released, a version of the SDK which supports the API is released in tandem. New versions of the API are not expected to support older versions of the SDK. + +### SDK changes + +Bugs fixes, security patches, and performance improvements are released as patch versions of the SDK. Support for new versions of the API are released as minor versions. New Plugin Interfaces and configuration options are also released as minor versions. + +Breaking changes to Plugin Interfaces are handled through deprecation. Instead of breaking a Plugin Interface, a new interface is created and the existing interface is marked as deprecated. Plugins which target the deprecated interface continue to work, and the SDK provides default implementations of any missing functionality. After one year, the deprecated Plugin Interface is removed as a major version release of the SDK. + +## Design requirements and explanations + +This approach to upgrading solves two critical design requirements, while minimizing the maintenance overhead associated with legacy code. + +* Callers of the API are never broken. +* Users of the SDK can easily upgrade to the latest version. + +Indefinite support for existing Instrumentation is a critical feature of OpenTelemetry. Millions of lines of code are expected to be written against the API. This includes shared libraries which ship with integrated OpenTelemetry instrumentation. These libraries must be able to compose together into applications without OpenTelemetry creating a dependency conflict. While some Instrumentation will be updated to the latest version of the API, such as that provided by the OpenTelemetry project, other Instrumentation will never be updated. + +Consuming new Instrumentation may require users to upgrade to the latest version of the SDK. If it was not easy to perform this upgrade, the OpenTelemetry project would be forced to support older versions of the SDK, as well as older versions of the entire Instrumentation ecosystem. + +This would be an enormous maintenance effort at best. But, as the OpenTelemetry project only controls a portion of that ecosystem, it is also unfeasible. OpenTelemetry cannot require that libraries with native instrumentation support multiple versions of the API. Ensuring that application owners and operators can upgrade to the latest version of the SDK resolves this issue. + +The primary blocker to upgrading the SDK is out of date Plugins. If a new version of the SDK were to break existing Plugin Interfaces, no user would be able to upgrade their SDK until the Plugins they depend on have been upgraded. Users could be caught between instrumentation they depend on requiring a version of the API which is not compatible with the version of the SDK which supports their Plugins. + +By following a deprecation pattern with Plugin Interfaces, we create a one year window in which the Plugin ecosystem can upgrade after the release of a new SDK. We believe this is sufficient time for any Plugin which is actively maintained to make an upgrade, and for defunct Plugins to be identified and replaced. + +By ensuring that the SDK can be easily upgraded, we also provide a path for application owners and operators to rapidly consume critical bug fixes and security patches, without the need to backport these patches across a large number of prior SDK versions. diff --git a/oteps/0149-exponential-histogram.md b/oteps/0149-exponential-histogram.md new file mode 100644 index 00000000000..5a76ed129ce --- /dev/null +++ b/oteps/0149-exponential-histogram.md @@ -0,0 +1,162 @@ +# Add exponential bucketing to histogram protobuf + +Add exponential bucketing to histogram protobuf + +## Motivation + +Currently, the OTEL protobuf [protocol](https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/metrics/v1/metrics.proto) only supports explicit bound buckets. Each bucket's bound and count must be explicitly defined. This is inefficient to transport buckets whose bounds have a pattern, such as exponential (ie. log scale) buckets. More importantly, without bucket pattern info, the receiver may not be able to optimize its processing on these buckets. With a protocol that also supports exponential buckets, the bounds can be encoded with just a few parameters, regardless of the number of buckets, and the receiver can optimize processing based on the knowledge that these buckets are exponential. + +The explicit bucket type will be kept, as a fallback for arbitrary bucket bounds. For example, Prometheus histograms often come with arbitrary user defined bounds. + +Exponential buckets are chosen to be added because they are very good at representing [long tail](https://en.wikipedia.org/wiki/Long_tail) distributions, which are common in the OTEL target application like response time measurement. Exponential buckets (ie. log scale buckets) need far fewer buckets than linear scale buckets to cover the wide range of a long tail distribution. Furthermore, percentiles/quantiles can be computed from exponential buckets with constant relative error across the full range. + +## Explanation + +Exponential buckets will be added. In general, bucket bounds are in the form of: + +``` +bound = base ^ exponent +``` + +where base is a parameter of a bound series, and exponent is an integer. Note that exponent may be positive, negative, or 0. Such bounds are also commonly known as "log scale bounds". + +## Internal details + +Proposed message type to be added: + +``` +message ExponentialBuckets { + double base = 1; + double zero_count = 2; // Count of values exactly at zero. + ExponentialBucketCounts positive_value_counts = 3; + ExponentialBucketCounts negative_value_counts = 4; // Negative values are bucketed with their absolute values +} + +// "repeated double bucket_counts" represents an array of N numbers from bucket_counts[0] to bucket_counts[N-1]. +// With index i starting at 0, ending at N-1, ExponentialBucketCounts defines N buckets, where +// bucket[i].start_bound = base ^ (i + exponent_offset) +// bucket[i].end_bound = base ^ (i + 1 + exponent_offset) +// bucket[i].count = bucket_counts[i] +message ExponentialBucketCounts { + sint32 exponent_offset = 1; // offset may be negative. + repeated double bucket_counts = 2; +} +``` + +Notes: + +* ExponentialBuckets will be added as "oneof" the bucket types in [#272](https://github.com/open-telemetry/opentelemetry-proto/pull/272) +* Per [#257](https://github.com/open-telemetry/opentelemetry-proto/issues/257), only a histogram accepting "double" will be defined. +* Per [#259](https://github.com/open-telemetry/opentelemetry-proto/issues/259), bucket counts type is "double". + +## Trade-offs and mitigations + +Simplicity is a main design goal. The format targets the most common scenarios. For now, histograms not conforming to ExponentialBuckets may be encoded as explicit buckets. If a histogram type is common enough, a new bucket type may be added in the future. + +The followings are restrictions of ExponentialBuckets: + +* Buckets for positive and negative values must have the same "base". +* Buckets must cover the full value range. In the future, ExponentialBucketCounts might add an overflow_count and an underflow_count for counts above the highest bucket and below the lowest buckets, respectively. However, overflow or underflow bucket breaks the "pure log scale" property. "Rescale" is preferred when reducing memory cost. See "merge" section below. +* ExponentialBucketCounts is designed for dense buckets. Between the lowest and the highest bucket, most buckets are expected to be non-empty. This is the common case in telemetry applications. +* A "reference" that multiplies onto all bounds is not included. It is always implicitly 1. + +## Prior art and alternatives + +[#226](https://github.com/open-telemetry/opentelemetry-proto/pull/226) tried to add multiple histogram types at once. This EP reduces the scope to exponential histogram only. And the complexity of the new format is lower because we now decided that histogram types do not share the count fields (see [#259](https://github.com/open-telemetry/opentelemetry-proto/issues/259)). + +## Open questions + +### Toward universally mergeable histograms + +Merging histograms of different types, or even the same type, but with different parameters remains an issue. There are lengthy discussions in [#226](https://github.com/open-telemetry/opentelemetry-proto/pull/226#issuecomment-776526864) + +Some merge method may introduce artifacts (information not present in original data). Generally, splitting a bucket introduces artifacts. For example, when using linear interpolation to split a bucket, we are assumming uniform distribution within the bucket. "Uniform distribution" is information not present in original data. Merging buckets on the other hand, does not introduce artifacts. Merging buckets with identical bounds from two histograms is totally artifact free. Merging multiple adjacent buckets in one histogram is also artifact free, but it does reduce the resolution of the histogram. Whether such a merge is "lossy" is arguable. Because of this ambiguity, the term "lossy" is not used in this doc. + +For exponential histograms, if base1 = base2 ^ N, where N is an integer, the two histograms can be merged without artifacts. Furthermore, we can introduce a series of bases where + +``` +base = referenceBase ^ (2 ^ baseScale) +``` + +Any two histograms using bases from the series can be merged without artifact. This approach is well known and in use in multiple vendors, including [Google internal use](https://github.com/open-telemetry/opentelemetry-proto/pull/226#issuecomment-737496026), [New Relic Distribution Metric](https://docs.newrelic.com/docs/telemetry-data-platform/ingest-manage-data/understand-data/metric-data-type). It is also described in the [UDDSketch paper](https://arxiv.org/pdf/2004.08604.pdf). + +Such "2 to 1" binary merge has the following benefits: + +* Any 2 histograms in the series can be merged without artifacts. This is a very attractive property. +* A single histogram may be shrunk by 2x using a 2 to 1 merge, at the cost of increasing base to base^2. When facing the choice between "reduced histogram resolution" and "blowing up application memory", shrinking is the obvious choice. + +A histogram producer may implement "auto scale" to control memory cost. With a reasonable default config on target relative error and max number of buckets, the producer could operate in an "automagic" fashion. The producer can start with a base at target resolution, and dynamically change the scale if incoming data's range would make the histogram exceed the memory limit. [New Relic](https://docs.newrelic.com/docs/telemetry-data-platform/ingest-manage-data/understand-data/metric-data-type) and [Google](https://github.com/open-telemetry/opentelemetry-proto/pull/226#issuecomment-737496026) have implemented such logic for internal use. Open source versions from these companies are in plan. + +The main disadvantage of scaled exponential histogram is not supporting arbitrary base. The base can only increase by square, or decrease by square root. Unless a user's target relative error is exactly on the series, they have to choose the next smaller base, which costs more space for the target. But in return, you get universally mergeable histograms, which seems like a reasonable trade off. As shown in discussions below, typically, the user has the choice around 1%, 2%, or 4% errors. Since error target is rarely precise science, choosing from the limited menu does not add much burden to the user. + +**If we can agree on a "referenceBase", then we have universally mergeable histograms.** In [#226](https://github.com/open-telemetry/opentelemetry-proto/pull/226#issuecomment-777922339), a referenceBase of 2 was proposed. The general form for the series is + +``` +base = 2 ^ (2 ^ baseScale) +``` + +where baseScale is an integer. + +* When baseScale = 0, base is exactly at referenceBase +* When baseScale > 0, 2^baseScale reference base buckets are merged into one +* When baseScale < 0, a reference base bucket (log2 bucket here) is sub-divided into 2^(-baseScale) subBuckets + +In practice, the most interesting range of baseScale is around -4, where percentile relative error is around a few percent. The following table shows bases of interest. Here relative error is computed from "squareRoot(base) - 1". This assumes that percentile calculation returns the log scale mid point of a bucket to minimize relative error. + +``` +scale #subBuckets base relative_error +-3 8 1.090507733 4.43% +-4 16 1.044273782 2.19% +-5 32 1.021897149 1.09% +``` + +The [comment in #226](https://github.com/open-telemetry/opentelemetry-proto/pull/226#issuecomment-777922339) has more details on why 2 was chosen as the reference base. In summary, "computers like binary numbers". If we are going to choose a reference base, why not make it 2 ("10" in binary)? + +### Base10 vs. base2 + +Alternatively, we may choose a reference base of 10 (decimal). While this may have some "human friendliness", in practice, the benefit is minimal. As shown in the table below, in base10, baseScale of interest is around -6, where a log10 bucket is sub-divided into 64 sub-buckets. We get a bound like 10, 100, or 1000 only every 64 buckets. The "human friendliness" value is minimal. + +``` +scale #subBuckets base relative error +-5 32 1.074607828 3.66% +-6 64 1.036632928 1.82% +-7 128 1.018151722 0.90% +``` + +Let's further consider the following use cases of histograms to compare base2 and base10: + +1. Displaying histogram charts. With a typical base around 1.04 (around 2% relative error), there will be hundreds of buckets for a typical range over 100x. This many points can produce a reasonably smooth curve, regardless of the raw data being base2 or base10. +2. Calculating percentiles or quantiles. This is often used in SLO monitoring. Example SLO: "99% percentile of response time need to be no more than 100ms". To minimize relative error, percentile calculation usually returns log scale mid point of a bucket. So returned percentile values won't be on 10, 100, 1000, etc., even if the histogram is base10. +3. Answering question like "what percentage of values fall below 100". When the threshold is on 10, 100, 1000, etc, base10 histograms do give exact answer. But even in the decimal world, power of 10 numbers is a small population. For thresholds like 200, 500, etc. base10 histograms have no advantage over base2. If an exact answer is required in these cases, a user should create explicit buckets at these bounds, instead of using exponential buckets. + +So the "human friendliness" of base10 exponential histograms is largely an illusion. To some extent, the base2 vs. base10 question has been answered long ago: computers convert input from base10 to base2, do processing in base2, then convert final output from base2 to base10. In the histogram case, we take input in "double", which is already in [binary float point format](https://en.wikipedia.org/wiki/Double-precision_floating-point_format). Bucketing these numbers in base10 bounds effectively switches base during processing. It just adds complexity and computational cost. + +### Protocol support for universally mergeable histograms + +Now the question is if and when we add support for universally mergeable histograms. There are some options: + +1. No special support in protocol. Receiver derives baseScale and referenceBase if the base is close enough to a base on a referenceBase series. Implementations will decide how close is "close enough". +2. Allow the protocol to explicitly state baseScale, with referenceBase hardwired at 2. +3. Allow the protocol to explicitly state baseScale and an arbitrary referenceBase. + +This proposal is currently written with option 1, with a path to extend to option 2 later: + +``` +// Current. Option 1 +double base = 1; + +// Future. Option 2. Changing a single value into a member of a new oneof is safe and binary compatible +oneof base_spec { + double base = 1; + sint32 base_scale = 99; // base = 2 ^ (2^base_scale). base_scale may be negative. +} +``` + +**Or should we just do option 2 right now?** + +## Future possibilities + +What are some future changes that this proposal would enable? + +* Support universally mergeable histograms +* Additional histogram types may be added diff --git a/oteps/0152-telemetry-schemas.md b/oteps/0152-telemetry-schemas.md new file mode 100644 index 00000000000..344ad4bb255 --- /dev/null +++ b/oteps/0152-telemetry-schemas.md @@ -0,0 +1,1178 @@ +# Telemetry Schemas + +* [Motivation](#motivation) +* [Solution Summary](#solution-summary) +* [What is Out of Scope](#what-is-out-of-scope) +* [Use Cases](#use-cases) + * [Full Schema-Aware](#full-schema-aware) + * [Collector-Assisted Schema Transformation](#collector-assisted-schema-transformation) +* [Schema URL](#schema-url) +* [Schema Version Number](#schema-version-number) +* [Schema File](#schema-file) + * [all Section](#all-section) + * [resources Section](#resources-section) + * [spans Section](#spans-section) + * [rename_attributes Transformation](#rename_attributes-transformation) + * [span_events Section](#span_events-section) + * [rename_events Transformation](#rename_events-transformation) + * [rename_attributes Transformation](#rename_attributes-transformation-1) + * [metrics Section](#metrics-section) + * [rename_metrics Transformation](#rename_metrics-transformation) + * [rename_attributes Transformation](#rename_attributes-transformation) + * [logs Section](#logs-section) + * [rename_attributes Transformation](#rename_attributes-transformation-2) + * [Order of Transformations](#order-of-transformations) + * [Schema File Format Number](#schema-file-format-number) +* [OTLP Changes](#otlp-changes) +* [API and SDK Changes](#api-and-sdk-changes) +* [OpenTelemetry Schema](#opentelemetry-schema) +* [Performance Impact](#performance-impact) +* [Open Questions](#open-questions) +* [Future Possibilities](#future-possibilities) + * [Parent Schema](#parent-schema) + * [Current State in Schema](#current-state-in-schema) + * [Other Transformation Types](#other-transformation-types) + * [Version Convertability](#version-convertability) +* [Alternates Considered](#alternates-considered) + * [Name Aliases](#name-aliases) + * [Schema Negotiation](#schema-negotiation) +* [Prior Art](#prior-art) +* [Appendix A. Example Schema File](#appendix-a-example-schema-file) + +## Motivation + +Telemetry sources such as instrumented applications and consumers of telemetry +such as observability backends sometimes make implicit assumptions about the +emitted telemetry. They assume that the telemetry will contain certain +attributes or otherwise have a certain shape and composition of data (this is +referred to as "telemetry schema" throughout this document). + +This makes it difficult or impossible to change the composition of the emitted +telemetry data without breaking the consumers. For example changing the name of +an attribute of a span created by an instrumentation library can break the +backend if the backend expects to find that attribute by its name. + +Semantic conventions are an important part of this problem. These conventions +define what names and values to use for span attributes, metric names and other +fields. If semantic conventions are changed the existing implementations +(telemetry source or consumers) need to be also changed correspondingly. +Furthermore, to make things worse, the implementations of telemetry sources and +implementations of telemetry consumers that work together and that depend on the +changed semantic convention need to be changed simultaneously, otherwise such +implementations will no longer work correctly together. + +Essentially there is a coupling between 3 parties: 1) OpenTelemetry semantic +conventions, 2) telemetry sources and 3) telemetry consumers. The coupling +complicates the independent evolution of these 3 parties. + +We recognize the following needs: + +- OpenTelemetry semantic conventions need to evolve over time. When conventions + are first defined, mistakes are possible and we may want to fix the mistakes + over time. We may also want to change conventions to re-group the attributes + into different namespaces as our understanding of the attribute taxonomy + improves. + +- Telemetry sources over time may want to change the schema of the telemetry + they emit. This may be because for example the semantic conventions evolved + and we want to make our telemetry match the newly introduced conventions. + +- In an observability system there may simultaneously exist telemetry sources + that produce data that conforms to different telemetry schemas because + different sources evolve at a different pace and are implemented and + controlled by different entities. + +- Telemetry consumers have a need to understand what schema a particular piece + of received telemetry confirms to. The consumers also need a way to be able to + interpret the telemetry data that uses different telemetry schemas. + +This document proposes a solution to these needs. + +## Solution Summary + +We believe that the 3 parties described above should be able to evolve +independently over time, while continuously retaining the ability to correctly +work together. + +Telemetry Schemas are central to how we make this possible. Here is a summary of +the proposal: + +- We introduce a file format for defining telemetry schemas. + +- Telemetry schemas are versioned. Over time the schema may evolve and telemetry + sources may emit data confirming to newer versions of the schema. + +- Telemetry schemas explicitly define transformations that are necessary to + convert telemetry data between different versions of the schema, provided that + such conversions are possible. When conversions are not possible it + constitutes a breaking change between versions. + +- Telemetry schemas are identified by Schema URLs, that are unique for each + schema version. + +- Telemetry sources (e.g. instrumentation libraries) should include a schema URL + in the emitted telemetry. + +- Telemetry consumers should pay attention to the schema of the received + telemetry. If necessary, telemetry consumers may transform the telemetry data + from the received schema version to the target schema version as expected at + the point of use (e.g. a dashboard may define which schema version it + expects). + +- OpenTelemetry will publish a telemetry schema as part of the specification. + The schema will contain the list of transformations that semantic conventions + undergo. The schema will be available to be referred and downloaded at well + known URLs. + +- OpenTelemetry instrumentation libraries will include the OpenTelemetry Schema + URL in all emitted telemetry. + +- OTLP will be modified to allow inclusion of a schema URL in the emitted + telemetry. + +- Third-party libraries, instrumentation or applications will be advised to + define and publish their own telemetry schema if it is completely different + from OpenTelemetry schema (or use OpenTelemetry schema) and include the schema + URL in the emitted telemetry. + +## What is Out of Scope + +- The concept of schema defined in this proposal does not attempt to fully + describe the shape of telemetry. The schema for example does not define all + possible valid values for attributes or expected data types for metrics, etc. + It is not a goal. Our goal is narrowly defined to solve the following problem + only: to allow OpenTelemetry Semantic Conventions to evolve over time. For + that reason this document is concerned with _changes_ to the schema as opposed + to the _full state_ of the schema. We do not preclude this though: the schema + file format is extensible and in the future may allow defining the full state + of the schema, see [Current State in Schema](#current-state-in-schema) in + Future Possibilities section). + +- We intentionally limit the types of transformations of schemas to the bare + minimum that is necessary to handle the most common changes that we believe + OpenTelemetry Semantic Conventions will require in the near future. More types + of transformations [may be proposed](#other-transformation-types) in the + future. This proposal does not attempt to support a comprehensive set of + possible transformation types that can handle all possible changes to schemas + that we can imagine. That would be too complicated and very likely + superfluous. Any new transformation types should be proposed and added in the + future to the schema file format when there is an evidence that they are + necessary for the evolution of OpenTelemetry. + +## Use Cases + +This section shows a couple interesting use-cases for the telemetry schemas +(other uses-cases are also possible, this is not an exhaustive list). + +### Full Schema-Aware + +Here is an example on a schema-aware observability system: + +![Full Schema-Aware](img/0152-otel-schema.png) + +Let's have a closer look at what happens with the Telemetry Source and Backend +pair as the telemetry data is emitted, delivered and stored: + +![Source and Backend](img/0152-source-and-backend.png) + +In this example the telemetry source produces spans that comply with version +1.2.0 of OpenTelemetry schema, where "deployment.environment" attribute is used +to record that the span is coming from production. + +The telemetry consumer desires to store the telemetry in version 1.1.0 of +OpenTelemetry schema. The schema translator compares the schema_url in the +received span with the desired schema and sees that a version conversion is +needed. It then applies the change that is described in the schema file and +renames the attribute from "deployment.environment" to "environment" before +storing the span. + +And here is for example how the schemas can be used to query stored data: + +![Query Translate](img/0152-query-translate.png) + +### Collector-Assisted Schema Transformation + +Here is a somewhat different use case, where the backend is not aware of schemas +and we rely on OpenTelemetry Collector to translate the telemetry to a schema +that the backend expects to receive. The "Schema Translate Processor" is +configured, the target schema_url is specified and all telemetry data that +passes through the Collector is converted to that target schema: + +![Collector](img/0152-collector.png) + +## Schema URL + +Schema URL is an identifier of a Schema. The URL specifies a location of a +[Schema File](#schema-file) that can be retrieved (so it is a URL and not just a +URI) using HTTP or HTTPS protocol. + +Fetching the specified URL may return an HTTP redirect status code. The fetcher +MUST follow the HTTP standard and honour the redirect response and fetch the +file from the redirected URL. + +The last part of the URL path is the version number of the schema. + +``` +http[s]://server[:port]/path/ +``` + +The part of the URL preceding the `` is called Schema Family +identifier. All schemas in one Schema Family have identical Schema Family +identifiers. + +To create a new version of the schema copy the schema file for the last version +in the schema family and add the definition of the new version. The schema file +that corresponds to the new version must be retrievable at a new URL. + +Important: schema files are immutable once they are published. Once the schema +file is retrieved it is recommended to be cached permanently. Schema files may +be also packaged at build time with the software that anticipates it may need +the schema (e.g. the latest OpenTelemetry schema file can be packaged at build +time with OpenTelemetry Collector's schema translation processor). + +## Schema Version Number + +Version number follows the MAJOR.MINOR.PATCH format, similar to semver 2.0. + +Version numbers use the [ordering rules](https://semver.org/#spec-item-11) +defined by semver 2.0 specification. See how ordering is used in the +[Order of Transformations](#order-of-transformations). Other than the ordering +rules the schema version numbers do not carry any other semantic meaning. + +OpenTelemetry schema version numbers match OpenTelemetry specification version +numbers, see more details [here](#opentelemetry-schema). + +## Schema File + +A Schema File is a YAML file that describes the schema of a particular version. +It defines the transformations that can be used to convert the telemetry data +represented in any other older compatible version of the same schema family to +this schema version. + +Here is the structure of the Schema File: + +```yaml +# Defines the file format. MUST be set to 1.0.0. +file_format: 1.0.0 + +# The Schema URL that this file is published at. The version number in the URL +# MUST match the highest version number in the "versions" section below. +# Note: the schema version number in the URL is not related in any way to +# the file_format setting above. +schema_url: https://opentelemetry.io/schemas/1.2.0 + +# Definitions for each schema version in this family. +# Note: the ordering of versions is defined according to semver +# version number ordering rules. +versions: + : + # definitions for this version. See details below. + + : + # definitions for previous version + ... + : + # Defines the first version. +``` + +Each `` section has the following structure: + +```yaml + : + all: + changes: + # sequence of transformations. + + resources: + changes: + # sequence of transformations. + + spans: + changes: + # sequence of transformations. + + span_events: + changes: + # sequence of transformations. + + metrics: + changes: + # sequence of transformations. + + logs: + changes: + # sequence of transformations. +``` + +There are 6 sub-sections under each version definition: "all", "resources", +"spans", "span_events", "metrics", "logs". The last 5 sub-sections in this list +contain definitions that apply only to the corresponding telemetry data type. +Section "all" contains definitions that apply to all types of telemetry data. + +Below we describe each section in detail. + +### all Section + +"all" section in the schema file defines transformations. It must contain a +sub-section named "changes" that defines how attributes were renamed from the +previous version to this version. + +The "changes" section is a sequence of transformations. Only one transformation +is supported for section "all": "rename_attributes" transformation. + +"rename_attributes" transformation requires a map of key/value pairs, where the +key is the old name of the attribute used in the previous version, the value is +the new name of the attribute starting from this version. Here is the structure: + +```yaml + all: + changes: + - rename_attributes: + attribute_map: + # map of key/values. +``` + +The transformations in section "all" apply to the following telemetry data: +resource attributes, span attributes, span event attributes, log attributes, +metric attributes. + +Important: when converting from the previous version to the current version the +transformation sequence in section "all" is performed first. After that the +transformations in the specific section ("resources", "spans", "span_events", +"metrics" or "logs") that correspond to the data type that is being converted +are applied. + +Note that "rename_attributes" transformation in most cases is reversible. It is +possible to apply it backwards, so that telemetry data is converted from this +version to the previous version. The only exception is when 2 or more different +attributes in the previous version are renamed to the same attribute in the new +version. In that case the reverse transformation is not possible since it would +be ambiguous. When the reverse transformation is not possible it is considered +an incompatible change. In this case the MAJOR version number of the schema +SHOULD be increased in the new version. + +### resources Section + +"resources" section is very similar in its structure to "all". Like section +"all" the transformations in section "resources" may contain only +"rename_attributes" transformation. + +The only difference from section "all" is that this transformation is only +applicable to Resource data type. + +Here is the structure: + +```yaml + resources: + changes: + - rename_attributes: + attribute_map: + # map of key/values. The keys are the old attribute name used + # the previous version, the values are the new attribute name + # starting from this version. +``` + +### spans Section + +"spans" section in the schema file defines transformations that are applicable +only to Span data type. It must contain a sub-section named "changes" that +defines a sequence of actions to be applied to convert Spans from the previous +version to this version. + +One transformation is supported for section "span": "rename_attributes". + +#### rename_attributes Transformation + +This is similar to the "rename_attributes" transformation supported in "all" and +"resource" sections. In addition it is also possible to optionally specify spans +that the transformation should apply to. Here is the structure: + +```yaml + spans: + changes: + - rename_attributes: + attribute_map: + # map of key/values. The keys are the old attribute name used + # in the previous version, the values are the new attribute name + # starting from this version. +``` + +### span_events Section + +"spans_events" section in the schema file defines transformations that are +applicable only to Span's Event data type. It must contain a sub-section named +"changes" that defines a sequence of actions to be applied to convert events +from the previous version to this version. + +Two transformations are supported for section "spans_events": "rename_events" +and "rename_attributes". + +#### rename_events Transformation + +This transformation allows to change event names. It is applied to all events or +only to events of spans that have the specified name. Here is the structure: + +```yaml + span_events: + changes: + - rename_events: + name_map: + # The keys are old event name used in the previous version, the + # values are the new event name starting from this version. +``` + +#### rename_attributes Transformation + +This is similar to the "rename_attributes" transformation supported in "all" and +"resource" sections. In addition it is also possible to optionally specify spans +and events that the transformation should apply to (both optional conditions +must match, if specified, for transformation to be applicable). Here is the +structure: + +```yaml + span_events: + changes: + - rename_attributes: + attribute_map: + # map of key/values. The keys are the old attribute name used + # in the previous version, the values are the new attribute name + # starting from this version. + + apply_to_spans: + # Optional span names to apply to. If empty applies to all spans. + + apply_to_events: + # Optional event names to apply to. If empty applies to all events. +``` + +### metrics Section + +"metrics" section in the schema file defines transformations that are applicable +only to Metric data type. It must contain a sub-section named "changes" that +defines a sequence of actions to be applied to convert metrics from the previous +version to this version. + +Two transformations are supported for section "metrics": "rename_metrics" and +"rename_attributes". + +#### rename_metrics Transformation + +This transformation allows to change metric names. It is applied to all metrics. +Here is the structure: + +```yaml + metrics: + changes: + - rename_metrics: + # map of key/values. The keys are the old metric name used + # in the previous version, the values are the new metric name + # starting from this version. +``` + +#### rename_attributes Transformation + +This is similar to the "rename_attributes" transformation supported in "span" +sections. Here is the structure: + +```yaml + metrics: + changes: + - rename_attributes: + attribute_map: + # map of key/values. The keys are the old attribute name used + # in the previous version, the values are the new attribute name + # starting from this version. + + apply_to_metrics: + # Optional. If it is missing the transformation is applied + # to all metrics. If it is present the transformation is applied + # only to the metrics with the name that is found in the sequence + # specified below. +``` + +### logs Section + +"logs" section in the schema file defines transformations that are applicable +only to the Log Record data type. It must contain a sub-section named "changes" +that defines a sequence of actions to be applied to convert logs from the +previous version to this version. + +One transformation is supported for section "logs": "rename_attributes". + +#### rename_attributes Transformation + +This is similar to the "rename_attributes" transformation supported in "spans" +section. Here is the structure: + +```yaml + logs: + changes: + - rename_attributes: + attribute_map: + # map of key/values. The keys are the old attribute name used + # the previous version, the values are the new attribute name + # starting from this version. +``` + +### Order of Transformations + +When converting from older version X to newer version Y of the schema (both +belonging to the same schema family) the transformations specified in each +version in the range [X..Y] are applied one by one, i.e. first we convert from X +to X+1, then from X+1 to X+2, ..., Y-2 to Y-1, Y-1 to Y. (Note, version numbers +are not a continuum of integer numbers. The notion of adding a natural number 1 +to the version number is a placeholder for the phrase "next newer version number +that is defined for this schema family".) + +The transformations that are listed for a particular version X describe changes +that happened since the schema version that precedes version X and belongs to +the same schema family. These transformations are listed in 6 sections: "all", +"resources", "spans", "span_events", "metrics", "logs". Here is the order in +which the transformations are applied: + +- Transformations in section "all" always are applied first, before any of the + transformations in the other 5 sections. + +- Transformations in section "spans" are applied before transformations in + section "span_events". + +- The order in which the transformations in remaining sections ("resources", + "metrics", logs") are applied relative to each other or relative to "spans" + section is undefined (since there are not-interdependencies, the order does + not matter). + +In the "changes" subsection of each particular one of these 6 sections the +sequence of transformations is applied in the order it is listed in the schema +file, from top to bottom. + +When converting in the opposite direction, from newer version Y to older version +X the order of transformation listed above is exactly the reverse, with each +individual transformation also performing the reverse conversion. + +### Schema File Format Number + +The "file_format" setting in the schema file specifies the format version of the +file. The format version follows the MAJOR.MINOR.PATCH format, similar to semver +2.0. + +The "file_format" setting is used by consumers of the file to know if they are +capable of interpreting the content of the file. + +The current value for this setting is "1.0.0" and it will be published in +OpenTelemetry specification once this OTEP is accepted. Any change to this +number MUST follow the OTEP process and be published in the specification. + +The current schema file format allows representing a limited set of +transformations of telemetry data. We anticipate that in the future more types +of transformations may be desirable to support or other, additional information +may be desirable to record in the schema file (see +[Future Possibilities](#future-possibilities)). + +As the schema file format evolves over time the format version number SHOULD +change according to the following rules: + +- PATCH number SHOULD be increased when the file format changes in a way that + does not affect the existing consumers of the file. For example addition of a + completely new section in the schema file that has no effect on existing + sections and has no effect on any existing schema functionality may be done + via incrementing the PATCH number only. This approach is only valid if the new + setting in the file is completely and safely ignorable by all existing + processing logic. + + For example adding a completely new section that describes the full state of + the schema has no effect on existing consumers which only care about "changes" + section (unless we explicitly define the semantics of the new section such + that it _needs_ to be taken into account when processing schema changes). So, + adding such a new section can be done using a PATCH number increase. + +- MINOR number SHOULD be increased if a new setting is added to the file format + in a backward compatible manner. "Backward compatible" in this context means + that consumers that are aware of the new MINOR number can consume the file of + a particular MINOR version number or of any MINOR version number lower than + that, provided that MAJOR version numbers are the same. Typically, this means + that the added setting in file format is optional and the default value of the + setting matches the behavior of the previous file format version. + + Note: there is no "forward compatibility" based on MINOR version number. + Consumers which support reading up to a particular MINOR version number SHOULD + NOT attempt to consume files with higher MINOR version numbers. + +- MAJOR number SHOULD be increased if the file format is changed in an + incompatible way. For example adding a new transformation type in the + "changes" section is an incompatible change because it cannot be ignored by + existing schema conversion logic, so such a change will require a new MAJOR + number. + +Correspondingly: + +- Consumers of the schema file SHOULD NOT attempt to interpret the schema file + if the MAJOR version number is different (higher or lower) than what the + consumer supports. + +- Consumers of the schema file SHOULD NOT attempt to interpret the schema file + if the MINOR version number is higher than what the consumer supports. + +- Consumers MAY ignore the PATCH number. + +To illustrate this with some examples: + + + + + + + + + + + + + + + + + + + + + + + + + + + +
File Format Version + Consumer's Expected Version + Consumer Can Read? +
1.0.0 + 1.0.0 + yes +
1.0.x + 1.0.y + yes, for any x and y. +
1.a.x + 1.b.x + if a<b then yes, otherwise no. +
2.0.0 + 1.x.y + no +
+ +## OTLP Changes + +To allow carrying the Schema URL in emitted telemetry it is necessary to add a +schema_url field to OTLP messages. + +We add schema_url fields to the following messages: + +```protobuf +message ResourceSpans { + ... + // This schema_url applies to the "resource" field and to all spans and span events + // in the "instrumentation_library_spans" except the spans and span events which + // have a schema_url specified in the nested InstrumentationLibrarySpans message. + string schema_url = 3; +} +message InstrumentationLibrarySpans { + ... + // This schema_url applies to all spans in the "spans" field regardless of the + // value of the schema_url field in the outer ResourceSpans message. + string schema_url = 3; +} + +message ResourceMetrics { + ... + // This schema_url applies to the "resource" field and to all metrics in the + // "instrumentation_library_metrics" except the metrics which have a schema_url + // specified in the nested InstrumentationLibraryMetrics message. + string schema_url = 3; +} +message InstrumentationLibraryMetrics { + ... + // This schema_url applies to all metrics in the "metrics" field regardless of the + // value of the schema_url field in the outer ResourceMetrics message. + string schema_url = 3; +} + +message ResourceLogs { + ... + // This schema_url applies to the "resource" field and to all logs in the + // "instrumentation_library_logs" except the logs which have a schema_url + // specified in the nested InstrumentationLibraryLogs message. + string schema_url = 3; +} +message InstrumentationLibraryLogs { + ... + // This schema_url applies to all logs in the "logs" field regardless of the + // value of the schema_url field in the outer ResourceLogs message. + string schema_url = 3; +} +``` + +The schema_url field in the ResourceSpans, ResourceMetrics, ResourceLogs +messages applies to the contained Resource, Span, SpanEvent, Metric, LogRecord +messages. + +The schema_url field in the InstrumentationLibrarySpans message applies to the +contained Span and SpanEvent messages. + +The schema_url field in the InstrumentationLibraryMetrics message applies to the +contained Metric messages. + +The schema_url field in the InstrumentationLibraryLogs message applies to the +contained LogRecord messages. + +If schema_url field is non-empty both in Resource message and in the contained +InstrumentationLibrary message then the value in InstrumentationLibrary +message takes the precedence. + +## API and SDK Changes + +### Instrumentation Library Schema URL + +The OpenTelemetry API needs to be changed to allow getting a +Tracer/Meter/LogEmitter that is associated with a Schema URL (in addition to the +association with instrumentation library name and version that is already +supported). + +This change needs to be done such that we do not break APIs that are already +declared stable (particularly the Get Tracer API). + +Depending on the language the following approaches are possible: + +- Add a third, optional parameter `schema_url` to Get Tracer/Get Meter/Get + LogEmitter methods of corresponding providers. This may not be the right + approach for languages where ABI stability is part of our guarantees since it + likely breaks the ABI. + +- Add a method overload that allows passing 3 parameters (instrumentation + library name, version and schema url) to obtain a Tracer/Meter/LogEmitter. + This is likely the preferred approach for languages where method overloads are + possible. + +- If neither of the above 2 approaches are possible to do in non-breaking manner + then the API may introduce a `SetSchema(schema_url)` method to + Tracer/Meter/LogEmitter instance. The method MUST be called only once and + MUST be called before any telemetry is emitted using the instance. + +There may be other ways to modify the API to allow the association with a Schema +URL. Language maintainers SHOULD choose the idiomatic way for their language. + +The effect of associating a Schema URL with a Tracer/Meter/LogEmitter SHOULD be +that the schema_url in the InstrumentationLibrarySpans, +InstrumentationLibraryMetrics, InstrumentationLibraryLogs message for all the +telemetry emitted with the associated Tracer/Meter/LogEmitter is populated with +the provided Schema URL value. + +If the Tracer/Meter/LogEmitter is not associated with a Schema URL then the +exporter MUST leave the schema_url field in OTLP messages unset, in which case +the application-wide Schema URL [will apply](#application-wide-schema-url). + +Open Question: how to make it easy for instrumentation libraries to refer to a +particular OpenTelemetry schema version and also make sure any semantic +convention helpers the library uses (e.g. constants that define the semantic +conventions) match exactly that same schema version? One possible solution is to +introduce helper packages per schema version that the libraries can use, e.g. +constants that define the semantic conventions and the corresponding schema +version url. This should be likely the topic for a follow-up OTEP. + +### Application-wide Schema URL + +The SDK interface MUST provide a way for the user to optionally set an +application-wide Schema URL. This Schema URL will be populated in all +ResourceSpans, ResourceMetrics and ResourceLogs messages emitted by OTLP +Exporter. + +If the user does not set an application-wide Schema URL then the current Schema +URL MUST be populated by OTLP Exporter in the messages, where "current" means +the version of OpenTelemetry Schema against which the SDK is coded. + +Note that if there is a schema url associated with instrumentation library it +overrides the application-wide schema url as described [here](#otlp-changes). + +## OpenTelemetry Schema + +OpenTelemetry publishes it own schema at +`https://opentelemetry.io/schemas/`. The version number of the schema +is the same as the specification version number which publishes the schema. +Every time a new specification version is released a corresponding new schema +MUST be released simultaneously. If the specification release did not introduce +any change the "changes" section of the corresponding version in the schema file +will be empty. + +As of the time of this proposal the specification is at version 1.2.0 and the +corresponding schema file if it was published with the specification would look +like this: + +```yaml +file_format: 1.0.0 +schema_url: https://opentelemetry.io/schemas/1.2.0 +versions: + 1.2.0: +``` + +Since 1.2.0 is the first published version of OpenTelemetry schema there are no +"changes" section and we omitted all previous versions from the file since there +is nothing to record for earlier versions. + +All OpenTelemetry instrumentation solutions will follow this schema. + +## Performance Impact + +Performance impact of the changes to OTLP protocol are negligible provided that +schema conversion is not necessary. The cost of recording the schema_url by +telemetry sources and the cost of checking the schema_url field in telemetry +consumers is negligible compared to other costs. When the telemetry schema +matches the expected schema there is no additional work involved at all. + +When the schema version does not match and it is necessary to perform conversion +the performance impact of the schema conversions depend on the volume and type +of transformations can be significant. We +[benchmarked](https://github.com/tigrannajaryan/telemetry-schema/blob/main/schema/perf_test.go) +one use-case where the telemetry schema conversion of spans and metrics from one +version to another is performed using a Go implementation. + +The benchmark does the following: + +- Compares the CPU time necessary for schema conversion to the time necessary + for Protobuf decoding. This is a useful comparison because telemetry consumers + have to perform Protobuf decoding so it is the minimum baseline work against + which we can measure the impact of the additional conversion work. + +- Uses a hypothetical schema version change where 20 attributes in semantic + conventions were renamed. + +- Uses data composed of batches of 100 spans each with 10 attributes or 100 + metric data points (Int64 Gauge type) with 2 attributes per data point. Each batch + is associated with one resource that has 20 attributes. + +Here are the benchmark results: + +``` +BenchmarkDecode/Trace/Attribs-8 6121 919271 ns/op +BenchmarkDecode/Metric/Int64-8 9516 635418 ns/op +BenchmarkDecodeAndConvertSchema/Trace/Attribs-8 5988 943158 ns/op +BenchmarkDecodeAndConvertSchema/Metric/Int64-8 8588 653266 ns/op +``` + +BenchmarkDecode is decoding time only. BenchmarkDecodeAndConvertSchema is +decoding plus schema conversion time. We see that the processing time to do the +conversion is about 3% of decoding time. + +The benchmarking done is for illustration purposes only. Real-world results will +depend on the composition of the data, the volume of transformations done during +conversion, programming language used, etc. However, we feel that this one data +point is still useful for understanding that the potential impact is acceptable +(and likely will be a very small percentage of total processing most telemetry +consumers do). + +## Open Questions + +- Do we need to support a concept of "pre-release" semantic conventions that can + be broken freely and are not part of strict schema checks? Supposedly, this is + already possible simply by avoiding introducing a new schema version when such + semantic conventions change. + +- How to make it easy for instrumentation libraries to refer to a particular + OpenTelemetry schema version and also make sure any semantic convention + helpers the library uses (e.g. constants that define the semantic conventions) + match exactly that same schema version? One possible solution is to introduce + helper packages per schema version that the libraries can use, e.g. constants + that define the semantic conventions and the corresponding schema version url. + This should be likely the topic for a follow-up OTEP. + +- Should we make it possible to include the entire Schema File in OTLP requests + in addition to the Schema URL so that recipients of the OTLP request do not + have to fetch the Schema File (which may potentially be impossible if there + are network issues)? + +## Future Possibilities + +### Parent Schema + +A schema can optionally have a _parent_ schema (so for example a custom schema +can be based on OpenTelemetry schema, which will be the parent in that case). A +schema that does not have a parent is called a _root_ schema (so OpenTelemetry +schema will be a root schema). We may have more than one root schema (which +opens up interesting possibilities, but this discussion is out of scope of this +proposal). + +All schemas therefore form a set of rooted trees, where nodes represent schema +versions and pairs of nodes are connected via edges of two types: 1) +parent-child relationship, 2) consecutive versions in the same schema family. +Each edge in the tree represents a schema transformation. + +Once we have the tree of schemas it is possible to then convert from a node in a +tree to another node in a tree given that the nodes are connected via a path +through edges. + +Further research is needed in this area if we find that we need this concept of +parent schemas. + +### Current State in Schema + +The schema file in this OTEP describes changes from version to version. This is +very useful for performing conversions between versions but it does not capture +the current state of a particular version. + +We can add an ability to specify the full current schema of each particular +version in the schema file (in addition to the "changes" section). This will +open up a few interesting possibilities: + +- Automatic validation that emitted telemetry confirms to the declared schema. + +- OpenTelemetry semantic conventions document can be automatically generated + from this formal schema file. This will remove the need to have semantic + conventions codified and generated from yaml files in the specification. + +- Consumers of telemetry can use this information to interpret the received + telemetry. + +- It may be possible to auto-generate the "changes" section in schema files as a + delta between full states of versions. + +An example of what can be recorded as the "current state" is the encoding of the +Body of the Log Records. It was described as a concept in +[this earlier proposal](https://docs.google.com/document/d/1ZExye1lW43owwaxcbjOvl0P2qER-UaYd_MyxItc2h0k/edit#). +The suggestion was to record the encoding of the Body field of Log Records, +which can be also implemented as a new setting in the "current state" of the +"logs" section in the schema file. + +### Other Transformation Types + +This OTEP introduces a limited set of transformations, while deliberately +keeping the type of transformations to a minimum. It is easy to see that +telemetry can evolve in other, more sophisticated ways over time (see e.g. +[this list](https://github.com/open-telemetry/opentelemetry-specification/issues/1324)). + +More transformation types can be added in the future when there is a need to +represent a particular change in the telemetry schema that is not possible with +the current set of transformations. However, care should be taken to avoid +introducing unnecessary or overly complicated transformation types, especially +ones that cannot be applied on a local portion of telemetry data and requires a +full state (e.g. transformations, such as aggregations, etc), since such +transformations can place significant implementation burden on telemetry sources +that wish to support the notion of telemetry schemas. + +### Version Convertability + +Depending on the changes that happened it may or may not be possible to convert +telemetry from one version to another in a lossless and unambiguous way. When +such conversion is possible we say that the schema is "convertible" from one +particular version to another. + +Generally, given an set of possible transformations and a pair of versions X and +Y, it may be the converting telemetry from X to Y is possible, while the +opposite direction - converting from Y to X is not possible. + +The transformations defined in this proposal make all conversions from older +schema versions to new versions possible. The opposite direction in some case +may not be possible (see for example the explanation about reversible +transformation [all Section](#all-section)). + +In the future we may also want to add ability to explicitly declare schema +versions as non convertible. This may be necessary to express the fact that the +schema has changed in a way that makes it incompatible, but schema file +transformations alone are not expressive enough to describe that fact. + +## Alternates Considered + +### Freeze Schema + +Instead of introducing formal schemas, schema files and version we can require +that the instrumentation once it is created never changes. Attributes are never +renamed, the emitted telemetry always remains the same, with the exception of +changes that cannot affect existing telemetry consumers. + +This was considered but does not seem to be an acceptable proposal. Mistakes +happen, semantic conventions may need to be changed because they were +incorrectly defined, there may be bugs in instrumentation that need to be fixed, +etc. We believe that attempting to freeze the schema and only allow fully +backward compatible changes in emitted telemetry is too restrictive and very +difficult to follow long term. + +### Name Aliases + +This approach solves a smaller subset of schema evolution problems: change of +names of attributes, metrics, etc. + +When such changes happen the telemetry sources can continue producing telemetry +using old attribute names, and also add attributes that have the new name and +exact same value as the old attribute. So, the telemetry data at the same time +has aliases of attributes recorded. + +Because both old and new names are present in any emitted telemetry the +consumers of the telemetry can just continue assuming that the attributes they +are interested in are present. There is no formal schema management in this +approach. We simply publish the same telemetry data in a way that both old and +new consumers can find the bits they are looking for. + +The benefit of this approach is that it is much simpler than proposed in this +document. + +The downsides of this approach are: + +- It can handle limited types of schema changes. It is easy to demonstrate that + some changes that can be easily handled via more formal schemas concept will + fail if we use the aliases approach (e.g. swapping of attribute names X and Y + cannot be handled by aliases). + +- Over time more and more data should be published increasing the volume of the + published data. Depending on how exactly the aliases handled we may need to + duplicate a lot (e.g. if metric name has change we may have to produce the + entire same metric twice, essentially duplicating the traffic) or may require + breaking changes in the OpenTelemetry API or protocol to allow aliases + natively. + +### Schema Negotiation + +Instead of performing transformations of the schema let telemetry consumers +(backends) negotiate the version they expect to receive from the telemetry +source. + +The benefit of this approach is that the backend can support just one schema, +there is no need to define transformation rules or do the transformations at +all. + +The downsides are: + +- Requires telemetry sources to be able to emit telemetry in multiple different + schema versions, which is a significant burden on the telemetry sources. This + is likely a dealbreaker. + +- Requires a communication channel from telemetry consumers to telemetry sources + which currently does not exist (all communication is currently + one-directional, from sources to consumers). + +## Prior Art + +- [OpenTelemetry Log Data Model: Body Meta-Data](https://docs.google.com/document/d/1ZExye1lW43owwaxcbjOvl0P2qER-UaYd_MyxItc2h0k/edit#) + by Christian Beedgen. + +- [Generic event encoding schemas](https://docs.google.com/document/d/11ccT_zBbiCfwKyi6TMuy2sA3nUdElNQsDJOQ79icVKs/edit#) + by Josh MacDonald. + +- [Structured Logging Payloads](https://docs.google.com/document/d/1Xu2tCU5vjw8RNqzqFD6y9ZwBqNtg_RKW_HzB72bd4KQ/edit#) + by David Poncelow. + +- [Versioning of Attributes](https://github.com/cloudevents/spec/blob/v1.0/primer.md#versioning-of-attributes) + and + [dataschema](https://github.com/cloudevents/spec/blob/v1.0/spec.md#dataschema) + field in CloudEvents specification. + +- Splunk + [sourcetype](https://docs.splunk.com/Documentation/Splunk/8.1.3/Data/Whysourcetypesmatter) + field. + +## Appendix A. Example Schema File + +```yaml +# Defines the file format. MUST be set to 1.0.0. +file_format: 1.0.0 + +# The Schema URL that this file is published at. The version number in the URL +# MUST match the highest version number in the "versions" section below. +# Note: the schema version number in the URL is not related in any way to +# the file_format setting above. +schema_url: https://opentelemetry.io/schemas/1.1.0 + +# Definitions for each schema version in this family. +# Note: the ordering of versions is defined according to semver +# version number ordering rules. +versions: + 1.1.0: + # Definitions for version 1.1.0. + all: + # Definitions that apply to all data types. + changes: + # Transformations to apply when converting from version 1.0.0 to 1.1.0. + - rename_attributes: + # map of key/values. The keys are the old attribute name used + # the previous version, the values are the new attribute name + # starting from this version. + # Rename k8s.* to kubernetes.* + k8s.cluster.name: kubernetes.cluster.name + k8s.namespace.name: kubernetes.namespace.name + k8s.node.name: kubernetes.node.name + k8s.node.uid: kubernetes.node.uid + k8s.pod.name: kubernetes.pod.name + k8s.pod.uid: kubernetes.pod.uid + k8s.container.name: kubernetes.container.name + k8s.replicaset.name: kubernetes.replicaset.name + k8s.replicaset.uid: kubernetes.replicaset.uid + k8s.cronjob.name: kubernetes.cronjob.name + k8s.cronjob.uid: kubernetes.cronjob.uid + k8s.job.name: kubernetes.job.name + k8s.job.uid: kubernetes.job.uid + k8s.statefulset.name: kubernetes.statefulset.name + k8s.statefulset.uid: kubernetes.statefulset.uid + k8s.daemonset.name: kubernetes.daemonset.name + k8s.daemonset.uid: kubernetes.daemonset.uid + k8s.deployment.name: kubernetes.deployment.name + k8s.deployment.uid: kubernetes.deployment.uid + + resources: + # Definitions that apply to Resource data type. + changes: + - rename_attributes: + telemetry.auto.version: telemetry.auto_instr.version + + spans: + # Definitions that apply to Span data type. + changes: + - rename_attributes: + attribute_map: + # map of key/values. The keys are the old attribute name used + # in the previous version, the values are the new attribute name + # starting from this version. + peer.service: peer.service.name + + span_events: + # Definitions that apply to Span Event data type. + changes: + - rename_events: + # The keys are old event name used in the previous version, the + # values are the new event name starting from this version. + name_map: {stacktrace: stack_trace} + + - rename_attributes: + attribute_map: + peer.service: peer.service.name + apply_to_events: + # Optional event names to apply to. If empty applies to all events. + - exception.stack_trace + + metrics: + # Definitions that apply to Metric data type. + changes: + - rename_metrics: + # map of key/values. The keys are the old metric name used + # in the previous version, the values are the new metric name + # starting from this version. + container.cpu.usage.total: cpu.usage.total + container.memory.usage.max: memory.usage.max + + - rename_attributes: + attribute_map: + status: state + apply_to_metrics: + # Optional. If it is missing the transformation is applied + # to all metrics. If it is present the transformation is applied + # only to the metrics with the name that is found in the sequence + # specified below. + - system.cpu.utilization + - system.memory.usage + - system.memory.utilization + - system.paging.usage + + logs: + # Definitions that apply to LogRecord data type. + changes: + - rename_attributes: + attribute_map: + process.executable_name: process.executable.name + + 1.0.0: + # First version of this schema family. +``` diff --git a/oteps/0155-external-modules.md b/oteps/0155-external-modules.md new file mode 100644 index 00000000000..1a045d892b0 --- /dev/null +++ b/oteps/0155-external-modules.md @@ -0,0 +1,132 @@ +# Ecosystem Management + +Proposal how to leverage wider community contributing instrumentations and other packages. + +## Motivation + +For OpenTelemetry to become a de-facto standard in observability there must exist a vast ecosystem of OpenTelemetry components, +including integrations with various libraries and frameworks in all languages supported by OpenTelemetry. +We cannot possibly expect that all these integrations will be provided by the core maintainers of OpenTelemetry. +We hope that wider community will integrate their projects with OpenTelemetry. +We have to encourage that by providing great documentation, examples and tooling to integration authors, +while still providing our end-users with some way to discover all available OpenTelemetry components together with some visibility into their quality. + +## Explanation + +The [OpenTelemetry Registry](https://opentelemetry.io/registry/) serves as a central catalogue of all known OpenTelemetry components, +both provided by core maintainers of the project and any third party. + +In order for a component to be included into Registry its authors have to fill a [self-assessment form](#registry-self-assessment-form). + +The registry should allow a clear visibility of component's ownership, quality and compatibility with other OpenTelemetry components. + +A component can be removed from the Registry if any declaration from the self-assessment form is violated and not remedied +in a timely manner, provisionally three or four weeks. + +## Internal details + +We distinguish the following sources of OpenTelemetry components and integrations. + +### Native or built-in instrumentations + +Any library or framework can use the OpenTelemetry API to natively produce telemetry. +We encourage library authors to submit their library for inclusion into the OpenTelemetry Registry. + +### Core components + +OpenTelemetry SIGs may provide instrumentation for any libraries that are deemed important enough by the SIG’s maintainers. +By doing this the SIG maintainers commit to support this instrumentation (including future versions of the library), +provide updates/fixes, including security patches and guarantee that they comply with OpenTelemetry semantic conventions and best practices. + +Depending on the SIG, these instances of core instrumentation may share the repository, infrastructure and maintainers with +the OpenTelemetry API/SDK implementation for this language or be separate. + +All instances of core instrumentation must be included into the OpenTelemetry Registry following the usual process described above. + +### Contrib components + +Any language SIG may have one or more “contrib” repos containing components contributed by developers with an interest in specific parts of the instrumentation ecosystem. +These repositories may have a separate set of approvers/maintainers than the core API/SDK repo. +Contrib repositories are as important for the project success as core repository, but may not require the same level of expertise. +In fact, these repositories often calls for other set of skills and customer's understanding. +On contrib repository creation, new set of approvers and maintainers can be added as we do for any new repository, without time/contribution requirements. +Repository maintainers are also encouraged to promote contributors to approver/maintainer role in this repository +based on targeted contributions and expertise of the contrib repository rather than overall SIG scope. +It is important to keep the process fair and inclusive by following the formal guidance published [here](https://github.com/open-telemetry/community/blob/main/community-membership.md#maintainer). +A contrib repository may leverage the CODEOWNERS functionality of GitHub to assign maintainers to individual packages +even if this means granting write permissions to the whole repo. +The goal should be to distribute the load of reviewing PRs and accepting changes as much as possible, while keeping reliability and overall quality of components and fair governance. + +All components in a contrib repository are expected to be included into the OpenTelemetry Registry following the usual process described above. + +We should welcome all contributions and make the inclusion process (including following all our requirements) as easy as possible. +The goal is to encourage all contributors to include their components into a contrib repo as opposed to hosting them separately. +This way they can reuse existing infrastructure for testing, publishing, security scanning etc. +This will also greatly simplify responsibility transfer between different maintainers if their priorities change. +It also promotes the development and maintenance of a single instrumentation package for each instrumentation source, +so that work isn't spread amongst multiple parallel solutions. + +Language SIGs are encouraged to provide a testing harness to verify that component adheres to OpenTelemetry semantic conventions +and recommendations for OpenTelemetry instrumentations design when OpenTelemetry starts publishing them. + +A high volume of contrib contributions presents a burden for language maintainers. There are two suggestions for tackling this: + +- Create an "experimental" folder within contrib. The contents of this folder are not reviewed or maintained by language repository maintainers, but +many of the other benefits of being within a contrib repository remain. +Experimental contributions should be marked as such in the Registry. +- Add more approvers and maintainers, perhaps some who exclusively focus on submissions to contrib. + +### External components + +If component authors, for whatever reason, want to host their contribution outside an OpenTelemetry contrib repository +they are free to do so (though we encourage all contributions to go into contrib or core language repositories). +Their submission for inclusion into OpenTelemetry Registry is still welcomed, subject to the same process described above. + +### Distribution + +Whenever OpenTelemetry components are published to any repository other than the OpenTelemetry Registry (such as npm registry or Maven Central), +only core and contrib components can be published under "opentelemetry" namespace. +Native and external components are to be published under their own namespace. + +In case the OpenTelemetry SIG provides any kind of "all-in-one" instrumentation distribution (e.g. as Java and .NET do) +their should be an option to get a distribution with only core and contrib packages included. +The OpenTelemetry Registry should provide a way to easily obtain a list of these components. +The SIG may provide other distributions as well. +If possible, SIGs should provide a mechanism to include any external component during target application's build- or runtime. +This may mean a separate language-specific component API that all components are encouraged to implement. + +## Trade-offs and mitigations + +* How easy it is to get merge permission in contrib repo? +The harder it is, the larger is the maintenance burden on the core team. +The easier it is, the more uncertainty there is about the quality of contributions. +Can every language SIG decide this for themselves or should we decide together? + +## Open questions + +### Registry self-assessment form + +The exact list should be developed separately, but at least the component's author should declare that + +* It uses permissive OSS license approved by CNCF +* It does not have any known security vulnerabilities +* It produces telemetry which adheres to OpenTelemetry semantic conventions +* If the OpenTelemetry/SIG provides a testing harness to verify produced telemetry, that tests were used and passed +* Authors commit a reasonable effort into future maintenance of this component + +### Component information in the Registry + +The exact UI of component's page in the Registry is outside of this OTEP's scope, but some suggestions are: + +* Short description of the component +* Link to the original repository of this component +* Clear indication of component's authors/maintainers and its OpenTelemetry status (native/core/contrib/external) +* License used +* Language and targeted library +* Links to a list of current issues and known security vulnerabilities +* Some way of component's quality as perceived by its users, e.g. rating or stars or thumbs up/down +* OpenTelemetry API/SDK and library versions that this component was tested against +* If OpenTelemetry verification harness is used by this component +* A filled self-assessment form +* Any quality indicators we may want to standardize on (e.g. test coverage) +* Dev stats (date of creation, date of last release, date of last commit) diff --git a/oteps/0156-columnar-encoding.md b/oteps/0156-columnar-encoding.md new file mode 100644 index 00000000000..e42c1c8b60f --- /dev/null +++ b/oteps/0156-columnar-encoding.md @@ -0,0 +1,1018 @@ +# OTel Arrow Protocol Specification + +**Author**: Laurent Querel, F5 Inc. + +**Keywords**: OTLP, Arrow Columnar Format, Bandwidth Reduction, Multivariate Time-series, Logs, Traces. + +**Abstract**: This OTEP describes a new protocol, the OTelArrow protocol, which is based on a **generic columnar representation +for metrics, logs and traces**. This protocol significantly improves efficiency in scenarios involving the transmission +of large batches of metrics, logs, traces. Moreover, it provides a better representation for [multivariate time-series](#multivariate-time-series). +The OTelArrow protocol also supports a fallback mechanism to the [OpenTelemetry protocol (OTEP 0035)](https://github.com/open-telemetry/oteps/blob/main/text/0035-opentelemetry-protocol.md) +in instances when one of the endpoints does not support the OTelArrow protocol. + +**Reference implementation**: The [OTel Arrow Adapter](https://github.com/f5/otel-arrow-adapter) Go library specifies +the protobuf spec, and implements the OTel Arrow Encoder/Decoder (main contributor [Laurent Querel](https://github.com/lquerel)). +An [experimental OTel Collector](https://github.com/open-telemetry/experimental-arrow-collector) has been implemented to +expose the new gRPC endpoint and to provide OTel Arrow support via the previous library (main contributor [Joshua MacDonald](https://github.com/jmacd)). + +## Table of contents + +* [Introduction](#introduction) + * [Motivation](#motivation) + * [Validation](#validation) + * [Why Apache Arrow and How to Use It?](#why-apache-arrow-and-how-to-use-it) + * [Integration Strategy and Phasing](#integration-strategy-and-phasing) +* [Protocol Details](#protocol-details) + * [ArrowStreamService](#arrowstreamservice) + * [Mapping OTel Entities to Arrow Records](#mapping-otel-entities-to-arrow-records) + * [Logs Arrow Mapping](#logs-arrow-mapping) + * [Spans Arrow Mapping](#spans-arrow-mapping) + * [Metrics Arrow Mapping](#metrics-arrow-mapping) +* [Implementation Recommendations](#implementation-recommendations) + * [Protocol Extension and Fallback Mechanism](#protocol-extension-and-fallback-mechanism) + * [Batch ID Generation](#batch-id-generation) + * [Schema ID Generation](#schema-id-generation) + * [Traffic Balancing Optimization](#traffic-balancing-optimization) + * [Throttling](#throttling) + * [Delivery Guarantee](#delivery-guarantee) +* [Risks and Mitigation](#risks-and-mitigations) +* [Trade-offs and Mitigations](#trade-offs-and-mitigations) + * [Duplicate Data](#duplicate-data) + * [Incompatible Backends](#incompatible-backends) + * [Small Devices/Small Telemetry Data Stream](#small-devicessmall-telemetry-data-stream) +* [Future Versions and Interoperability](#future-versions-and-interoperability) +* [Prior Art and Alternatives](#prior-art-and-alternatives) +* [Open Questions](#open-questions) +* [Future Possibilities](#future-possibilities) +* [Appendix A - Protocol Buffer Definitions](#appendix-a---protocol-buffer-definitions) +* [Glossary](#glossary) +* [Acknowledgements](#acknowledgements) + +## Introduction + +### Motivation + +As telemetry data becomes more widely available and volumes increase, new uses and needs are emerging for the OTLP +ecosystem: cost-effectiveness, advanced data processing, data minimization. This OTEP aims to improve the OTLP +protocol to better address them while maintaining compatibility with the existing ecosystem. + +Currently, the OTLP protocol uses a "row-oriented" format to represent all the OTel entities. This representation works +well for small batches (<50 entries) but, as the analytical database industry has shown, a "column-oriented" +representation is more optimal for the transfer and processing of *large batches* of entities. The term "row-oriented" +is used when data is organized into a series of records, keeping all data associated with a record next to each other in +memory. A "column-oriented" system organizes data by fields, grouping all the data associated with a field next to each +other in memory. The main benefits of a columnar approach are: + +* **better data compression rate** (arrays of similar data generally compress better), +* **faster data processing** (see diagram below), +* **faster serialization and deserialization** (few arrays vs many in-memory objects to serialize/deserialize), +* **better IO efficiency** (less data to transmit). + +![row vs column-oriented](img/0156_OTEL%20-%20Row%20vs%20Column.png) + +This OTEP proposes to improve the [OpenTelemetry protocol (OTEP 0035)](https://github.com/open-telemetry/oteps/blob/main/text/0035-opentelemetry-protocol.md) +with a **generic columnar representation for metrics, logs and traces based on Apache Arrow**. Compared to the existing +OpenTelemetry protocol this compatible extension has the following improvements: + +* **Reduce the bandwidth requirements** of the protocol. The two main levers are: 1) a better representation of the + telemetry data based on a columnar representation, 2) a stream-oriented gRPC endpoint that is more efficient to + transmit batches of OTLP entities. +* **Provide a more optimal representation for multivariate time-series data**. + Multivariate time-series are currently not well compressed in the existing protocol (multivariate = related metrics + sharing the same attributes and timestamp). The OTel Arrow protocol provides a much better compression rate for this + type of data by leveraging the columnar representation. +* **Provide more advanced and efficient telemetry data processing capabilities**. Increasing data volume, cost + efficiency, and data minimization require additional data processing capabilities such as data projection, + aggregation, and filtering. + +These improvements not only address the aforementioned needs but also answer the [open questions](https://github.com/open-telemetry/oteps/blob/main/text/0035-opentelemetry-protocol.md#open-questions) +cited in OTEP 035 (i.e. cpu usage, memory pressure, compression optimization). + +**It is important to understand that this proposal is complementary to the existing protocol. The row-oriented version +is still suitable for some scenarios. Telemetry sources that generate a small amount of telemetry data should continue +to use it. On the other side of the spectrum, sources and collectors generating or aggregating a large amount of +telemetry data will benefit from adopting this extension to optimize the resources involved in the transfer and +processing of this data. This adoption can be done incrementally.** + +Before detailing the specifications of the OTel Arrow protocol, the following two sections present: 1) a validation of +the value of a columnar approach based on a set of benchmarks, 2) a discussion of the value of using Apache Arrow as a +basis for columnar support in OTLP. + +### Validation + +A series of tests were conducted to compare compression ratios between OTLP and a columnar version of OTLP called OTel +Arrow. The key results are: + +* For univariate time series, OTel Arrow is **2 to 2.5 better in terms of bandwidth reduction while having an + end-to-end speed (including conversion to/from OTLP) 1.5 to 2 times slower in phase 1**. In **phase 2** the conversion + OTLP to/from Arrow is gone and the end-to-end speed is **3.1 to 11.2 times faster by our estimates**. +* For multivariate time series, OTel Arrow is **3 to 7 times better in terms of bandwidth reduction while having an + end-to-end speed (including conversion to/from OTLP) similar to the univariate time series scenario phase 1**. Phase 2 + has been yet estimated but similar results are expected. +* For logs, OTel Arrow is **1.6 to 2 times better in terms of bandwidth reduction while having an end-to-end speed + (including conversion to/from OTLP) 2.0 to 3.5 times slower in phase 1**. In **phase 2** the conversion + OTLP to/from Arrow is gone and the end-to-end speed is **2.3 to 4.86 times faster** by our estimates. +* For traces, OTel Arrow is **1.7 to 2.8 times better in terms of bandwidth reduction while having an end-to-end speed + (including conversion to/from OTLP) 1.5 to 2.1 times slower in phase 1**. In **phase 2** the conversion + OTLP to/from Arrow is gone and the end-to-end speed is **3.37 to 6.16 times faster** by our estimates. + +The following 3 columns charts show the results of the benchmarks for the univariate metrics, logs and traces. For both +protocols, the baseline is the size of the uncompressed OTLP messages. The reduction factor is the ratio between this +baseline and the compressed message size for each protocol. The compression algorithm used is ZSTD for OTLP and OTel +Arrow. + +![Summary (standard metrics)](img/0156_compression_ratio_summary_std_metrics.png) + +In the following 3-columns charts, the only difference with the previous ones is that the metrics are multivariate. The +benchmarks show that the compression ratio is much better for OTel Arrow than for OTLP. This is due to the fact that +OTel Arrow is able to leverage the columnar representation to compress the data more efficiently in a multivariate +scenario. + +![Summary (multivariate metrics)](img/0156_compression_ratio_summary_multivariate_metrics.png) + +The following stacked bar graphs compare side-by-side the distribution of time spent for each step and for each +version of the protocol. + +![Summary of the time spent](img/0156_summary_time_spent.png) +[Zoom on the chart](https://raw.githubusercontent.com/lquerel/oteps/main/text/img/0156_summary_time_spent.png) + +> In conclusion, these benchmarks demonstrate the interest of integrating a column-oriented telemetry data protocol to +> optimize bandwidth and processing speed in a batch processing context. + +### Why Apache Arrow and How to Use It? + +[Apache Arrow](https://arrow.apache.org/) is a versatile columnar format for flat and hierarchical data, well +established in the industry. Arrow is optimized for: + +* column-oriented data exchange based on an in-memory format common among implementations, regardless of the language. +The use of a serialization and deserialization mechanism is thus eliminated, allowing zero-copy. +* in-memory analytic operations using modern hardware optimizations (e.g. SIMD) +* integration with a large ecosystem (e.g. data pipelines, databases, stream processing, etc.) +* language-independent + +All these properties make Arrow a great choice for a general-purpose telemetry protocol. Efficient implementations of +Apache Arrow exist for most of the languages (Java, Go, C++, Rust, ...). Connectors with Apache Arrow buffer exist for +well-known file format (e.g. Parquet) and for well-known backend (e.g. BigQuery). +By reusing this existing infrastructure (see [Arrow ecosystem](https://arrow.apache.org/powered_by/)), we are accelerating the development and +integration of the OpenTelemetry protocol while expanding its scope of application. + +Adapting the OTLP data format to the Arrow world (see below) is only part of the problem this proposal aims to describe. +Many other design choices and trade-offs have been made, such as: + +- the organization of the data (i.e. schema and sorting) and the selection of the compression algorithm to optimize the compression ratio. +- the way to serialize the Arrow data, the metadata, the dictionaries, and the selection of the transfer mode (reply/reply vs. bi-dir stream). +- optimization of many parameters introduced in the system. + +![In-memory Apache Arrow RecordBatch](img/0156_OTEL%20-%20HowToUseArrow.png) + +### Integration Strategy and Phasing + +This OTEP enhances the existing OTel eco-system with an additional representation of telemetry data in columnar form to +better support certain scenarios (e.g. cost-effectiveness to transmit large batch, multivariate time-series, advanced +data processing, data minimization). All existing components will continue to be compatible and operational. + +A two-phase integration is proposed to allow incremental benefits. + +#### Phase 1 + +This proposal is designed as a new protocol compatible with the OTLP protocol. As illustrated in the +following diagram, a new OTel Arrow receiver will be responsible for translating this new protocol to the +OTLP protocol. Similarly, a new exporter will be responsible for translating the OTLP messages into this new Arrow-based +format. + +![OTel Collector](img/0156_collector_internal_overview.png) + +This first step is intended to address the specific use cases of **traffic reduction** and native support of +**multivariate time-series**. Based on community feedback, many companies want to reduce the cost of transferring +telemetry data over the Internet. By adding a collector that acts as a point of integration and traffic conversion at +the edge of a client environment, we can take advantage of the columnar format to eliminate redundant data and optimize +the compression ratio. This is illustrated in the following diagram. + +![Traffic reduction](img/0156_traffic_reduction_use_case.png) + +> Note 1: A fallback mechanism can be used to handle the case where the new protocol is not supported by the target. +> More on this mechanism in this [section](#protocol-extension-and-fallback-mechanism). + +#### Phase 2 + +Phase 2 aims to extend the support of Apache Arrow end-to-end and more specifically inside the collector to better +support the following scenarios: cost efficiency, advanced data processing, data minimization. New receivers, processors, +and exporters supporting Apache Arrow natively will be developed. A bidirectional adaptation layer OTLP / OTel Arrow +will be developed within the collector to continue supporting the existing ecosystem. The following diagram is an +overview of a collector supporting both OTLP and an end-to-end OTel Arrow pipeline. + +![OTel Arrow Collector](img/0156_collector_phase_2.png) + +Implementing an end-to-end column-oriented pipeline will provide many benefits such as: + +- **Accelerate stream processing**, +- **Reduce CPU and Memory usage**, +- **Improve compression ratio end-to-end**, +- **Access to the Apache Arrow ecosystem** (query engine, parquet support, ...). + +## Protocol Details + +The protocol specifications are composed of two parts. The first section describes the new gRPC services supporting +column-oriented telemetry data. The second section presents the mapping between the OTLP entities and their Apache +Arrow counterpart. + +### ArrowStreamService + +OTel Arrow defines the columnar encoding of telemetry data and the gRPC-based protocol used to exchange data between +the client and the server. OTel Arrow is a bi-directional stream oriented protocol leveraging Apache Arrow for the +encoding of telemetry data. + +OTLP and OTel Arrow protocols can be used together and can use the same TCP port. To do so, in addition to the 3 +existing +services (`MetricsService`, `LogsService` and `TraceService`), we introduce the service `ArrowStreamService` +(see [this protobuf specification](#appendix-a---protocol-buffer-definitions) for more details) exposing a single API +endpoint named `ArrowStream`. This endpoint is based on a bidirectional streaming protocol. The client message is a +`BatchArrowRecords` stream encoding a batch of Apache Arrow buffers (more specifically [Arrow IPC format](#arrow-ipc-format)). +The server message side is a `BatchStatus` stream reporting asynchronously the status of each `BatchArrowRecords` +previously sent. In addition to this endpoint, the OTel Arrow protocol offers three additional services to facilitate +intricate load-balancing routing rules, tailored to the specific nature of the OTLP entities - namely Metrics, Logs, +and Traces. + +After establishing the underlying transport, the client starts sending telemetry data using the `ArrowStream` service. +The client continuously sends `BatchArrowRecords`'s messages over the opened stream to the server and expects to receive +continuously `BatchStatus`'s messages from the server as illustrated by the following sequence diagram: + +![Sequence diagram](img/0156_OTEL%20-%20ProtocolSeqDiagram.png) + +> Multiple streams can be simultaneously opened between a client and a server to increase the maximum achievable +throughput. + +If the client is shutting down (e.g. when the containing process wants to exit) the client will wait until +all pending acknowledgements are received or until an implementation specific timeout expires. This ensures reliable +delivery of telemetry data. + +The protobuf definition of this service is: + +```protobuf +// Service that can be used to send `BatchArrowRecords` between one Application instrumented with OpenTelemetry and a +// collector, or between collectors. +service ArrowStreamService { + // The ArrowStream endpoint is a bi-directional stream used to send batch of `BatchArrowRecords` from the exporter + // to the collector. The collector returns `BatchStatus` messages to acknowledge the `BatchArrowRecords` + // messages received. + rpc ArrowStream(stream BatchArrowRecords) returns (stream BatchStatus) {} +} + +// ArrowTracesService is a traces-only Arrow stream. +service ArrowTracesService { + rpc ArrowTraces(stream BatchArrowRecords) returns (stream BatchStatus) {} +} + +// ArrowTracesService is a logs-only Arrow stream. +service ArrowLogsService { + rpc ArrowLogs(stream BatchArrowRecords) returns (stream BatchStatus) {} +} + +// ArrowTracesService is a metrics-only Arrow stream. +service ArrowMetricsService { + rpc ArrowMetrics(stream BatchArrowRecords) returns (stream BatchStatus) {} +} +``` + +> **Unary RPC vs Stream RPC**: We use a stream-oriented protocol **to get rid of the overhead of specifying the schema +> and dictionaries for each batch.** A state will be maintained receiver side to keep track of the schemas and +> dictionaries. The [Arrow IPC format](#arrow-ipc-format) has been designed to follow this pattern and also allows the +> dictionaries to be sent incrementally. Similarly, ZSTD dictionaries can also be transferred to the RPC stream to +> optimize the transfer of small batches. To mitigate the usual pitfalls of a stream-oriented protocol (e.g. unbalanced +> connections with load balancer deployment) please see this [paragraph](#traffic-balancing-optimization) in the +> implementation recommendations section. + +A `BatchArrowRecords` message is composed of 3 attributes. The protobuf definition is: + +```protobuf +// Enumeration of all the OTelArrow payload types currently supported by the OTel Arrow protocol. +// A message sent by an exporter to a collector containing a batch of Arrow +// records. +message BatchArrowRecords { + // [mandatory] Batch ID. Must be unique in the context of the stream. + int_64 batch_id = 1; + + // [mandatory] A collection of payloads containing the data of the batch. + repeated ArrowPayload arrow_payloads = 2; + + // [optional] Headers associated with this batch, encoded using hpack. + bytes headers = 3; +} +``` + +The `batch_id` attribute is a unique identifier for the batch inside the scope of the current stream. It is used to +uniquely identify the batch in the server message `BatchStatus` stream. See the [Batch Id generation](#batch-id-generation) +section for more information on the implementation of this identifier. + +The `arrow_payloads` attribute is a list of `ArrowPayload` messages. Each `ArrowPayload` message represents +a table of data encoded in a columnar format (e.g. metrics, logs, traces, attributes, events, links, exemplars, ...). +Several correlated IPC Arrow messages of different nature and with different schemas can be sent in the same OTelArrow batch +identified by `batch_id` and thus be processed as one unit without complex logic in the collector or any other processing systems. +More details on the `ArrowPayload` columns in the section [Mapping OTel entities to Arrow records](#mapping-otel-entities-to-arrow-records). + +The `headers` attribute is optional and used to send additional HTTP headers associated with the batch and encoded with +hpack. + +More specifically, an `ArrowPayload` protobuf message is defined as: + +```protobuf +// Enumeration of all the OTel Arrow payload types currently supported by the +// OTel Arrow protocol. +enum ArrowPayloadType { + UNKNOWN = 0; + + // A payload representing a collection of resource attributes. + RESOURCE_ATTRS = 1; + // A payload representing a collection of scope attributes. + SCOPE_ATTRS = 2; + + // A set of payloads representing a collection of metrics. + METRICS = 10; // Main metric payload + NUMBER_DATA_POINTS = 11; + SUMMARY_DATA_POINTS = 12; + HISTOGRAM_DATA_POINTS = 13; + EXP_HISTOGRAM_DATA_POINTS = 14; + NUMBER_DP_ATTRS = 15; + SUMMARY_DP_ATTRS = 16; + HISTOGRAM_DP_ATTRS = 17; + EXP_HISTOGRAM_DP_ATTRS = 18; + NUMBER_DP_EXEMPLARS = 19; + HISTOGRAM_DP_EXEMPLARS = 20; + EXP_HISTOGRAM_DP_EXEMPLARS = 21; + NUMBER_DP_EXEMPLAR_ATTRS = 22; + HISTOGRAM_DP_EXEMPLAR_ATTRS = 23; + EXP_HISTOGRAM_DP_EXEMPLAR_ATTRS = 24; + + // A set of payloads representing a collection of logs. + LOGS = 30; + LOG_ATTRS = 31; + + // A set of payloads representing a collection of traces. + SPANS = 40; + SPAN_ATTRS = 41; + SPAN_EVENTS = 42; + SPAN_LINKS = 43; + SPAN_EVENT_ATTRS = 44; + SPAN_LINK_ATTRS = 45; +} + +// Represents a batch of OTel Arrow entities. +message ArrowPayload { + // [mandatory] A canonical ID representing the schema of the Arrow Record. + // This ID is used on the consumer side to determine the IPC reader to use + // for interpreting the corresponding record. For any NEW `schema_id`, the + // consumer must: + // 1) close the current IPC reader, + // 2) create a new IPC reader in order to interpret the new schema, + // dictionaries, and corresponding data. + string schema_id = 1; + + // [mandatory] Type of the OTel Arrow payload. + ArrowPayloadType type = 2; + + // [mandatory] Serialized Arrow Record Batch + // For a description of the Arrow IPC format see: + // https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc + bytes record = 3; +} + +``` + +The `schema_id` attribute is a unique identifier representing the schema of the Arrow Record present in the +`ArrowPayload`. This id will be used receiver side to keep track of the schema and dictionaries for a +specific type of Arrow Records. See the [Schema Id generation](#schema-id-generation) section for more information +on the implementation of this identifier. + +The `ArrowPayloadType` enum specifies the `type` of the payload. + +The `record` attribute is a binary representation of the Arrow RecordBatch. + +By storing Arrow buffers in a protobuf field of type 'bytes' we can leverage the zero-copy capability of some +protobuf implementations (e.g. C++, Java, Rust) in order to get the most out of Arrow (relying on zero-copy ser/deser +framework). + +> Note: By default, ZSTD compression is enabled at the Arrow IPC level in order to benefit from the best compression +> ratio regardless of the collector configuration. However, this compression can be disabled to enable it at the global +> gRPC level if it makes more sense for a particular configuration. + +On the server message stream, a `BatchStatus` message is a collection of `StatusMessage`. A `StatusMessage` is composed of 5 +attributes. The protobuf definition is: + +```protobuf +// A message sent by a Collector to the exporter that opened the data stream. +message BatchStatus { + repeated StatusMessage statuses = 1; +} + +message StatusMessage { + int64 batch_id = 1; + StatusCode status_code = 2; + ErrorCode error_code = 3; + string error_message = 4; + RetryInfo retry_info = 5; +} + +enum StatusCode { + OK = 0; + ERROR = 1; +} + +enum ErrorCode { + UNAVAILABLE = 0; + INVALID_ARGUMENT = 1; +} + +message RetryInfo { + int64 retry_delay = 1; +} +``` + +The `BatchStatus` message definition is relatively simple and essentially self-explanatory. + +The server may respond with either a success ('OK') or an error ('ERROR') status. Receiving an `OK` means that the +message received by the collector has been processed by the collector. If the server receives an empty `BatchEvent` +the server should respond with success. + +When an error is returned by the server it falls into 2 broad categories: retryable and not-retryable: + +* Retryable errors indicate that processing of telemetry data failed and the client should record the error and may + retry exporting the same data. This can happen when the server is temporarily unable to process the data. +* Not-retryable errors indicate that processing of telemetry data failed and the client must not retry sending the same + telemetry data. The telemetry data must be dropped. This can happen, for example, when the request contains bad data + and + cannot be deserialized or otherwise processed by the server. The client should maintain a counter of such dropped + data. + +The server should indicate retryable errors using code UNAVAILABLE and may supply additional details via `error_message` +and `retry_info`. + +To indicate not-retryable errors the server is recommended to use code INVALID_ARGUMENT and may supply additional +details +via `error_message`. + +> Note: [Appendix A](#appendix-a---protocol-buffer-definitions) contains the full protobuf definition. + +### Mapping OTel Entities to Arrow Records + +OTel entities are batched into multiple Apache Arrow `RecordBatch`. An Apache Arrow RecordBatch is a combination of two things: +a schema and a collection of Arrow Arrays. Individual Arrow Arrays or their nested children may be dictionary encoded, +in which case the Array that is dictionary encoded contains a reference to its dictionary. The Arrow IPC +implementations, in general, will recognize when one dictionary is referenced by multiple Arrays and only send it +across the wire once, allowing the receiving end to maintain the memory usage benefits of reusing a dictionary. In this +proposal dictionary encoded arrays are used to encode string (or binary) columns that have low cardinality. The +stream-oriented API is leveraged to amortize the schema and dictionary overheads across multiple batches. + +An Apache Arrow schema can define columns of different [types](https://arrow.apache.org/docs/python/api/datatypes.html) +and with or without nullability property. For more details on the Arrow Memory Layout see this +[document](https://arrow.apache.org/docs/format/Columnar.html). + +A set of specific and well-defined Arrow Schemas is used for each OTel entity type (metrics, logs, traces). + +The current OTel metric model can be summarized by this UML diagram: + +![OTel Metrics Model](img/0156_OTEL-Metric-Model.png) + +The leaf nodes (in green in this diagram) are where the data are actually defined as list of attributes and metrics. +Basically the relationship between the metric and resource nodes is a many-to-one relationship. Similarly, the +relationship between the metric and instrumentation scope nodes is also a many-to-one relationship. + +The approach chosen for this proposal involves dividing the OTel entities into multiple Arrow RecordBatches. Each of +these RecordBatches will possess a specific schema and will be linked to other RecordBatches through a combination of +primary and foreign keys. This methodology offers an optimal balance between compression ratio, queryability, and ease +of integration with existing Arrow-based tools. + +To maximize the benefits of this columnar representation, OTel Arrow sorts a subset of columns to enhance the locality +of identical data, thereby amplifying the compression ratio. + +Finally, to mitigate the overhead of defining schemas and dictionaries, we use the Arrow IPC format. RecordBatches sharing the +same schema are grouped in a homogeneous stream. The first message sent contains in addition to the columns data, +the schema definition and the dictionaries. The following messages will not need to define the schema anymore. +The dictionaries will only be sent again when their content change. The following diagram illustrates this process. + +> Note: The approach of using a single Arrow record per OTel entity, which employs list, struct, and union Arrow data +> types, was not adopted mainly due to the inability to sort each level of the OTel hierarchy independently. The mapping +> delineated in this document, on average, provides a superior compression ratio. + +![Arrow IPC](img/0156_OTEL%20-%20Arrow%20IPC.png) + +The next sections describe the schema of each type of `ArrowPayload`. The mapping of OTLP entities +to `ArrowPayload` has been designed to be reversible in order to be able to implement an OTel Arrow -> OTLP +receiver. + +#### Logs Arrow Mapping + +We begin with the logs payload as it is the most straightforward to map onto Arrow. The following Entity Relationship +Diagram succinctly describes the schemas of the four Arrow record utilized to represent a batch of OTLP logs. + +The `LOGS` entity contains a flattened representation of the `LogRecord`, merged with `ResourceLogs` and `ScopeLogs`. +The `id` column functions as a primary key, linking the `LogRecord` with their corresponding attributes, which are +stored in the `LOG_ATTRS` entity. The `resource_id` column serves as a key, associating each `ResourceLogs` instance +with their respective attributes stored in the `RESOURCE_ATTRS` entity. Likewise, the `scope_id` column acts as a key +to link each `ScopeLogs` instance with their corresponding attributes found in the `SCOPE_ATTRS` entity. + +![Logs Arrow Schema](img/0156_logs_schema.png) + +Each of these Arrow records is sorted by specific columns to optimize the compression ratio. The `id`, `resource_id`, +and `scope_id` are stored with delta encoding to minimize their size post-compression. `parent_id` is also stored with a +variant of delta encoding, known as "delta group encoding" (more details will follow). + +Attributes are represented as a triad of columns: `key`, `type`, and one of the following columns: `str`, `int`, +`double`, `bool`, `bytes`, `ser`. The `key` column is a string dictionary, the `type` column is an enum with six +variants, and the value column depends on the type of the attribute. The `ser` column is a binary column containing the +CBOR encoding of the attribute value when the attribute type is complex (e.g., map, or array). Unused value columns are +filled with null values. + +The `body` is represented with the tuple `body_type` and one of the following columns: `body_str`, `body_int`, +`body_double`, `body_bool`, `body_bytes`, `body_ser`. + +This representation offers several advantages: + +- Each record can be sorted independently to better arrange the data for compression. +- Primary keys and foreign keys can be used to connect the different Arrow records, and they easily integrate with SQL +engines. +- The avoidance of complex Arrow data types (like union, list of struct) optimizes compatibility with the Arrow +ecosystem. + +> Note: Complex attribute values could also be encoded in protobuf once the `pdata` library provides support for it. + +#### Spans Arrow Mapping + +The approach for OTLP traces is similar to that used for logs. The primary `SPANS` entity (i.e., Arrow record) +encompasses a flattened representation of `ResourceSpans`, `ScopeSpans`, and `Spans`. Beyond the standard set of +attributes (i.e., resource, scope, and span attributes), this mapping represents span events and span links as distinct +entities (`SPAN_EVENTS` and `SPAN_LINKS` respectively). These have a 1-to-many relationship with the `SPANS` entity. +Each of these entities is also associated with dedicated attribute entities (i.e. `SPAN_EVENT_ATTRS` and +`SPAN_LINK_ATTRS`). + +![Traces Arrow Schema](img/0156_traces_schema.png) + +Similarly, each of the Arrow records is sorted by specific columns to optimize the compression ratio. + +The `end_time_unix_nano` is represented as a duration (`end_time_unix_nano` - `start_time_unix_nano`) to reduce the +number of bits required to represent the timestamp. + +#### Metrics Arrow Mapping + +The mapping for metrics, while being the most complex, fundamentally follows the same logic as applied to logs and +spans. The primary 'METRICS' entity encapsulates a flattened representation of `ResourceMetrics`, `ScopeMetrics`, and +`Metrics`. All common columns among the different metric types are consolidated in this main entity (i.e., `metric_type`, +`name`, `description`, `unit`, `aggregation_temporality`, and `is_monotonic`). Furthermore, a dedicated entity is +crafted to represent data points for each type of metrics, with their columns being specific to the respective metric +type. For instance, the `SUMMARY_DATA_POINTS` entity includes columns `id`, `parent_id`, `start_time_unix_nano`, +`time_unix_nano`, `count`, `sum`, and `flags`. Each of these "data points" entities is linked to: + +- A set of data point attributes (following a one-to-many relationship). +- A set of data points exemplars (also adhering to a one-to-many relationship). + +Exemplar entities, in turn, are connected to their dedicated set of attributes. + +Technically speaking, the `quantile` entity isn't encoded as an independent entity but rather as a list of struct within +the `SUMMARY_DATA_POINTS entity`. + +![Metrics Arrow Schema](img/0156_metrics_schema.png) + +Gauge and Sum are identified by the `metric_type` column in the `METRICS` entity and they share the same Arrow record +for the data points, i.e. `NUMBER_DATA_POINTS`. + +`span_id` and `trace_id` are represented as fixed size binary dictionaries by default but can evolve to non-dictionary +form when their cardinality exceeds a certain threshold (usually 2^16). + +As usual, each of these Arrow records is sorted by specific columns to optimize the compression ratio. With this mapping +batch of metrics containing a large number of data points sharing the same attributes and timestamp will be highly +compressible (multivariate time-series scenario). + +> Note: every OTLP timestamps are represented as Arrow timestamps as Epoch timestamps with nanosecond precision. This representation will +> simplify the integration with the rest of the Arrow ecosystem (numerous time/date functions are supported in +> DataFusion for example). +> Note: aggregation_temporality is represented as an Arrow dictionary with a dictionary index of type int8. This OTLP +> enum has currently 3 variants, and we don't expect to have in the future more than 2^8 variants. + +## Implementation Recommendations + +### Protocol Extension and Fallback Mechanism + +The support of this new protocol can only be progressive, so implementers are advised to follow the following +implementation recommendations in phase 1: + +* `OTelArrow Receiver`: Listen on a single TCP port for both OTLP and OTel Arrow protocols. The goal is to make the + support of this protocol extension transparent and automatic. This can be achieved by adding the `ArrowStreamService` + to the same gRPC listener. A configuration parameter will be added to the OTelArrow receiver to disable this default + behavior to support specific uses. +* `OTelArrow Exporter`: By default the OTelArrow exporter should initiate a connection to the `ArrowStreamService` + endpoint of the target receiver. If this connection fails because the `ArrowStreamService` is not implemented by the + target, the exporter must automatically fall back on the behavior of the OTLP protocol. A configuration parameter + could be added to disable this default behavior. + +The implementation of these two rules should allow a seamless and +adaptive integration of OTel Arrow into the current ecosystem +generally. + +For the prototype specifically, which is a fork of the OpenTelemetry +collector codebase, we have derived the OTelArrow exporter and +receiver as set of changes directly to the `receiver/otelarrowreceiver` and +`exporter/otelarrowexporter` components, with new `internal/arrow` packages +in both. With every collector release we merge the OTel Arrow changes +with the mainline components to maintain this promise of +compatibility. + +OTel Arrow supports conveying the gRPC metadata (i.e., http2 headers) using a dedicated `bytes` field. Metadata is +encoded using [hpack](https://datatracker.ietf.org/doc/rfc7541/) like a typical unary gRPC request. + +Specifically: + +#### OTelArrow/gRPC Receiver + +When Arrow is enabled, the OTelArrow receiver listens for both the standard unary gRPC service OTLP and OTel Arrow stream +services. Each stream uses an instance of the OTel-Arrow-Adapter's +[Consumer](https://pkg.go.dev/github.com/f5/otel-arrow-adapter/pkg/otel/arrow_record#Consumer). Sets +`client.Metadata` in the Context. + +#### OTelArrow/gRPC Exporter + +When Arrow is enabled, the OTelArrow exporter starts a fixed number of streams and repeatedly sends one `plog.Logs`, +`ptrace.Traces`, or `pmetric.Metrics` item per stream request. The `exporterhelper` callback first tries to get an +available stream, blocking when none are available (or until the connection is downgraded), and then falls back to the +standard unary gRPC path. The stream-sending mechanism internally handles retries when failures are caused by streams +restarting, while honoring the caller's context deadline, to avoid delays introduced by allowing these retries to go +through the `exporterhelper` mechanism. + +Each stream uses an instance of the OTel-Arrow-Adapter's +[Producer](https://pkg.go.dev/github.com/f5/otel-arrow-adapter/pkg/otel/arrow_record#Producer). + +When a stream fails specifically because the server does not recognize the Arrow service, it will not restart. When all +streams have failed in this manner, the connection downgrades by closing a channel, at which point the exporter behaves +exactly as the OTLP exporter. + +The mechanism as described is vulnerable to partial failure scenarios. When some of the streams are succeeding but some +have failed with Arrow unsupported, the collector performance will be degraded because callers are blocked waiting for +available streams. The exact signal used to signal that Arrow and downgrade mechanism is seen as an area for future +development. [See the prototype's test for whether to downgrade.](https://github.com/open-telemetry/experimental-arrow-collector/blob/30e0ffb230d3d2f1ad9645ec54a90bbb7b9878c2/exporter/otlpexporter/internal/arrow/stream.go#L152) + +### Batch ID Generation + +The `batch_id` attribute is used by the message delivery mechanism. Each `BatchArrowRecords` issued must be associated with a +unique `batch_id`. Uniqueness must be ensured in the scope of the stream opened by the call to the `ArrowStreamService`. +This `batch_id` will be used in the `BatchStatus` object to acknowledge receipt and processing of the corresponding +batch. +A numeric counter is used to implement this batch_id, the goal being to use the most concise id possible. + +### Schema ID Generation + +Within the collector, batching, filtering, exporting, ... operations require to group the Arrow Records having a +compatible schema. A synthetic identifier (or `schema_id`) must be computed for each `ArrowPayload` to perform this +grouping. + +We recommend calculating the schema id in the following way: + +* for each Arrow Schema create a list of triples (name, type, metadata) for each column. +* sort these triples according to a lexicographic order. +* concatenate the sorted triples with a separator and use these identifiers as `schema_id` (or a shorter version via + an equivalence table). + +### Traffic Balancing Optimization + +To mitigate the usual pitfalls of a stream-oriented protocol, protocol implementers are advised to: + +* client side: create several streams in parallel (e.g. create a new stream every 10 event types), +* server side: close streams that have been open for a long time (e.g. close stream every 1 hour). + +These parameters must be exposed in a configuration file and be tuned according to the application. + +### Throttling + +OTel Arrow allows backpressure signaling. If the server is unable to keep up with the pace of data it receives from the +client then it should signal that fact to the client. The client must then throttle itself to avoid overwhelming the +server. + +To signal backpressure when using OTel Arrow, the server should return an error with code UNAVAILABLE and may supply +additional details via the `retry_info` attribute. + +When the client receives this signal it should follow the recommendations outlined in documentation for RetryInfo: + +``` +// Describes when the clients can retry a failed request. Clients could ignore +// the recommendation here or retry when this information is missing from error +// responses. +// +// It's always recommended that clients should use exponential backoff when +// retrying. +// +// Clients should wait until `retry_delay` amount of time has passed since +// receiving the error response before retrying. If retrying requests also +// fail, clients should use an exponential backoff scheme to gradually increase +// the delay between retries based on `retry_delay`, until either a maximum +// number of retires have been reached or a maximum retry delay cap has been +// reached. +``` + +The value of retry_delay is determined by the server and is implementation dependent. The server should choose a +retry_delay value that is big enough to give the server time to recover, yet is not too big to cause the client to drop +data while it is throttled. + +Throttling is important for implementing reliable multi-hop telemetry data delivery all the way from the source to the +destination via intermediate nodes, each having different processing capacity and thus requiring different data transfer +rates. + +### Delivery Guarantee + +The OTel Arrow protocol adheres to the OpenTelemetry Protocol (OTLP) specification, particularly in terms of delivery guarantee. +The collector ensures that messages received will only receive a positive acknowledgement if they have been properly +processed by the various stages of the collector. + +## Risks and Mitigations + +An authentication mechanism is highly recommended to protect against malicious traffic. Without authentication, an OTel +Arrow receiver can be attacked in multiple ways ranging from DoS, traffic amplification to sending sensitive data. This +specification reuses the authentication mechanisms already in place in the collector. + +## Trade-offs and Mitigations + +### Duplicate Data + +In edge cases (e.g. on reconnections, network interruptions, etc) the client has no way of knowing if recently sent data +was delivered if no acknowledgement was received yet. The client will typically choose to re-send such data to guarantee +delivery, which may result in duplicate data on the server side. This is a deliberate choice and is considered to be the +right tradeoff for telemetry data. This can be mitigated by using an idempotent insertion mechanism at the data backend +level. + +### Incompatible Backends + +Backends that don't support natively multivariate time-series can still automatically transform these events in multiple +univariate time-series and operate as usual. + +### Small Devices/Small Telemetry Data Stream + +A columnar-oriented protocol is not necessarily desirable for all scenarios (e.g. devices that do not have the resources +to accumulate data in batches). This protocol extension allows to better respond to these different scenarios by letting +the client select between OTLP or OTel Arrow protocol depending on the nature of its telemetry traffic. + +## Future Versions and Interoperability + +As far as protocol evolution and interoperability mechanisms are concerned, this extension follows the +[recommendations](https://github.com/open-telemetry/oteps/blob/main/text/0035-opentelemetry-protocol.md#future-versions-and-interoperability) +outlined in the OTLP spec. + +## Prior Art and Alternatives + +We considered using a purely protobuf-based columnar encoding for this protocol extension. The realization of a +prototype and its comparison with [Apache Arrow](https://arrow.apache.org/) dissuaded us to continue in this direction. + +We also considered using [VNG](https://zed.brimdata.io/docs/formats/vng) from the Zed project as a columnar coding +technology. Although this format has interesting properties, this project has not yet reached a sufficient level of +maturity comparable to Apache Arrow. + +Finally, we also considered the use of Parquet records encapsulated in protobuf messages (similar to the approach described +in this document). Although a Parquet representation offers some additional encoding modes that can improve the compression +ratio, Parquet is not designed as an in-memory format optimized for online data processing. Apache Arrow is optimized for +this type of scenario and offers the best trade-off of compression ratio, processing speed, and serialization/deserialization speed. + +## Monitoring OTel-Arrow performance + +[OpenTelemetry Collector users would benefit from standard ways to monitor the number of network bytes sent and received.](https://github.com/open-telemetry/opentelemetry-collector/issues/6638). [We have proposed the use of dedicated `obsreport` metrics in the Collector.](https://github.com/open-telemetry/opentelemetry-collector/pull/6712). + +In connection with these proposals, [we also propose corresponding improvements in the OpenTelemetry +Collector-Contrib's `testbed` framework](https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/16835), +in order to include OTel-Arrow in standard regression testing of the Collector. + +## Open Questions + +### Extending into other parts of the Arrow ecosystem + +A SQL support for telemetry data processing remains an open question in the current Go collector. The main OTelArrow query +engine [Datafusion](https://github.com/apache/arrow-datafusion) is implemented in Rust. Several solutions can be +considered: 1) create a Go wrapper on top of Datafusion, 2) implement a Rust collector dedicated to the end-to-end +support of OTel Arrow, 3) implement a SQL/Arrow engine in Go (big project). A proof of concept using Datafusion has been +implemented in Rust and has shown very good results. + +We believe that because the Arrow IPC mechanism and data format is intended for zero-copy use, we believe it is possible +to use Arrow libraries written in other languages, for example within the Golang-based OpenTelemetry Collector. + +### Choosing row-oriented transport when it is more efficient + +The columnar representation is more efficient for transporting large homogeneous batches. The support of a mixed approach +combining automatically column-oriented and row-oriented batches would allow to cover all scenarios. The development of +a strategy to automatically select the best data representation mode is an open question. + +### Unary gRPC OTel Arrow and HTTP OTel Arrow + +The design currently calls for the use of gRPC streams to benefit from OTel Arrow transport. We believe that some of +this benefit can be had even for unary gRPC and HTTP requests with large request batches to amortize sending of +dictionary and schema information. This remains an area for study. + +## Future possibilities + +### Further-integrated compression techniques + +ZSTD offers a training mode, which can be used to tune the algorithm for a selected type of data. The result of this +training is a dictionary that can be used to compress the data. Using this [dictionary](http://facebook.github.io/zstd/#small-data) +can dramatically improve the compression rate for small batches. This future development will build on both the gRPC +stream approach used in this proposal and the ability to send a ZSTD dictionary over the OTel Arrow stateful protocol, +allowing us to train the ZSTD algorithm on the first batches and then update the configuration of the ZSTD +encoder/decoder with an optimized dictionary. + +More advanced lightweight compression algorithms on a per column basis could be integrated to the OTel Arrow +protocol (e.g. [delta delta encoding](https://www.vldb.org/pvldb/vol8/p1816-teller.pdf) for numerical columns) + +## Appendix A - Protocol Buffer Definitions + +Protobuf specification for an Arrow-based OpenTelemetry event. + +```protobuf +// Copyright The OpenTelemetry Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// This protocol specifies the services and messages utilized by the OTel Arrow +// Protocol. OTelArrow represents OTLP entities in a columnar manner using +// Apache Arrow. The primary objective of this new protocol is to optimize +// transport efficiency in terms of compression (phase 1), memory, and CPU usage +// (phase 2). +// +// Note: This protocol is still experimental and subject to change. + +syntax = "proto3"; + +package opentelemetry.proto.experimental.arrow.v1; + +option java_multiple_files = true; +option java_package = "io.opentelemetry.proto.experimental.arrow.v1"; +option java_outer_classname = "ArrowServiceProto"; + +// Note the following is temporary +option go_package = "github.com/f5/otel-arrow-adapter/api/experimental/arrow/v1"; + +// This service can be utilized to transmit `BatchArrowRecords` either from an +// application instrumented with OpenTelemetry to a collector, or between +// multiple collectors. +// +// Note: If your deployment requires to load-balance the telemetry data based on +// the nature of the telemetry data (e.g. traces, metrics, logs), then you should +// use the `ArrowTracesService`, `ArrowMetricsService`, and `ArrowLogsService`. +service ArrowStreamService { + // The ArrowStream endpoint is a bi-directional stream used to send batch of + // `BatchArrowRecords` from the exporter to the collector. The collector + // returns `BatchStatus` messages to acknowledge the `BatchArrowRecords` + // messages received. + rpc ArrowStream(stream BatchArrowRecords) returns (stream BatchStatus) {} +} + +// ArrowTracesService is a traces-only Arrow stream. +service ArrowTracesService { + rpc ArrowTraces(stream BatchArrowRecords) returns (stream BatchStatus) {} +} + +// ArrowTracesService is a logs-only Arrow stream. +service ArrowLogsService { + rpc ArrowLogs(stream BatchArrowRecords) returns (stream BatchStatus) {} +} + +// ArrowTracesService is a metrics-only Arrow stream. +service ArrowMetricsService { + rpc ArrowMetrics(stream BatchArrowRecords) returns (stream BatchStatus) {} +} + +// A message sent by an exporter to a collector containing a batch of Arrow +// records. +message BatchArrowRecords { + // [mandatory] Batch ID. Must be unique in the context of the stream. + int64 batch_id = 1; + + // [mandatory] A collection of payloads containing the data of the batch. + repeated ArrowPayload arrow_payloads = 2; + + // [optional] Headers associated with this batch, encoded using hpack. + bytes headers = 3; +} + +// Enumeration of all the OTel Arrow payload types currently supported by the +// OTel Arrow protocol. +enum ArrowPayloadType { + UNKNOWN = 0; + + // A payload representing a collection of resource attributes. + RESOURCE_ATTRS = 1; + // A payload representing a collection of scope attributes. + SCOPE_ATTRS = 2; + + // A set of payloads representing a collection of metrics. + METRICS = 10; // Main metric payload + NUMBER_DATA_POINTS = 11; + SUMMARY_DATA_POINTS = 12; + HISTOGRAM_DATA_POINTS = 13; + EXP_HISTOGRAM_DATA_POINTS = 14; + NUMBER_DP_ATTRS = 15; + SUMMARY_DP_ATTRS = 16; + HISTOGRAM_DP_ATTRS = 17; + EXP_HISTOGRAM_DP_ATTRS = 18; + NUMBER_DP_EXEMPLARS = 19; + HISTOGRAM_DP_EXEMPLARS = 20; + EXP_HISTOGRAM_DP_EXEMPLARS = 21; + NUMBER_DP_EXEMPLAR_ATTRS = 22; + HISTOGRAM_DP_EXEMPLAR_ATTRS = 23; + EXP_HISTOGRAM_DP_EXEMPLAR_ATTRS = 24; + + // A set of payloads representing a collection of logs. + LOGS = 30; + LOG_ATTRS = 31; + + // A set of payloads representing a collection of traces. + SPANS = 40; + SPAN_ATTRS = 41; + SPAN_EVENTS = 42; + SPAN_LINKS = 43; + SPAN_EVENT_ATTRS = 44; + SPAN_LINK_ATTRS = 45; +} + +// Represents a batch of OTel Arrow entities. +message ArrowPayload { + // [mandatory] A canonical ID representing the schema of the Arrow Record. + // This ID is used on the consumer side to determine the IPC reader to use + // for interpreting the corresponding record. For any NEW `schema_id`, the + // consumer must: + // 1) close the current IPC reader, + // 2) create a new IPC reader in order to interpret the new schema, + // dictionaries, and corresponding data. + string schema_id = 1; + + // [mandatory] Type of the OTel Arrow payload. + ArrowPayloadType type = 2; + + // [mandatory] Serialized Arrow Record Batch + // For a description of the Arrow IPC format see: + // https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc + bytes record = 3; +} + +// A message sent by a Collector to the exporter that opened the data stream. +message BatchStatus { + repeated StatusMessage statuses = 1; +} + +message StatusMessage { + int_64 batch_id = 1; + StatusCode status_code = 2; + ErrorCode error_code = 3; + string error_message = 4; + RetryInfo retry_info = 5; +} + +enum StatusCode { + OK = 0; + ERROR = 1; +} + +enum ErrorCode { + UNAVAILABLE = 0; + INVALID_ARGUMENT = 1; +} + +message RetryInfo { + int64 retry_delay = 1; +} +``` + +## Glossary + +### Arrow Dictionary + +Apache Arrow allows to encode a text or binary column as a dictionary (numeric index -> text/binary buffer). When this +encoding is used, the column contains only the index values and the dictionary is attached to the schema for reference. +This type of encoding significantly reduce the space occupied by text or binary columns with low cardinalities +(usually less than 2^16 distinct values). See Apache Arrow [documentation](https://arrow.apache.org/docs/python/data.html#dictionary-arrays) +for more details. + +### Arrow IPC Format + +The [Arrow IPC format](https://arrow.apache.org/docs/python/ipc.html) is used to efficiently send homogenous record +batches in stream mode. The schema is only sent at the beginning of the stream. Dictionaries are only sent when they are +updated. + +### Multivariate Time-series + +A multivariate time series has more than one time-dependent variable. Each variable depends not only on +its past values but also has some dependency on other variables. A 3 axis accelerometer reporting 3 metrics +simultaneously; a mouse move that simultaneously reports the values of x and y, a meteorological weather station +reporting temperature, cloud cover, dew point, humidity and wind speed; an http transaction characterized by many +interrelated metrics sharing the same attributes are all common examples of multivariate time-series. + +## Acknowledgements + +Special thanks to [Joshua MacDonald](https://github.com/jmacd) for his contribution in integrating the reference +implementation into the OTel collector, to [Tigran Najaryan](https://github.com/tigrannajaryan) for helping to define +the integration strategy with the OTLP protocol, and to [Sébastien Soudan](https://github.com/ssoudan) for the +numerous exchanges and advice on the representation of the data charts. + +Thanks to all reviewers who participated in the review and validation of this [PR](https://github.com/open-telemetry/oteps/pull/171). + +Finally, many thanks to [F5](https://www.f5.com/) for supporting this effort. diff --git a/oteps/0178-mapping-to-otlp-anyvalue.md b/oteps/0178-mapping-to-otlp-anyvalue.md new file mode 100644 index 00000000000..705b2fa6008 --- /dev/null +++ b/oteps/0178-mapping-to-otlp-anyvalue.md @@ -0,0 +1,227 @@ +# Mapping Arbitrary Data to OTLP AnyValue + +This document defines how to map (convert) arbitrary data (e.g. in-memory +objects) to OTLP's AnyValue. + +## Motivation + +The mapping is necessary to correctly implement Logging Library SDKs such that +the converted values are unambiguous and consistent across languages and +implementations. + +## Explanation + +[AnyValue](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L27) +is capable of representing primitive and structured data of certain types. + +Implementations that have a source data in any form, such as in-memory objects +or data coming from other formats that needs to be converted to AnyValue SHOULD +follow the rules described below. + +### Primitive Values + +#### Integer Values + +Integer values which are within the range of 64 bit signed numbers +[-2^63..2^63-1] SHOULD be converted to AnyValue's +[int_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L33) +field. + +Integer values which are outside the range of 64 bit signed numbers SHOULD be +converted to AnyValue's +[string_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L31) +field using decimal representation. + +#### Enumerations + +Values, which belong to a limited enumerated set (e.g. a Java +[enum](https://docs.oracle.com/javase/tutorial/java/javaOO/enum.html)) SHOULD be +converted to AnyValue's +[string_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L31) +field with the value of the string set to the symbolic name of the enumeration. + +If the symbolic name of the enumeration is not possible to obtain the +implementation SHOULD map enumeration's value to AnyValue's +[int_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L33) +field set equal to enum's ordinal number when such ordinal number is naturally +obtainable. + +If the ordinal value is also not possible to obtain the enumeration SHOULD be +converted to AnyValue's +[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37) +field in any manner that the implementation deems reasonable. + +#### Floating Point Values + +Floating point values which are within the range and precision of IEEE 754 +64-bit floating point numbers (including IEEE 32-bit floating point values) +SHOULD be converted to AnyValue's +[double_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L34) +field. + +Floating point values which are outside the range or precision of IEEE 754 +64-bit floating point numbers (e.g. IEEE 128-bit floating bit values) SHOULD be +converted to AnyValue's +[string_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L31) +field using decimal floating point representation. + +#### String Values + +String values which are valid Unicode sequences SHOULD be converted to +AnyValue's +[string_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L31) +field. + +String values which are not valid Unicode sequences SHOULD be converted to +AnyValue's +[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37) +with the bytes representing the string in the original order and format of the +source string. + +#### Bytes Sequences + +Byte sequences (e.g. Go's `[]byte` slice or raw byte content of a file) SHOULD +be converted to AnyValue's +[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37) +field. + +### Composite Values + +#### Array Values + +Values that represent ordered sequences of other values (such as +[arrays](https://docs.oracle.com/javase/specs/jls/se7/html/jls-10.html), +[vectors](https://en.cppreference.com/w/cpp/container/vector), ordered +[lists](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists), +[slices](https://golang.org/ref/spec#Slice_types)) SHOULD be converted to +AnyValue's +[array_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L35) +field. String Values and Byte Sequences are an exception from this rule (see +above). + +#### Associative Arrays With Unique Keys + +Values that represent associative arrays with unique keys (also often known +as maps, dictionaries or key-value stores) SHOULD be converted to AnyValue's +[kvlist_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L36) +field. + +If the keys of the source array are not strings they MUST be converted to +strings by any means available, often via a toString() or stringify functions +available in programming languages. The conversion function MUST be chosen in a +way that ensures that the resulting string keys are unique in the target array. + +The value part of each element of the source array SHOULD be converted to +AnyValue recursively. + +For example a JSON object `{"a": 123, "b": "def"}` SHOULD be converted to + +``` +AnyValue{ + kvlist_value:KeyValueList{ + values:[ + KeyValue{key:"a",value:AnyValue{int_value:123}}, + KeyValue{key:"b",value:AnyValue{string_value:"def"}}, + ] + } +} +``` + +#### Associative Arrays With Non-Unique Keys + +Values that represent an associative arrays with non-unique keys where multiple values may be associated with the same key (also sometimes known +as multimaps, multidicts) SHOULD be converted to AnyValue's +[kvlist_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L36) +field. + +The resulting +[kvlist_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L36) +field MUST list each key only once and the value of each element of +[kvlist_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L36) +field MUST be an array represented using AnyValue's +[array_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L35) +field, each element of the +[array_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L35) +representing one value of source array associated with the given key. + +For example an associative array shown in the following table: + +|Key|Value| +|---|---| +|"abc"|123| +|"def"|"foo"| +|"def"|"bar"| + +SHOULD be converted to: + +``` +AnyValue{ + kvlist_value:KeyValueList{ + values:[ + KeyValue{ + key:"abc", + value:AnyValue{array_value:ArrayValue{values[ + AnyValue{int_value:123} + ]}} + }, + KeyValue{ + key:"def", + value:AnyValue{array_value:ArrayValue{values[ + AnyValue{string_value:"foo"}, + AnyValue{string_value:"bar"} + ]}} + }, + ] + } +} +``` + +#### Sets + +Unordered collections of non-duplicate values (such as +[Java Sets](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Set.html), +[C++ sets](https://en.cppreference.com/w/cpp/container/set), +[Python Sets](https://docs.python.org/3/tutorial/datastructures.html#sets)) SHOULD be +converted to AnyValue's +[array_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L35) +field, where each element of the set becomes an element of the array. + +### Other Values + +Any other values not listed above SHOULD be converted to AnyValue's +[string_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L31) +field if the source data can be serialized to a string (can be stringified) +using toString() or stringify functions available in programming languages. + +If the source data cannot be serialized to a string then the value SHOULD be +converted AnyValue's +[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37) +field by serializing it into a byte sequence by any means available. + +If the source data cannot be serialized neither to a string nor to a byte +sequence then it SHOULD by converted to an empty AnyValue. + +### Empty Values + +If the source data has no type associated with it and is empty, null, nil or +otherwise indicates absence of data it SHOULD be converted to an +[empty](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L29) +AnyValue, where all the fields are unset. + +Empty values which has a type associated with them (e.g. empty associative +array) SHOULD be converted using the corresponding rules defined for the types +above. + +## Alternatives + +Some of the mappings listed in this document can be designed differently. For +example multimaps may be represented as arrays of maps, instead of maps with +array values. Such alternative representations were considered but were not +found to be advantageous to the solutions chosen in this document. + +## Future possibilities + +If AnyValue is extended to support more data types some rules in this document +may be revised in order to result in a more natural mapping. If this is done +then backwards compatibility should be carefully considered to avoid breaking +applications that rely on the existing mapping rules. diff --git a/oteps/0182-otlp-remote-parent.md b/oteps/0182-otlp-remote-parent.md new file mode 100644 index 00000000000..5bcfa3b1e50 --- /dev/null +++ b/oteps/0182-otlp-remote-parent.md @@ -0,0 +1,124 @@ +# Export SpanContext.IsRemote in OTLP + +Update OTLP to indicate whether a span's parent is remote. + +## Motivation + +It is sometimes useful to post-process or visualise only entry-point spans: spans which either have no parent (trace roots), or which have a remote parent. +For example, the Elastic APM solution highlights entry-point spans (Elastic APM refers to these as "transactions") and surfaces these as top-level operations +in its user interface. + +The goal is to identify the spans which represent a request that is entering a service, or originating within a service, without having to first assemble the +complete distributed trace as a DAG (Directed Acyclic Graph). It is trivially possible to identify trace roots, but it is not possible to identify spans with +remote parents. + +Here is a contrived example distributed trace, with a border added to the entry-point spans: + +```mermaid +graph TD + subgraph comments_service + POST_comments(POST /comment) + POST_comments --> comments_send(comments send) + end + + subgraph auth_service + POST_comments --> POST_auth(POST /auth) + POST_auth --> LDAP + end + + subgraph user_details_service + POST_comments --> GET_user_details(GET /user_details) + GET_user_details --> SELECT_users(SELECT FROM users) + end + + subgraph comments_inserter + comments_send --> comments_receive(comments receive) + comments_receive --> comments_process(comments process) + comments_process --> INSERT_comments(INSERT INTO comments) + end + + style POST_comments stroke-width:4 + style POST_auth stroke-width:4 + style GET_user_details stroke-width:4 + style comments_receive stroke-width:4 +``` + +## Explanation + +The OTLP encoding for spans has a boolean `parent_span_is_remote` field for identifying whether a span's parent is remote or not. +All OpenTelemetry SDKs populate this field, and backends may use it to identify a span as being an entry-point span. +A span can be considered an entry-point span if it has no parent (`parent_span_id` is empty), or if `parent_span_is_remote` is true. + +## Internal details + +The first part would be to update the trace protobuf, adding a `boolean parent_span_is_remote` field to the +[`Span` message](https://github.com/open-telemetry/opentelemetry-proto/blob/b43e9b18b76abf3ee040164b55b9c355217151f3/opentelemetry/proto/trace/v1/trace.proto#L84). + +[`SpanContext.IsRemote`](../specification/trace/api.md#isremote) identifies whether span context has been propagated from a remote parent. +The OTLP exporter in each SDK would need to be updated to record this in the new `parent_span_is_remote` field. + +For backwards compatibility with older OTLP versions, the protobuf field should be `nullable` (`true`, `false`, or unspecified) +and the opentelemetry-collector protogen code should provide an API that enables backend exporters to identify whether the field is set. + +```go +package pdata + +// ParentSpanIsRemote indicates whether ms's parent span is remote, if known. +// If the parent span remoteness property is known then the "ok" result will be true, +// and false otherwise. +func (ms Span) ParentSpanIsRemote() (remote bool, ok bool) +``` + +## Trade-offs and mitigations + +None identified. + +## Prior art and alternatives + +### Alternative 1: include entry-point span ID in other spans + +As an alternative to identifying whether the parent span is remote, we could instead encode and propagate the ID of the entry-point span in all non entry-point spans. +Thus we can identify entry-point spans by lack of this field. + +The entry-point span ID would be captured when starting a span with a remote parent, and propagated through `SpanContext`. We would introduce a new `entry_span_id` field to +the `Span` protobuf message definition, and set it in OTLP exporters. + +This was originally [proposed in OpenCensus](https://github.com/census-instrumentation/opencensus-specs/issues/229) with no resolution. + +The drawbacks of this alternative are: + +- `SpanContext` would need to be extended to include the entry-point span ID; SDKs would need to be updated to capture and propagate it +- The additional protobuf field would be an additional 8 bytes, vs 1 byte for the boolean field + +The main benefit of this approach is that it additionally enables backends to group spans by their process subgraph. + +### Alternative 2: introduce a semantic convention attribute to identify entry-point spans + +As an alternative to adding a new field to spans, a new semantic convention attribute could be added to only entry-point spans. + +This approach would avoid increasing the memory footprint of all spans, but would have a greater memory footprint for entry-point spans. +The benefit of this approach would therefore depend on the ratio of entry-point to internal spans, and may even be more expensive. + +### Alternative 3: extend SpanKind values + +Another alternative is to extend the SpanKind values to unambiguously define when a CONSUMER span has a remote parent or a local parent (e.g. with the message polling use case). + +For example, introducing a new SpanKind (e.g. `AMBIENT_CONSUMER`) that would have a clear `no` on the `Remote-Incoming` property of the SpanKind, and `REMOTE_CONSUMER` would have a clear `yes` on the `Remote-Incoming` property of the SpanKind. The downside of this approach is that it is a breaking on the semantics of `CONSUMER` spans. + +## Open questions + +### Relation between `parent_span_is_remote` and `SpanKind` + +The specification for `SpanKind` describes the following: + +``` +The first property described by SpanKind reflects whether the Span is a "logical" remote child or parent ... +``` + +However, the specification stay ambiguous for the `CONSUMER` span kind with respect to the property of the "logical" remote parent. +Nevertheless, the proposed field `parent_span_is_remote` has some overlap with that `SpanKind` property. +The specification would require some clearification on the `SpanKind` and its relation to `parent_span_is_remote`. + +## Future possibilities + +No other future changes identified. diff --git a/oteps/0199-support-elastic-common-schema-in-opentelemetry.md b/oteps/0199-support-elastic-common-schema-in-opentelemetry.md new file mode 100644 index 00000000000..a48cfcae655 --- /dev/null +++ b/oteps/0199-support-elastic-common-schema-in-opentelemetry.md @@ -0,0 +1,294 @@ +# Merge Elastic Common Schema with OpenTelemetry Semantic Conventions + +## Introduction + +This proposal is to merge the Elastic Common Schema (ECS) with the OpenTelemetry Semantic Conventions (SemConv) and provide full interoperability in OpenTelemetry component implementations. We propose to implement this by aligning the OpenTelemetry Semantic Conventions with [ECS FieldSets](https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html#ecs-fieldsets) and vice versa where feasible. The long-term goal is to achieve convergence of ECS and OTel Semantic Conventions into a single open schema so that OpenTelemetry Semantic Conventions truly is a successor of the Elastic Common Schema. + +## The Goal + +- Long-term, ECS and OTel SemConv will converge into one open standard that is maintained by OpenTelemetry. To kick off this effort, Elastic will nominate several domain experts to join the OpenTelemetry Semantic Convention Approvers to help with maintaining the new standard. +- OTel SemConv will adopt ECS in its full scope (except for individual adjustments in detail where inevitable), including the logging, observability and security domain fields, to make the new schema a true successor of ECS and OTel SemConv. +- Elastic and OpenTelemetry will coordinate and officially announce the direction of the merger (e.g. through official websites, blog posts, etc.) +- Migrate ECS and OTel SemConv users to the new common schema over time and provide utilities to allow the migration to be as easy as possible. + +## Scope and Overlap of ECS and OTel SemConv + +ECS and OTel SemConv have some overlap today, but also significant areas of mutually enriching fields. The following diagram illustrates the different areas: + +

+ +

+ +1. `A`: ECS comes with a rich set of fields that cover broad logging, observability and security use cases. Many fields are additive to the OTel SemConv and would enrich the OTel SemConv without major conflicts. Examples are [Geo information fields](https://www.elastic.co/guide/en/ecs/current/ecs-geo.html), [Threat Fields](https://www.elastic.co/guide/en/ecs/current/ecs-threat.html), and many others. +2. `B`: Conversely, there are attributes in the OTel SemConv that do not exist in ECS and would be an enrichment to ECS. Examples are the [Messaging semantic conventions](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/messaging/) or technology-specific conventions, such as the [AWS SDK conventions](https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/instrumentation/aws-sdk/). +3. `C`: There is some significant area of overlap between ECS and OTel SemConv. The are `C` represents overlapping fields/attributes that are very similar in ECS and OTel SemConv. The field conflicts in `C` can be resolved through simple field renames and simple transformations. +4. `D`: For some of the fields and attributes there will be conflicts that cannot be resolved through simple renaming or transformation and would require introducing breaking changes on ECS or OTel SemConv side for the purpose of merging the schemas. + +## Proposed process to merge ECS with OTel SemConv + +The process of merging ECS with OTel SemConv will take time and we propose to do it as part of the stabilization effort for OTel SemConv. During that period and also for a significant period after the merger (sunset period), Elastic will continue to support ECS as a schema. However, further evolution of ECS will happen on the basis of the new, common schema. Elastic will nominate ECS experts to help with maintaining the new schema and will require the approver role for the semantic conventions in OpenTelemetry for the new schema. + +With the merger there will be different categories of field conflicts between ECS fields and Otel SemConv attributes (as illustrated in the above figure). We expect the areas `A` and `B` to be less controversial and potentially low-hanging fruits for an enriched, new schema. + +For the areas `C` and `D` we propose to resolve conflicts through a close collaboration as part of the stabilization initiative of the OTel SemConv. Where feasible, the goal is to align the OTel SemConv attributes as close as possible with the existing, stable ECS fields. Where alignment is not feasible, the goal is to identify ways to address field conflicts through technical transformations and aliasing, to bridge existing fields and formats into the new schema and vice versa (e.g. through OpenTelemetry Collector Processors). + +While realistically truly breaking changes on ECS and OTel SemConv won't be avoidable as part of the merger, they should be the last resort and need to be discussed on a field-by-field basis. + +## Motivation + +Adding the Elastic Common Schema (ECS) to OpenTelemetry (OTel) is a great way to accelerate the integration of vendor-created logging and OTel component logs (i.e. OTel Collector Logs Receivers). The goal is to define vendor neutral semantic conventions for most popular types of systems and support vendor-created or open-source components (for example HTTP access logs, network logs, system access/authentication logs) extending OTel correlation to these new signals. + +Adding the coverage of ECS to OTel would provide guidance to authors of OpenTelemetry Collector Logs Receivers and help establish the OTel Collector as a de facto standard log collector with a well-defined schema to allow for richer data definition. + +In addition to the use case of structured logs, the maturity of ECS for SIEM (Security Information and Event Management) is a great opportunity for OpenTelemetry to expand its scope to the security use cases. + +Another significant use case is providing first-class support for Kubernetes application logs, system logs, and application introspection events. We would also like to see support for structured events (e.g. [k8seventsreceiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/k8seventsreceiver)) and using 'content-type' to identify event types. + +We'd like to see different categories of structured logs being well-supported in the [OTel Log Data Model](../specification/logs/data-model.md), presumably through [semantic conventions for log attributes](../specification/logs/data-model.md#field-attributes). For example, NGINX access logs and Apache access logs should be processed the same way as structured logs. This would help in trace and metric correlation with such log data as well as it would help grow the ecosystem of curated UIs provided by observability backends and monitoring dashboards (e.g. one single HTTP access log dashboard benefiting Apache httpd, Nginx, and HAProxy). + +## Customer Motivation + +Adoption of OTel logs will accelerate greatly if ECS is leveraged as the common standard, using this basis for normalization. OTel Logs adoption will be accelerated by this support. For example, ECS can provide the unified structured format for handling vendor-generated logs along with open source logs. + +Customers will benefit from turnkey logs integrations that will be fully recognized by OTel-compatible observability products and services. + +OpenTelemetry logging is today mostly structured when instrumentation libraries are used. However, most of the logs which exist today are generated by software, hardware, and cloud services which the user cannot control. OpenTelemetry provides a limited set of "reference integrations" to structure logs: primarily the [OpenTelemetry Collector Kubernetes Events Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/k8seventsreceiver) and an example of a regexp based parsing of Tomcat access logs with OpenTelemetry Collector File Receiver ([here](https://github.com/open-telemetry/opentelemetry-log-collection/blob/30807b96b2f0771e7d11452ebf98fe5e211ed6d7/examples/tomcat/config.yaml#L20)). +By expanding the OTel semantic conventions with further namespaces already defined in ECS, a broader coverage of such mappings from different sources can be defined and implemented in the OTel collector. +This, for example, includes logs from network appliances (mapping to the `network` and `interface` namespaces in ECS). + +The semantic conventions of a log are a challenge. What is a specific component defined in a log and how does it relate to other logs which have the same semantic component defined differently? ECS has already done some heavy-lifting on defining a unified set of semantic conventions which can be adopted in OTel. + +OpenTelemetry has the potential to grow exponentially if the data from these other services can be correlated with instrumented code and components. In order to do this, industry stakeholders should leverage a common and standard logging data model which allows for the mapping of these different data types. The OpenTelemetry data protocol can provide this interoperable open standard. This unlocks countless use cases, and ensures that OpenTelemetry can work with other technologies which are not OpenTelemetry compliant. + +## Background + +### What is ECS? + +The [Elastic Common Schema (ECS)](https://github.com/elastic/ecs) is an open source specification, developed with support from Elastic's user community. ECS defines a common set of fields to be used when storing data in Elasticsearch, such as logs, metrics, and security and audit events. The goal of ECS is to enable and encourage users of Elasticsearch to normalize their event data, so that they can better analyze, visualize, and correlate the data represented in their events. Learn more at: [https://www.elastic.co/guide/en/ecs/current/ecs-reference.html](https://www.elastic.co/guide/en/ecs/current/ecs-reference.html) + +The coverage of ECS is very broad including in depth support for logs, security, and network events such as "[logs.* fields](https://www.elastic.co/guide/en/ecs/current/ecs-log.html)" , "[geo.* fields](https://www.elastic.co/guide/en/ecs/current/ecs-geo.html)", "[tls.* fields](https://www.elastic.co/guide/en/ecs/current/ecs-tls.html)", "[dns.* fields](https://www.elastic.co/guide/en/ecs/current/ecs-dns.html)", or "[vulnerability.* fields](https://www.elastic.co/guide/en/ecs/current/ecs-vulnerability.html)". + +ECS has the following guiding principles: + +* ECS favors human readability in order to enable broader adoption as many fields can be understood without having to read up their meaning in the reference, +* ECS events include metadata to enable correlations across any dimension (host, data center, docker image, ip address...), + * ECS does not differentiate the metadata fields that are specific to each event of the event source and the metadata that is shared by all the events of the source in the way OTel does, which differentiates between Resource Attributes and Log/Span/Metrics Attributes, +* ECS groups fields in namespaces in order to: + * Offer consistency and readability, + * Enable reusability of namespaces in different contexts, + * For example, the "geo" namespace is nested in the "client.geo", "destination.geo", "host.geo" or "threat.indicator.geo" namespaces + * Enable extensibility by adding fields to namespaces and adding new namespaces, + * Prevent field name conflicts +* ECS covers a broad spectrum of events with 40+ namespaces including detailed coverage of security and network events. It's much broader than simple logging use cases. + +### Example of a log message structured with ECS: NGINX access logs + +Example of a Nginx Access Log entry structured with ECS + +```json +{ + "@timestamp":"2020-03-25T09:51:23.000Z", + "client":{ + "ip":"10.42.42.42" + }, + "http":{ + "request":{ + "referrer":"-", + "method":"GET" + }, + "response":{ + "status_code":200, + "body":{ + "bytes":2571 + } + }, + "version":"1.1" + }, + "url":{ + "path":"/blog" + }, + "user_agent":{ + "original":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36", + "os":{ + "name":"Mac OS X", + "version":"10.14.0", + "full":"Mac OS X 10.14.0" + }, + "name":"Chrome", + "device":{ + "name":"Other" + }, + "version":"70.0.3538.102" + }, + "log":{ + "file":{ + "path":"/var/log/nginx/access.log" + }, + "offset":33800 + }, + "host": { + "hostname": "cyrille-laptop.home", + "os": { + "build": "19D76", + "kernel": "19.3.0", + "name": "Mac OS X", + "family": "darwin", + "version": "10.15.3", + "platform": "darwin" + }, + "name": "cyrille-laptop.home", + "id": "04A12D9F-C409-5352-B238-99EA58CAC285", + "architecture": "x86_64" + } +} +``` + +## Comparison between OpenTelemetry Semantic Conventions for logs and ECS + +## Principles + +| Description | [OTel Logs and Event Record](../specification/logs/data-model.md#log-and-event-record-definition) | [Elastic Common Schema (ECS)](https://www.elastic.co/guide/en/ecs/current/ecs-reference.html) | +|-------------|-------------|--------| +| Metadata shared by all the Log Messages / Spans / Metrics of an application instance | Resource Attributes | ECS fields | +| Metadata specific to each Log Message / Span / Metric data point | Attributes | ECS Fields | +| Message of log events | Body | [message field](https://www.elastic.co/guide/en/ecs/current/ecs-base.html#field-message) | +| Naming convention | Dotted names | Dotted names | +| Reusability of namespaces | Namespaces are intended to be composed | Namespaces are intended to be composed | +| Extensibility | Attributes can be extended by either adding a user defined field to an existing namespaces or introducing new namespaces. | Extra attributes can be added in each namespace and users can create their own namespaces | + +## Data Types + +| Category | OTel Logs and Event Record (all or a subset of GRPC data types) | ECS Data Types | +|---|---|---| +| Text | string | text, match_only_text, keyword constant_keyword, wildcard | +| Dates | uint64 nanoseconds since Unix epoch | date, date_nanos | +| Numbers | number | long, double, scaled_float, boolean… | +| Objects | uint32, uint64… | object (JSON object), flattened (An entire JSON object as a single field value) | +| Structured Objects | No complex semantic data type specified for the moment (e.g. string is being used for ip addresses rather than having an "ip" data structure in OTel).
Note that OTel supports arrays and nested objects. | ip, geo_point, geo_shape, version, long_range, date_range, ip_range | +| Binary data | Byte sequence | binary | + +## Known Differences + +Some differences exist on fields that are both defined in OpenTelemetry Semantic Conventions and in ECS. In this case, it would make sense for overlapping ECS fields to not be integrated in the new specification. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OTel Logs and Event Record + Elastic Common Schema (ECS) + Description +
Timestamp (uint64 nanoseconds since Unix epoch) + @timestamp (date) + +
TraceId (byte sequence), SpanId (byte sequence) + trace.id (keyword), span.id (keyword) + +
N/A + Transaction.id (keyword) + +
SeverityText (string) + log.syslog.severity.name (keyword), log.level (keyword) + +
SeverityNumber (number) + log.syslog.severity.code + +
Body (any) + message (match_only_text) + +
process.cpu.load (not specified but collected by OTel Collector) +
+process.cpu.time (async counter) +
+system.cpu.utilization +
host.cpu.usage (scaled_float) with a slightly different measurement than what OTel metrics measure + Note that most metrics have slightly different names and semantics between ECS and OpenTelemetry +
+ +## How would OpenTelemetry users practically use the new OpenTelemetry Semantic Conventions Attributes brought by ECS + +The concrete usage of ECS-enriched OpenTelemetry Semantic Conventions Attributes depends on the use case and the fieldset. +In general, OpenTelemetry users would transparently upgrade to ECS and benefit from the alignment of attributes for new use cases. +The main goal of this work is to enable producers of OpenTelemetry signals (collectors/exporters) to create enriched uniform signals for existing and new use cases. +The uniformity allows for easier correlation between signals originating from different producers. The richness ensures more options for Root Cause Analysis, correlation and reporting. + +While ECS covers many different use cases and scenarios, in the following, we outline two examples: + +### Example: OpenTelemetry Collector Receiver to collect the access logs of a web server + +The author of the "OTel Collector Access logs file receiver for web server XXX" would find in the OTel Semantic Convention specifications all +the guidance to map the fields of the web server logs, not only the attributes that the OTel Semantic Conventions has specified today for +[HTTP calls](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.9.0/specification/trace/semantic_conventions/http.md), +but also attributes for the [User Agent](https://www.elastic.co/guide/en/ecs/current/ecs-user_agent.html) +or the [Geo Data](https://www.elastic.co/guide/en/ecs/current/ecs-geo.html). + +This completeness of the mapping will help the author of the integration to produce OTel Log messages that will be compatible with access logs +of other web components (web servers, load balancers, L7 firewalls...) allowing turnkey integration with observability solutions +and enabling richer correlations. + +### Other Examples + +- [Logs with sessions (VPN Logs, Network Access Sessions, RUM sessions, etc.)](https://github.com/elastic/ecs/blob/main/rfcs/text/0004-session.md#usage) +- [Logs from systems processing files](https://www.elastic.co/guide/en/ecs/current/ecs-file.html) + +## Alternatives / Discussion + +### Prometheus Naming Conventions + +Prometheus is a de facto standard for observability metrics and OpenTelemetry already provides full interoperability with the Prometheus ecosystem. + +It would be useful to get interoperability between metrics collected by [official Prometheus exporters](https://prometheus.io/docs/instrumenting/exporters/) (e.g. the [Node/system metrics exporter](https://github.com/prometheus/node_exporter) or the [MySQL server exporter](https://github.com/prometheus/mysqld_exporter)) and their equivalent OpenTelemetry Collector receivers (e.g. OTel Collector [Host Metrics Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/hostmetricsreceiver) or [MySQL Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/mysqlreceiver)). + +Note that one of the challenges with Prometheus metrics naming conventions is that these are implicit conventions defined by each integration author which doesn't enable correlation due to the lack of consistency across integrations. For example, this inconsistency increases the complexity that an end-user has to deal with when configuring and monitoring alerts. + +Prometheus' conventions are restricted to the style of the name of the metrics (see [Prometheus Metric and label naming](https://prometheus.io/docs/practices/naming/)) but don't specify unified metric names. + +## Other areas that need to be addressed by OTel (the project) + +Some areas that need to be addressed in the long run as ECS is integrated into OTel include defining the innovation process, +ensuring the OTel specification incorporates the changes to accommodate ECS, and a process for handling breaking changes if any (the proposal +[Define semantic conventions and instrumentation stability #2180](https://github.com/open-telemetry/opentelemetry-specification/pull/2180) +should tackle this point). Also, migration of existing naming (e.g. Prometheus exporter) to standardized convention (see +[Semantic Conventions for System Metrics](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/system/system-metrics.md) , +[Semantic Conventions for OS Process Metrics](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/system/process-metrics.md)). diff --git a/oteps/0201-scope-attributes.md b/oteps/0201-scope-attributes.md new file mode 100644 index 00000000000..d9aebae793b --- /dev/null +++ b/oteps/0201-scope-attributes.md @@ -0,0 +1,209 @@ +# Introduce Scope Attributes + +This OTEP adds attributes to the Scope of a telemetry emitter (e.g. Tracer, Meter, LogEmitter). + +## Motivation + +There are a few reasons why adding Scope attributes is a good idea: + +- There are 2 known use cases where Scope attributes can solve specific problems: + - Add support for [Meter "short_name"](https://github.com/open-telemetry/opentelemetry-specification/pull/2422), + represented as an attribute of Meter's Scope. + - Add support for differentiating the type of data emitted from the scopes that belong + to different data domains, e.g. profiling data emitted as log records or client-side + data emitted as log records needs to be differentiated so that it can be easily + routed and processed differently in the backends. We don't have a good way to handle + this today. The type of the data can be recorded as an attribute Logger's Scope. +- It makes Scope consistent with the other primary data types: Resource, Span, Metric, + LogRecord. + +See additional [discussion here](https://github.com/open-telemetry/opentelemetry-specification/issues/2450). + +## Summary + +The following is the summary of proposed changes: + +- We will extend OpenTelemetry API to allow specifying Scope attributes when obtaining a + Tracer, Meter or LogEmitter. Scope attributes will be optional. +- We will add `attributes` field to the [InstrumentationScope](https://github.com/open-telemetry/opentelemetry-proto/blob/88faab1197a2a105c7da659951e94bc951d37ab9/opentelemetry/proto/common/v1/common.proto#L83) + message of OTLP. +- We will specify that Telemetry emitted via a Scope-ed Tracer, Meter or LogEmitter will + be associated with the Scope's attributes. +- We will specify that OTLP Exporter will record the attributes in the + InstrumentationScope message. +- We will create a section for Scope attributes' semantic conventions in + the specification. + +## Internal details + +### API Changes + +#### Tracer + +`Get a Tracer` API will be extended to add the following parameter: + +``` +- `attributes` (optional): Specifies the instrumentation scope attributes to associate + with emitted telemetry. +``` + +Since the attributes are optional this is a backwards compatible change. + +We will modify the following clause: + +``` +It is unspecified whether or under which conditions the same or different +`Tracer` instances are returned from this functions. +``` + +and replace it by: + +``` +The implementation MUST NOT return the same `Tracer` when called repeatedly with +different values of parameters. The only exception to this rule is no-op `Tracer`, the +implementation MAY return the same instance regardless of parameter values. + +It is unspecified whether or under which conditions the same or different +`Tracer` instances are returned from this functions when the same +(name,version,schema_url,attributes) parameters are used. +``` + +Since we are defining more precisely previously undefined behavior this is a +backwards compatible change. + +#### Meter + +`Get a Meter` API will be extended to add the following parameter: + +``` +- `attributes` (optional): Specifies the instrumentation scope attributes to associate + with emitted telemetry. +``` + +We will modify the following clause: + +``` +It is unspecified whether or under which conditions the same or different +`Meter` instances are returned from this functions. +``` + +and replace it by: + +``` +The implementation MUST NOT return the same `Meter` when called repeatedly with +different values of parameters. The only exception to this rule is no-op `Meter`, the +implementation MAY return the same instance regardless of parameter values. + +It is unspecified whether or under which conditions the same or different +`Meter` instances are returned from this functions when the same +(name,version,schema_url,attributes) parameters are used. +``` + +#### LogEmitter + +`Get LogEmitter` SDK call will be altered to the following: + +``` +Accepts the instrumentation scope name and optional version and attributes and +returns a LogEmitter associated with the instrumentation scope. + +The implementation MUST NOT return the same `LogEmitter` when called repeatedly with +different values of parameters. The only exception to this rule is no-op `LogEmitter`, the +implementation MAY return the same instance regardless of parameter values. + +It is unspecified whether or under which conditions the same or different +`LogEmitter` instances are returned from this functions when the same +(name,version,attributes) parameters are used. +``` + +### OTLP Changes + +The InstrumentationScope message in OTLP will be modified to add 2 new fields: +attributes and dropped_attributes_count: + +```protobuf +message InstrumentationScope { + string name = 1; + string version = 2; + repeated KeyValue attributes = 3; + uint32 dropped_attributes_count = 4; +} +``` + +This change is backwards compatible from OTLP's interoperability perspective. Recipients +of old OTLP versions will not see the Scope attributes and will ignore them, which we +consider acceptable from interoperability perspective. This is aligned with our general +stance on what happens when telemetry sources _add_ new data which old recipients +don't understand: we expect the new data to be safely ignored. + +## Attribute Value Precedence + +If the same attribute is specified both at the Span/Metric/LogRecord and at the Scope +then the attribute value at Span/Metric/LogRecord takes precedence. + +This rule applies to non-OTLP exporters in SDKs, to conversions from OTLP to non-OTLP +formats in the Collector and to OTLP recipients of data that need to interpret the +attributes in the received data. + +## Exporting to non-OTLP + +SDK's non-OTLP Exporters and Collector's exporter to formats that don't have a concept +that is equivalent to the Scope will record the attributes at the most suitable place +in their corresponding format, typically at the Span, Metric or LogRecord equivalent. + +## Prior art and alternatives + +The [Meter "short_name" PR](https://github.com/open-telemetry/opentelemetry-specification/pull/2422) +had an alternate approach where the "short_name" was added as the only attribute to the +InstrumentationScope. This OTEP's proposal generalizes this and allows arbitrary +attributes which allows them to be used for use cases. + +Differentiating the type of data emitted from the scopes that belong to different data +domains can be alternatively done by recording attributes on the Span, Metric or LogRecord. +However, this will be less efficient since it will require the same attributes to be +specified repeatedly on the wire. It will be also cumbersome to require the callers +to always specify such attributes when creating a Span, Metric or a LogRecord as +opposed to specifying them once when obtaining the Trace, Meter or LogEmitter. + +## Examples + +### Usage in Code + +The following is an example usage where LogEmitter is used to emit client-side +log records (pseudocode follows): + +``` +// Obtain loggers once, at startup. +appLogger = LogEmitterProvider.GetLogEmitter("mylibrary", "1.0.0") +loggerForUserEvents = LogEmitterProvider.GetLogEmitter("mylibrary", "1.0.0", KeyValue("otel.clientside", true)) + +// Somewhere later in the code when the user clicks a UI element. This should +// export telemetry with otel.clientside=true Scope attribute set. +loggerForUserEvents.emit(LogRecord{Body:"click", Attributes:...}) + +// Somewhere else in the code, not related to user interactions. This should +// export telemetry without any Scope attributes. +appLogger.emit(LogRecord{Body:"Error occurred while processing the file", Attributes:...}) +``` + +### LogRecord Multiplexing + +Here is an example usage where LogRecords are used to represent profiling data, +client-side events and regular logs. The Scope attribute is used for multiplexing +and routing the LogRecords: + +![LogRecord Multiplexing](img/0201-scope-multiplexing.png) + +## Open questions + +- Should we allow/encourage recording Span/Metric/LogRecord attributes at the Scope level? + The alternate is to disallow this and have completely separate set of semantic + conventions that are allowed for Scope attributes only. +- Can all existing APIs in all languages be safely modified to ensure the addition + of the optional attributes is not a breaking change? (It should be safe, since we did + a very similar change when we [introduced the Scope](https://github.com/open-telemetry/opentelemetry-specification/pull/2276)) + +## Future possibilities + +If this OTEP is accepted we need to then introduce the relevant semantic conventions +that will make the 2 use cases [described earlier](#motivation) possible. diff --git a/oteps/0202-events-and-logs-api.md b/oteps/0202-events-and-logs-api.md new file mode 100644 index 00000000000..b41151e7c27 --- /dev/null +++ b/oteps/0202-events-and-logs-api.md @@ -0,0 +1,74 @@ +# Introducing Events and Logs API + +We introduce an Events and Logs API that is based on the OpenTelemetry Log signal, backed by LogRecord data model and LogEmitter SDK. + +## Motivation + +In OpenTelemetry's perspective Log Records and Events are different names for the same concept - however, there is a subtle difference in how they are represented using the underlying data model that is described below. We will describe why the existing Logging APIs are not sufficient for the purpose of creating events. It will then be evident that we will need an API in OpenTelementry for creating events. Note that the Events here refer to standalone Events and not to be confused with Span Events which occur only in the context of a span. + +The Logs part of the API introduced here is supposed to be used only by the Log Appenders and end-users should continue to use the logging APIs available in the languages. + +### Subtle differences between Logs and Events + +Logs have a mandatory severity level as a first-class parameter that events do not have, and events have a mandatory name that logs do not have. Further, logs typically have messages in string form and events have data in the form of key-value pairs. It is due to this that their API interface requirements are slightly different. + +### Who requires Events API + +Here are a few situations that require recording of Events, there will be more. Note that the Trace API provides capability to record Events but that is only when a span is in progress. We need a separate API for recording standalone Events. + +- RUM events (Client-side instrumentation) + - Standalone events that occur when there is no span in progress, such as errors, user interaction events and web vitals. +- Recording kubernetes events +- Collector Entity events [link](https://docs.google.com/document/d/1Tg18sIck3Nakxtd3TFFcIjrmRO_0GLMdHXylVqBQmJA/edit) +- Few other event systems described in [example mappings](../specification/logs/data-model.md#appendix-a-example-mappings) in the data model. + +### Can the current Log API interfaces be used for events? + +- The log level is fundamental to the Log APIs in almost all the languages; all the methods in the Log interface are named after the log level, and there is usually no generic method to submit a log entry without log level. + - In JavaScript for Web, the standard method of logging is to use console.log. Events can be created using [Event/CustomEvent](https://developer.mozilla.org/en-US/docs/Web/Events/Creating_and_triggering_events) interfaces. However, there is no option to define custom destination for these logs and events. Logs go only to console and event listeners are attached to the DOM element that dispatches it. + - In Android, android.util.Log has methods Log.v(), Log.d(), Log.i(), Log.w(), and Log.e() to write logs. These methods correspond to the severity level. + - Swift on iOS has Logger interface that has methods corresponding to severity level too. +- The current Log APIs do not have a standard way to pass event attributes. + - It may be possible to use the interpolation string args as the parameter to pass event attributes. However, the logging spec seems to map the human readable message (which is obtained after replacing the args in the interpolated string) to the Body field of LogRecord. + - Log4j has an EventLogger interface that can be used to create structured messages with arbitrary key-value pairs, but log4j is not commonly used in Android apps as it is not officially supported on Android as per this [Stack Overflow thread](https://stackoverflow.com/questions/60398799/disable-log4j-jmx-on-android/60407849#60407849) by one of log4j’s maintainers. + - In Python, logging.LogRecord's extra field is mapped to Otel LogRecord's attributes but this field is a hidden field and not part of the public interface. +- The current Log APIs have a message parameter which could map to the Body field of LogRecord. However, this is restricted to String messages and does not allow for structured logs. + +For the above reasons we can conclude that we will need an API for creating Events. + +## Explanation + +We propose a structure for Events for the purpose of distinguishing them from Logs and also propose having an API to ensure the structure is followed when creating Events using `LogRecord` data model. + +### Events structure + +All Events will have a name and a domain. The name is MANDATORY. The domain will serve as a mechanism to avoid conflicts with event names and is OPTIONAL. With this structure, an event name will be unique only in the context of a domain. It allows for two events in different domains to have same name, yet be unrelated events. When the domain is not present in an Event no claim is made about the uniqueness of event name. + +### Events and Logs API + +We also propose having an API interface for creating Events and Logs. Currently, there is only an SDK called [LoggerProvider](../specification/logs/sdk.md#loggerprovider) for creating `LogRecord`s. + +However, there is a question of whether OTel should have an API for logs. A part of the OTel community thinks that we should not have a full-fledged logging API unless there is a language that doesn't already have a plethora of logging libraries and APIs to choose from where it might make sense to define one. Further, we will not be able to have the [rich set of configuration options](https://logging.apache.org/log4j/2.x/manual/configuration.html) that some popular logging frameworks provide so the logging API in OTel will only become yet another API. However, it was noted that the Log Appender API is very similar to the API for Events and so instead of having API for Events and API for Log Appenders separately it was agreed to have one API for Events and Logs, and that the API for Logs is targeted only to Log Appenders. This will also keep it consistent with Traces and Metrics in having one API for each signal. + +## Internal Details + +The event name and domain will be attributes in the `LogRecord` defined using semantic conventions. + +For the Events and Logs API, it will be very similar to the Trace API. There will be LoggerProvider and Logger interfaces analogous to TracerProvider and Tracer. The Logger interface will then be used to create Events and Logs using the LogRecord data model. + +## Trade-offs and mitigations + +There could be confusion on whether the Logs part of the API is end-user callable. While it can eventually be used in the languages that do not have a popular logging library, it is not recommended to be used in the languages where there are other popular logging libraries and APIs and this fact must emphasized in different forums. + +## Prior art and alternatives + +For client-side instrumentation, it was suggested initially that we use 0-duration spans to represent Events to get the benefit of Spans providing causality. For example, Splunk's RUM sdk for Android implements Events using [0-duration span](https://github.com/signalfx/splunk-otel-android/blob/main/splunk-otel-android/src/main/java/com/splunk/rum/SplunkRum.java#L213). However, 0-duration spans are confusing and not consistent with standalone Events in other domains which are represented using `LogRecord`s. Hence, for consistency reasons it will be good to use `LogRecord`s for standalone Events everywhere. To address the requirement of modeling causality between Events, we can create wrapper spans linked to the `LogRecord`s. + +## Open questions + +None. + +## Future possibilities + +1. As noted in the `Trade-offs and mitigation` section, we could allow the API to be used by end-users in the languages that do not have a popular logging library. +2. There is a possibility that we may want to record the Span Events using `LogRecord`s in future. In this case, they will be correlated wth the Spans using the `TraceId` and `SpanId` fields of the `LogRecord`. If this is desired, we may add a configuration option to the `TracerProvider` to create LogRecords for the Span Events. diff --git a/oteps/0225-configuration.md b/oteps/0225-configuration.md new file mode 100644 index 00000000000..c48d49c264d --- /dev/null +++ b/oteps/0225-configuration.md @@ -0,0 +1,375 @@ +# OpenTelemetry Configuration + +A new configuration interface is proposed here in the form of a configuration model, which can be expressed as a file, and validated through a published schema. + +## Motivation + +OpenTelemetry specifies code that can operate in a variety of ways based on the end-user’s desired mode of operation. This requires a configuration interface be provided to the user so they are able to communicate this information. Currently, OpenTelemetry specifies this interface in the form of the API exposed by the SDKs and environment variables. This environment variable interface is limited in the structure of information it can communicate and the primitives it can support. + +### Environment Variable Interface Limitations + +The environment variable interface suffers from the following identified limitations: + +* **Flat**. Structured data is only allowed by using higher-level naming or data encoding schemes. Examples of configuration limited by lack of structured configuration include: + * Configuring multiple span processors, periodic metric readers, or log record processors. + * Configuring views. + * Configuring arguments for parent based sampler (sampler parent is remote and sampled vs. not sampled, sampler when parent is local and sampled vs. not sampled). +* **Runtime dependent**. Different systems expose this interface differently (Linux, BSD, Windows). This usually means unique instructions are required to properly interact with the configuration interface on different systems. +* **Limited values**. Many systems only allow string values to be used, but OpenTelemetry specifies many configuration values other than this type. For example, OTEL_RESOURCE_ATTRIBUTES specifies a list of key value pairs to be used as resource attributes, but there is no way to specify array values, or indicate that the value should be interpreted as non-string type. +* **Limited validation**. Validation can only be performed by the receiver, there is no meta-configuration language to validate input. +* **Lacks versioning**. The lack of versioning support for environment variables prevents evolution over time. + +## Explanation + +Using a configuration model or configuration file, users can configure all options currently available via environment variables. + +### Goals + +* The configuration must be language implementation agnostic. It must not contain structure or statements that only can be interpreted in a subset of languages. This does not preclude the possibility that the configuration can have specific extensions included for a subset of languages, but it does mean that the configuration must be interpretable by all implementation languages. +* Broadly supported format. Ideally, the information encoded in the file can be decoded using native tools for all OpenTelemetry implementation languages. However, it must be possible for languages that do not natively support an encoding format to write their own parsers. +* The configuration format must support structured data. At the minimum arrays and associative arrays. +* The format must support at least boolean, string, double precision floating point (IEEE 754-1985), and signed 64 bit integer value types. +* Custom span processors, exporters, samplers, or other user defined extension components can be configured using this format. +* Configure SDK, but also configure instrumentation. +* Must offer stability guarantees while supporting evolution. +* The structure of the configuration can be validated via a schema. +* Support environment variable substitution to give users the option to avoid storing secrets in these files. + +#### Out of scope + +* Embedding additional configuration files within a configuration file through an expression. Additional configuration providers MAY choose to support this use-case in the future. + +## Internal details + +The schema for OpenTelemetry configuration is to be published in a repository to allow language implementations to leverage that definition to automatically generate code and/or validate end-user configuration. This will ensure that all implementations provide a consistent experience for any version of the schema they support. An example of such a proposed schema is available [here](./assets/0225-schema.json). + +The working group proposes the use of [JSON Schema](https://json-schema.org/) as the language to define the schema. It provides: + +* support for client-side validation +* code generation +* broad support across languages + +In order to provide a minimal API surface area, implementations *MUST* support the following: + +### Parse(file) -> config + +An API called `Parse` receives a file object. The method loads the contents of the file, parses it, and validates that the configuration against the schema. At least one of JSON or YAML MUST be supported. If either format can be supported without additional dependencies, that format SHOULD be preferred. If neither or both formats are supported natively, YAML should be the preferred choice. If YAML is not supported due to dependency concerns, there MAY be a way for a user to explicitly enable it by installing their own dependency. + +The method returns a [Configuration model](#configuration-model) that has been validated. This API *MAY* return an error or raise an exception, whichever is idiomatic to the implementation for the following reasons: + +* file doesn't exist or is invalid +* configuration parsed is invalid according to schema + +#### Python Parse example + +```python + +filepath = "./config.yaml" + + +try: + cfg = opentelemetry.Parse(filepath) +except Exception as e: + print(e) + +filepath = "./config.json" + +try: + cfg = opentelemetry.Parse(filepath) +except Exception as e: + raise e + +``` + +#### Go Parse example + +```go + +filepath := "./config.yaml" +cfg, err := otel.Parse(filepath) +if err != nil { + return err +} + +filepath := "./config.json" +cfg, err := otel.Parse(filepath) +if err != nil { + return err +} + +``` + +Implementations *MUST* allow users to specify an environment variable to set the configuration file. This gives flexibility to end users of implementations that do not support command line arguments. The proposed name for this variable: + +* `OTEL_CONFIG_FILE` + +The format for the configuration file will be detected using the file extension of this variable. + +### Configurer + +`Configurer` interprets a [Configuration model](#configuration-model) and produces configured SDK components. + +Multiple `Configurer`s can be [created](#createconfig---configurer) with different configurations. It is the caller's responsibility to ensure the [resulting SDK components](#get-tracerprovider-meterprovider-loggerprovider) are correctly wired into the application and instrumentation. + +`Configurer` **MAY** be extended in the future with functionality to apply an updated configuration model to the resulting SDK components. + +#### Create(config) -> Configurer + +Create a `Configurer` from a [configuration model](#configuration-model). + +#### Get TracerProvider, MeterProvider, LoggerProvider + +Interpret the [configuration model](#configuration-model) and return SDK TracerProvider, MeterProvider, LoggerProvider which strictly reflect the configuration object's details and ignores the [opentelemetry environment variable configuration scheme](../specification/configuration/sdk-environment-variables.md). + +### Configuration model + +To allow SDKs and instrumentation libraries to accept configuration without having to implement the parsing logic, a `Configuration` model *MAY* be provided by implementations. This object: + +* has already been parsed from a file or data structure +* is structurally valid (errors may yet occur when SDK or instrumentation interprets the object) + +### Configuration file + +The following demonstrates an example of a configuration file format (full example [here](./assets/0225-config.yaml)): + +```yaml +# include version specification in configuration files to help with parsing and schema evolution. +file_format: 0.1 +sdk: + # Disable the SDK for all signals. + # + # Boolean value. If "true", a no-op SDK implementation will be used for all telemetry + # signals. Any other value or absence of the variable will have no effect and the SDK + # will remain enabled. This setting has no effect on propagators configured through + # the OTEL_PROPAGATORS variable. + # + # Environment variable: OTEL_SDK_DISABLED + disabled: false + # Configure resource attributes and resource detection for all signals. + resource: + # Key-value pairs to be used as resource attributes. + # + # Environment variable: OTEL_RESOURCE_ATTRIBUTES + attributes: + # Sets the value of the `service.name` resource attribute + # + # Environment variable: OTEL_SERVICE_NAME + service.name: !!str "unknown_service" + # Configure context propagators. Each propagator has a name and args used to configure it. None of the propagators here have configurable options so args is not demonstrated. + # + # Environment variable: OTEL_PROPAGATORS + propagators: [tracecontext, baggage] + # Configure the tracer provider. + tracer_provider: + # Span exporters. Each exporter key refers to the type of the exporter. Values configure the exporter. Exporters must be associated with a span processor. + exporters: + # Configure the zipkin exporter. + zipkin: + # Sets the endpoint. + # + # Environment variable: OTEL_EXPORTER_ZIPKIN_ENDPOINT + endpoint: http://localhost:9411/api/v2/spans + # Sets the max time to wait for each export. + # + # Environment variable: OTEL_EXPORTER_ZIPKIN_TIMEOUT + timeout: 10000 + # List of span processors. Each span processor has a name and args used to configure it. + span_processors: + # Add a batch span processor. + # + # Environment variable: OTEL_BSP_*, OTEL_TRACES_EXPORTER + - name: batch + # Configure the batch span processor. + args: + # Sets the delay interval between two consecutive exports. + # + # Environment variable: OTEL_BSP_SCHEDULE_DELAY + schedule_delay: 5000 + # Sets the maximum allowed time to export data. + # + # Environment variable: OTEL_BSP_EXPORT_TIMEOUT + export_timeout: 30000 + # Sets the maximum queue size. + # + # Environment variable: OTEL_BSP_MAX_QUEUE_SIZE + max_queue_size: 2048 + # Sets the maximum batch size. + # + # Environment variable: OTEL_BSP_MAX_EXPORT_BATCH_SIZE + max_export_batch_size: 512 + # Sets the exporter. Exporter must refer to a key in sdk.tracer_provider.exporters. + # + # Environment variable: OTEL_TRACES_EXPORTER + exporter: zipkin + # custom processor + - name: my-custom-processor + args: + foo: bar + baz: qux + + # Configure the meter provider. + ... +``` + +Note that there is no consistent mapping between environment variable names and the keys in the configuration file. + +### Environment variable substitution + +Configuration files *MUST* support environment variable expansion. While this accommodates the scenario in which a configuration file needs to reference sensitive data and is not able to be stored securely, environment variable expansion is not limited to sensitive data. + +As a starting point for development, the syntax for environment variable expansion *MAY* mirror the [collector](https://opentelemetry.io/docs/collector/configuration/#configuration-environment-variables). + +For example, given an environment where `API_KEY=1234`, the configuration file contents: + +```yaml +file_format: 0.1 +sdk: + tracer_provider: + exporters: + otlp: + endpoint: https://example.host:4317/v1/traces + headers: + api-key: ${env:API_KEY} +``` + +Result in the following after substitution: + +```yaml +file_format: 0.1 +sdk: + tracer_provider: + exporters: + otlp: + endpoint: https://example.host:4317/v1/traces + headers: + api-key: 1234 +``` + +Implementations *MUST* perform environment variable substitution before validating and parsing configuration file contents. + +If a configuration file references an environment variable which is undefined, implementations *MUST* return an error or raise an exception. + +#### Handling environment variable & file config overlap + +The behaviour when both configuration file and environment variables are present will be decided in the final design. Here are four options that should be considered: + +1. implementations ignore environment variables in preference of the configuration file +2. implementations give preference to the environment variables over the configuration file +3. an exception arises causing the application to fail to start +4. the behaviour is left unspecified + +The support for environment variable substitution in the configuration file gives users a mechanism for migrating away from environment variables in favour of configuration files. + +### Version guarantees & backwards compatibility + +Each version of the configuration schema carries a major and minor version. Configurations specify the major and minor version they adhere to. Before reaching 1.0, each minor version change is equivalent to major version change. That is, there are no guarantees about compatibility and all changes are permitted. As of 1.0, we provide the following stability guarantees: + +* For major version: No guarantees. +* For minor versions: TBD + +Allowable changes: + +* For major versions: All changes are permitted. +* For minor versions: TBD + +SDKs validating configuration *MUST* fail when they encounter a configuration with an unsupported version. Generally, this means fail when encountering a major version which is not recognized. An SDK might choose to maintain a library of validators / parsers for each major version, and use the configuration version to select and use the correct instance. Differences in minor versions (except pre-1.0 minor versions) *MUST* be acceptable. + +## Trade-offs and mitigations + +### Additional method to configure OpenTelemetry + +If the implementation suggested in this OTEP goes ahead, users will be presented with another mechanism for configuring OpenTelemetry. This may cause confusion for users who are new to the project. It may be possible to mitigate the confusion by providing users with best practices and documentation. + +### Many ways to configure may result in users not knowing what is configured + +As there are multiple mechanisms for configuration, it's possible that the active configuration isn't what was expected. This could happen today, and one way it could be mitigated would be by providing a mechanism to list the active OpenTelemetry configuration. + +### Errors or difficulty in configuration files + +Configuration files provide an opportunity for misconfiguration. A way to mitigate this would be to provide clear messaging and fail quickly when misconfiguration occurs. + +## Prior art and alternatives + +The working group looked to the OpenTelemetry Collector and OpenTelemetry Operator for inspiration and guidance. + +### Alternative schema languages + +In choosing to recommend JSON schema, the working group looked at the following options: + +* [Cue](https://cuelang.org/) - A promising simpler language to define a schema, the working group decided against CUE because: + * Tooling available for validating CUE files in languages outside of Go were limited. + * Familiarity and learning curve would create problems for both users and contributors of OpenTelemetry. +* [Protobuf](https://developers.google.com/protocol-buffers) - With protobuf already used heavily in OpenTelemetry, the format was worth investigating as an option to define the schema. The working group decided against Protobuf because: + * Validation errors are the result of serlization errors which can be difficult to interpret. + * Limitations in the schema definition language result in poor ergonomics if type safety is to be retained. + +## Open questions + +### How to handle no-code vs programmatic configuration? + +How should the SDK be configured when both no-code configuration (either environment variable or file config) and programmatic configuration are present? NOTE: this question exists today with only the environment variable interface available. + +* Solution 1: Make it clear that interpretation of the environment shouldn’t be built into components. Instead, SDKs should have a component that explicitly interprets the environment and returns a configured instance of the SDK. This is how the java SDK works today and it nicely separates concerns. + +### What is the exact configuration file format to use? + +Included in this OTEP is an example configuration file format. +This included format was settled on so the configuration file schemas being proposed here could all be evaluated as viable options. +It acted as proof-of-concept that *a* configuration file format existed that could describe needed OTel configuration and be described by a schema. +However, the configuration file format presented here is not meant as the final, nor optimal, design for use by OpenTelemetry. + +What that final design will be is left to discussion when this OTEP is implemented in the OpenTelemetry specification. +It is explicitly something this OTEP is not intended to resolved. + +The following are existing questions that will need to be resolved in the final design: + +1. Should the trace exporter be at the same level as the span processors? +2. Should the sampler config be separate from the sampler? +3. Is the `sdk` key appropriate? Should alternate configuration live under its own key but the SDKs configuration be at the top level? +4. Should the `disabled` key be renamed as `enabled`? + +This list is not intended as comprehensive. +There are likely more questions related to the final design that will be discussed when implemented in the OpenTelemetry specification. + +## Future possibilities + +### Additional configuration providers + +Although the initial proposal for configuration supports only describes in-code and file representations, it's possible additional sources (remote, opamp, ...) for configuration will be desirable. The implementation of the configuration model and components should be extensible to allow for this. + +### Integration with auto-instrumentation + +The configuration model could be integrated to work with the existing auto-instrumentation tooling in each language implementation. + +#### Java + +The Java implementation provides a JAR that supports configuring various parameters via system properties. This implementation could leverage a configuration file by supporting its configuration a system property: + +```bash +java -javaagent:path/to/opentelemetry-javaagent.jar \ + -Dotel.config.file=./config.yaml + -jar myapp.jar +``` + +#### Python + +The Python implementation has a command available that allows users to leverage auto-instrumentation. The `opentelemetry-instrument` command could use a `--config` flag to pass in a config file: + +```bash +# install the instrumentation package +pip install opentelemetry-instrumentation +# use a --config parameter to pass in the configuration file +# NOTE: this parameter does not currently exist and would need to be added +opentelemetry-instrument --config ./config.yaml ./python/app.py +``` + +#### OpAmp + +The configuration may be used in the future in conjunction with the OpAmp protocol to make remote configuration of SDKs available as a feature supported by OpenTelemetry. + +## Related Spec issues address + +* [https://github.com/open-telemetry/opentelemetry-specification/issues/1773](https://github.com/open-telemetry/opentelemetry-specification/issues/1773) +* [https://github.com/open-telemetry/opentelemetry-specification/issues/2857](https://github.com/open-telemetry/opentelemetry-specification/issues/2857) +* [https://github.com/open-telemetry/opentelemetry-specification/issues/2746](https://github.com/open-telemetry/opentelemetry-specification/issues/2746) +* [https://github.com/open-telemetry/opentelemetry-specification/issues/2860](https://github.com/open-telemetry/opentelemetry-specification/issues/2860) diff --git a/oteps/0227-separate-semantic-conventions.md b/oteps/0227-separate-semantic-conventions.md new file mode 100644 index 00000000000..427c019d0e2 --- /dev/null +++ b/oteps/0227-separate-semantic-conventions.md @@ -0,0 +1,182 @@ +# Separate Semantic Conventions + +Move Semantic Conventions outside of the main Specifications and version them +separately. + +## Motivation + +We need to allow semantic conventions to evolve mostly independent of the +overall OpenTelemetry specification. Today, any breaking change in a semantic +convention would require bumping the version number of the entirety of the +OpenTelemetry specification. + +## Explanation + +A new GitHub repository called `semantic-conventions` would be created in the +OpenTelemetry organization. + +This would *initially* have the following structure: + +- Boilerplate files, e.g. `README.md`, `LICENSE`, `CODEOWNERS`, `CONTRIBUTING.md` +- `Makefile` that allows automatic generation of documentation from model. +- `semantic_conventions/` The set of YAML files that exist in + `{specification}/semantic_conventions` today. +- `docs/` A new directory that contains human readable documentation for how to + create instrumentation compliant with semantic conventions. + - `resource/` - Contents of `{specification}/resource/semantic_conventions` + - `trace/` - Contents of `{specification}/trace/semantic_conventions` + - `metrics/` - Contents of `{specification}/metrics/semantic_conventions` + - `logs/`- Contents of `{specification}/logs/semantic_conventions` + - `schemas/` - A new location for [Telemetry Schemas](https://github.com/open-telemetry/semantic-conventions/tree/main/schemas) + to live. This directory will be hosted at + `https://opentelemetry.io/schemas/` + +Existing semantic conventions in the specification would be marked as +moved, with documentation denoting the move, but preserving previous contents. + +Additionally, if the semantic conventions eventually move to domain-specific +directory structure (e.g. `docs/{domain}/README.md`, with trace, metrics, events +in the same file), then this can be refactored in the new repository, preserving +git history. + +There will also be the following exceptions in the specification: + +- Semantic conventions used to implement API/SDK details will be fully specified in the `opentelemetry-specification` repo + and will not be allowed to change in the Semantic Convention directory. + - Error/Exception handling will remain in the specification. + - SDK configuration interaction w/ semantic convention will remain in the + specification. Specifically `service.name`. +- The specification may elevate some semantic conventions as necessary for + compatibility requirements, e.g. `service.instance.id` and + [Prometheus Compatibility](../specification/compatibility/prometheus_and_openmetrics.md). + +These exceptions exist because: + +- Stable portions of the specification already rely on these conventions. +- These conventions are required to implement an SDK today. + +As such, the Specification should define the absolute minimum of reserved or +required attribute names and their interaction with the SDK. + +## Internal details + +The following process would be used to ensure semantic conventions are +seamlessly moved to their new location. This process lists steps in order: + +- A moratorium will be placed on Semantic Convention PRs to the specififcation + repository. (Caveat that PRs related to this proposal would be allowed). +- Interactions between Semantic Conventions and the Specification will be + extracted such that the Specification can place requirements on Semantic + Conventions and *normative* specification language will remain in the + core specification directories. +- A new repository `open-telemetry/semantic-conventions` will be constructed with + the proposed format and necessary `Makefile` / tooling. + - The new repository would be created by using `git filter-branch` to preserve + all existing semantic convention history. *This means all existing + semantic conventions will be in the new repository*. + - GitHub Actions, `Makefile` tooling and contributing / readmes would be + updated for the separate repository. + - **Note: At this point, the new location for semantic conventions should + be adoptable/usable.** +- Semantic conventions in the Specification will be marked as moved with + links to the new location. + - The semconv YAML files in the specification repository *will be deleted*. + - All semconv markdown files will be updated such that: + - They no longer generate from YAML files. + - They include a header denoting deprecation and move to the new repository. +- Instrumentation authors will update their code generation to pull from the new + semconv repository instead of the specification repository. + +## Trade-offs and mitigations + +This proposal has a few drawbacks: + +- The semantic conventions will no longer be easily referenceable form the specification. + - This is actually a benefit. We can ensure isolation of convention from + specification and require the Specification to use firm language for + attributes it requires, like `service.name`. + - We will provide links from existing location to the new location. +- Semantic Convention version will no longer match the specification version. + - Instrumentation authors will need to consume a separate semantic-convention + bundle from Specification bundle. What used to be ONE upgrade effort will + now be split into two (hopefully smaller) efforts. + - We expect changes from Semantic Conventions and Specification to be + orthogonal, so this should not add significant wall-clock time. +- Existing PRs against semantic conventions will need to be regenerated. + +Initially this repository would have the following ownership: + +- Approvers + - [Christian Neumüller](https://github.com/Oberon00), Dynatrace + - [James Moessis](https://github.com/jamesmoessis), Atlassian + - [Joao Grassi](https://github.com/joaopgrassi), Dynatrace + - [Johannes Tax](https://github.com/pyohannes), Microsoft + - [Liudmila Molkova](https://github.com/lmolkova), Microsoft + - [Sean Marcinak](https://github.com/MovieStoreGuy), Atlassian + - [Ted Young](https://github.com/tedsuo), Lightstep +- Approvers (HTTP Only) + - [Trask Stalnaker](https://github.com/trask) +- Approvers (SchemaUrl Files) + - [Tigran Najaryan](https://github.com/tigrannajaryan) +- Maintainers + - [Josh Suereth](https://github.com/jsuereth) + - [Armin Ruech](https://github.com/arminru) + - [Reiley Yang](https://github.com/reyang) + +That is, Maintenance would initially continue to fall on (a subset of) the +Technical Committee. Approvers would start with existing semconv approvers in +addition to targeted at HTTP semantic convention stability approvers and +expand rapidly as we build momentum on semantic conventions. + +## Prior art and alternatives + +When we evaluate equivalent communities and efforts we see the following: + +- `OpenTracing` - had specification and [semantics](https://github.com/opentracing/specification/blob/master/semantic_conventions.md) + merged. +- `OpenCensus` - had specification and semantics merged. However, OpenCensus + merged with OpenTelemetry prior to mass adoption or stable release of its + specification. +- `Elastic Common Schema` - the schema is its own project / document. +- `Prometheus` - Prometheus does not define rigid guidelines for telemetry, like + semantic conventions, instead relying on naming conventions and + standardization through mass adoption. + +## Open questions + +This OTEP doesn't address the full needs of tooling and codegen that will be +needed for the community to shift to a separate semantic convention directory. +This will require each SIG that uses autogenerated semantic conventions to +adapt to the new location. + +The first version of the new location for semantic conventions may not follow +the latest of the specification. There is reasoning to desire a `2.0` but the +details will be discussed in the new repository location upon execution of this +OTEP. + +## Future possibilities + +This OTEP paves way for the following desirable features: + +- Semantic Conventions can decide to bump major version numbers to accommodate + new signals or hard-to-resolve new domains without breaking the Specification + version number. +- Semantic Conventions can have dedicated maintainers and approvers. +- Semantic Conventions can restructure to better enable subject matter experts + (SMEs) to have approver/CODEOWNER status on relevant directories. +- Semantic Conventions can adopt semantic versioning for itself, helping clearly + denote breaking changes to users. + +There is a desire to move semantic conventions to domain-specific directories +instead of signal-specific. This can occur after the separation of the repository +and will be proposed and discussed separately from this OTEP. + +For example: + +- `docs/` + - `signals/` - Conventions for metrics, traces + logs + - `http/` + - `db/` + - `messaging/` + - `client/` + - `resource/` - We still need resource-specific semantic conventions diff --git a/oteps/0232-maturity-of-otel.md b/oteps/0232-maturity-of-otel.md new file mode 100644 index 00000000000..3ce9726f756 --- /dev/null +++ b/oteps/0232-maturity-of-otel.md @@ -0,0 +1,74 @@ +# Definition of maturity levels to be uniformly used by OpenTelemetry SIGs + +On 08 Mar 2023, the OpenTelemetry GC and TC held an OpenTelemetry Leadership summit, discussing various topics. One of the themes we discussed was establishing standard rules for describing the maturity of the OpenTelemetry project. This OTEP summarizes what was discussed there and is intended to have the wider community provide feedback. + +This OTEP builds on what was previously communicated by the project, especially the [Versioning and stability for OpenTelemetry clients](https://opentelemetry.io/docs/reference/specification/versioning-and-stability). + +The Collector's [stability levels](https://github.com/open-telemetry/opentelemetry-collector#stability-levels) inspired the maturity levels. + +## Motivation + +Quite often, the community is faced with the question of the quality and maturity expectations of its diverse set of components. This OTEP aims to bring clarity by establishing a framework to communicate the maturity for SIG deliverables and components in the name of the project. As the OpenTelemetry project comprises a multitude of SIGs, and each SIG has several components of varying quality, having this framework will help set the right expectations for OpenTelemetry users using a unified nomenclature. + +## Explanation + +### Maturity levels + +Deliverables of a SIG MUST have a declared maturity level, established by SIG maintainers (SIGs), likely with the input of the code owners. While the main deliverable can have a specific maturity level, individual components might have a different one. Examples: + +* the Collector core distribution might declare itself stable and include a receiver that is not stable. In that case, the receiver has to be clearly marked as such +* the Java Agent might be declared stable, while individual instrumentation packages are not + +Components SHOULD NOT be marked as stable if their user-visible interfaces are not stable. For instance, if the Collector's component "otlpreceiver" declares a dependency on the OpenTelemetry Collector API "config" package which is marked with a maturity level of "beta", the "otlpreceiver" should be at most "beta". Maintainers are free to deviate from this recommendation if they believe users are not going to be affected by future changes. + +For the purposes of this document, a breaking change is defined as a change that may require consumers of our components to adapt themselves in order to avoid disruption to their usage of our components. + +#### Development + +Not all pieces of the component are in place yet, and it might not be available for users yet. Bugs and performance issues are expected to be reported. User feedback around the UX of the component is desired, such as for configuration options, component observability, technical implementation details, and planned use-cases for the component. Configuration options might break often depending on how things evolve. The component SHOULD NOT be used in production. The component MAY be removed without prior notice. + +#### Alpha + +This is the default level: any components with no explicit maturity level should be assumed to be "Alpha". The component is ready to be used for limited non-critical production workloads, and the authors of this component welcome user feedback. Bugs and performance problems are encouraged to be reported, but component owners might not work on them immediately. The component's interface and configuration options might often change without backward compatibility guarantees. Components at this stage might be dropped at any time without notice. + +#### Beta + +Same as Alpha, but the interfaces (API, configuration, generated telemetry) are treated as stable whenever possible. While there might be breaking changes between releases, component owners should try to minimize them. A component at this stage is expected to have had exposure to non-critical production workloads already during its Alpha phase, making it suitable for broader usage. + +#### Release Candidate + +The component is feature-complete and ready for broader usage. The component is ready to be declared stable, it might just need to be tested in more production environments before that can happen. Bugs and performance problems are expected to be reported, and there's an expectation that the component owners will work on them. Breaking changes, including configuration options and the component's output, are only allowed under special circumstances. Whenever possible, users should be given prior notice of the breaking changes. + +#### Stable + +The component is ready for general availability. Bugs and performance problems should be reported, and there's an expectation that the component owners will work on them. Breaking changes, including configuration options and the component's output, are only allowed under special circumstances. Whenever possible, users should be given prior notice of the breaking changes. + +#### Deprecated + +Development of this component is halted. No new versions are planned, and the component might be removed from its included distributions. Note that new issues will likely not be worked on except for critical security issues. Components that are included in distributions are expected to exist for at least two minor releases or six months, whichever happens later. They also MUST communicate in which version they will be removed, either in terms of a concrete version number or the date of a release, like: "the first release after 2023-08-01". + +#### Unmaintained + +A component identified as unmaintained does not have an active code owner. Such components may have never been assigned a code owner, or a previously active code owner has not responded to requests for feedback within 6 weeks of being contacted. Issues and pull requests for unmaintained components SHOULD be labeled as such. After 6 months of being unmaintained, these components MAY be deprecated. Unmaintained components are actively seeking contributors to become code owners. + +## Trade-offs and mitigations + +This OTEP allows SIG maintainers to declare the maturity of the SIG's deliverables without declaring which ones are key for OpenTelemetry. When, and if, this is needed, a new OTEP may be created using the maturity levels as a possible framework. + +## Prior art and alternatives + +* The specification status has a ["Component Lifecycle"](https://opentelemetry.io/docs/specs/status/) description, with definitions that might overlap with some of the levels listed in this OTEP. +* The same page lists the status of the different parts of the spec +* The ["Versioning and stability for OpenTelemetry clients"](https://opentelemetry.io/docs/specs/otel/versioning-and-stability/#signal-lifecycle) page has a detailed view on the lifecycle of a signal and which general stability guarantees should be expected by OpenTelemetry clients. Notably, it lacks information about maturity of the Collector. This OTEP could be seen as clashing with the last section of that page, "OpenTelemetry GA". But while that page established a point where both OpenTracing and OpenCensus would be considered deprecated, this OTEP here defines the criteria for calling OpenTelemetry "stable" and making that a requirement for a future graduation. This would also make it clear to end-users which parts of the project they can rely on. +* The OpenTelemetry Collector has its own [stability levels](https://github.com/open-telemetry/opentelemetry-collector#stability-levels), which served as inspiration to the ones here. +* [Definition of documentation states](https://opentelemetry.io/docs/specs/otel/document-status/) +* [Telemetry stability](https://opentelemetry.io/docs/specs/otel/telemetry-stability/) (uses unstable instead of experimental) + +## Open questions + +* Should SDKs be required to fully implement the specification before they can be marked as stable? See [open-telemetry/opentelemetry-specification#3673](https://github.com/open-telemetry/opentelemetry-specification/issues/3673) +* Should this OTEP define a file name to be adopted by all repositories to declare their deliverables and their maturity levels? + +## Future possibilities + +Once the maturity levels are widely adopted, the GC/TC might decide to pick and choose components from different SIGs and proceed with a graduation proposal within the CNCF. The decision framework for choosing the components will be defined at a later stage. diff --git a/oteps/0243-app-telemetry-schema-vision-roadmap.md b/oteps/0243-app-telemetry-schema-vision-roadmap.md new file mode 100644 index 00000000000..51fc746227f --- /dev/null +++ b/oteps/0243-app-telemetry-schema-vision-roadmap.md @@ -0,0 +1,386 @@ +# Introducing Application Telemetry Schema in OpenTelemetry - Vision and Roadmap + +---- +**Author**: Laurent Querel, F5 Inc. + +**Keywords**: Schema-First Approach, Telemetry Schema, Semantic Convention, +Discoverability, Interoperability, Type-Safe Client SDKs, Client SDKs Generation, +CI/CD Integration, Data Governance, Data Privacy. + +**Related OTEPs**: [OTEP0152](https://github.com/open-telemetry/oteps/blob/main/text/0152-telemetry-schemas.md), [OTEP0202](https://github.com/open-telemetry/oteps/blob/main/text/0202-events-and-logs-api.md). + +---- +_Unlike the traditional data ecosystem (OLTP and OLAP), the world of telemetry +generally does not rely on the concept of a schema. Instrumentation is deeply +embedded in the code of applications and libraries, making it difficult to +discover all the possible telemetry signals an application can emit. This gap +prevents or limits the development of CI/CD tools for checking, reporting, +documenting, and generating artifacts from telemetry signals specific to an +application. This document presents a long-term vision aimed at enabling the +OpenTelemetry project to address this issue and extend its impact to a broader +ecosystem. It proposes extending the initiatives of Telemetry Schema and +Semantic Conventions to include logical concepts of Component Telemetry Schema +and Resolved Telemetry Schema. A series of OTEPs and Tools will be proposed in +this overarching document to detail each aspect of this vision._ + +## Current State Overview + +Traditionally, the instrumentation of applications is deeply integrated into the +source code of the applications and their components. The current stack of +OpenTelemetry follows this approach and offers a unified mechanism that allows +this ecosystem to report telemetry data jointly via a generic client interface +and a common protocol through an SDK. Moreover, OpenTelemetry's semantic +conventions establish a vendor-agnostic and standardized vocabulary to ensure +more consistent instrumentation over time. This standardization facilitates the +analysis and utilization of these metadata data across the entire telemetry +pipeline. + +But this approach is not without challenges: + +* **Discoverability and Interoperability**: It is difficult to discover a priori + and automatically what an application as a whole specifically generates in terms + of telemetry and associated metadata. This makes it difficult to integrate with + enterprise data catalogs, compliance procedures, or automated privacy + enforcement in _CI/CD pipelines_. +* **User experience**: Although very flexible, generic clients do not offer the + same benefits as a strongly typed dedicated API. A type-safe API is more + ergonomic, more robust, and more easily maintainable. Modern IDEs are capable + of providing smart autocompletion and contextual documentation based on the API. + Compilers can automatically detect errors in case of refactoring or evolution + of the telemetry schema. +* **Extensibility**: Adding metadata to the basic signal specification is + essential for enabling use cases like data security, privacy enforcement, + metadata-driven data transformation, and knowledge graph enrichment. Currently, + there's no standard way to add metadata separate from an application's or + library's development cycle. These metadata definitions should be distinct from + the signal definitions and could be specified by teams other than development + teams. +* **Performance overheads**: A significant downside of generic telemetry + instrumentation is the various overheads it generally introduces due to inherent + layers of abstraction. For example, the collection of attributes is typically + represented as a list of key/value pairs or as hashmaps, resulting in memory + overhead. A simple struct or a set of well-typed function arguments will be more + efficient and less error-prone for representing this list of attributes. In the + same way, it is possible to use a dictionary encoding for values whose domain is + specified in the form of an enumeration in the schema definition. + +Databases and RPC systems (e.g., Protobuf & gRPC) have already addressed some of +these issues with a schema-first approach. There is nothing to prevent adopting +a similar approach in the context of telemetry. **This document discusses how to +apply a schema-first approach in the OpenTelemetry project and its implications +for the existing Telemetry Schema and Semantic Conventions.** + +## Possible/Desired Future State + +The following diagram provides a conceptual overview of the relationships +between the various components, processes, and artifacts of what could be a +typical schema-driven end-to-end telemetry system in the future. + +![Application Telemetry Schema Overview](./img/0243-otel-weaver-overview.svg) + +The Application Telemetry Schema concept is divided into two key logical parts: the +Component Telemetry Schema and the Resolved Telemetry Schema, as shown in the +previous diagram (more details on these two concepts in the proposal section). +These concepts are central to unlocking a variety of use cases. + +Examples of use cases include: + +* Automatic generation of Telemetry Client SDKs from telemetry schemas, + improving user experience and performance. +* CI/CD pipelines using telemetry schemas to: + * Check compatibility between different schema versions. + * Ensure security and privacy compliance. + * Integrate with enterprise data catalog systems. + * And more. +* Telemetry backends capable of: + * Automatically updating database schemas or dashboards. + * Triggering schema-driven transformations or processing in stream processors. + * And more. + +This recent [paper](https://arxiv.org/pdf/2311.07509.pdf#:~:text=The%20results%20of%20the%20benchmark%20provide%20evidence%20that%20supports%20our,LLM%20without%20a%20Knowledge%20Graph) +from [data.world](https://data.world/home/), along with +the [MetricFlow framework](https://docs.getdbt.com/docs/build/about-metricflow) +which underpins the [dbt Semantic Layer](https://www.getdbt.com/product/semantic-layer), +highlights the significance of adopting a schema-first approach in data +modeling, especially for Generative AI-based question answering systems. Tools +like Observability Query Assistants ( +e.g. [Elastic AI Assistant](https://www.elastic.co/fr/blog/introducing-elastic-ai-assistant) +and [Honeycomb Query Assistant](https://www.honeycomb.io/blog/introducing-query-assistant?utm_source=newswire&utm_medium=link&utm_campaign=query_assistant)) +are likely to become increasingly prevalent and efficient in the near future, +thanks to the adoption of a schema-first approach. + +> **Note: The names and formats of these concepts are still under discussion. A +> detailed analysis of pros and cons will be covered later in the document. The +> final decision will be deferred to future dedicated OTEPs.** + +Another problem this proposal aims to address is the inherent complexity of the +ecosystem where OpenTelemetry is utilized but not fully addressed by existing +solutions. OpenTelemetry has been adopted by enterprises of all sizes. While +offering the possibility to inherit standardized semantic conventions is +beneficial, it often proves insufficient due to the need for customizations in +diverse contexts, such as overriding some properties (e.g., changing the +requirement level from recommended to required). Additionally, the presence of +vendor-specific attributes and metrics in the existing official OpenTelemetry +semantic convention registry does not align with the goal of offering a catalog +of attributes, metrics, and signals that are vendor-agnostic. **These issues are +indicative of a lack of standardized mechanisms for extending, customizing, and +developing an ecosystem of schemas and semantic convention registries.** + +In response to these problems, a hierarchy of telemetry schemas can be defined, +ranging from the most general to one that is specifically refined for an +application. Each child schema inherits the properties of the parent schema and +can, if necessary, override these properties locally. Additionally, any +telemetry schema can import one or several semantic convention registries, +allowing for the definition of OpenTelemetry, vendor-specific, and +enterprise-level registries. These two enhancements make OpenTelemetry more +customizable, extensible, and ecosystem-friendly. + +The following section will elaborate on the concepts of the Component Telemetry +Schema, Resolved Telemetry Schema, Semantic Convention Registries, and their +relationship with both the existing OpenTelemetry Schema v1.1 and the +OpenTelemetry Semantic Conventions. + +## Proposal + +### Overview + +Conceptually, this proposal is based on three main concepts: **Component Telemetry +Schema**, **Semantic Convention Registry**, and **Resolved Telemetry Schema**. +The relationships between these entities are described in the following diagram. + +![Telemetry Schema Concepts](./img/0243-otel-weaver-concepts.svg) + +The Component Telemetry Schemas are created by the OpenTelemetry SIG members, +application, or library authors. A Component Telemetry Schema may import any +number of Semantic Convention +Registries as needed. During the schema resolution process, a Resolved Telemetry +Schema is created from a Component Telemetry Schema. This **Resolved Telemetry +Schema is self-contained** and has no external references. Optionally, a Component +Telemetry Schema may extend an existing Telemetry Schema, whether component or +resolved (this aspect is still under discussion). Typically, the official +OpenTelemetry Telemetry Schema is inherited by a Component Telemetry Schema to +include the standard OpenTelemetry Semantic Convention registry. In complex +cases, large enterprises might create their own intermediary telemetry schemas +for custom definitions. + +Each signal definition defined in the Component Telemetry Schema, where possible, +reuses the existing syntax and semantics defined by the semantic conventions. +Each signal definition is also identified by a unique name (or ID), making +schemas addressable, easy to traverse, validate, and diff. + +The key design principles to be followed in the definition of the Resolved +Telemetry Schema are: + +* **Self-contained**: No external references are allowed. This artifact contains + everything required to determine what an application or a library produces in + terms of telemetry. +* **Easy to exchange**: This artifact must be easily accessible to actors + via a URI. This artifact must be small and avoid the repetition of + definitions. +* **Easy to parse**: A widespread and well-defined format should be preferred. + JSON is an example of such a format. +* **Easy to interpret**: The internal structure of this artifact must be + straightforward to avoid any misinterpretation and must be efficient. +* **Platform- and Language-agnostic**: This artifact must be independent of any + platform architectures and programming languages. + +The following diagram describes two main use cases for the Resolved Telemetry +Schema. The key points to remember are: 1) both use cases result in a Resolved +Telemetry Schema, 2) Resolved Telemetry Schemas serve as the mechanism for +distributing Telemetry Schemas throughout the entire ecosystem, and 3) Resolved +Telemetry Schemas would replace/augment existing SchemaURL. + +![Use cases](./img/0243-otel-weaver-use-cases.svg) + +Note: The relationship between Telemetry Schema v1.1 +([OTEP 0152](https://github.com/open-telemetry/oteps/blob/main/text/0152-telemetry-schemas.md)) +and the Component and Resolved Telemetry Schema concepts is still being +discussed. This will be clarified in future OTEPs (refer to the last section). + +The following diagram illustrates a possible instance of a complex hierarchy of +schemas and semantic convention registries. It involves several vendor and +enterprise artifacts, in addition to the standardized OpenTelemetry artifacts. +The schema resolution process will produce a self-contained Resolved Telemetry +Schema that can be easily consumed by various tools and applications, such as a +Client SDK generator, compatibility checker, compliance checker, data catalog +feeder, and more. + +![Example of Telemetry Schema Hierarchy](./img/0243-otel-weaver-hierarchy.svg) + +This hierarchy of telemetry schemas helps large organizations in +collaborating on the Component Telemetry Schema. It enables different +aspects of a Component Telemetry Schema to be managed by various teams. + +For all the elements that make up the Component Telemetry Schema, a +general mechanism of annotation or tagging will be integrated in order to +attach additional traits, characteristics, or constraints, allowing vendors +and companies to extend the definition of concepts defined by OpenTelemetry. +This annotation mechanism will be included as part of the Component Telemetry +Schema definition. + +Annotations and Tags can also be employed to modify schemas for +diverse audiences. For example, the public version of a schema can exclude all +signals or other metadata labeled as private. Similarly, elements can be +designated as exclusively available for beta testers. These annotations can +also identify attributes as PII (Personally Identifiable Information), and +privacy policy enforcement can be implemented at various levels (e.g., in the +generated client SDK or in a proxy). + +For each important component, the following diagram defines the responsibilities +and key design properties that will be considered in future OTEPs defining the +Component and Resolved Telemetry Schemas. + +![Telemetry Schema - Levels of responsibilities & Key design properties](./img/0243-otel-weaver-responsibilities-properties.svg) + +This design enables the definition of semantic conventions in a distributed +manner. OpenTelemetry, vendors, and enterprises can define their own semantic +conventions in different registries simplifying the existing process. + +> Although there is no direct lineage between these systems, a similar approach +> was designed and deployed by Facebook to address the same type of problem but in +> a proprietary context (refer to +> this [positional paper](https://research.facebook.com/publications/positional-paper-schema-first-application-telemetry/) +> for more information). + +### Development Strategies + +Two development strategies coexist and must be supported. The first strategy, a +monorepo type (single source of truth), has complete control over the +applications, their dependencies, and the associated telemetry schemas. The +second strategy is more heterogeneous, composed of multiple repositories, where +the build process of external dependencies is out of the control of the entity +owning the applications or services. + +In the first model, each build process can independently apply telemetry schema +policies in parallel, knowing that the policies are shared and the entire +environment is controlled by the same entity. + +In the second model, the application or service environment does not control the +telemetry schema policies of external dependencies. There is a need for a method +to retrieve telemetry schemas from these dependencies. The mechanism for +distributing these schemas is still being discussed (refer to the Open Questions +section). Ultimately, this will enable the CI/CD pipeline of the application or +service to apply its own local policies to these schemas from the dependencies. + +![Development Strategies to Support](./img/0243-otel-weaver-dev-strategies.svg) + +## Open Questions + +During the review of this document, several questions were raised, and some +remain unresolved. We've decided to postpone the answers to these open questions +to future OTEPs. This approach seems more practical as the resolution of these +questions is not critical at this stage and will likely become clearer as we +define and implement future OTEPs. + +### Should we use a single Telemetry Schema or the combination of Component and Resolved Telemetry Schema? + +The debate between adopting a single Telemetry Schema or separate Component and +Resolved Telemetry Schemas remains unresolved. + +Advocates of a single-schema approach see it as a simplification in terms of +schema definition, implementation, and even cognitive overhead. They suggest +using the same schema but disallowing any external references for schemas +intended for publication and reuse by third parties. The schema resolution +process would then remove these references. + +Proponents of a two-schema approach believe that each format is intended for +different users and use cases (app/lib developers vs telemetry tool developers), +and therefore, having two distinct structures and formats would make it easier +to optimize each for its specific use case (in multiple dimensions). +Furthermore, the group of developers using the Component Telemetry Schema will +most likely be much larger than the group of developers who need to understand +the details of the Resolved Telemetry Schema. + +### What should be the schema(s) be named? + +The naming of the schema(s) was also discussed but without consensus. The current proposals are as follows: + +- Single schema approach: Retain the existing Telemetry Schema name, supporting +different formats depending on the use case (e.g., YAML for app and lib +developers, JSON + JSON schema for publication and tool consumption). +- Two schemas approach: + - Component Telemetry Schema alternative names: Telemetry Schema, Weaver Schema. + - Resolved Telemetry Schema alternative names: Compiled Telemetry Schema, Published Telemetry Schema, Weaver Schema. + +### What distribution mechanism should be used for retrieving dependency schemas? + +Currently, two main methods are being considered: + +1) Establishing a public, centralized repository for resolved schemas, where + library authors can register their resolved telemetry schemas as part of their + build process. +2) Embedding the Resolved Telemetry Schema directly within the library artifact + (such as binary libraries, jar files, crates, Go modules, etc.), ensuring + automatic collection during the application's build process. + +At present, the second option is preferred as it is fully decentralized and +eliminates the need for a global schema registry. A specific OTEP will be +developed to define this distribution mechanism. + +## Roadmap + +### OTEPs + +To facilitate the review process and progressively deliver value to the +OpenTelemetry community, a series of OTEPs and tools are suggested. + +* **Telemetry Schema(s) - Structure and Formats**: This OTEP will weigh the pros + and cons of a single schema versus a dual-schema approach. It aims to identify + the optimal solution and define the structures and formats for the two + concepts introduced in this OTEP: the Component and Resolved Telemetry Schema. + We plan several sub-OTEPs to progressively introduce and phase the concepts + discussed in this document: + * **Phase 1**: Introduce a registry section with a list of fully + resolved attributes, metrics, and other telemetry signals from the existing + semantic convention registry. Upon implementation, the OpenTelemetry project's + telemetry schema will include a registry of all standardized attributes, + metrics, and signals as defined by the semantic conventions. + * **Phase 2**: Add sections for resource, instrumentation library, + and schema sections to represent telemetry signals emitted by an application + (and its dependencies) or a library. Initially, only metrics, logs, and spans + definitions will be supported. + * **Phase 3**: Enable application or library authors to + override definitions inherited from a parent resolved schema. + * **Phase 4**: Implement support for events in the schema + section, pending the approval of events in OpenTelemetry. + * **Phase 5**: Introduce support for multivariate metrics in + the schema section, relevant only if the Client SDK Generator plans to support + multivariate metrics. +* **Dependency Management and Telemetry Schema Distribution**: This OTEP will +outline the method for collecting resolved telemetry schemas from dependencies. + +### Proof of Concept and Tools + +A proof of concept, OTel Weaver, is under development to test the feasibility of +the proposed approach. It will support the following commands: + +* `resolve registry`: Generates a Resolved Telemetry Schema from an +OpenTelemetry semantic convention registry. +* `resolve schema`: Creates a Resolved Telemetry Schema from a Component +Telemetry Schema. +* `search registry`: Offers search functionality within an OpenTelemetry +semantic convention registry. +* `search schema`: Provides search capabilities within a Component or Resolved +Telemetry Schema. +* `gen-client sdk`: Generates a Client SDK from a Component or Resolved +Telemetry Schema. +* `gen-client api`: Produces a Client API from a Component or Resolved +Telemetry Schema. + +A Plugin System is planned to allow community contributions to the OTel Weaver +tool. Proposed plugins include: + +* An example of plugin for gathering resolved telemetry schemas from external +dependencies. +* An example of compatibility checker plugin to ensure successive versions of +the same Telemetry Schema follow specified compatibility rules. + +Additional plugins and tools are anticipated to be developed by the broader +community, leveraging the OTel Weaver Plugin System and the standardized format +of the Resolved Telemetry Schema. + +## Links + +- [Positional Paper: Schema-First Application Telemetry](https://research.facebook.com/publications/positional-paper-schema-first-application-telemetry/) +- [A benchmark to understand the role of knowledge graphs on Large Language Model's accuracy for question answering on enterprise sql databases](https://arxiv.org/pdf/2311.07509.pdf#:~:text=The%20results%20of%20the%20benchmark%20provide%20evidence%20that%20supports%20our,LLM%20without%20a%20Knowledge%20Graph) +- [MetricFlow framework](https://docs.getdbt.com/docs/build/about-metricflow) diff --git a/oteps/0258-env-context-baggage-carriers.md b/oteps/0258-env-context-baggage-carriers.md new file mode 100644 index 00000000000..50adcd5292e --- /dev/null +++ b/oteps/0258-env-context-baggage-carriers.md @@ -0,0 +1,310 @@ +# Environment Variable Specification for Context and Baggage Propagation + +This is a proposal to add Environment Variables to the OpenTelemetry +specification as carriers for context and baggage propagation between +processes. + +## Table of Contents + +* [Motivation](#motivation) +* [Design](#design) + * [Example Context](#example-context) + * [Distributed Tracing in OpenTofu Prototype Example](#distributed-tracing-in-opentofu-prototype-example) +* [Core Specification Changes](#core-specification-changes) + * [UNIX](#unix-limitations) + * [Windows](#windows-limitations) + * [Allowed Characters](#allowed-characters) +* [Trade-offs and Mitigations](#trade-offs-and-mitigations) + * [Case-sensitivity](#case-sensitivity) + * [Security](#security) +* [Prior Art and Alternatives](#prior-art-and-alternatives) + * [Alternatives and why they were not chosen](#alternatives-and-why-they-were-not-chosen) +* [Open Questions](#open-questions) +* [Future Possibilities](#future-possibilities) + +## Motivation + +The motivation for defining the specification for context and baggage +propagation by using environment variables as carriers stems from the long open +[issue #740][issue-740] on the OpenTelemetry Specification repository. This +issue has been open for such a long time that multiple implementations now +exist using `TRACEPARENT` and `TRACESTATE` environment variables. + +[Issue #740][issue-740] identifies several use cases in systems that do not +communicate across bounds by leveraging network communications such as: + +* ETL +* Batch +* CI/CD systems + +Adding arbitrary [Text Map propagation][tmp] through environment variable carries into +the OpenTelemetry Specification will enable distributed tracing within the +above listed systems. + +There has already been a significant amount of [Prior Art](#prior-art-and-alternatives) built +within the industry and **within OpenTelemetry** to accomplish the immediate needs, +however, OpenTelemetry at this time does not define the specification for this +form of propagation. + +Notably, as we define semantic conventions within the [CI/CD Working Group][cicd-wg], +we'll need the specification defined for the industry to be able to adopt +native tracing within CI/CD systems. + +[cicd-wg]: https://github.com/open-telemetry/community/blob/main/projects/ci-cd.md +[issue-740]: https://github.com/open-telemetry/opentelemetry-specification/issues/740#issue-665588273 +[tmp]: https://opentelemetry.io/docs/specs/otel/context/api-propagators/#textmap-propagator + +## Design + +To propagate context and baggage between parent, sibling, and child processes +in systems where network communication does not occur between processes, a +specification using key-value pairs injected into the environment can be read +and produced by an arbitrary TextMapPropagator. + +### Example Context + +Consider the following diagram in the context of process forking: + +> Note: The diagram is simply an example and simplification of process forking. +> There are other ways to spawn processes which are more performant like +> exec(). + +![Environment Variable Context Propagation](./img/0258-env-context-parent-child-process.png) + +In the above diagram, a parent process is forked to spawn a child process, +inheriting the environment variables from the original parent. The environment +variables defined here, `TRACEPARENT`, `TRACESTATE`, and `BAGGAGE` are used to +propagate context to the child process such that it can be tied to the parent. +Without `TRACEPARENT`, a tracing backend would not be able to connect the child +process spans to the parent span, forming an end-to-end trace. + +> Note: While the below exclusively follows the W3C Specification translated +> into environment variables, this proposal is not exclusive to W3C and is +> instead focused on the mechanism of Text Map Propagation with a potential set +> of well-known environment variable names. See the [Core Specification +> Changes](#core-specification-changes) section for more information. + +Given the above example aligning with the W3C Specification, the following is +a contextual mapping of environment variables to headers defined by W3C. + +The `traceparent` (lowercase) header is defined in the [W3C +Trace-Context][w3c-parent] specification and includes the following valid +fields: + +* `version` +* `trace-id` +* `parent-id` +* `trace-flags` + +This could be set in the environment as follows: + +```bash +export TRACEPARENT=00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01 +``` + +> Note: The value of TRACEPARENT is a combination of the above field values as +> unsigned integer values serialized as ASCII strings, delimited by `-`. + +The `tracestate` (lowercase) header is defined in [W3C +Trace-State][w3c-state] and can include any opaque value in a key-value pair +structure. Its goal is to provide additional vendor-specific trace information. + +The `baggage` (lowercase) header is defined in [W3C Baggage][w3c-bag] +and is a set of key-value pairs to propagate context between signals. In +OpenTelemetry, baggage is propagated through the [Baggage API][bag-api]. + +[w3c-parent]: https://www.w3.org/TR/trace-context-2/#traceparent-header-field-values +[w3c-state]: https://www.w3.org/TR/trace-context-2/#tracestate-header +[w3c-bag]: https://www.w3.org/TR/baggage/#baggage-http-header-format + +#### Distributed Tracing in OpenTofu Prototype Example + +Consider this real world example OpenTofu Controller Deployment. + +![OpenTofu Run](./img/0258-env-context-opentofu-tracing.png) + +In this model, the OpenTofu Controller is the start of the trace, containing +the actual trace_id and generating the root span. The OpenTofu controller +deploys a runner which has its own environment and processes to run OpenTofu +commands. If one was to trace these processes without a carrier mechanism, then +they would all show up as unrelated root spans in separate traces. However, by +leveraging environment variables as carriers, each span is able to be tied back +to the root span, creating a single trace as shown in the image of a real +OpenTofu trace below. + +![OpenTofu Trace](./img/0258-env-context-opentofu-trace.png) + +Additionally, the `init` span is able to pass baggage to the `plan` and `apply` +spans. One example of this is module version and repository information. This +information is only determined and known during the `init` process. Subsequent +processes only know about the module by name. With `BAGGAGE` the rest of the +processes are able to understand a key piece of information which allows +errors to be tied back to original module version and source code. + +Defining the specification for Environment Variables as carriers will have a +wide impact to the industry in enabling better observability to systems outside +of the normal HTTP microservice architecture. + +[w3c-bag]: https://www.w3.org/TR/baggage/#header-name +[bag-api]: https://opentelemetry.io/docs/specs/otel/baggage/api/ + +The above prototype example came from the resources mentioned in [this +comment][otcom] on the [OpenTofu Tracing RFC][otrfc]. + +[otcom]: https://github.com/opentofu/opentofu/pull/2028#issuecomment-2411588695 +[otrfc]: https://github.com/opentofu/opentofu/pull/2028 + +## Core Specification Changes + +The OpenTelemetry Specification should be updated with the definitions for +extending context propagation into the environment through Text Map +propagators. + +This update should include: + +* A common set of environment variables like `TRACEPARENT`, `TRACESTATE`, and + `BAGGAGE` that can be used to propagate context between processes. These + environment variables names should be overridable for legacy support reasons + (like using B3), but the default standard should align with the W3C + specification. +* A specification for allowed environment names and values due to operating + system limitations. +* A specification for how implementers can inject and extract context from the + environment through a TextMapPropagator. +* A specification for how processes should update environment variables before + spawning new processes. + +Defining the specification for Environment Variables as carriers for context +will enable SDK's and other tools to implement getters and setters of context +in a standard, observable way. Therefore, current OpenTelemetry language +maintainers will need to develop language specific implementations that adhere +to the specification. + +Two implementations already exist within OpenTelemetry for environment +variables through the TextMap Propagator: + +* [Python SDK][python-env] - This implementation uses environment dictionary as + the carrier in Python for invoking process to invoked process context + propagation. This pull request does not appear to have been merged. +* [Swift SDK][swift-env] - This implementation uses `TRACEPARENT` and + `TRACESTATE` environment variables alongside the W3C Propagator to inject and + extract context. + +Due to programming conventions, operating system limitations, prior art, and +information below, it is recommended to leverage upper-cased environment +variables for the carrier that align with context propagator specifications. + +[python-env]: https://github.com/Div95/opentelemetry-python/tree/feature/env_propagator/propagator/opentelemetry-propagator-env +[swift-env]: https://github.com/open-telemetry/opentelemetry-swift/blob/main/Sources/OpenTelemetrySdk/Trace/Propagation/EnvironmentContextPropagator.swift + +### UNIX Limitations + +UNIX system utilities use upper-case for environment variables and lower-case +are reserved for applications. Using upper-case will prevent conflicts with +internal application variables. + +Environment variable names used by the utilities in the Shell and Utilities +(XCU) specification consist solely of upper-case letters, digits and the "_" +(underscore) from the characters defined in Portable Character Set. Other +characters may be permitted by an implementation; applications must tolerate +the presence of such names. Upper-case and lower-case letters retain their +unique identities and are not folded together. The name space of environment +variable names containing lower-case letters is reserved for applications. +Applications can define any environment variables with names from this name +space without modifying the behaviour of the standard utilities. + +Source: [The Open Group, The Single UNIX® Specification, Version 2, Environment Variables](https://pubs.opengroup.org/onlinepubs/7908799/xbd/envvar.html) + +### Windows Limitations + +Windows is case-insensitive with environment variables. Despite this, the +recommendation is to use upper-case names across OS. + +Some languages already do this. This [CPython issue][cpython] discusses how +Python automatically upper-cases environment variables. The issue was merged and +this [documentation][cpython-doc] was added to clarify the behavior. + +[cpython]: https://github.com/python/cpython/issues/101754 +[cpython-doc]: https://docs.python.org/3/library/os.html#os.environ + +### Allowed characters + +To ensure compatibility, specification for Environment Variables SHOULD adhere +to the current specification for `TextMapPropagator` where key/value pairs MUST +only consist of US-ASCII characters that make up valid HTTP header fields as +per RFC 7230. + +Environment variable keys, SHOULD NOT conflict with common known environment +variables like those described in [IEEE Std 1003.1-2017][std1003]. + +One key note is that windows disallows the use of the `=` character in +environment variable names. See [MS Env Vars][ms-env] for more information. + +There is also a limit on how many characters an environment variable can +support which is 32,767 characters. + +[std1003]: https://pubs.opengroup.org/onlinepubs/9799919799/ + +[ms-env]: https://learn.microsoft.com/en-us/windows/win32/procthread/environment-variables + +## Trade-offs and Mitigations + +### Case-sensitivity + +On Windows, because environment variable keys are case insensitive, there is a +chance that automatically instrumented context propagation variables could +conflict with existing application environment variables. It will be important +to denote this behavior and document how languages mitigate this issue. + +### Security + +Do not put sensitive information in environment variables. Due to the nature of +environment variables, an attacker with the right access could obtain +information they should not be privy too. Additionally, the integrity of the +environment variables could be compromised. + +## Prior Art and Alternatives + +There are many users of `TRACEPARENT` and/or `TRACESTATE` environment variables +mentioned in [opentelemetry-specification #740](https://github.com/open-telemetry/opentelemetry-specification/issues/740): + +* [Jenkins OpenTelemetry Plugin](https://github.com/jenkinsci/opentelemetry-plugin) +* [otel-cli generic wrapper](https://github.com/equinix-labs/otel-cli) +* [Maven OpenTelemetry Extension](https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/maven-extension) +* [Ansible OpenTelemetry Plugin](https://github.com/ansible-collections/community.general/pull/3091) +* [go-test-trace](https://github.com/rakyll/go-test-trace/commit/22493612be320e0a01c174efe9b2252924f6dda9) +* [Concourse CI](https://github.com/concourse/docs/pull/462) +* [BuildKite agent](https://github.com/buildkite/agent/pull/1548) +* [pytest](https://github.com/chrisguidry/pytest-opentelemetry/issues/20) +* [Kubernetes test-infra Prow](https://github.com/kubernetes/test-infra/issues/30010) +* [hotel-california](https://github.com/parsonsmatt/hotel-california/issues/3) + +Additionally, there was a prototype implementation for environment variables as +context carriers written in the [Python SDK][python-env]. + +[python-env]: https://github.com/open-telemetry/opentelemetry-specification/issues/740#issuecomment-919657003 + +## Alternatives and why they were not chosen + +### Using a file for the carrier + +Using a JSON file that is stored on the filesystem and referenced through an +environment variable would eliminate the need to workaround case-insensitivity +issues on Windows, however it would introduce a number of issues: + +1. Would introduce an out-of-band file that would need to be created and + reliably cleaned up. +2. Managing permissions on the file might be non-trivial in some circumstances + (for example, if `sudo` is used). +3. This would deviate from significant prior art that currently uses + environment variables. + +## Open questions + +The author has no open questions at this point. + +## Future possibilities + +1. Enabling distributed tracing in systems that do not communicate over network + protocols that allow trace context being propagated through headers, + metadata, or other means. diff --git a/oteps/0265-event-vision.md b/oteps/0265-event-vision.md new file mode 100644 index 00000000000..297866c813c --- /dev/null +++ b/oteps/0265-event-vision.md @@ -0,0 +1,79 @@ +# Event Basics + +## Motivation + +The introduction of Events has been contentious, so we want to document and agree on a few basics. + +### What are OpenTelemetry Events? + +OpenTelemetry Events are a type of OpenTelemetry Log that requires an event name and follows a specific structure implied by that event name. + +They are a core concept in OpenTelemetry Semantic Conventions. + +### OTLP + +Since OpenTelemetry Events are a type of OpenTelemetry Log, they share the same OTLP log data structure and pipeline. + +### API + +OpenTelemetry SHOULD provide a (user-facing) Logs API that includes the capability to emit OpenTelemetry Events. + +### Interoperability with other logging libraries + +OpenTelemetry SHOULD provide a way to send OpenTelemetry Logs from the OpenTelemetry Logs API to other logging libraries (e.g., Log4j). +This allows users to integrate OpenTelemetry Logs into an existing (non-OpenTelemetry) log stream. + +OpenTelemetry SHOULD provide a way to bypass the OpenTelemetry Logs API entirely and emit OpenTelemetry Logs (including Events) +directly via existing language-specific logging libraries, if that library has the capability to do so. + +OpenTelemetry will recommend that +[instrumentation libraries](../specification/glossary.md#instrumentation-library) +use the OpenTelemetry Logs API to emit OpenTelemetry Events rather than using other logging libraries to emit OpenTelemetry Events. This recommendation aims to provide users with a simple and consistent +onboarding experience that avoids mixing approaches. + +OpenTelemetry will also recommend that application developers use the OpenTelemetry Logs API to emit OpenTelemetry Events instead of using another +logging library, as this helps prevent accidentally emitting logs that lack an event name or are unstructured. + +Recommending the OpenTelemetry Logs API for emitting OpenTelemetry Events, rather than using other logging libraries, contributes to a clearer overall +OpenTelemetry API story. This ensures a unified approach with first-class user-facing APIs for traces, metrics, and events, +all suitable for direct use in native instrumentation. + +### Relationship to Span Events + +Events are intended to replace Span Events in the long term. +Span Events will be deprecated to signal that users should prefer Events. + +Interoperability between Events and Span Events will be defined in the short term. + +### SDK + +This refers to the existing OpenTelemetry Logs SDK. + +## Alternatives + +Many alternatives were considered over the past 2+ years. + +These alternatives primarily boil down to differences in naming (e.g. whether to even use the word Event) +and organization (e.g. whether Event API should be something separate from Logs API). + +The state of this OTEP represents the option that we think will be the least confusing to the most number of users across the wide range of different language ecosystems that are supported. + +## Open questions + +* How to support routing logs from the Logs API to a language-specific logging library + while simultaneously routing logs from the language-specific logging library to an OpenTelemetry Logging Exporter? +* How do log bodies interoperate with other logging libraries? + OpenTelemetry Logs have two places to put structure (attributes and body), while often logging libraries only have one layer of structure, + which makes it non-obvious how to do a two-way mapping between them in this case. +* How do event bodies interoperate with Span Events? +* Should the Logs API have an `Enabled` function based on severity level and event name? +* What kind of capabilities should the OpenTelemetry Logs API have now that it is user-facing? + (Keeping in mind the bundle size constraints of browsers and possibly other client environments.) +* What kind of ergonomic improvements make sense now that the OpenTelemetry Logs API is user-facing? + (Keeping in mind the bundle size constraints of browsers and possibly other client environments.) +* How do OpenTelemetry Events relate to raw metric events? + (e.g. [opentelemetry-specification/617](https://github.com/open-telemetry/opentelemetry-specification/issues/617)). +* How do OpenTelemetry Events relate to raw span events? + (e.g. a streaming SDK). +* Should event name be captured as an attribute or as a top-level field? +* How will Event / Span Event interoperability work in the presence of sampling (e.g. since Span Events are sampled along with Spans)? diff --git a/oteps/0266-move-oteps-to-spec.md b/oteps/0266-move-oteps-to-spec.md new file mode 100644 index 00000000000..2bef2caa34d --- /dev/null +++ b/oteps/0266-move-oteps-to-spec.md @@ -0,0 +1,68 @@ +# Move OTEPS to the Specification repository + +Let's move OTEP documentation and PRs back into the [Specification](https://github.com/open-telemetry/opentelemetry-specification) repository. + +## Motivation + +Moving OTEPs back into the specification solves two main issues: + +- Maintaining its tooling infrastructure (currently woefully out of date) +- Bringing it into the existing triage and voting process currently within the + Specification. + +## Explanation + +Originally, OTEPs were kept as a separate repository to keep disjoint/disruptive designs as a separate repository. There are a few differences between a normal PR and an OTEP: + +- OTEPs are expected to be directional and subject to change when actually entered into the specification. +- OTEPs require more approvals than specification PRs +- OTEPs have different PR worklfows (whether due to accidental omission or conscious decision), e.g. staleness checks, linting. + +As OpenTelemetry is stabilizing, the need for OTEPs to live outside the specification is growing less, and we face challenges like: + +- Keeping OTEP tooling up to date +- Advertising the repositories existence + - New contributors to OpenTelemetry often can't find recorded decision that exist in OTEPs. + - Getting reviews from folks used to checking the Specification repository, but not the less-frequently-worked-on OTEP repository. + +To solve these, let's move OTEPs into a directory within the [specification repository](https://github.com/open-telemetry/opentelemetry-specification). +We would also update all tooling and expected reviews to match existing standards for OTEPs. Given the maintainers of OTEPs are the same as +maintainers of the specification, this should not change the bar for acceptance. + +## Internal details + +The following changes would occur: + +- The following files would be moved to the specification repo: + - `text/` directory -> `oteps/text/` + - `0000-template.md` -> `oteps/0000-template.md` +- Update the specification `Makefile` to include linting, spell checking, link checking and TOC-ing the oteps directory. +- A one-time cleanup of OTEP markdown upon import to the specification repository. +- Close existing OTEP PRs and ask folks to reopen against the specification repository. +- New labels within the specification repository to tag OTEPs, including automation to set these on PR open. +- Updating contributing guidelines to include a section about OTEPs. +- Add `oteps/README.md` file outlining that OTEPS are not normative and part of enhancement proposal process. +- Add disclaimer to the header of every OTEP that the contents are not normative and part of the enhancement proposal process. + +## Trade-offs and mitigations + +Moving into the specification repository DOES mean that we would have a directory with a different quality bar and, somewhat, process than the rest of the repository. +This can be mitigated through the use of clear, vibrant labels for OTEPS, and updating process guidelines for the specification repository to retain the important +aspects of the current OTEP status. + +## Prior art and alternatives + +OTEPs were originally based on common enhancement proposal processes in other ecosystems, where enhancements live outside core repositories and follow a more rigorous criteria and evaluation. We are finding this +problematic for OpenTelemetry for reasons discussed above. Additionally, unlike many other ecosystems where enhancement/design is kept separate from core code, OpenTelemetry *already* keeps its design separate +form core code via the Specification vs. implementation repositories. Unlike these other OSS projects, our Specification generally requires rigorous discussion, design and prototyping prior to acceptance. Even +after acceptance into the specification, work is still required for improvements to roll out to the ecosystem. Effectively: The OpenTelemetry specification has no such thing as a "small" change: There are only medium changes that appear small, but would be enhancements in other proejcts or large changes that require an OTEP. + +## Open questions + +What are the important portions of the OTEP process to bring over? Have we missed anything in this description? + +## Future possibilities + +In the future, we could figure out how to make OTEPs more searchable, discoverable and highlighted within the opentelemetry.io website. + +Additionally, we can look at extending staleness deadlines for OTEP labeled PRs. diff --git a/oteps/README.md b/oteps/README.md new file mode 100644 index 00000000000..c6a4665ebcb --- /dev/null +++ b/oteps/README.md @@ -0,0 +1,88 @@ +# OpenTelemetry Enhancement Proposal (OTEP) + +## Evolving OpenTelemetry at the speed of Markdown + +OpenTelemetry uses an "OTEP" (similar to a RFC) process for proposing changes to the OpenTelemetry Specification. + +### Table of Contents + +- [OpenTelemetry Enhancement Proposal (OTEP)](#opentelemetry-enhancement-proposal-otep) + - [Evolving OpenTelemetry at the speed of Markdown](#evolving-opentelemetry-at-the-speed-of-markdown) + - [Table of Contents](#table-of-contents) + - [What changes require an OTEP](#what-changes-require-an-otep) + - [Extrapolating cross-cutting changes](#extrapolating-cross-cutting-changes) + - [OTEP scope](#otep-scope) + - [Writing an OTEP](#writing-an-otep) + - [Submitting the OTEP](#submitting-the-otep) + - [Integrating the OTEP into the Spec](#integrating-the-otep-into-the-spec) + - [Implementing the OTEP](#implementing-the-otep) + - [Changes to the OTEP process](#changes-to-the-otep-process) + - [Background on the OpenTelemetry OTEP process](#background-on-the-opentelemetry-otep-process) + +### What changes require an OTEP + +The OpenTelemetry OTEP process is intended for changes that are **cross-cutting** - that is, applicable across *languages* and *implementations* - and either **introduce new behaviour**, **change desired behaviour**, or otherwise **modify requirements**. + +In practice, this means that OTEPs should be used for such changes as: + +- New tracer configuration options +- Additions to span data +- New metric types +- Modifications to extensibility requirements + +On the other hand, they do not need to be used for such changes as: + +- Bug fixes +- Rephrasing, grammatical fixes, typos, etc. +- Refactoring +- Things that affect only a single language or implementation + +**Note:** The above lists are intended only as examples and are not meant to be exhaustive. If you don't know whether a change requires an OTEP, please feel free to ask! + +#### Extrapolating cross-cutting changes + +Sometimes, a change that is only immediately relevant within a single language or implementation may be indicative of a problem upstream in the specification. We encourage you to add an OTEP if and when you notice such cases. + +### OTEP scope + +While OTEPs are intended for "significant" changes, we recommend trying to keep each OTEP's scope as small as makes sense. A general rule of thumb is that if the core functionality proposed could still provide value without a particular piece, then that piece should be removed from the proposal and used instead as an *example* (and, ideally, given its own OTEP!). + +For example, an OTEP proposing configurable sampling *and* various samplers should instead be split into one OTEP proposing configurable sampling as well as an OTEP per sampler. + +### Writing an OTEP + +- First, [fork](https://help.github.com/en/articles/fork-a-repo) this repo. +- Copy [`0000-template.md`](./0000-template.md) to `0000-my-OTEP.md`, where `my-OTEP` is a title relevant to your proposal, and `0000` is the OTEP ID. +- Fill in the template. Put care into the details: It is important to present convincing motivation, demonstrate an understanding of the design's impact, and honestly assess the drawbacks and potential alternatives. + +### Submitting the OTEP + +- An OTEP is `proposed` by posting it as a PR. +- An OTEP is `approved` when four reviewers github-approve the PR. The OTEP is then merged. +- If an OTEP is `rejected` or `withdrawn`, the PR is closed. Note that these OTEPs submissions are still recorded, as GitHub retains both the discussion and the proposal, even if the branch is later deleted. +- If an OTEP discussion becomes long, and the OTEP then goes through a major revision, the next version of the OTEP can be posted as a new PR, which references the old PR. The old PR is then closed. This makes OTEP review easier to follow and participate in. + +### Integrating the OTEP into the Spec + +- Once an OTEP is `approved`, an issue is created in this repo to integrate the OTEP into the actual spec. +- When reviewing the spec PR for the OTEP, focus on whether the spec is written clearly, and reflects the changes approved in the OTEP. Please abstain from relitigating the approved OTEP changes at this stage. +- An OTEP is `integrated` when four reviewers github-approve the spec PR. The PR is then merged, and the spec is versioned. + +### Implementing the OTEP + +- Once an OTEP is `integrated` into the spec, an issue is created in the backlog of every relevant OpenTelemetry implementation. +- PRs are made until the all the requested changes are implemented. +- The status of the OpenTelemetry implementation is updated to reflect that it is implementing a new version of the spec. + +## Changes to the OTEP process + +The hope and expectation is that the OTEP process will **evolve** with the OpenTelemetry. The process is by no means fixed. + +Have suggestions? Concerns? Questions? **Please** raise an issue or raise the matter on our [community](https://github.com/open-telemetry/community) chat. + +## Background on the OpenTelemetry OTEP process + +Our OTEP process borrows from the [Rust RFC](https://github.com/rust-lang/rfcs) and [Kubernetes Enhancement Proposal](https://github.com/kubernetes/enhancements) processes, the former also being [very influential](https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/0000-kep-process#prior-art) on the latter; as well as the [OpenTracing OTEP](https://github.com/opentracing/specification/tree/master/rfc_process.md) process. Massive kudos and thanks to the respective authors and communities for providing excellent prior art 💖 + +[slack-image]: https://img.shields.io/badge/Slack-4A154B?style=for-the-badge&logo=slack&logoColor=white +[slack-url]: https://cloud-native.slack.com/archives/C01N7PP1THC diff --git a/oteps/assets/0225-config.yaml b/oteps/assets/0225-config.yaml new file mode 100644 index 00000000000..febfd3ee861 --- /dev/null +++ b/oteps/assets/0225-config.yaml @@ -0,0 +1,420 @@ +# include version specification in configuration files to help with parsing and schema evolution. +scheme_version: 0.1 +sdk: + # Disable the SDK for all signals. + # + # Boolean value. If "true", a no-op SDK implementation will be used for all telemetry + # signals. Any other value or absence of the variable will have no effect and the SDK + # will remain enabled. This setting has no effect on propagators configured through + # the OTEL_PROPAGATORS variable. + # + # Environment variable: OTEL_SDK_DISABLED + disabled: false + # Configure resource attributes and resource detection for all signals. + resource: + # Key-value pairs to be used as resource attributes. + # + # Environment variable: OTEL_RESOURCE_ATTRIBUTES + attributes: + # Sets the value of the `service.name` resource attribute + # + # Environment variable: OTEL_SERVICE_NAME + service.name: !!str "unknown_service" + # Configure context propagators. Each propagator has a name used to configure it. + # + # Environment variable: OTEL_PROPAGATORS + propagators: [tracecontext, baggage, b3multi] + # Configure general attribute limits. See also sdk.tracer_provider.span_limits, sdk.logger_provider.log_record_limits. + attribute_limits: + # Set the max attribute value size. + # + # Environment variable: OTEL_ATTRIBUTE_VALUE_LENGTH_LIMIT + attribute_value_length_limit: 4096 + # Set the max attribute count. + # + # Environment variable: OTEL_ATTRIBUTE_COUNT_LIMIT + attribute_count_limit: 128 + # Configure the tracer provider. + tracer_provider: + # Span exporters. Each exporter key refers to the type of the exporter. Values configure the exporter. Exporters must be associated with a span processor. + exporters: + # Configure the otlp exporter. + otlp: + # Sets the protocol. + # + # Environment variable: OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_EXPORTER_OTLP_TRACES_PROTOCOL + protocol: http/protobuf + # Sets the endpoint. + # + # Environment variable: OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_TRACES_ENDPOINT + endpoint: http://localhost:4318/v1/traces + # Sets the certificate. + # + # Environment variable: OTEL_EXPORTER_OTLP_CERTIFICATE, OTEL_EXPORTER_OTLP_TRACES_CERTIFICATE + certificate: /app/cert.pem + # Sets the mTLS private client key. + # + # Environment variable: OTEL_EXPORTER_OTLP_CLIENT_KEY, OTEL_EXPORTER_OTLP_TRACES_CLIENT_KEY + client_key: /app/cert.pem + # Sets the mTLS client certificate. + # + # Environment variable: OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE, OTEL_EXPORTER_OTLP_TRACES_CLIENT_CERTIFICATE + client_certificate: /app/cert.pem + # Sets the headers. + # + # Environment variable: OTEL_EXPORTER_OTLP_HEADERS, OTEL_EXPORTER_OTLP_TRACES_HEADERS + headers: + api-key: 1234 + # Sets the compression. + # + # Environment variable: OTEL_EXPORTER_OTLP_COMPRESSION, OTEL_EXPORTER_OTLP_TRACES_COMPRESSION + compression: gzip + # Sets the max time to wait for each export. + # + # Environment variable: OTEL_EXPORTER_OTLP_TIMEOUT, OTEL_EXPORTER_OTLP_TRACES_TIMEOUT + timeout: 10000 + # Configure the zipkin exporter. + zipkin: + # Sets the endpoint. + # + # Environment variable: OTEL_EXPORTER_ZIPKIN_ENDPOINT + endpoint: http://localhost:9411/api/v2/spans + # Sets the max time to wait for each export. + # + # Environment variable: OTEL_EXPORTER_ZIPKIN_TIMEOUT + timeout: 10000 + # Configure the jaeger exporter. + jaeger: + # Sets the protocol. + # + # Environment variable: OTEL_EXPORTER_JAEGER_PROTOCOL + protocol: http/thrift.binary + # Sets the endpoint. Applicable when protocol is http/thrift.binary or grpc. + # + # Environment variable: OTEL_EXPORTER_JAEGER_ENDPOINT + endpoint: http://localhost:14268/api/traces + # Sets the max time to wait for each export. Applicable when protocol is http/thrift.binary or grpc. + # + # Environment variable: OTEL_EXPORTER_JAEGER_TIMEOUT + timeout: 10000 + # Sets the username for HTTP basic authentication. Applicable when protocol is http/thrift.binary or grpc. + # + # Environment variable: OTEL_EXPORTER_JAEGER_USER + user: user + # Sets the password for HTTP basic authentication. Applicable when protocol is http/thrift.binary or grpc. + # + # Environment variable: OTEL_EXPORTER_JAEGER_PASSWORD + password: password + # Sets the hostname of the Jaeger agent. Applicable when protocol is udp/thrift.compact or udp/thrift.binary. + # + # Environment variable: OTEL_EXPORTER_JAEGER_AGENT_PORT + agent_host: localhost + # Sets the port of the Jaeger agent. Applicable when protocol is udp/thrift.compact or udp/thrift.binary. + # + # Environment variable: OTEL_EXPORTER_JAEGER_AGENT_HOST + agent_port: 6832 + # List of span processors. Each span processor has a type and args used to configure it. + span_processors: + # Add a batch span processor. + # + # Environment variable: OTEL_BSP_*, OTEL_TRACES_EXPORTER + - type: batch + # Configure the batch span processor. + args: + # Sets the delay interval between two consecutive exports. + # + # Environment variable: OTEL_BSP_SCHEDULE_DELAY + schedule_delay: 5000 + # Sets the maximum allowed time to export data. + # + # Environment variable: OTEL_BSP_EXPORT_TIMEOUT + export_timeout: 30000 + # Sets the maximum queue size. + # + # Environment variable: OTEL_BSP_MAX_QUEUE_SIZE + max_queue_size: 2048 + # Sets the maximum batch size. + # + # Environment variable: OTEL_BSP_MAX_EXPORT_BATCH_SIZE + max_export_batch_size: 512 + # Sets the exporter. Exporter must refer to a key in sdk.tracer_provider.exporters. + # + # Environment variable: OTEL_TRACES_EXPORTER + exporter: otlp + # Add a batch span processor configured with zipkin exporter. For full description of options see sdk.tracer_provider.span_processors[0]. + - type: batch + args: + exporter: zipkin + # Add a batch span processor configured with jaeger exporter. For full description of options see sdk.tracer_provider.span_processors[0]. + - type: batch + args: + exporter: jaeger + # Configure the span limits. See also sdk.attribute_limits. + span_limits: + # Set the max span attribute value size. Overrides sdk.attribute_limits.attribute_value_length_limit. + # + # Environment variable: OTEL_SPAN_ATTRIBUTE_VALUE_LENGTH_LIMIT + attribute_value_length_limit: 4096 + # Set the max span attribute count. Overrides sdk.attribute_limits.attribute_count_limit. + # + # Environment variable: OTEL_SPAN_ATTRIBUTE_COUNT_LIMIT + attribute_count_limit: 128 + # Set the max span event count. + # + # Environment variable: OTEL_SPAN_EVENT_COUNT_LIMIT + event_count_limit: 128 + # Set the max span link count. + # + # Environment variable: OTEL_SPAN_LINK_COUNT_LIMIT + link_count_limit: 128 + # Set the max attributes per span event. + # + # Environment variable: OTEL_EVENT_ATTRIBUTE_COUNT_LIMIT + event_attribute_count_limit: 128 + # Set the max attributes per span link. + # + # Environment variable: OTEL_LINK_ATTRIBUTE_COUNT_LIMIT + link_attribute_count_limit: 128 + # Configuration for samplers. Each key refers to the type of sampler. Values configure the sampler. One key must be referenced in sdk.tracer_provider.sampler. + sampler_config: + # Configure the always_on sampler. + # + # Environment variable: OTEL_TRACES_SAMPLER=always_on + always_on: + # Configure the always_off sampler. + # + # Environment variable: OTEL_TRACES_SAMPLER=always_off + always_off: + # Configure the trace_id_ratio_based sampler. + # + # Environment variable: OTEL_TRACES_SAMPLER=traceidratio + trace_id_ratio_based: + # Set the sampling ratio. + # + # Environment variable: OTEL_TRACES_SAMPLER=traceidratio, OTEL_TRACES_SAMPLER_ARG=0.0001 + ratio: 0.0001 + # Configure the parent_based sampler. + # + # Environment variable: OTEL_TRACES_SAMPLER=parentbased_* + parent_based: + # Set root sampler. Must refer a key in sdk.tracer_provider.sampler_config. + # + # Environment variable: OTEL_TRACES_SAMPLER=parentbased_* + root: trace_id_ratio_based + # Set the sampler used when the parent is remote and is sampled. Must refer a key in sdk.tracer_provider.sampler_config. + remote_parent_sampled: always_on + # Set the sampler used when the parent is remote and is not sampled. Must refer a key in sdk.tracer_provider.sampler_config. + remote_parent_not_sampled: always_off + # Set the sampler used when the parent is local and is sampled. Must refer a key in sdk.tracer_provider.sampler_config. + local_parent_sampled: always_on + # Set the sampler used when the parent is local and is not sampled. Must refer a key in sdk.tracer_provider.sampler_config. + local_parent_not_sampled: always_off + # Configure the jaeger_remote sampler. + # + # Environment variable: OTEL_TRACES_SAMPLER=jaeger_remote + jaeger_remote: + # Set the endpoint. + # + # Environment variable: OTEL_TRACES_SAMPLER=jaeger_remote, OTEL_TRACES_SAMPLER_ARG=endpoint=http://localhost:14250 + endpoint: http://localhost:14250 + # Set the polling interval. + # + # Environment variable: OTEL_TRACES_SAMPLER=jaeger_remote, OTEL_TRACES_SAMPLER_ARG=pollingINtervalMs=5000 + polling_interval: 5000 + # Set the initial sampling rate. + # + # Environment variable: OTEL_TRACES_SAMPLER=jaeger_remote, OTEL_TRACES_SAMPLER_ARG=initialSamplingRate=0.25 + initial_sampling_rate: 0.25 + # Set the sampler. Sampler must refer to a key in sdk.tracer_provider.sampler_config. + sampler: parent_based + # Configure the meter provider. + meter_provider: + # Metric exporters. Each exporter key refers to the type of the exporter. Values configure the exporter. Exporters must be associated with a metric reader. + exporters: + # Configure the otlp exporter. + otlp: + # Sets the protocol. + # + # Environment variable: OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_EXPORTER_OTLP_METRICS_PROTOCOL + protocol: http/protobuf + # Sets the endpoint. + # + # Environment variable: OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_METRICS_ENDPOINT + endpoint: http://localhost:4318/v1/metrics + # Sets the certificate. + # + # Environment variable: OTEL_EXPORTER_OTLP_CERTIFICATE, OTEL_EXPORTER_OTLP_METRICS_CERTIFICATE + certificate: /app/cert.pem + # Sets the mTLS private client key. + # + # Environment variable: OTEL_EXPORTER_OTLP_CLIENT_KEY, OTEL_EXPORTER_OTLP_METRICS_CLIENT_KEY + client_key: /app/cert.pem + # Sets the mTLS client certificate. + # + # Environment variable: OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE, OTEL_EXPORTER_OTLP_METRICS_CLIENT_CERTIFICATE + client_certificate: /app/cert.pem + # Sets the headers. + # + # Environment variable: OTEL_EXPORTER_OTLP_HEADERS, OTEL_EXPORTER_OTLP_METRICS_HEADERS + headers: + api-key: 1234 + # Sets the compression. + # + # Environment variable: OTEL_EXPORTER_OTLP_COMPRESSION, OTEL_EXPORTER_OTLP_METRICS_COMPRESSION + compression: gzip + # Sets the max time to wait for each export. + # + # Environment variable: OTEL_EXPORTER_OTLP_TIMEOUT, OTEL_EXPORTER_OTLP_METRICS_TIMEOUT + timeout: 10000 + # Sets the temporality preference. + # + # Environment variable: OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE + temporality_preference: delta + # Sets the default histogram aggregation. + # + # Environment variable: OTEL_EXPORTER_OTLP_METRICS_DEFAULT_HISTOGRAM_AGGREGATION + default_histogram_aggregation: exponential_bucket_histogram + # List of metric readers. Each metric reader has a type and args used to configure it. + metric_readers: + # Add a periodic metric reader. + # + # Environment variable: OTEL_METRICS_EXPORT_*, OTEL_METRICS_EXPORTER + - type: periodic + args: + # Sets delay interval between the start of two consecutive export attempts. + # + # Environment variable: OTEL_METRIC_EXPORT_INTERVAL + interval: 5000 + # Sets the maximum allowed time to export data. + # + # Environment variable: OTEL_METRIC_EXPORT_TIMEOUT + timeout: 30000 + # Sets the exporter. Exporter must refer to a key in sdk.meter_provider.exporters. + # + # Environment variable: OTEL_METRICS_EXPORTER + exporter: otlp + # Add a prometheus metric reader. Some languages SDKs may implement this as a metric exporter. + # + # Environment variable: OTEL_METRICS_EXPORTER=prometheus + - type: prometheus + args: + # Set the host used to serve metrics in the prometheus format. + # + # Environment variable: OTEL_EXPORTER_PROMETHEUS_HOST + host: localhost + # Set the port used to serve metrics in the prometheus format. + # + # Environment variable: OTEL_EXPORTER_PROMETHEUS_PORT + port: 9464 + # List of views. Each view has a selector which determines the instrument(s) it applies to, and a view which configures resulting metric(s). + views: + # Add a view. The selection criteria and view aim to demonstrate the configuration surface area and are not representative of what a user would be expected to do. + - selector: + # Select instrument(s) by name. + instrument_name: my-instrument + # Select instrument(s) by type. + instrument_type: histogram + # Select instrument(s) by meter name. + meter_name: my-meter + # Select instrument(s) by meter version. + meter_version: 1.0.0 + # Select instrument(s) by meter schema URL. + meter_schema_url: https://opentelemetry.io/schemas/1.16.0 + view: + # Set the name of resulting metric(s). + name: new_instrument_name + # Set the description of resulting metric(s). + description: new_description + # Set the aggregation of resulting metric(s). Aggregation has a type an args used to configure it. + aggregation: + # Set the aggregation type. Options include: default, drop, sum, last_value, explicit_bucket_histogram, exponential_bucket_histogram. + type: explicit_bucket_histogram + # Configure the aggregation. + args: + # Set the bucket boundaries. Applicable when aggregation is explicit_bucket_histogram. + boundaries: [1.0, 2.0, 5.0] + # Set whether min and max are recorded. Applicable when aggregation is explicit_bucket_histogram or exponential_bucket_histogram. + record_min_max: true + # Sets the max number of buckets in each of the positive and negative ranges. Applicable when aggregation is exponential_bucket_histogram. + max_size: 160 + # Set the attribute keys to retain. + attribute_keys: + - key1 + - key2 + # Configure the logger provider. + logger_provider: + # Log record exporters. Each exporter key refers to the type of the exporter. Values configure the exporter. Exporters must be associated with a log record processor. + exporters: + # Configure the otlp exporter. + otlp: + # Sets the protocol. + # + # Environment variable: OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_EXPORTER_OTLP_LOGS_PROTOCOL + protocol: http/protobuf + # Sets the endpoint. + # + # Environment variable: OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_LOGS_ENDPOINT + endpoint: http://localhost:4318/v1/logs + # Sets the certificate. + # + # Environment variable: OTEL_EXPORTER_OTLP_CERTIFICATE, OTEL_EXPORTER_OTLP_LOGS_CERTIFICATE + certificate: /app/cert.pem + # Sets the mTLS private client key. + # + # Environment variable: OTEL_EXPORTER_OTLP_CLIENT_KEY, OTEL_EXPORTER_OTLP_LOGS_CLIENT_KEY + client_key: /app/cert.pem + # Sets the mTLS client certificate. + # + # Environment variable: OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE, OTEL_EXPORTER_OTLP_LOGS_CLIENT_CERTIFICATE + client_certificate: /app/cert.pem + # Sets the headers. + # + # Environment variable: OTEL_EXPORTER_OTLP_HEADERS, OTEL_EXPORTER_OTLP_LOGS_HEADERS + headers: + api-key: 1234 + # Sets the compression. + # + # Environment variable: OTEL_EXPORTER_OTLP_COMPRESSION, OTEL_EXPORTER_OTLP_LOGS_COMPRESSION + compression: gzip + # Sets the max time to wait for each export. + # + # Environment variable: OTEL_EXPORTER_OTLP_TIMEOUT, OTEL_EXPORTER_OTLP_LOGS_TIMEOUT + timeout: 10000 + # List of log record processors. Each log record processor has a type and args used to configure it. + log_record_processors: + # Add a batch log record processor. + # + # Environment variable: OTEL_BLRP_*, OTEL_LOGS_EXPORTER + - type: batch + # Configure the batch log record processor. + args: + # Sets the delay interval between two consecutive exports. + # + # Environment variable: OTEL_BLRP_SCHEDULE_DELAY + schedule_delay: 5000 + # Sets the maximum allowed time to export data. + # + # Environment variable: OTEL_BLRP_EXPORT_TIMEOUT + export_timeout: 30000 + # Sets the maximum queue size. + # + # Environment variable: OTEL_BLRP_MAX_QUEUE_SIZE + max_queue_size: 2048 + # Sets the maximum batch size. + # + # Environment variable: OTEL_BLRP_MAX_EXPORT_BATCH_SIZE + max_export_batch_size: 512 + # Sets the exporter. Exporter must refer to a key in sdk.loger_provider.exporters. + # + # Environment variable: OTEL_LOGS_EXPORTER + exporter: otlp + # Configure the log record limits. See also sdk.attribute_limits. + log_record_limits: + # Set the max log record attribute value size. Overrides sdk.attribute_limits.attribute_value_length_limit. + # + # Environment variable: OTEL_LOGRECORD_ATTRIBUTE_VALUE_LENGTH_LIMIT + attribute_value_length_limit: 4096 + # Set the max log record attribute count. Overrides sdk.attribute_limits.attribute_count_limit. + # + # Environment variable: OTEL_LOGRECORD_ATTRIBUTE_COUNT_LIMIT + attribute_count_limit: 128 diff --git a/oteps/assets/0225-schema.json b/oteps/assets/0225-schema.json new file mode 100644 index 00000000000..76af51b434e --- /dev/null +++ b/oteps/assets/0225-schema.json @@ -0,0 +1,656 @@ +{ + "$schema": "https://json-schema.org/draft-06/schema", + "$ref": "#/definitions/OpenTelemetryConfiguration", + "definitions": { + "OpenTelemetryConfiguration": { + "type": "object", + "additionalProperties": false, + "properties": { + "scheme_version": { + "type": "number" + }, + "sdk": { + "$ref": "#/definitions/SDK" + } + }, + "required": [ + "scheme_version", + "sdk" + ], + "title": "OpenTelemetryConfiguration" + }, + "SDK": { + "type": "object", + "additionalProperties": false, + "properties": { + "disabled": { + "type": "boolean" + }, + "resource": { + "$ref": "#/definitions/Resource" + }, + "propagators": { + "type": "array", + "items": { + "type": "string" + } + }, + "attribute_limits": { + "$ref": "#/definitions/Limits" + }, + "tracer_provider": { + "$ref": "#/definitions/TracerProvider" + }, + "meter_provider": { + "$ref": "#/definitions/MeterProvider" + }, + "logger_provider": { + "$ref": "#/definitions/LoggerProvider" + } + }, + "required": [ + "disabled" + ], + "title": "SDK" + }, + "Limits": { + "type": "object", + "additionalProperties": false, + "properties": { + "attribute_value_length_limit": { + "type": "integer" + }, + "attribute_count_limit": { + "type": "integer" + } + }, + "required": [ + "attribute_count_limit", + "attribute_value_length_limit" + ], + "title": "Limits" + }, + "LoggerProvider": { + "type": "object", + "additionalProperties": false, + "properties": { + "exporters": { + "$ref": "#/definitions/LoggerProviderExporters" + }, + "log_record_processors": { + "type": "array", + "items": { + "$ref": "#/definitions/Processor" + } + }, + "log_record_limits": { + "$ref": "#/definitions/Limits" + } + }, + "required": [ + "exporters", + "log_record_limits", + "log_record_processors" + ], + "title": "LoggerProvider" + }, + "LoggerProviderExporters": { + "type": "object", + "additionalProperties": false, + "properties": { + "otlp": { + "$ref": "#/definitions/Otlp" + } + }, + "required": [ + "otlp" + ], + "title": "LoggerProviderExporters" + }, + "Otlp": { + "type": "object", + "additionalProperties": false, + "properties": { + "protocol": { + "type": "string", + "pattern": "^(http|grpc)\\/(protobuf|json)" + }, + "endpoint": { + "type": "string", + "format": "uri", + "qt-uri-protocols": [ + "http" + ] + }, + "certificate": { + "type": "string" + }, + "client_key": { + "type": "string" + }, + "client_certificate": { + "type": "string" + }, + "headers": { + "$ref": "#/definitions/Headers" + }, + "compression": { + "type": "string" + }, + "timeout": { + "type": "integer" + }, + "temporality_preference": { + "type": "string" + }, + "default_histogram_aggregation": { + "type": "string" + } + }, + "required": [ + "endpoint", + "protocol" + ], + "title": "Otlp" + }, + "Headers": { + "type": "object", + "additionalProperties": true, + "title": "Headers" + }, + "Processor": { + "type": "object", + "additionalProperties": false, + "properties": { + "type": { + "type": "string" + }, + "args": { + "$ref": "#/definitions/LogRecordProcessorArgs" + } + }, + "required": [ + "args", + "type" + ], + "title": "Processor" + }, + "LogRecordProcessorArgs": { + "type": "object", + "additionalProperties": false, + "properties": { + "schedule_delay": { + "type": "integer" + }, + "export_timeout": { + "type": "integer" + }, + "max_queue_size": { + "type": "integer" + }, + "max_export_batch_size": { + "type": "integer" + }, + "exporter": { + "type": "string" + } + }, + "required": [ + "exporter" + ], + "title": "LogRecordProcessorArgs" + }, + "MeterProvider": { + "type": "object", + "additionalProperties": false, + "properties": { + "exporters": { + "$ref": "#/definitions/LoggerProviderExporters" + }, + "metric_readers": { + "type": "array", + "items": { + "$ref": "#/definitions/MetricReader" + } + }, + "views": { + "type": "array", + "items": { + "$ref": "#/definitions/ViewElement" + } + } + }, + "required": [ + "exporters", + "metric_readers", + "views" + ], + "title": "MeterProvider" + }, + "MetricReader": { + "type": "object", + "additionalProperties": false, + "properties": { + "type": { + "type": "string" + }, + "args": { + "$ref": "#/definitions/MetricReaderArgs" + } + }, + "required": [ + "args", + "type" + ], + "title": "MetricReader" + }, + "MetricReaderArgs": { + "type": "object", + "additionalProperties": false, + "properties": { + "interval": { + "type": "integer" + }, + "timeout": { + "type": "integer" + }, + "exporter": { + "type": "string" + }, + "host": { + "type": "string" + }, + "port": { + "type": "integer" + } + }, + "required": [], + "title": "MetricReaderArgs" + }, + "ViewElement": { + "type": "object", + "additionalProperties": false, + "properties": { + "selector": { + "$ref": "#/definitions/Selector" + }, + "view": { + "$ref": "#/definitions/ViewView" + } + }, + "required": [ + "selector", + "view" + ], + "title": "ViewElement" + }, + "Selector": { + "type": "object", + "additionalProperties": false, + "properties": { + "instrument_name": { + "type": "string" + }, + "instrument_type": { + "type": "string" + }, + "meter_name": { + "type": "string" + }, + "meter_version": { + "type": "string" + }, + "meter_schema_url": { + "type": "string", + "format": "uri", + "qt-uri-protocols": [ + "https" + ], + "qt-uri-extensions": [ + ".0" + ] + } + }, + "required": [ + "instrument_name", + "instrument_type", + "meter_name", + "meter_schema_url", + "meter_version" + ], + "title": "Selector" + }, + "ViewView": { + "type": "object", + "additionalProperties": false, + "properties": { + "name": { + "type": "string" + }, + "description": { + "type": "string" + }, + "aggregation": { + "$ref": "#/definitions/Aggregation" + }, + "attribute_keys": { + "type": "array", + "items": { + "type": "string" + } + } + }, + "required": [ + "aggregation", + "attribute_keys", + "description", + "name" + ], + "title": "ViewView" + }, + "Aggregation": { + "type": "object", + "additionalProperties": false, + "properties": { + "type": { + "type": "string" + }, + "args": { + "$ref": "#/definitions/AggregationArgs" + } + }, + "required": [ + "args", + "type" + ], + "title": "Aggregation" + }, + "AggregationArgs": { + "type": "object", + "additionalProperties": false, + "properties": { + "boundaries": { + "type": "array", + "items": { + "type": "number" + } + }, + "record_min_max": { + "type": "boolean" + }, + "max_size": { + "type": "integer" + } + }, + "required": [ + "boundaries", + "max_size", + "record_min_max" + ], + "title": "AggregationArgs" + }, + "Resource": { + "type": "object", + "additionalProperties": false, + "properties": { + "attributes": { + "$ref": "#/definitions/Attributes" + } + }, + "required": [ + "attributes" + ], + "title": "Resource" + }, + "Attributes": { + "type": "object", + "additionalProperties": true, + "properties": { + "service.name": { + "type": "string" + } + }, + "required": [ + "service.name" + ], + "title": "Attributes" + }, + "TracerProvider": { + "type": "object", + "additionalProperties": false, + "properties": { + "exporters": { + "$ref": "#/definitions/TracerProviderExporters" + }, + "span_processors": { + "type": "array", + "items": { + "$ref": "#/definitions/Processor" + } + }, + "span_limits": { + "$ref": "#/definitions/SpanLimits" + }, + "sampler_config": { + "$ref": "#/definitions/SamplerConfig" + }, + "sampler": { + "type": "string" + } + }, + "required": [ + "exporters", + "span_processors" + ], + "title": "TracerProvider" + }, + "TracerProviderExporters": { + "type": "object", + "additionalProperties": true, + "patternProperties": { + "^otlp.*": { + "$ref": "#/definitions/Otlp" + }, + "^zipkin.*": { + "$ref": "#/definitions/Zipkin" + }, + "^jaeger.*": { + "$ref": "#/definitions/Jaeger" + } + }, + "title": "TracerProviderExporters" + }, + "Jaeger": { + "type": "object", + "additionalProperties": false, + "properties": { + "protocol": { + "type": "string", + "pattern": "^http/thrift.binary$" + }, + "endpoint": { + "type": "string", + "format": "uri", + "qt-uri-protocols": [ + "http" + ] + }, + "timeout": { + "type": "integer" + }, + "user": { + "type": "string" + }, + "password": { + "type": "string" + }, + "agent_host": { + "type": "string" + }, + "agent_port": { + "type": "integer" + } + }, + "required": [ + "agent_host", + "agent_port", + "endpoint", + "password", + "protocol", + "timeout", + "user" + ], + "title": "Jaeger" + }, + "Zipkin": { + "type": "object", + "additionalProperties": false, + "properties": { + "endpoint": { + "type": "string", + "format": "uri", + "qt-uri-protocols": [ + "http" + ] + }, + "timeout": { + "type": "integer" + } + }, + "required": [ + "endpoint", + "timeout" + ], + "title": "Zipkin" + }, + "SamplerConfig": { + "type": "object", + "additionalProperties": false, + "properties": { + "always_on": { + "type": "null" + }, + "always_off": { + "type": "null" + }, + "trace_id_ratio_based": { + "$ref": "#/definitions/TraceIDRatioBased" + }, + "parent_based": { + "$ref": "#/definitions/ParentBased" + }, + "jaeger_remote": { + "$ref": "#/definitions/JaegerRemote" + } + }, + "required": [ + "always_off", + "always_on", + "jaeger_remote", + "parent_based", + "trace_id_ratio_based" + ], + "title": "SamplerConfig" + }, + "JaegerRemote": { + "type": "object", + "additionalProperties": false, + "properties": { + "endpoint": { + "type": "string", + "format": "uri", + "qt-uri-protocols": [ + "http" + ] + }, + "polling_interval": { + "type": "integer" + }, + "initial_sampling_rate": { + "type": "number" + } + }, + "required": [ + "endpoint", + "initial_sampling_rate", + "polling_interval" + ], + "title": "JaegerRemote" + }, + "ParentBased": { + "type": "object", + "additionalProperties": false, + "properties": { + "root": { + "type": "string" + }, + "remote_parent_sampled": { + "type": "string" + }, + "remote_parent_not_sampled": { + "type": "string" + }, + "local_parent_sampled": { + "type": "string" + }, + "local_parent_not_sampled": { + "type": "string" + } + }, + "required": [ + "local_parent_not_sampled", + "local_parent_sampled", + "remote_parent_not_sampled", + "remote_parent_sampled", + "root" + ], + "title": "ParentBased" + }, + "TraceIDRatioBased": { + "type": "object", + "additionalProperties": false, + "properties": { + "ratio": { + "type": "number" + } + }, + "required": [ + "ratio" + ], + "title": "TraceIDRatioBased" + }, + "SpanLimits": { + "type": "object", + "additionalProperties": false, + "properties": { + "attribute_value_length_limit": { + "type": "integer" + }, + "attribute_count_limit": { + "type": "integer" + }, + "event_count_limit": { + "type": "integer" + }, + "link_count_limit": { + "type": "integer" + }, + "event_attribute_count_limit": { + "type": "integer" + }, + "link_attribute_count_limit": { + "type": "integer" + } + }, + "required": [ + "attribute_count_limit", + "attribute_value_length_limit", + "event_attribute_count_limit", + "event_count_limit", + "link_attribute_count_limit", + "link_count_limit" + ], + "title": "SpanLimits" + } + } +} \ No newline at end of file diff --git a/oteps/entities/0256-entities-data-model.md b/oteps/entities/0256-entities-data-model.md new file mode 100644 index 00000000000..51d5a00faae --- /dev/null +++ b/oteps/entities/0256-entities-data-model.md @@ -0,0 +1,790 @@ +# Entities Data Model, Part 1 + +This is a proposal of a data model to represent entities. The purpose of the data model +is to have a common understanding of what an entity is, what data needs to be recorded, +transferred, stored and interpreted by an entity observability system. + + + +- [Motivation](#motivation) +- [Design Principles](#design-principles) +- [Data Model](#data-model) + * [Minimally Sufficient Id](#minimally-sufficient-id) + * [Examples of Entities](#examples-of-entities) +- [Entity Events](#entity-events) + * [EntityState Event](#entitystate-event) + * [EntityDelete Event](#entitydelete-event) +- [Entity Identification](#entity-identification) + * [LID, GID and IDCONTEXT](#lid-gid-and-idcontext) + * [Semantic Conventions](#semantic-conventions) + * [Examples](#examples) + + [Process in a Host](#process-in-a-host) + + [Process in Kubernetes](#process-in-kubernetes) + + [Host in Cloud Account](#host-in-cloud-account) +- [Prototypes](#prototypes) +- [Prior Art](#prior-art) +- [Alternatives](#alternatives) + * [Different ID Structure](#different-id-structure) + * [No Entity Events](#no-entity-events) + * [Merge Entity Events data into Resource](#merge-entity-events-data-into-resource) + * [Hierarchical ID Field](#hierarchical-id-field) +- [Open questions](#open-questions) + * [Attribute Data Type](#attribute-data-type) + * [Classes of Entity Types](#classes-of-entity-types) + * [Multiple Observers](#multiple-observers) + * [Is Type part of Entity's identity?](#is-type-part-of-entitys-identity) + * [Choosing from Multiple Ids](#choosing-from-multiple-ids) +- [Future Work](#future-work) +- [References](#references) + + + +## Motivation + +This data model sets the foundation for adding entities to OpenTelemetry. The data model +is largely borrowed from +[the initial proposal](https://docs.google.com/document/d/1VUdBRInLEhO_0ABAoiLEssB1CQO_IcD5zDnaMEha42w/edit) +that was accepted for entities SIG formation. + +This OTEP is step 1 in introducing the entities data model. Follow up OTEPs will add +further data model definitions, including the linking of Resource information +to entities. + +## Design Principles + +- Consistency with the rest of OpenTelemetry is important. We heavily favor solutions + that look and feel like other OpenTelemetry data models. + +- Meaningful (especially human-readable) IDs are more valuable than random-generated IDs. + Long-lived IDs that survive state changes (e.g. entity restarts) are more valuable than + short-lived, ephemeral IDs. + See [the need for navigation](https://docs.google.com/document/d/1Xd1JP7eNhRpdz1RIBLeA1_4UYPRJaouloAYqldCeNSc/edit#heading=h.fut2c2pec5wa). + +- We cannot make an assumption that the entirety of information that is necessary for + global identification of an entity is available at once, in one place. This knowledge + may be distributed across multiple participants and needs to be combined to form an + identifier that is globally unique. + +- Semantic conventions must bring as much order as possible to telemetry, however they + cannot be too rigid and prevent real-world use cases. + +## Data Model + +We propose a new concept of Entity. + +Entity represents an object of interest associated with produced telemetry: +traces, metrics or logs. + +For example, telemetry produced using OpenTelemetry SDK is normally associated with +a Service entity. Similarly, OpenTelemetry defines system metrics for a host. The Host is the +entity we want to associate metrics with in this case. + +Entities may be also associated with produced telemetry indirectly. +For example a service that produces +telemetry is also related with a process in which the service runs, so we say that +the Service entity is related to the Process entity. The process normally also runs +on a host, so we say that the Process entity is related to the Host entity. + +Note: subsequent OTEPs will define how the entities are associated with +traces, metrics and logs and how relations between entities will be specified. +See [Future Work](#future-work). + +The data model below defines a logical model for an entity (irrespective of the physical +format and encoding of how entity data is recorded). + + + + + + + + + + + + + + + + + + + + + + +
Field + Type + Description +
Type + string + Defines the type of the entity. MUST not change during the +lifetime of the entity. For example: "service" or "host". This field is +required and MUST not be empty for valid entities. +
Id + map<string, attribute> + Attributes that identify the entity. +

+MUST not change during the lifetime of the entity. The Id must contain +at least one attribute. +

+Follows OpenTelemetry common +attribute definition. SHOULD follow OpenTelemetry semantic +conventions for attributes. +

Attributes + map<string, any> + Descriptive (non-identifying) attributes of the entity. +

+MAY change over the lifetime of the entity. MAY be empty. These +attributes are not part of entity's identity. +

+Follows any +value definition in the OpenTelemetry spec - it can be a scalar value, +byte array, an array or map of values. Arbitrary deep nesting of values +for arrays and maps is allowed. +

+SHOULD follow OpenTelemetry semantic +conventions for attributes. +

+ +### Minimally Sufficient Id + +Commonly, a number of attributes of an entity are readily available for the telemetry +producer to compose an Id from. Of the available attributes the entity Id should +include the minimal set of attributes that is sufficient for uniquely identifying +that entity. For example +a Process on a host can be uniquely identified by (`process.pid`,`process.start_time`) +attributes. Adding for example `process.executable.name` attribute to the Id is +unnecessary and violates the Minimally Sufficient Id rule. + +### Examples of Entities + +_This section is non-normative and is present only for the purposes of demonstrating +the data model._ + +Here are examples of entities, the typical identifying attributes they +have and some examples of non-identifying attributes that may be +associated with the entity. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Entity + Entity Type + Identifying Attributes + Non-identifying Attributes +
Service + "service" + service.name (required) +

+service.instance.id +

+service.namespace +

service.version +
Host + "host" + host.id + host.name +

+host.type +

+host.image.id +

+host.image.name +

K8s Pod + "k8s.pod" + k8s.pod.uid (required) +

+k8s.cluster.name +

Any pod labels +
K8s Pod Container + "container" + k8s.pod.uid (required) +

+k8s.cluster.name +

+container.name +

Any container labels +
+ +See more examples showing nuances of Id field composition in the +[Entity Identification](#entity-identification) section. + +## Entity Events + +Information about Entities can be produced and communicated using 2 +types of entity events: EntityState and EntityDelete. + +### EntityState Event + +The EntityState event stores information about the _state_ of the entity +at a particular moment of time. The data model of the EntityState event +is the same as the entity data model with some extra fields: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Field + Type + Description +
Timestamp + nanoseconds + The time since when the entity state is described by this event. +The time is measured by the origin clock. The field is required. +
Interval + milliseconds + Defines the reporting period, i.e. how frequently the +information about this entity is reported via EntityState events even if +the entity does not change. The next expected EntityEvent for this +entity is expected at (Timestamp+Interval) time. Can be used by +receivers to infer that a no longer reported entity is gone, even if the +EntityDelete event was not observed. Optional, if missing the interval +is unknown. +
Type + + See data model. + +
Id + + See data model +
Attributes + + See data model +
+ +We say that an entity mutates (changes) when one or more of its +descriptive attributes changes. A new descriptive attribute may be +added, an existing descriptive attribute may be deleted or a value of an +existing descriptive attribute may be changed. All these changes +represent valid mutations of an entity over time. When these mutations +happen the identity of the entity does not change. + +When the entity's state is changed it is expected that the source will +emit a new EntityState event with a fresh timestamp and full list of +values of all other fields. + +Entity event producers SHOULD periodically emit events even +if the entity does not change. In this case the Type, Id and Attribute +fields will remain the same, but a fresh Timestamp will be recorded in +the event. Producing such events allows the system to be resilient to +event losses. Even if some events are lost eventually the correct state +of the entity is more likely to be delivered to the final destination. +Periodic sending of EntityState events also serves as a liveliness +indicator (see below how it can be used in lieu of EntityDelete event). + +### EntityDelete Event + +EntityDelete event indicates that a particular entity is gone: + + + + + + + + + + + + + + + + + + + + + + +
Field + Type + Description +
Timestamp + nanoseconds + The time when the entity was deleted. The time is measured by +the origin clock. The field is required. +
Type + + See data model +
Id + + See data model +
+ +Note that transmitting EntityDelete is not guaranteed +when the entity is gone. Recipients of entity signals need to be prepared to +handle this situation by expiring entities that are no longer seeing +EntityState events reported (i.e. treat the presence of EntityState +events as a liveliness indicator). + +The expiration mechanism is based on the previously reported `Interval` field of +EntityState event. The recipient can use this value to compute when to expect the next +EntityState event and if the event does not arrive in a timely manner (plus some slack) +it can consider the entity to be gone even if the EntityDelete event was not observed. + +## Entity Identification + +The data model defines the structure of the entity Id field. This section explains +how the Id field is computed. + +### LID, GID and IDCONTEXT + +All entities have a local ID (LID) and a global ID (GID). + +The LID is unique in a particular identification context, but is not necessarily globally +unique. For example a process entity's LID is its PID number and process start time. +The (PID,StartTime) pair is unique only in the context of a host where the process runs +(and the host in this case is the identification context). + +The GID of an entity is globally unique, in the sense that for the entire set of entities +in a particular telemetry store no 2 entities exist that have the same GID value. + +The GID of an entity E is defined as: + +`GID(E) = UNION( LID(E), GID(IDCONTEXT(E)) )` + +Where `IDCONTEXT(E)` is the identification context in which the LID of entity E is unique. +The value of `IDCONTEXT(E)` is an entity itself, and thus we can compute the GID value of it too. + +In other words, the GID of an entity is a union of its LID and the GID of its +identification context. Note: GID(E) is a map of key/value attributes. + +The enrichment process often is responsible for determining the value of `IDCONTEXT(E)` +and for computing the GID according to the formula defined above, although the GID may +also be produced at once by the telemetry source (e.g. by OTel SDK) without requiring +any additional enrichment. + +### Semantic Conventions + +OpenTelemetry semantic conventions will be enhanced to include entity definitions for +well-known entities such as Service, Process, Host, etc. + +For well-known entity types LID(E) is defined in OTel semantic conventions per +entity type. The value of LID is a map of key/value attributes. For example, +for entity of type "process" the semantic conventions define LID as 2 attributes: + +```json5 +{ + "process.pid": $pid, + "process.start_time": $starttime +} +``` + +For custom entity types (not defined in OTel semantic conventions) the end-user is +responsible for defining their custom semantic conventions in a similar way. + +The entity information producer is responsible for determining the identification +context of each entity it is producing information about. + +In certain cases, where only one possible IDCONTEXT definition is meaningful, the +IDCONTEXT may be defined in the semantic conventions. For example Kubernetes nodes +always exist in the identifying context of a Kubernetes cluster. The semantic convention +for "k8s.node" and "k8s.cluster" can prescribe that the IDCONTEXT of entity of type +"k8s.node" is always an entity of type "k8s.cluster". + +Important: semantic conventions are not expected to (and normally won't) prescribe +the complete GID composition. +Semantic conventions should prescribe LID and may prescribe IDCONTEXT, but GID +composition, generally speaking, cannot be known statically. + +For example: a host's LID should be a `host.id` attribute. A host running on a cloud +should have an IDCONTEXT of "cloud.account" and the LID of "cloud.account" entity +is (`cloud.provider`, `cloud.account.id`). However semantic conventions cannot prescribe +that the GID of a host is (`host.id`, `cloud.provider`, `cloud.account.id`) because not all +hosts run on cloud. A host that runs on prem in a single data center may have a GID +of just (`host.id`) or if a customer has multiple on prem data centers they may use +data.center.id as its identifier and use (`host.id`, `data.center.id`) as GID of the host. + +### Examples + +_This section is a supplementary guideline and is not part of logical data model._ + +#### Process in a Host + +A locally running host agent (e.g. an OTel Collector) that produces +information about "process" entities has the knowledge that the +processes run in the particular host and thus the "host" is the +identification context for the processes that the agent observes. The +LID of a process can look like this: + +```json5 +{ + "process.pid": 12345, + "process.start_time": 1714491491 +} +``` + +and Collector will use "host" as the IDCONTEXT and add host's LID to it: + +```json5 +{ + // Process LID, unique per host. + "process.pid": 12345, + "process.start_time": 1714491491, + + + // Host LID + "host.id": "fdbf79e8af94cb7f9e8df36789187052" +} +``` + +If we assume that we have only one data center and host ids are globally +unique then the above id is globally unique and is the GID of the +process. If this assumption is not valid in our situation we would +continue applying additional IDCONTEXT's until the GID is globally +unique. See for example the +[Host in Cloud Account](#host-in-cloud-account) example below. + +#### Process in Kubernetes + +An OTel Collector (running in Kubernetes) that produces information about process entities +has the knowledge that the processes run in the particular containers in +the particular pod and thus the container is the identification context +for the process, and the pod is the identification context for the +container. If we begin with the same process LID: + +```json5 +{ + "process.pid": 12345, + "process.start_time": 1714491491 +} +``` + +the Collector will then add the IDCONTEXT of container and +pod to this, resulting in: + +```json5 +{ + // Process LID, unique per container. + "process.pid": 12345, + "process.start_time": 1714491491, + + // Container LID, unique per pod. + "k8s.container.name": "redis", + + + // Pod LID has 2 attributes. + "k8s.pod.uid": "0c4cbbf8-d4b4-4e84-bc8b-b95f0d537fc7", + "k8s.cluster.name": "dev" +} +``` + +Note that we used 3 different LIDs above to compose the GID. The +attributes that are part of each LID are defined in OTel semantic +conventions. + +In this example we assume this to be a valid GID because Pod is the root +IDCONTEXT, since Pod's LID includes the cluster name, which is expected +to be globally unique. If this assumption about global uniqueness of +cluster names is wrong then another containing IDCONTEXT within which +cluster names are unique will need to be applied and so forth. + +Note also how we used a pair (`k8s.pod.uid`, `k8s.cluster.name`). +Alternatively, we could say that Kubernetes Cluster is a separate entity +we care about. This would mean the Pod's IDCONTEXT is the cluster. The +net result for process's GID would be exactly the same, but we would +arrive to it in a different way: + +```json5 +{ + // Process LID, unique per container. + "process.pid": 12345, + "process.start_time": 1714491491, + + // Container LID, unique per pod. + "k8s.container.name": "redis", + + + // Pod LID, unique per cluster. + "k8s.pod.uid": "0c4cbbf8-d4b4-4e84-bc8b-b95f0d537fc7", + + // Cluster LID, also globally unique since cluster is root entity. + "k8s.cluster.name": "dev" +} +``` + +#### Host in Cloud Account + +A host running in a cloud account (e.g. AWS) will have a LID that uses +the host instance id, unique within a single cloud account, e.g.: + +```json5 +{ + // Host LID, unique per cloud account. + "host.id": "fdbf79e8af94cb7f9e8df36789187052" +} +``` + +OTel Collector with +[resourcedetection](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/resourcedetectionprocessor) +processor with "aws" detector enabled will add the IDCONTEXT of the +cloud account like this: + +```json5 +{ + // Host LID, unique per cloud account. + "host.id": "fdbf79e8af94cb7f9e8df36789187052" + + // Cloud account LID has 2 attributes: + "cloud.provider": "aws", + "cloud.account.id": "1234567890" +} +``` + +## Prototypes + +A set of prototypes that demonstrate this data model has been implemented: + +- [Go SDK Prototype](https://github.com/tigrannajaryan/opentelemetry-go/pull/244) +- [Collector Prototype](https://github.com/tigrannajaryan/opentelemetry-collector/pull/4) +- [Collector Contrib Prototype](https://github.com/tigrannajaryan/opentelemetry-collector-contrib/pull/1/files) +- [OTLP Protocol Buffer changes](https://github.com/tigrannajaryan/opentelemetry-proto/pull/2/files) + +## Prior Art + +An experimental entity data model was implemented in OpenTelemetry Collector as described +in [this document](https://docs.google.com/document/d/1Tg18sIck3Nakxtd3TFFcIjrmRO_0GLMdHXylVqBQmJA/edit). +The Collector's design uses LogRecord as the carrier of entity events, with logical structure +virtually identical to what this OTEP proposes. + +There is also an implementation of this design in the Collector, see +[completed issue to add entity events](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/23565) +and [the PR](https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/24419) +that implements entity event emitting for k8scluster receiver in the Collector. + +## Alternatives + +### Different ID Structure + +Alternative proposals were made [here](https://docs.google.com/document/d/1PLPSAnWvFWCsm6meAj6OIVDBvxsk983V51WugF0NgVo/edit) and +[here](https://docs.google.com/document/d/1bLmkQSv35Fi6Wbe4bAqQ-_JS7XWIXWbvuirVmAkz4a4/edit) +to use a different structure for entity Id field. + +We rejected these proposals in favour of the Id field proposed in this OTEP for the +following reasons: + +- The map of key/value attributes is widely used elsewhere in OpenTelemetry as + Resource attributes, as Scope attributes, as Metric datapoint attributes, etc. so + it is conceptually consistent with the rest of OTel. + +- We already have a lot of machinery that works well with this definition of attributes, + for example OTTL language has syntax for working with attributes, or Collector's pdata + API or Attribute value types in SDKs. All this code will no longer work as is if we + have a different data structure and needs to be re-implemented in a different way. + +### No Entity Events + +Entity signal allows recording the state of the entities. As the entity's state changes +events are emitted that describe the new state. In this proposal the entity's state is +(type,id,attributes) tuple, but we envision that in the future we may also want to add +more information to the entity signal, particularly to record the relationships between +entities (i.e the fact that a Process runs on a Host). + +### Merge Entity Events data into Resource + +If we eliminate the entity signal as a concept and put the entire entity's state into +the Resource then every time the entity's state changes we must emit one of +ResourceLogs/ResourceSpans/ResourceMetrics messages that includes the Resource that +represents the entity's state. + +However, what do we do if there are no logs or spans or metrics data points to +report? Do we emit a ResourceLogs/ResourceSpans/ResourceMetrics OTLP message with empty logs +or spans or metrics data points? Which one do we emit: ResourceLogs, ResourceSpans +or ResourceMetrics? + +What do we do when we want to add support for recording entity relationships in the +future? Do we add all that information to the Resource and bloat the Resource size? + +How do we report the EntityDelete event? + +All these questions don't have good answers. Attempting to shoehorn the entity +information into the Resource where it does not naturally fit is likely to result +in ugly and inefficient solutions. + +### Hierarchical ID Field + +We had an alternate proposal to retain the information about how the ID was +[composed from LID and IDCONTEXT](#entity-identification), essentially to record the +hierarchy of identification contexts in the ID data structure instead of flattening it +and losing the information about the composition process that resulted in the particular ID. + +There are a couple of reasons: + +- The flat ID structure is simpler. + +- There are no known use cases that require a hierarchical ID structure. The use case + of "record parental relationship between entities" will be handled explicitly via + separate relationship data structures (see [Future Work](#future-work)). + +## Open questions + +### Attribute Data Type + +The data model requires the Attributes field to use the extended +[any](../../specification/logs/data-model.md#type-any) +attribute values, that allows more complex data types. This is different from the data +type used by the Id field, which is more restricted in the shape. + +Are we happy with this discrepancy? + +Here is corresponding +[TODO item](https://github.com/orgs/open-telemetry/projects/85/views/1?pane=issue&itemId=62411493). + +### Classes of Entity Types + +Do we need to differentiate between infrastructure entities (e.g. Pod, Host, Process) +and non-infrastructure entities (logical entities) such as Service? Is this distinction +important? + +Here is corresponding +[TODO item](https://github.com/orgs/open-telemetry/projects/85/views/1?pane=issue&itemId=62411407). + +### Multiple Observers + +The same entity may be observed by different observers simultaneously. For example the +information about a Host may be reported by the agent that runs on the host. At the same +time more information about that same host may be obtained via cloud provider's API. + +The information obtained by different observers can be complementary, they don't +necessarily have access exactly to the same data. It can be very useful to combine this +information in the backend and make it all available to the user. + +However, it is not possible for multiple observers to simultaneously use EntityState +events as they are defined earlier in this document, since the information in the event +will overwrite information in the previously received event about that same entity. + +A possible way to allow multiple observers to report portions of information about the +same entity simultaneously is to indicate the observer in the EntityState event by adding +an "ObserverId" field. EntityState event will then look like this: + +|Field|Type| +|---|---| +|Timestamp|nanoseconds| +|Interval|milliseconds| +|Type|| +|Id|| +|Attributes|| +|ObserverId|string or bytes| + +ObserverId field can be optional. Attributes from EntityState events that contain +different ObserverId values will be merged in the backend. Attributes from EntityState +events that contain the same ObserverId value will overwrite attributes from the previous +reporting of the EntityState event from that observer. + +Here is corresponding +[TODO item](https://github.com/orgs/open-telemetry/projects/85/views/1?pane=issue&itemId=62411289). + +### Is Type part of Entity's identity? + +Is the Type field part of the entity's identity together with the Id field? + +For example let's assume we have a Host and an OTel Collector running on the Host. +The Host's Id will contain one attribute: `host.id`, and the Type of the entity will be +"host". The Collector technically speaking can be also identified by one attribute +`host.id` and the Type of the entity will be "otel.collector". This only works if we +consider the Type field to be part of the entity's identity. + +If the Type field is not part of identity then in the above example we require that the +entity that describes the Collector has some other attribute in its Id (for example +`agent.type` attribute [if it gets accepted](https://github.com/open-telemetry/semantic-conventions/pull/950)). + +Here is corresponding +[TODO item](https://github.com/orgs/open-telemetry/projects/85?pane=issue&itemId=57053320). + +### Choosing from Multiple Ids + +Sometimes the same entity may be identified in more than one way. For example a Pod can +be identified by its `k8s.pod.uid` but it can be also identified by a pair of +`k8s.namespace.name`, `k8s.pod.name` attributes. + +We need to provide a recommendation for the cases when more than one valid identifier +exists about how to make a choice between the identifiers. + +Here is corresponding +[TODO item](https://github.com/orgs/open-telemetry/projects/85?pane=issue&itemId=57053415). + +## Future Work + +This OTEP is step 1 of defining the Entities data model. It will be followed by other +OTEPs that cover the following topics: + +- How the existing Resource concept will be modified to link existing + signals to entities. + +- How relationships between entities are modeled. + +- Representation of entity data over the wire and the transmission + protocol for entities. + +- Add transformations that describe entity semantic convention changes in + OpenTelemetry Schema Files. + +We will possibly also submit additional OTEPs that address the Open Questions. + +## References + +- [OpenTelemetry Proposal: Resources and Entities](https://docs.google.com/document/d/1VUdBRInLEhO_0ABAoiLEssB1CQO_IcD5zDnaMEha42w/edit) +- [OpenTelemetry Entity Data Model](https://docs.google.com/document/d/1FdhTOvB1xhx7Ks7dFW6Ht1Vfw2myU6vyKtEfr_pqZPw/edit) +- [OpenTelemetry Entity Identification](https://docs.google.com/document/d/1hJIAIMsRCgZs-poRsw3lnirP14d3sMfn1LB08C4LCDw/edit) +- [OpenTelemetry Resources - Principles and Characteristics](https://docs.google.com/document/d/1Xd1JP7eNhRpdz1RIBLeA1_4UYPRJaouloAYqldCeNSc/edit) diff --git a/oteps/entities/0264-resource-and-entities.md b/oteps/entities/0264-resource-and-entities.md new file mode 100644 index 00000000000..6cb0e70b9c5 --- /dev/null +++ b/oteps/entities/0264-resource-and-entities.md @@ -0,0 +1,863 @@ +# Resource and Entities - Data Model Part 2 + +This is a proposal to address Resource and Entity data model interactions, +including a path forward to address immediate friction and issues in the +current resource specification. + +It is an expansion on the [previous entity proposal](0256-entities-data-model.md). + + + +- [Motivation](#motivation) +- [Design](#design) + * [Approach - Resource Improvements](#approach---resource-improvements) + + [Resource Provider](#resource-provider) + + [Entity Detector](#entity-detector) + + [Entity Merging and Resource](#entity-merging-and-resource) + + [Environment Variable Detector](#environment-variable-detector) + * [Interactions with OpenTelemetry Collector](#interactions-with-opentelemetry-collector) +- [Datamodel Changes](#datamodel-changes) + * [Resource](#resource) + * [ResourceEntityRef](#resourceentityref) + * [Resource Identity](#resource-identity) +- [How this proposal solves the problems that motivated it](#how-this-proposal-solves-the-problems-that-motivated-it) + * [Problem 1: Commingling of Entities](#problem-1-commingling-of-entities) + * [Problem 2: Lack of Precise Identity](#problem-2-lack-of-precise-identity) + * [Problem 3: Lack of Mutable Attributes](#problem-3-lack-of-mutable-attributes) + * [Problem 4: Metric Cardinality Problem](#problem-4-metric-cardinality-problem) +- [Entity WG Rubric](#entity-wg-rubric) + * [Resource detectors (soon to be entity detectors) need to be composable / disjoint](#resource-detectors-soon-to-be-entity-detectors-need-to-be-composable--disjoint) + * [New entities added by extension should not break existing code](#new-entities-added-by-extension-should-not-break-existing-code) + * [Navigational attributes need to exist and can be used to identify an entity but could be augmented with UUID or other aspects. - Having ONLY a UUID for entity identification is not good enough](#navigational-attributes-need-to-exist-and-can-be-used-to-identify-an-entity-but-could-be-augmented-with-uuid-or-other-aspects---having-only-a-uuid-for-entity-identification-is-not-good-enough) + * [Collector augmentation / enrichment (resource, e.g.) - Should be extensible and not hard-coded. We need a general algorithm not specific rulesets](#collector-augmentation--enrichment-resource-eg---should-be-extensible-and-not-hard-coded-we-need-a-general-algorithm-not-specific-rulesets) + * [Users are expected to provide / prioritize "detectors" and determine which entity is "producing" or most-important for a signal](#users-are-expected-to-provide--prioritize-detectors-and-determine-which-entity-is-producing-or-most-important-for-a-signal) + * [For an SDK - ALL telemetry should be associated with the same set of entities (resource labels)](#for-an-sdk---all-telemetry-should-be-associated-with-the-same-set-of-entities-resource-labels) +- [Open Questions](#open-questions) + * [How to attach Entity "bundle" information in Resource?](#how-to-attach-entity-bundle-information-in-resource) + * [How to deal with Resource/Entities whose lifecycle does not match the SDK?](#how-to-deal-with-resourceentities-whose-lifecycle-does-not-match-the-sdk) + * [How to deal with Prometheus Compatibility for non-SDK telemetry?](#how-to-deal-with-prometheus-compatibility-for-non-sdk-telemetry) + * [Should entities have a domain?](#should-entities-have-a-domain) + * [Should resources have only one associated entity?](#should-resources-have-only-one-associated-entity) + * [What identity should entities use (LID, UUID / GUID, or other)?](#what-identity-should-entities-use-lid-uuid--guid-or-other) + * [What happens if existing Resource translation in the collector remove resource attributes an Entity relies on?](#what-happens-if-existing-resource-translation-in-the-collector-remove-resource-attributes-an-entity-relies-on) + * [What about advanced entity interaction in the Collector?](#what-about-advanced-entity-interaction-in-the-collector) +- [Trade-offs and mitigations](#trade-offs-and-mitigations) + * [Why don't we download schema url contents?](#why-dont-we-download-schema-url-contents) +- [Prior art and alternatives](#prior-art-and-alternatives) +- [Future Posibilities](#future-posibilities) +- [Use Cases](#use-cases) + * [SDK - Multiple Detectors of the same Entity type](#sdk---multiple-detectors-of-the-same-entity-type) + * [SDK and Collector - Simple coordination](#sdk-and-collector---simple-coordination) + * [SDK and Collector - Entity coordination with descriptive attributes](#sdk-and-collector---entity-coordination-with-descriptive-attributes) + * [SDK and Collector - Entity coordination with conflicts](#sdk-and-collector---entity-coordination-with-conflicts) + * [SDK and Collector - Entity coordination across versions](#sdk-and-collector---entity-coordination-across-versions) +- [Collection of Resource detectors and attributes used](#collection-of-resource-detectors-and-attributes-used) + * [Implications](#implications) + * [What could this mean for choosing entities that belong on resource?](#what-could-this-mean-for-choosing-entities-that-belong-on-resource) + + + +## Motivation + +This proposal attempts to focus on the following problems within OpenTelemetry to unblock multiple working groups: + +- Allowing mutating attributes to participate in Resource ([OTEP 208](https://github.com/open-telemetry/oteps/pull/208)). +- Allow Resource to handle entities whose lifetimes don't match the SDK's lifetime ([OTEP 208](https://github.com/open-telemetry/oteps/pull/208)). +- Provide support for async resource lookup ([spec#952](https://github.com/open-telemetry/opentelemetry-specification/issues/952)). +- Fix current Resource merge rules in the specification, which most implementations violate ([oteps#208](https://github.com/open-telemetry/oteps/pull/208), [spec#3382](https://github.com/open-telemetry/opentelemetry-specification/issues/3382), [spec#3710](https://github.com/open-telemetry/opentelemetry-specification/issues/3710)). +- Allow semantic convention resource modeling to progress ([spec#605](https://github.com/open-telemetry/opentelemetry-specification/issues/605), [spec#559](https://github.com/open-telemetry/opentelemetry-specification/issues/559), etc). + +## Design + +### Approach - Resource Improvements + +Let's focus on outlining Entity detectors and Resource composition. This has a higher priority for fixing within OpenTelemetry, and needs to be unblocked sooner. Then infer our way back to data model and Collector use cases. + +We define the following SDK components: + +- **Resource Detectors (legacy)**: We preserve existing resource detectors. They have the same behavior and interfaces as today. +- **Entity Detectors (new)**: Detecting an entity that is relevant to the current instance of the SDK. For example, this would detect a service entity for the current SDK, or its process. Every entity must have some relation to the current SDK. +- **Resource Provider (new)**: A component responsible for taking Resource and Entity detectors and doing the following: + - Constructing a Resource for the SDK from detectors. + - Dealing with conflicts between detectors. + - Providing SDK-internal access to detected Resources for reporting via Log signal on configured LogProviders. + - *(new) Managing Entity changes during SDK lifetime, specifically dealing with entities that have lifetimes shorter than the SDK* + +#### Resource Provider + +The SDK Resource Provider is responsible for running all configured Resource and Entity Detectors. There will be some (user-controlled, otel default) priority order to these. + +- The Resource Provider will detect conflicts in Entity of the same type being discovered and choose one to use. +- When using Entity Detectors and Resource detectors together, the following merge rules will be used: + - Entity merging will occur first resulting in an "Entity Merged" Resource (See [algorithm here](#entity-merging-and-resource)). + - Resource detectors otherwise follow existing merge semantics. + - The Specification merge rules will be updated to account for violations prevalent in ALL implementation of resource detection. + - Specifically: This means the [rules around merging Resource across schema-url will be dropped](../../specification/resource/sdk.md#merge). Instead only conflicting attributes will be dropped. + - SchemaURL on Resource will be deprecated with entity-specific schema-url replacing it. SDKs will only fill out SchemaURL on Resource when SchemaURL matches across all entities discovered. Additionally, only existing stable resource attributes can be used in Resource SchemaURL in stable OpenTelemetry components (Specifially `service.*` and `sdk.*` are the only stabilized resource convnetions). Given prevalent concerns of implementations around Resource merge specification, we suspect impact of this deprecation to be minimal, and existing usage was within the "experimental" phase of semantic conventions. + - An OOTB ["Env Variable Entity Detector"](#environment-variable-detector) will be specified and provided vs. requiring SDK wide ENV variables for resource detection. +- *Additionally, Resource Provider would be responsible for understanding Entity lifecycle events, for Entities whose lifetimes do not match or exceed the SDK's own lifetime (e.g. browser session).* + +#### Entity Detector + +The Entity detector in the SDK is responsible for detecting possible entities that could identify the SDK (called "associated entities"). For Example, if the SDK is running in a kubernetes pod, it may provide an Entity for that pod. SDK Entity Detectors are only required to provide identifying attributes, but may provide descriptive attributes to ensure combined Resource contains similar attributes as today's SDK. + +An Entity Detector would have an API similar to: + +```rust +trait EntityDetector + pub fn detect_entities(...) -> Result, EntityDetectionError>> +``` + +Where `Result` is the equivalent of error channel in the language of choice (e.g. in Go this would be `entities, err := e.detectEntities()`). + +An Entity Detector MUST NOT provide two entities of the same entity type (e.g. two `host` or two `service` entities). + +#### Entity Merging and Resource + +The most important aspect of this design is how Entities will be merged to construct a Resource. + +We provide a simple algorithm for this behavior: + +- Construct a set of detected entities, `E` + - All entity detectors are sorted by priority (highest first) + - For each entity detector `D`, detect entities + - For each entity detected, `d'` + - If an entity `e'` exists in `E` with same entity type as `d'`, do one of the following: + - If the entity identiy and schema_url are the same, merge the descriptive attributes of `d'` into `e'`: + - For each descriptive attribute `da'` in `d'` + - If `da'.key` does not exist in `e'`, then add `da'` to `ei` + - otherwise, ignore. + - If the entity identity is the same, but schema_url is different: drop the new entity `d'` + *Note: We could offer configuration in this case* + - If the entity identity is different: drop the new entity `d'`. + - Otherwise, add the entity `d'` to set `E` +- Construct a Resource from the set `E`. + - If all entities within `E` have the same `schema_url`, set the resource's + `schema_url` to match. + - Otherwise, leave the Resource `schema_url` blank. + +Any implementation that achieves the same result as this algorithm is acceptable. + +#### Environment Variable Detector + +An Entity detector will be specified to allow Platform to inject entity identity information into workloads running on that platform. For Example, the OpenTelemetry Operator could inject information about Kubernetes Deployment + Container into the environment, which SDKs can elect to interact with (through configuration of the Environment Variable Entity Detector). Here, Platform means an environment that can run workloads that would provide identity of those workloads, e.g. Kubernetes, Spark, Cloud environments, etc. + +See [#3966](https://github.com/open-telemetry/opentelemetry-specification/issues/3966) for context on this issue. + +While details of ENV variables will be subject to change, it would look something like the following: + +```bash +set OTEL_DETECTED_ENTITIES=k8s.deployment[k8s.deployment.name=my-program],k8s.pod[k8s.pod.name=my-program-2314,k8s.namespace=default] + +``` + +The minimum requirements of this entity detector are: + +- ENV variable(s) can specify multiple entities (resource attribute bundles) +- ENV variable(s) can be easily appended or leverages by multiple participating systems, if needed. +- Entities discovered via ENV variable(s) can participate in Resource Provider generically, i.e. resolving conflicting definitions. +- ENV variable(s) have a priority that can be influenced by platform entity providers (e.g. prepending vs. appending) + +The actual design for this ENV variable interaction would follow the approval of this OTEP. + +### Interactions with OpenTelemetry Collector + +The OpenTelemetry collector can be updated to optionally interact with Entity on Resource. A new entity-focused resource detection process can be created which allows add/override behavior at the entity level, rather than individual attribute level. + +For example, the existing resource detector looks like this: + +```yaml +processors: + resourcedetection/docker: + detectors: [env, docker] + timeout: 2s + override: false +``` + +The future entity-based detector would look exactly the same, but interact with the entity model of resource: + +```yaml +processor: + entityresourcedetection: + # Order determines override behavior + detectors: [env, docker] + # False means only append if entity doesn't already exist. + override: false +``` + +The list of detectors is given in priority order (first wins, in event of a tie, outside of override configuration). The processor may need to be updated to allow the override flag to apply to each individual detector. + +The rules for attributes would follow entity merging rules, as defined for the SDK resource proivder. + +Note: While this proposals shows a new processor replacing the `resourcedetection` processor, the details of whether to modify-in-place the existing `resourcedetection` processor or create a new one would be determined as a follow up to this design. Ideally, we don't want users to need new configuration for resource in the otel collector. + +## Datamodel Changes + +Given our desired design and algorithms for detecting, merging and manipulating Entities, we need the ability to denote how entity and resource relate. These changes must not break existing usage of Resource, therefore: + +- The Entity model must be *layered on top of* the Resource model. A system does not need to interact with entities for correct behavior. +- Existing key usage of Resource must remain when using Entities, specifically navigationality (see: [OpenTelemetry Resources: Principles and Characteristics](https://docs.google.com/document/d/1Xd1JP7eNhRpdz1RIBLeA1_4UYPRJaouloAYqldCeNSc/edit)) +- Downstream components should be able to engage with the Entity model in Resource. + +The following changes are made: + +### Resource + +| Field | Type | Description | Changes | +| ----- | ---- | ----------- | ------- | +| schema_url | string | The Schema URL, if known. This is the identifier of the Schema that the resource data is recorded in. This field is deprecated and should no longer be used. | Will be deprecated | +| dropped_attributes_count | integer | dropped_attributes_count is the number of dropped attributes. If the value is 0, then no attributes were dropped. | Unchanged | +| attributes | repeated KeyValue | Set of attributes that describe the resource.

Attribute keys MUST be unique (it is not allowed to have more than one attribute with the same key).| Unchanged | +| entities | repeated ResourceEntityRef | Set of entities that participate in this Resource. | Added | + +The DataModel would ensure that attributes in Resource are produced from both the identifying and descriptive attributes of Entity. This does not mean the protocol needs to transmit duplicate data, that design is TBD. + +### ResourceEntityRef + +The entityref data model, would have the following changes from the original [entity OTEP](https://github.com/open-telemetry/oteps/blob/main/text/entities/0256-entities-data-model.md) to denote references within Resource: + +| Field | Type | Description | Changes | +| ----- | ---- | ----------- | ------- | +| schema_url | string | The Schema URL, if known. This is the identifier of the Schema that the entity data is recorded in. To learn more about Schema URL ([see docs](https://opentelemetry.io/docs/specs/otel/schemas/#schema-url)) | added | +| type | string | Defines the type of the entity. MUST not change during the lifetime of the entity. For example: "service" or "host". This field is required and MUST not be empty for valid entities. | unchanged | +| identifying_attributes_keys | repeated string | Attribute Keys that identify the entity.
MUST not change during the lifetime of the entity. The Id must contain at least one attribute.

These keys MUST exists in Resource.attributes.

Follows OpenTelemetry common attribute definition. SHOULD follow OpenTelemetry semantic conventions for attributes.| now a reference | +| descriptive_attributes_keys | repeated string | Descriptive (non-identifying) attribute keys of the entity.
MAY change over the lifetime of the entity. MAY be empty. These attribute keys are not part of entity's identity.

These keys MUST exist in Resource.attributes.

Follows any value definition in the OpenTelemetry spec - it can be a scalar value, byte array, an array or map of values. Arbitrary deep nesting of values for arrays and maps is allowed.

SHOULD follow OpenTelemetry semantic conventions for attributes.| now a reference | + +### Resource Identity + +OpenTelemetry resource identity will be modified as follows: + +- When `entities` is empty on resource, then its identity is the collection + of all `attributes` (both key and values). +- When `entities` is non-empty on resource, then its identity is the collection + of all `attributes` where the key is not found in `entities.descriptive_attributes_keys`. + +When grouping or mixing OTLP data, you can detect if two resources are the same +using its identity and merge descriptive attributes (if applicable) using the +entity merge algorithm (described above) which will be formalized in the data model. + +## How this proposal solves the problems that motivated it + +Let's look at some motivating problems from the [Entities Proposal](https://docs.google.com/document/d/1VUdBRInLEhO_0ABAoiLEssB1CQO_IcD5zDnaMEha42w/edit#heading=h.atg5m85uw9w8): + +### Problem 1: Commingling of Entities + +We embrace the need for commingling entities in Resource and allow downstream users to interact with the individual entities rather than erasing these details. + +### Problem 2: Lack of Precise Identity + +Identity is now clearly delineated from description via the Entity portion of Resource. When Entity is used for Resource, only identifying attributes need to be interacted with to create resource identity. + +### Problem 3: Lack of Mutable Attributes + +This proposal offers two solutions going forward to this: + +- Descriptive attributes may be mutated without violating Resource identity +- Entities whose lifetimes do not match SDK may be attached/removed from Resource. + +### Problem 4: Metric Cardinality Problem + +Via solution to (2) we can leverage an identity synthesized from identifying attributes on Entity. By directly modeling entity lifetimes, we guarantee that identity changes in Resource ONLY occur when source of telemetry changes. This solves unintended metric cardinality problems (while leaving those that are necessary to deal with, e.g. collecting metrics from phones or browser instances where intrinsic cardinality is high). + +## Entity WG Rubric + +The Entities WG came up with a rubric to evaluate solutions based on shared +beliefs and goals for the overall effort. Let's look at how each item is +achieved: + +### Resource detectors (soon to be entity detectors) need to be composable / disjoint + +Entity detection and Resource Manager now fulfill this need. + +### New entities added by extension should not break existing code + +Users will need to configure a new Entity detector for new entities being modelled. + +### Navigational attributes need to exist and can be used to identify an entity but could be augmented with UUID or other aspects. - Having ONLY a UUID for entity identification is not good enough + +Resource will still be composed of identifying and descriptive attributes of Entity, allowing baseline navigational attributes users already expect from resource. + +### Collector augmentation / enrichment (resource, e.g.) - Should be extensible and not hard-coded. We need a general algorithm not specific rulesets + +The concept of "Entity" is a new definition for Resource. Where previously, resource was a collection of attributes and users would interact with each +individually, now there is a "bundle" of attributes called an Entity. Entities have an identity and descriptions, and the collector is able to +identify conflicts against the set of attributes that make up an Entity. + +The merge rules defined here give precedent for the collector to generically interact with "type", "identifying attributes" and "descriptive attributes" +rather than hard-coded rules that have to understand the nuance of when `host.id` influences `host.name`, e.g. + +### Users are expected to provide / prioritize "detectors" and determine which entity is "producing" or most-important for a signal + +The Resource Manager allows users to configure priority of Entity Detectors. + +### For an SDK - ALL telemetry should be associated with the same set of entities (resource labels) + +Resource Manager is responsible for resolving entities into a cohesive Resource that meets the same demands as Resource today. + +## Open Questions + +The following remain open questions: + +### How to attach Entity "bundle" information in Resource? + +The protocol today requires a raw grab bag of Attributes on Resource. We cannot break this going forward. However, Entities represent a new mechanism of "bundling" attributes on Resource and interacting with these bundles. We do not want this to bloat the protocol, nor do we want it to cause oddities. + +Going forward, we have set of options: + +- Duplicate attributes in `Entity` section of Resource. +- Reference attributes of Resource in entity. +- Only identify Entity id and keep attribute<->entity association out of band. +- Extend Attribute on Resource so that we can track the entity type per Key-Value (across any attribute in OTLP). + +The third option prevents generic code from interacting with Resource and Entity without understanding the model of each. The first keeps all usage of entity simple at the expense of duplicating information and the middle is awkward to interact with from an OTLP usage perspective. The fourth is violates our stability policy for OTLP. + +### How to deal with Resource/Entities whose lifecycle does not match the SDK? + +This proposal motivates a Resource Provider in the SDK whose job could include managing changes in entity lifetimes, but does not account for how these changes would be broadcast across TracerProvider, LogProvider, MeterProvider, etc. That would be addressed in a follow on OTEP. + +### How to deal with Prometheus Compatibility for non-SDK telemetry? + +Today, [Prometheus compatibility](../../specification/compatibility/prometheus_and_openmetrics.md) relies on two key attributes in Resource: `service.name` and `service.instance.id`. These are not guaranteed to exist outside of OpenTelemetry SDK generation. While this question is not fully answered, we believe outlining identity in all resources within OpenTelemetry allows us to define a solution in the future while preserving compatibility with what works today. + +Here's a list of requirements for the solution: + +- Existing prometheus/OpenTelemetry users should be able to migrate from where they are today. +- Any solution MUST work with the [info-typed metrics](https://github.com/prometheus/proposals/blob/main/proposals/2024-04-10-native-support-for-info-metrics-metadata.md#goals) being added in prometheus. + - Resource descriptive attributes should leverage `info()` or metadata. + - Resource identifying attributes need more thought/design from OpenTelemetry semconv + entities WG. + - Note: Current `info()` design will only work with `target_info` metric by default (other info metrics can be specified per `info` call), and `job/instance` labels for joins. These labels MUST be generated by the OTLP endpoint in prometheus. +- (desired) Users should be able to correlate metric timeseries to other signals via Resource attributes showing up as labels in prometheus. +- (desired) Conversion from `OTLP -> prometheus` can be reversed such that `OTLP -> Prometheus -> OTLP` is non-lossy. + +Here's a few (non-exhaustive) options for what this could look like: + +- Option #1 - Stay with what we have today + - `target_info` continues to exist as it is, with all resource attributes. + - Prometheus OTLP ingestion continues to support promoting resource attributes to metric labels. +- Option #2 - Promote all identifying attributes + - By default all identifying labels on Resource are promoted to resource attributes. + - All descriptive labels are placed on `target_info`. + - (likely) `job`/`instance` will need to be synthesized for resources lacking a `service` entity. +- Option #3 - Enocde entities into prometheus as info metrics + - Create `{entity_type}_entity_info` metrics. + - Synthesize `job`/`instance` labels for joins between all `*_info` metrics. + - Expand the scope of info-typed metrics work in Prometheus to work with this encoding. +- Option #4 - Find solutions leveraging the [metadata design](https://docs.google.com/document/d/1epBslSSwRO2do4armx40fruStJy_PS6thROnPeDifz8/edit#heading=h.5sybau7waq2q) + +These designs will be explored and evaluated in light of the requirements. For now, prometheus compatibility will continue with Option #1 as we work together towards building a better future for resource in prometheus. + +### Should entities have a domain? + +Is it worth having a `domain` in addition to type for entity? We could force each entity to exist in one domain and leverage domain generically in resource management. Entity Detectors would be responsible for an entire domain, selecting only ONE to apply a resource. Domains could be layered, e.g. a Cloud-specific domain may layer on top of a Kubernetes domain, where "GKE cluster entity" identifies *which* kubernetes cluster a kuberntes infra entity is part of. This layer would be done naively, via automatic join of participating entities or explicit relationships derived from GKE specific hooks. + +It's unclear if this is needed initially, and we believe this could be layered in later. + +### Should resources have only one associated entity? + +Given the problems leading to the Entities working group, and the needs of existing Resource users today, we think it is infeasible and unscalable to limit resource to only one entity. This would place restrictions on modeling Entities that would require OpenTelemetry to be the sole source of entity definitions and hurt building an open and extensible ecosystem. Additionally it would need careful definition of solutions for the following problems/rubrics: + +- New entities added by extension should not break existing code +- Collector augmentation / enrichment (resource, e.g.) - Should be extensible and not hard-coded. We need a general algorithm not specific rulesets. + +### What identity should entities use (LID, UUID / GUID, or other)? + +One of the largest questions in the first entities' OTEP was how to identify an entity. This was an attempt to unify the need for Navigational attributes with the notion that only identifying attributes of Entity would show up in Resource going forward. This restriction is no longer necessary in this proposal and we should reconsider how to model identity for an Entity. + +This can be done in follow up design / OTEPs. + +### What happens if existing Resource translation in the collector remove resource attributes an Entity relies on? + +While we expect the collector to be the first component to start engaging with Entities in an architecture, this could lead to data model violations. We have a few options to deal with this issue: + +- Consider this a bug and warn users not to do it. +- Specify that missing attribute keys are acceptable for descriptive attribtues. +- Specify that missing attribute keys denote that entities are unusable for that batch of telemetry, and treat the content as malformed. + +### What about advanced entity interaction in the Collector? + +One problem that motivated this design is the issue of "local resource detection" vs. "remote signal collection" in the OpenTelemetry collector. That is, I have a process running on a machine writing to an OpenTelemetry +collector running on a different machine. The current `resourcedetectionprocessor` in the collector appends attributes to resource based on discovering *where the collector is running*. However, +as the collector could determine that telemetry has come from a different machine, it could also avoid adding resource attributes that are not relevant to incoming data. + +Today, `resourcedetectionprocessor` is naive, as is the algorithm proposed in this OTEP. We believe that a more sophisticated solution could be created where the collector would know not to join entities onto a +resource based on more advanced knowledge of the communication protocol used to obtain the data (e.g. using the ip address of the sender on an OTLP server). + +## Trade-offs and mitigations + +The design proposed here attempts to balance non-breaking (backwards and forwards compatible) changes with the need to improve problematic issues in the Specification. Given the inability of most SDKs to implement the current Resource merge specification, breaking this should have little effect on actual users. Instead, the proposed merge specification should allow implementations to match current behavior and expectation, while evolving for users who engage with the new model. + +### Why don't we download schema url contents? + +OpenTelemetry needs to work in environments that have no/limited access to the external internet. We entertained, and +dismissed merging solutions that *require* access to the contents of `schema_url` to work. While the core algorithm +*cannot require* this access, we *should* be able to provide improved processing and algorithms that may leverage this data. + +For example: + +- Within an SDK, we can registry entity schema information with `EntityDetector`. +- The OpenTelemetry Collector can allow registered `schema_url` via configuration + or (optionally) download schema on demand. + +This design does not prevent these solutions, but provides the baseline/fallback +where `schema_url` is not accessible and entities must still be usable. + +## Prior art and alternatives + +Previously, we have a few unaccepted oteps, e.g. ([OTEP 208](https://github.com/open-telemetry/oteps/pull/208)). Additionally, there are some alternatives that were considered in the Entities WG and rejected. + +Below is a brief discussion of some design decisions: + +- **Only associating one entity with a Resource.** This was rejected, as too high a friction point in evolving semantic conventions and allowing independent systems to coordinate identity + entities within the OpenTelemetry ecosystem. Eventually, this would force OpenTelemetry to model all possibly entities in the world and understand their interaction or otherwise prevent non-OpenTelemetry instrumentation from interacting with OpenTelemetry entities. +- **Embed fully Entity in Resource.** This was rejected because it makes it easy/trivial for Resource attributes and Entities to diverge. This would prevent the backwards/forwards compatibility goals and also require all participating OTLP users to leverage entities. Entity should be an opt-in / additional feature that may or may not be engaged with, depending on user need. +- **Re-using resource detection as-is** This was rejected as not having a viable compatibility path forward. Creating a new set of components that can preserve existing behavior while allowing users to adopt the new functionality means that users have better control of when they see / change system behavior, and adoption is more obvious across the ecosystem. + +## Future Posibilities + +This proposal opens the door for addressing issues where an Entity's lifetime does not match an SDK's lifetime, in addition to providing a data model where mutable (descriptive) attributes can be changed over the lifetime of a resource without affecting its identity. We expect a follow-on OTEP which directly handles this issue. + +## Use Cases + +Below are a set of use cases to help motivate this design. + +### SDK - Multiple Detectors of the same Entity type + +Let's consider the interaction of the SDK in the presence of multiple registered +entity detectors: + +```mermaid +flowchart LR + SDK["`**SDK**`"] -->|OTLP| BACKEND["`**Backend**`"] + SDK -.- RC((Resource Provider)) + RC -.- OTEL_DETECTOR((OpenTelemetry Default Resource Detection)) + RC -.- GCP_DETECTOR((Google Cloud Specific Resource Detection)) + GCP_DETECTOR -. Detects .-> GCE{{gcp.gce}} + GCP_DETECTOR -. Detects .-> GCPHOST{{"host (gcp)"}} + OTEL_DETECTOR -. Detects .-> HOST{{"host (generic)"}} + OTEL_DETECTOR -. Detects .-> PROCESS{{process}} + OTEL_DETECTOR -. Detects .-> SERVICE{{service}} +``` + +Here, there is a service running on Google Compute Engine. The user +has configured a Google Cloud specific set of entity detectors. Both the +built in OpenTelemetry detection and the configured Google Cloud detection +discover a `host` entity. + +The following outcome would occur: + +- The resulting resource would have all of the following entities: `host`, `process`, `service` and `gcp.gce` +- The user-configured resource detector would take priority over built in: the `host` defined from the Google Cloud detection would "win" and be included in resource. + - This means `host.id` e.g. could be the id discovered for GCE VMs. Similarly for other cloud provider detection, like Amazon EC2 where VMs are given a unique ID by the Cloud Provider, rather than a generic machine ID, e.g. + - This matches existing behavior/expectations today for AWS, GCP, etc. on what `host.id` would mean. +- Users would be able to configure which host wins, by swapping the priority order of "default" vs. cloud-specific detection. + +### SDK and Collector - Simple coordination + +Let's consider the interaction of resource, entity in the presence of an SDK +and a Collector: + +```mermaid +flowchart LR + SDK["`**SDK**`"] -->|OTLP| COLLECTOR["`**Collector**`"] + COLLECTOR -->|OTLP| BACKEND["`**Backend**`"] + SDK -.- RC((Resource Provider)) + COLLECTOR -.- RP((Resource Processor)) + RP -. Detects .-> EC2{{aws.ec2}} + RP -. Detects .-> HOST{{host}} + RC -. Detects .-> PROCESS{{process}} + RC -. Detects .-> SERVICE{{service}} +``` + +Here, an SDK is running on Amazon EC2. it is configured with resource detection +that finds a `process` and `service` entity. The SDK is sending data to an +OpenTelemetry Collector that has a resource processor configured to detect +the `ec2` and `host` entities. + +The resulting OTLP from the collector would contain a resource with all +of the entities (`process`, `service`, `ec2`, and `host`). This is because +the entities are all disjoint. + +*Note: this matches today's behavior of existing resource detection and OpenTelmetry collector where all attributes wind up on resource.* + +### SDK and Collector - Entity coordination with descriptive attributes + +Let's consider the interaction of resource, entity where both the SDK and the Collector detect an entity: + +```mermaid +flowchart LR + SDK["`**SDK**`"] -->|OTLP| COLLECTOR["`**Collector**`"] + COLLECTOR -->|OTLP| BACKEND["`**Backend**`"] + SDK -.- RC((Resource Provider)) + COLLECTOR -.- RP((Resource Processor)) + RP -. Detects .-> HOST2{{host}} + RC -. Detects .-> HOST{{host}} + RC -. Detects .-> SERVICE{{service}} +``` + +Here, an SDK is running on a machine (physical or virtual). The SDK is +configured to detect the host it is running on. The collector is also running +on a machine (physical or virtual). Both the SDK and the Collector detect +a `host` entity (with the same identity). + +The behavior would be as follows: + +- By default, the collector would append any missing descriptive attributes + from its `host` entity to the `host` entity and resource. +- If the collector's processor is configured to `override: true`, then the + host entity from the SDK would be dropped in favor of the collector's `host` + entity. All identifying+descriptive attributes from the original entity + + resource would be removed and those detected in the collector would replace it. + +This allows the collector to enrich or enhance resource attributes without altering the *identity* of the source. + +### SDK and Collector - Entity coordination with conflicts + +Let's consider the interaction of resource, entity where there is an identity conflict between the SDK and the Collector: + +```mermaid +flowchart LR + SDK["`**SDK**`"] -->|OTLP| COLLECTOR["`**Collector**`"] + COLLECTOR -->|OTLP| BACKEND["`**Backend**`"] + SDK -.- RC((Resource Provider)) + COLLECTOR -.- RP((Resource Processor)) + RP -. Detects .-> HOST2{{host 2}} + RC -. Detects .-> HOST{{host 1}} + RC -. Detects .-> SERVICE{{service}} +``` + +Here, and SDK is running on a machine (physical or virtual). The SDK is +configured to detect the host it is running on. The collector is also running +on a machine (physical or virtual). Both the SDK and the Collector detect +a `host` entity. However, the `host` entity has a *different identity* between +the SDK and Collector. + +The behavior would be as follows: + +- The default would *drop* the entity detected by the collector, as the + entity identity does not match. This would mean, e.g. descriptive host + attributes from the collector are **not** added to the Resource in OTLP. +- If the collector's processor is configured to `override: true`, then the + host entity from the SDK would be dropped in favor of the collector's `host` + entity. All identifying+descriptive attributes from the original entity + + resource would be removed and those detected in the collector would replace it. + +The default behavior is useful when the SDK and Collector are run on different +machines. Unlike today's resource detection, this could prevent `host` +descriptive attributes that were not detected by the SDK from being added to the +resource. + +The `override` behavior could also ensure that attributes which should be +detected and reported together are replaced together. Today, it's possible the +collector may detect and override some, but not all attributes from the SDK. + +### SDK and Collector - Entity coordination across versions + +Let's look at SDK + collector coordination where semantic version differences +can occur between components within the system. + +```mermaid +flowchart LR + SDK["`**SDK**`"] -->|OTLP| COLLECTOR["`**Collector**`"] + COLLECTOR -->|OTLP| BACKEND["`**Backend**`"] + SDK -.- RC((Resource Provider)) + COLLECTOR -.- RP((Resource Processor)) + RP -. Detects .-> POD{{"`k8s.pod + *schema: 1.26.0* + `"}} + RP -. Detects .-> DEPLOYMENT{{k8s.deployment}} + RC -. Detects .-> POD2{{"`k8s.pod + *schema: 1.25.0* + `"}} + RC -. Detects .-> SERVICE{{service}} +``` + +Here, an SDK is communicating with a Collector. The SDK and the collector +are both participating in resource detection through the use of entities, +however the installed versions of software are leverage different standard +versions between the collector and the SDK. + +Ideally, we'd like a solution where: + +- The user can ensure only attributes related to previously undiscovered, + but relevant, entities can be added in Resource (specifically, `k8s.deployment`). +- The user can address issues where schema version `1.26.0` and `1.25.0` may + have different attributes for the same entity. +- We have default rules and merging that requires the least amount of + configuration or customization for users to achieve their desired + attributes in resource. + +## Collection of Resource detectors and attributes used + +- Collector + - [system](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/resourcedetectionprocessor/internal/system/metadata.yaml) + - host.arch + - host.name + - host.id + - host.ip + - host.mac + - host.cpu.vendor.id + - host.cpu.family + - host.cpu.model.id + - host.cpu.model.name + - host.cpu.stepping + - host.cpu.cache.l2.size + - os.description + - os.type + - [docker](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/resourcedetectionprocessor/internal/docker/metadata.yaml) + - host.name + - os.type + - [heroku](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/resourcedetectionprocessor/internal/heroku/metadata.yaml) + - cloud.provider + - heroku.app.id + - heroku.dyno.id + - heroku.release.commit + - heroku.release.creation_timestamp + - service.instance.id + - service.name + - service.version + - [gcp](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/resourcedetectionprocessor/internal/gcp/metadata.yaml) + - gke + - cloud.provider + - cloud.platform + - cloud.account.id + - cloud.region + - cloud.availability_zone + - k8s.cluster.name + - host.id + - host.name + - gce + - cloud.provider + - cloud.platform + - cloud.account.id + - cloud.region + - cloud.availability_zone + - host.id + - host.name + - host.type + - (optional) gcp.gce.instance.hostname + - (optional) gcp.gce.instance.name + - AWS + - [ec2](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/resourcedetectionprocessor/internal/aws/ec2/metadata.yaml) + - cloud.provider + - cloud.platform + - cloud.account.id + - cloud.region + - cloud.availability_zone + - host.id + - host.image.id + - host.name + - host.type + - [ecs](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/resourcedetectionprocessor/internal/aws/ecs/metadata.yaml) + - cloud.provider + - cloud.platform + - cloud.account.id + - cloud.region + - cloud.availability_zone + - aws.ecs.cluster.arn + - aws.ecs.task.arn + - aws.ecs.task.family + - aws.ecs.task.id + - aws.ecs.task.revision + - aws.ecs.launchtype (V4 only) + - aws.log.group.names (V4 only) + - aws.log.group.arns (V4 only) + - aws.log.stream.names (V4 only) + - aws.log.stream.arns (V4 only) + - [elastic_beanstalk](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/resourcedetectionprocessor/internal/aws/elasticbeanstalk/metadata.yaml) + - cloud.provider + - cloud.platform + - deployment.environment + - service.instance.id + - service.version + - eks + - cloud.provider + - cloud.platform + - k8s.cluster.name + - lambda + - cloud.provider + - cloud.platform + - cloud.region + - faas.name + - faas.version + - faas.instance + - faas.max_memory + - aws.log.group.names + - aws.log.stream.names + - Azure + - cloud.provider + - cloud.platform + - cloud.region + - cloud.account.id + - host.id + - host.name + - azure.vm.name + - azure.vm.size + - azure.vm.scaleset.name + - azure.resourcegroup.name + - Azure aks + - cloud.provider + - cloud.platform + - k8s.cluster.name + - Consul + - cloud.region + - host.id + - host.name + - *exploded consul metadata* + - k8s Node + - k8s.node.uid + - Openshift + - cloud.provider + - cloud.platform + - cloud.region + - k8s.cluster.name +- Java Resource Detection + - SDK-Default + - service.name + - telemetry.sdk.version + - telemetry.sdk.language + - telemetry.sdk.name + - [process](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/691de74a4b0539c1329222aefb962c232028032b/instrumentation/resources/library/src/main/java/io/opentelemetry/instrumentation/resources/ProcessResource.java#L60) + - process.pid + - process.command_line + - process.command_args + - process.executable.path + - [host](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/resources/library/src/main/java/io/opentelemetry/instrumentation/resources/HostResource.java#L31) + - host.name + - host.arch + - [container](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/resources/library/src/main/java/io/opentelemetry/instrumentation/resources/ContainerResource.java) + - container.id + - [os](https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/instrumentation/resources/library/src/main/java/io/opentelemetry/instrumentation/resources/OsResource.java) + - os.type + - [AWS](https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/aws-resources) + - EC2 + - host.id + - cloud.availability_zone + - host.type + - host.image.id + - cloud.account.id + - cloud.region + - host.name + - ECS + - cloud.provider + - cloud.platform + - aws.log.group.names + - aws.log.stream.names + - EKS + - cloud.provider + - cloud.platform + - k8s.cluster.name + - container.id + - Lambda + - cloud.platform + - cloud.region + - faas.name + - faas.version + - [GCP](https://github.com/open-telemetry/opentelemetry-java-contrib/tree/main/gcp-resources) + - cloud.provider + - cloud.platform + - cloud.account.id + - cloud.availability_zone + - cloud.region + - host.id + - host.name + - host.type + - k8s.pod.name + - k8s.namespace.name + - k8s.container.name + - k8s.cluster.name + - faas.name + - faas.instance + - Go + - [container](https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/resource/container.go) + - container.id + - [host](https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/resource/host_id.go) + - host.id + - [os](https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/resource/os.go) + - os.name + - [process](https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/resource/process.go) + - process.pid + - process.executable.name + - process.executable.path + - process.command_line + - process.command_args + - process.owner + - [builtin](https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/resource/builtin.go) + - service.instance.id + - service.name +- [OTEL operator](https://github.com/open-telemetry/opentelemetry-operator/blob/a1e8f927909b81eb368c0483940e0b90d7fdb057/pkg/instrumentation/sdk_test.go#L752) injected ENV variables + - service.instance.id + - service.name + - service.version + - k8s.namespace.name + - k8s.pod.name + - k8s.node.name + - k8s.container.name + +### Implications + +Some initial thoughts on implications: + +AWS, Azure, GCP, Heroku, etc. all provide the following "bundles" of resource: + +- `cloud.*` +- `faas.*`, when relevant +- `host.*`, when relevant +- `k8s.cluster.*`, when relevant +- `service.*` when relevant +- `container.*` for a subset of k8s providers + +"system" detection provides the following: + +- `host.*` +- `os.*` +- `process.*` for SDKs +- `container.*` for Docker images + +SDK specific detection provides the following: + +- `sdk.*` +- `service.*` + +The OTEL operator for k8s provides the following via ENV variables: + +- `k8s.namespace.*` +- `k8s.node.*` +- `k8s.pod.*` +- `k8s.container.*` +- `service.*` + +### What could this mean for choosing entities that belong on resource? + +Let's look at an example of a container running in kubernetes, specifically EKS. + +If the OTEL operator, the SDK and the collector are all used, the following +attributes will wind up on resource: + +- `service.*` - from SDK and otel operator +- `sdk.*` - from SDK +- `process.*` - from SDK +- `host.*` - Note: from system detector on collector +- `container.*` - from EKS detector on SDK +- `k8s.namespace.*` - from otel operator +- `k8s.node.*` - from otel operator +- `k8s.pod.*` - from otel operator +- `k8s.container.*` - from otel operator +- `k8s.cluster.*` - from EKS detector on SDK or collector +- `cloud.*` - from EKS detector on SDK or collector + +A simple litmus test derived from this for when to include an entity on +Resource would be: "Any entity relevant to the produced telemetry should be +included". + +However, this can be refined. Resources today provide a [few key features](https://docs.google.com/document/d/1Xd1JP7eNhRpdz1RIBLeA1_4UYPRJaouloAYqldCeNSc/edit): + +- They provide identity - Uniquely identifying the origin of the data. +- They provide "navigationality" - allowing users to find the source of the data within their o11y and infrastructure tools. +- They allow aggregation / slicing of data on interesting domains. + +A litmus test for what entities to include on resource should be as follows: + +- Is the entity the source/origin of the data? +- Does the entity help navigate to the source of the data? (e.g. `k8s.cluster.*` helping find a `k8s.container.*`) +- Do want to easily slice/aggregate on an axis provided by the entity? (e.g. quickly filtering all CPU container usage metrics across a cluster to find overloaded nodes). + +If the answer to any question is yes, then include the entity on resource. diff --git a/oteps/experimental/0121-config-service.md b/oteps/experimental/0121-config-service.md new file mode 100644 index 00000000000..af43645cdce --- /dev/null +++ b/oteps/experimental/0121-config-service.md @@ -0,0 +1,127 @@ +# A Dynamic Configuration Service for the SDK + +This proposal is a request to develop a prototype to configure metric collection periods. Per-metric and tracing configuration is also intended to be added, with details left for a later iteration. + +It is related to [this pull request](https://github.com/open-telemetry/opentelemetry-proto/pull/155) + +## Motivation + +During normal use, users may wish to collect metrics every 10 minutes. Later, while investigating a production issue, the same user could easily increase information available for debugging by reconfiguring some of their processes to collect metrics every 30 seconds. Because this change is centralized and does not require redeploying with new configurations, there is lower friction and risk in updating the configurations. + +## Explanation + +This OTEP is a proposal for an experimental feature [open-telemetry/opentelemetry-specification#62](https://github.com/open-telemetry/opentelemetry-specification/pull/632), to be developed as a proof of concept. This means no development will be done inside either the OpenTelemetry SDK or the collector. Since this will be implemented in [opentelemetry-go-contrib](https://github.com/open-telemetry/opentelemetry-go-contrib) and [opentelemetry-collector-contrib](https://github.com/open-telemetry/opentelemetry-collector-contrib), all of this functionality will be optional. + +The user, when instrumenting their application, can configure the SDK with the endpoint of their remote configuration service, the associated Resource, and a default config to be used if it fails to read from the configuration service. + +The user must then set up the config service. This can be done through the collector, which can be set up to expose an arbitrary configuration service implementation. Depending on implementation, this allows the collector to either act as a stand-alone configuration service, or as a bridge to remote configurations of the user's monitoring backend by 'translating' the monitoring backend's protocol to comply with the OpenTelemetry configuration protocol. + +## Internal details + +In the future, we intend to add per-metric configuration. For example, this would allow the user to collect 5xx server error counts ever minute, and CPU usage statistics every 10 minutes. The remote configuration protocol was designed with this in mind, meaning that it includes more details than simply the metric collection period. + +Our remote configuration protocol will support this call: + +``` +service MetricConfig { + rpc GetMetricConfig (MetricConfigRequest) returns (MetricConfigResponse); +} +``` + +A request to the config service will look like this: + +``` +message MetricConfigRequest{ + + // Required. The resource for which configuration should be returned. + opentelemetry.proto.resource.v1.Resource resource = 1; + + // Optional. The value of ConfigResponse.fingerprint for the last configuration + // that the caller received and successfully applied. + bytes last_known_fingerprint = 2; +} +``` + +While the response will look like this: + +``` +message MetricConfigResponse { + + // Optional. The fingerprint associated with this MetricConfigResponse. Each + // change in configs yields a different fingerprint. + bytes fingerprint = 1; + + // A Schedule is used to apply a particular scheduling configuration to + // a metric. If a metric name matches a schedule's patterns, then the metric + // adopts the configuration specified by the schedule. + + message Schedule { + + // A light-weight pattern that can match 1 or more + // metrics, for which this schedule will apply. The string is used to + // match against metric names. It should not exceed 100k characters. + message Pattern { + oneof match { + string equals = 1; // matches the metric name exactly + string starts_with = 2; // prefix-matches the metric name + } + } + + // Metrics with names that match at least one rule in the inclusion_patterns are + // targeted by this schedule. Metrics that match at least one rule from the + // exclusion_patterns are not targeted for this schedule, even if they match an + // inclusion pattern. + + // For this iteration, since we only want one Schedule that applies to all metrics, + // we will not check the inclusion_patterns and exclusion_patterns. + repeated Pattern exclusion_patterns = 1; + repeated Pattern inclusion_patterns = 2; + + // Describes the collection period for each schedule in seconds. + int32 period_sec = 3; + } + + // For this iteration, since we only want one Schedule that applies to all metrics, + // we will have a restriction that schedules must have a length of 1, and we will + // not check the patterns when we apply the collection period. + repeated Schedule schedules = 2; + + // Optional. The client is suggested to wait this long (in seconds) before + // pinging the configuration service again. + int32 suggested_wait_time_sec = 3; +} +``` + +The SDK will periodically read a config from the service using GetConfig. This reading interval can depend on If it fails to do so, it will just use either the default config or the most recent successfully read config. If it reads a new config, it will apply it. + +Export frequency from the SDK depends on Schedules. There can only be one Schedule for now, which defines the schedule for all metrics. The schedule has a CollectionPeriod, which defines how often metrics are exported. + +In the future, we will add per-metric configuration. Each Schedule also has inclusion_patterns and exclusion_patterns. Any metrics that match any of the inclusion_patterns and do not match any of the exclusion_patterns will be exported every CollectionPeriod (e.g. every minute). A component will be added that can export metrics that match a pattern from one Schedule at that Schedule's collection period, while exporting other metrics that match the patterns of another Schedule at that other Schedule's collection period. + +The collector will support a new interface for a DynamicConfig service that can be used by an SDK, allowing a custom implementation of the configuration service protocol described above, to act as an optional bridge between an SDK and an arbitrary configuration service. This interface can be implemented as a shim to support accessing remote configurations from arbitrary backends. The collector is configured to expose an endpoint for requests to the DynamicConfig service, and returns results on that endpoint. + +## Trade-offs and mitigations + +This feature will be implemented purely as an experiment, to demonstrate its viability and usefulness. More investigation can be done after a rough prototype is demonstrated. + +As mentioned [here](https://github.com/open-telemetry/opentelemetry-proto/pull/155#issuecomment-640582048), the configuration service can be a potential attack vector for an application instrumented with OpenTelemetry, depending on what we allow in the protocol. We can highlight in the remote configuration protocol that for future changes, caution is needed in terms of the sorts of configurations we allow. + +Having a small polling interval (how often we read configs) would mean that config changes could be applied very quickly. However, this would also increase load on the configuration service. The typical use case probably does not need config changes to be applied immediately and config changes will likely be quite infrequent, so a typical polling interval probably needs to be no more frequent than every several minutes. + +## Prior art and alternatives + +Jaegar has the option of a Remote sampler, which allows reading from a central configuration, even dynamically with an Adaptive sampler. + +The main comparative for remote configuration is a push vs. polling mechanism. The benefits of having a mechanism where the configuration service pushes new configs is that it's less work for the user, with it being not necessary for them to set up a configuration service. There is also no load associated with polling the configuration service in the instrumented application, which would keep the OpenTelemetry SDK more lightweight. + +Using a polling mechanism may be more performant in the context of large distributed applications with many instrumented processes. This is a result of having the instrumented processes polling the configuration service, rather than the config service having to push to processes. A polling mechanism is also compatible with more network protocols, not just gRPC. + +## Open questions + +- As mentioned [here](https://github.com/open-telemetry/opentelemetry-proto/pull/155#issuecomment-640582048). what happens if a malicious/accidental config change overwhelms the application/monitoring system? Is it the responsibility of the user to be cautious while making config changes? Should we automatically decrease telemetry exporting if we can detect performance problems? + +## Future possibilities + +If this OTEP is implemented, there is the option to remotely and dynamically configure other things. As mentioned [here](https://github.com/open-telemetry/opentelemetry-proto/pull/155#issuecomment-639878490), possibilities include labels and aggregations. As mentioned [here](https://github.com/open-telemetry/oteps/pull/121#discussion_r447839301), it is also possible to configure the collector. + +It is intended to add per-metric configuration and well as tracing configuration in the future. diff --git a/oteps/images/otlp-client-server.png b/oteps/images/otlp-client-server.png new file mode 100644 index 00000000000..664fdd1eadb Binary files /dev/null and b/oteps/images/otlp-client-server.png differ diff --git a/oteps/images/otlp-concurrent.png b/oteps/images/otlp-concurrent.png new file mode 100644 index 00000000000..17d1ae18ec6 Binary files /dev/null and b/oteps/images/otlp-concurrent.png differ diff --git a/oteps/images/otlp-multi-destination.png b/oteps/images/otlp-multi-destination.png new file mode 100644 index 00000000000..743a9020bd9 Binary files /dev/null and b/oteps/images/otlp-multi-destination.png differ diff --git a/oteps/images/otlp-request-response.png b/oteps/images/otlp-request-response.png new file mode 100644 index 00000000000..87c133fb59d Binary files /dev/null and b/oteps/images/otlp-request-response.png differ diff --git a/oteps/images/otlp-sequential.png b/oteps/images/otlp-sequential.png new file mode 100644 index 00000000000..f4b9573356f Binary files /dev/null and b/oteps/images/otlp-sequential.png differ diff --git a/oteps/img/0066_context_propagation_details.png b/oteps/img/0066_context_propagation_details.png new file mode 100644 index 00000000000..459c847aa06 Binary files /dev/null and b/oteps/img/0066_context_propagation_details.png differ diff --git a/oteps/img/0066_context_propagation_overview.png b/oteps/img/0066_context_propagation_overview.png new file mode 100644 index 00000000000..70afb512f12 Binary files /dev/null and b/oteps/img/0066_context_propagation_overview.png differ diff --git a/oteps/img/0143_api_lifecycle.png b/oteps/img/0143_api_lifecycle.png new file mode 100644 index 00000000000..04efb594779 Binary files /dev/null and b/oteps/img/0143_api_lifecycle.png differ diff --git a/oteps/img/0143_cross_cutting.png b/oteps/img/0143_cross_cutting.png new file mode 100644 index 00000000000..66216f6cb61 Binary files /dev/null and b/oteps/img/0143_cross_cutting.png differ diff --git a/oteps/img/0143_long_term.png b/oteps/img/0143_long_term.png new file mode 100644 index 00000000000..2e41399e28c Binary files /dev/null and b/oteps/img/0143_long_term.png differ diff --git a/oteps/img/0152-collector.png b/oteps/img/0152-collector.png new file mode 100644 index 00000000000..98fa9797b66 Binary files /dev/null and b/oteps/img/0152-collector.png differ diff --git a/oteps/img/0152-otel-schema.png b/oteps/img/0152-otel-schema.png new file mode 100644 index 00000000000..cc6cbe7f81f Binary files /dev/null and b/oteps/img/0152-otel-schema.png differ diff --git a/oteps/img/0152-query-translate.png b/oteps/img/0152-query-translate.png new file mode 100644 index 00000000000..016e124d009 Binary files /dev/null and b/oteps/img/0152-query-translate.png differ diff --git a/oteps/img/0152-source-and-backend.png b/oteps/img/0152-source-and-backend.png new file mode 100644 index 00000000000..8ac5fef7f3a Binary files /dev/null and b/oteps/img/0152-source-and-backend.png differ diff --git a/oteps/img/0156-arrow-ecosystem.svg b/oteps/img/0156-arrow-ecosystem.svg new file mode 100644 index 00000000000..61b319d24d0 --- /dev/null +++ b/oteps/img/0156-arrow-ecosystem.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0156-resource-events.svg b/oteps/img/0156-resource-events.svg new file mode 100644 index 00000000000..8e32f99ddd2 --- /dev/null +++ b/oteps/img/0156-resource-events.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0156_All trials.png b/oteps/img/0156_All trials.png new file mode 100644 index 00000000000..0a9bdce47a4 Binary files /dev/null and b/oteps/img/0156_All trials.png differ diff --git a/oteps/img/0156_Best trials.png b/oteps/img/0156_Best trials.png new file mode 100644 index 00000000000..c89aaafb6f5 Binary files /dev/null and b/oteps/img/0156_Best trials.png differ diff --git a/oteps/img/0156_OTEL - Arrow IPC.png b/oteps/img/0156_OTEL - Arrow IPC.png new file mode 100644 index 00000000000..b65a246912d Binary files /dev/null and b/oteps/img/0156_OTEL - Arrow IPC.png differ diff --git a/oteps/img/0156_OTEL - HowToUseArrow.png b/oteps/img/0156_OTEL - HowToUseArrow.png new file mode 100644 index 00000000000..154e1d863f0 Binary files /dev/null and b/oteps/img/0156_OTEL - HowToUseArrow.png differ diff --git a/oteps/img/0156_OTEL - ProtocolSeqDiagram.png b/oteps/img/0156_OTEL - ProtocolSeqDiagram.png new file mode 100644 index 00000000000..c623f2349b5 Binary files /dev/null and b/oteps/img/0156_OTEL - ProtocolSeqDiagram.png differ diff --git a/oteps/img/0156_OTEL - Row vs Column.png b/oteps/img/0156_OTEL - Row vs Column.png new file mode 100644 index 00000000000..ea11561f671 Binary files /dev/null and b/oteps/img/0156_OTEL - Row vs Column.png differ diff --git a/oteps/img/0156_OTEL-Metric-Model.png b/oteps/img/0156_OTEL-Metric-Model.png new file mode 100644 index 00000000000..1bd38128c9c Binary files /dev/null and b/oteps/img/0156_OTEL-Metric-Model.png differ diff --git a/oteps/img/0156_RecordBatch.png b/oteps/img/0156_RecordBatch.png new file mode 100644 index 00000000000..f32969ca9ac Binary files /dev/null and b/oteps/img/0156_RecordBatch.png differ diff --git a/oteps/img/0156_collector_internal_overview.png b/oteps/img/0156_collector_internal_overview.png new file mode 100644 index 00000000000..0d86e6b5de4 Binary files /dev/null and b/oteps/img/0156_collector_internal_overview.png differ diff --git a/oteps/img/0156_collector_phase_2.png b/oteps/img/0156_collector_phase_2.png new file mode 100644 index 00000000000..829c39d2cd3 Binary files /dev/null and b/oteps/img/0156_collector_phase_2.png differ diff --git a/oteps/img/0156_compression_ratio_summary_multivariate_metrics.png b/oteps/img/0156_compression_ratio_summary_multivariate_metrics.png new file mode 100644 index 00000000000..a09e4fb07f8 Binary files /dev/null and b/oteps/img/0156_compression_ratio_summary_multivariate_metrics.png differ diff --git a/oteps/img/0156_compression_ratio_summary_std_metrics.png b/oteps/img/0156_compression_ratio_summary_std_metrics.png new file mode 100644 index 00000000000..ca39029c3a2 Binary files /dev/null and b/oteps/img/0156_compression_ratio_summary_std_metrics.png differ diff --git a/oteps/img/0156_logs_bytes.png b/oteps/img/0156_logs_bytes.png new file mode 100644 index 00000000000..52d8374e6af Binary files /dev/null and b/oteps/img/0156_logs_bytes.png differ diff --git a/oteps/img/0156_logs_schema.png b/oteps/img/0156_logs_schema.png new file mode 100644 index 00000000000..9e3eb4be5f0 Binary files /dev/null and b/oteps/img/0156_logs_schema.png differ diff --git a/oteps/img/0156_logs_step_times.png b/oteps/img/0156_logs_step_times.png new file mode 100644 index 00000000000..a3e98c2efaf Binary files /dev/null and b/oteps/img/0156_logs_step_times.png differ diff --git a/oteps/img/0156_logs_step_times_phase1.png b/oteps/img/0156_logs_step_times_phase1.png new file mode 100644 index 00000000000..d1f50d02f33 Binary files /dev/null and b/oteps/img/0156_logs_step_times_phase1.png differ diff --git a/oteps/img/0156_metrics_schema.png b/oteps/img/0156_metrics_schema.png new file mode 100644 index 00000000000..2d94e64bbf4 Binary files /dev/null and b/oteps/img/0156_metrics_schema.png differ diff --git a/oteps/img/0156_metrics_small_batches.png b/oteps/img/0156_metrics_small_batches.png new file mode 100644 index 00000000000..74db3c7c791 Binary files /dev/null and b/oteps/img/0156_metrics_small_batches.png differ diff --git a/oteps/img/0156_metrics_step_times.png b/oteps/img/0156_metrics_step_times.png new file mode 100644 index 00000000000..3e792a99165 Binary files /dev/null and b/oteps/img/0156_metrics_step_times.png differ diff --git a/oteps/img/0156_metrics_step_times_phase1.png b/oteps/img/0156_metrics_step_times_phase1.png new file mode 100644 index 00000000000..1f9c40c778f Binary files /dev/null and b/oteps/img/0156_metrics_step_times_phase1.png differ diff --git a/oteps/img/0156_multivariate_metrics_bytes.png b/oteps/img/0156_multivariate_metrics_bytes.png new file mode 100644 index 00000000000..f7caf50f26e Binary files /dev/null and b/oteps/img/0156_multivariate_metrics_bytes.png differ diff --git a/oteps/img/0156_summary.png b/oteps/img/0156_summary.png new file mode 100644 index 00000000000..8b83027c37d Binary files /dev/null and b/oteps/img/0156_summary.png differ diff --git a/oteps/img/0156_summary_time_spent.png b/oteps/img/0156_summary_time_spent.png new file mode 100644 index 00000000000..9ccdc411817 Binary files /dev/null and b/oteps/img/0156_summary_time_spent.png differ diff --git a/oteps/img/0156_traces_schema.png b/oteps/img/0156_traces_schema.png new file mode 100644 index 00000000000..98a20ee4ef9 Binary files /dev/null and b/oteps/img/0156_traces_schema.png differ diff --git a/oteps/img/0156_traces_step_times_phase1.png b/oteps/img/0156_traces_step_times_phase1.png new file mode 100644 index 00000000000..dba3d5b35f3 Binary files /dev/null and b/oteps/img/0156_traces_step_times_phase1.png differ diff --git a/oteps/img/0156_traffic_reduction_use_case.png b/oteps/img/0156_traffic_reduction_use_case.png new file mode 100644 index 00000000000..12b9ae34fc0 Binary files /dev/null and b/oteps/img/0156_traffic_reduction_use_case.png differ diff --git a/oteps/img/0156_univariate_metrics_bytes.png b/oteps/img/0156_univariate_metrics_bytes.png new file mode 100644 index 00000000000..6c23d7725fe Binary files /dev/null and b/oteps/img/0156_univariate_metrics_bytes.png differ diff --git a/oteps/img/0201-scope-multiplexing.png b/oteps/img/0201-scope-multiplexing.png new file mode 100644 index 00000000000..2407e6c1b5e Binary files /dev/null and b/oteps/img/0201-scope-multiplexing.png differ diff --git a/oteps/img/0235-sampling-threshold-calculation.png b/oteps/img/0235-sampling-threshold-calculation.png new file mode 100644 index 00000000000..b4a9063b6d3 Binary files /dev/null and b/oteps/img/0235-sampling-threshold-calculation.png differ diff --git a/oteps/img/0243-otel-weaver-component-schema.svg b/oteps/img/0243-otel-weaver-component-schema.svg new file mode 100644 index 00000000000..589be2e8c4e --- /dev/null +++ b/oteps/img/0243-otel-weaver-component-schema.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0243-otel-weaver-concepts.svg b/oteps/img/0243-otel-weaver-concepts.svg new file mode 100644 index 00000000000..40594b2712a --- /dev/null +++ b/oteps/img/0243-otel-weaver-concepts.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0243-otel-weaver-dev-strategies.svg b/oteps/img/0243-otel-weaver-dev-strategies.svg new file mode 100644 index 00000000000..fd1d893a362 --- /dev/null +++ b/oteps/img/0243-otel-weaver-dev-strategies.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0243-otel-weaver-hierarchy.svg b/oteps/img/0243-otel-weaver-hierarchy.svg new file mode 100644 index 00000000000..b7ea11580a0 --- /dev/null +++ b/oteps/img/0243-otel-weaver-hierarchy.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0243-otel-weaver-overview.svg b/oteps/img/0243-otel-weaver-overview.svg new file mode 100644 index 00000000000..33a17a5ec69 --- /dev/null +++ b/oteps/img/0243-otel-weaver-overview.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0243-otel-weaver-resolved-schema.svg b/oteps/img/0243-otel-weaver-resolved-schema.svg new file mode 100644 index 00000000000..1424b913b8a --- /dev/null +++ b/oteps/img/0243-otel-weaver-resolved-schema.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0243-otel-weaver-responsibilities-properties.svg b/oteps/img/0243-otel-weaver-responsibilities-properties.svg new file mode 100644 index 00000000000..273c342c3b8 --- /dev/null +++ b/oteps/img/0243-otel-weaver-responsibilities-properties.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0243-otel-weaver-use-cases.svg b/oteps/img/0243-otel-weaver-use-cases.svg new file mode 100644 index 00000000000..77597df176e --- /dev/null +++ b/oteps/img/0243-otel-weaver-use-cases.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/oteps/img/0258-env-context-opentofu-trace.png b/oteps/img/0258-env-context-opentofu-trace.png new file mode 100644 index 00000000000..09bff16ebde Binary files /dev/null and b/oteps/img/0258-env-context-opentofu-trace.png differ diff --git a/oteps/img/0258-env-context-opentofu-tracing.png b/oteps/img/0258-env-context-opentofu-tracing.png new file mode 100644 index 00000000000..01adf6ed781 Binary files /dev/null and b/oteps/img/0258-env-context-opentofu-tracing.png differ diff --git a/oteps/img/0258-env-context-parent-child-process.png b/oteps/img/0258-env-context-parent-child-process.png new file mode 100644 index 00000000000..edaf1188f4d Binary files /dev/null and b/oteps/img/0258-env-context-parent-child-process.png differ diff --git a/oteps/logs/0091-logs-vocabulary.md b/oteps/logs/0091-logs-vocabulary.md new file mode 100644 index 00000000000..b9934a3c363 --- /dev/null +++ b/oteps/logs/0091-logs-vocabulary.md @@ -0,0 +1,58 @@ +# Logs: Vocabulary + +This documents defines the vocabulary for logs to be used across OpenTelemetry project. + +## Motivation + +We need a common language and common understanding of terms that we use to +avoid the chaos experienced by the builders of the Tower of Babel. + +## Proposal + +OpenTelemetry specification already contains a [vocabulary](../../specification/overview.md) +for Traces, Metrics and other relevant concepts. + +This proposal is to add the following concepts to the vocabulary. + +### Log Record + +A recording of an event. Typically the record includes a timestamp indicating +when the event happened as well as other data that describes what happened, +where it happened, etc. + +Also known as Log Entry. + +### Log + +Sometimes used to refer to a collection of Log Records. May be ambiguous, since +people also sometimes use `Log` to refer to a single `Log Record`, thus this +term should be used carefully and in the context where ambiguity is possible +additional qualifiers should be used (e.g. `Log Record`). + +### Embedded Log + +`Log Records` embedded inside a [Span](../../specification/trace/api.md#span) +object, in the [Events](../../specification/trace/api.md#add-events) list. + +### Standalone Log + +`Log Records` that are not embedded inside a `Span` and are recorded elsewhere. + +### Log Attributes + +Key/value pairs contained in a `Log Record`. + +### Structured Logs + +Logs that are recorded in a format which has a well-defined structure that allows +to differentiate between different elements of a Log Record (e.g. the Timestamp, +the Attributes, etc). The _Syslog protocol_ ([RFC 5425](https://tools.ietf.org/html/rfc5424)), +for example, defines a `structured-data` format. + +### Flat File Logs + +Logs recorded in text files, often one line per log record (although multiline +records are possible too). There is no common industry agreement whether +logs written to text files in more structured formats (e.g. JSON files) +are considered Flat File Logs or not. Where such distinction is important it is +recommended to call it out specifically. diff --git a/oteps/logs/0092-logs-vision.md b/oteps/logs/0092-logs-vision.md new file mode 100644 index 00000000000..72b906623e0 --- /dev/null +++ b/oteps/logs/0092-logs-vision.md @@ -0,0 +1,126 @@ +# OpenTelemetry Logs Vision + +The following are high-level items that define our long-term vision for +Logs support in OpenTelemetry project, what we aspire to achieve. + +This a vision document that reflects our current desires. It is not a commitment +to implement everything precisely as listed. The primary purpose of this +document is to ensure all contributors work in alignment. As our vision changes +over time maintainers reserve the right to add, modify, and remove items from +this document. + +This document uses vocabulary introduced in . + +## First-class Citizen + +Logs are a first-class citizen in observability, along with traces and metrics. +We will aim to have best-in-class support for logs at OpenTelemetry. + +## Correlation + +OpenTelemetry will define how logs will be correlated with traces and metrics +and how this correlation information will be stored. + +Correlation will work across 2 major dimensions: + +- To correlate telemetry emitted for the same request (also known as Request + or Trace Context Correlation), +- To correlate telemetry emitted from the same source (also known as Resource + Context Correlation). + +## Logs Data Model + +We will design a Log Data model that will aim to correctly represent all types +of logs. The purpose of the data model is to have a common understanding of what +a log record is, what data needs to be recorded, transferred, stored and +interpreted by a logging system. + +Existing log formats can be unambiguously mapped to this data model. Reverse +mapping from this data model is also possible to the extent that the target log +format has equivalent capabilities. + +We will produce mapping recommendations for commonly used log formats. + +## Log Protocol + +Armed with the Log Data model we will aim to design a high performance protocol +for logs, which will pursue the same [design goals](https://github.com/open-telemetry/opentelemetry-proto/blob/main/docs/design-goals.md) +as we had for the traces and metrics protocol. + +Most notably the protocol will aim to be highly reliable, have low resource +consumption, be suitable for all participant nodes, ensure high throughput, +allow backpressure signalling and be load-balancer friendly (see the design +goals link above for clarifications). + +The reason for this design is to have a single OpenTelemetry protocol that can +deliver logs, traces and metrics via one connection and satisfy all design +goals. + +## Unified Collection + +We aim to have high-performance, unified +[Collector](https://github.com/open-telemetry/opentelemetry-collector/) that +support logs, traces and metrics in one package, symmetrically and uniformly for +all 3 types of telemetry data (see also +[Collector vision](https://github.com/open-telemetry/opentelemetry-collector/blob/8310e665ec1babfd56ca5b1cfec91c1f997f4f2c/docs/vision.md)). + +The unified Collector will support multiple log protocols including the newly +designed OpenTelemetry log protocol. + +Unified collection is important for the following reasons: + +- One agent (or one collector) to deploy and manage. +- One place of configuration for target endpoints, authentication tokens, etc. +- Uniform tagging of all 3 types of telemetry data (enrichment by attributes + of resources where the data comes from or by user-defined attributes), + enabling correct correlation across Resource dimensions later on the backend. + +## Cloud Native + +We will have best-in-class support for logs emitted in cloud native environments +(e.g. Kubernetes, serverless, etc), including legacy applications running +in such environments. This is in line with our CNCF mission. + +## Support Legacy + +We will produce guidelines on how legacy applications can emit logs in a +manner that makes them compatible with OpenTelemetry's approach and enables +telemetry data correlation. We will also have a reasonable story around +logs that are emitted by sources over which we may have no control and which +emit logs in pre-defined formats via pre-defined mediums (e.g. flat file logs, +Syslog, etc). + +We will have technical solutions or guidelines for using popular logging +libraries in a OpenTelemetry-compatible manner and we may produce logging +libraries for languages where gaps exist. + +This is important because we believe software that was created before +OpenTelemetry should not be disregarded and should benefit from OpenTelemetry +efforts where possible. + +### Auto-instrumentation + +To enable functionality that requires modification of how logs are emitted we +will work on auto-instrumenting solutions. This will reduce the adoption barrier +for existing deployments. + +### Applicable to All Log Sources + +Logging support at OpenTelemetry will be applicable to all sorts of log sources: +system logs, infrastructure logs, third-party and first-party application logs. + +### Standalone and Embedded Logs + +OpenTelemetry will support both logs embedded inside [Spans](../../specification/trace/api.md#span) +and standalone logs recorded elsewhere. The support of embedded logs is +important for OpenTelemetry's primary use cases, where errors and exceptions +need to be embedded in Spans. The support of standalone logs is important for +legacy applications which may not emit Spans at all. + +## Legacy Use Cases + +Logging technology has a decades-long history. There exists a large number of +logging libraries, collection agents, network protocols, open-source and +proprietary backends. We recognize this fact and aim to make our proposals in a +manner that honours valid legacy use-cases, while at the same time suggests +better solutions where they are due. diff --git a/oteps/logs/0097-log-data-model.md b/oteps/logs/0097-log-data-model.md new file mode 100644 index 00000000000..10ca034180e --- /dev/null +++ b/oteps/logs/0097-log-data-model.md @@ -0,0 +1,1353 @@ +# Log Data Model + +Introduce Data Model for Log Records as it is understood by OpenTelemetry. + +* [Motivation](#motivation) +* [Design Notes](#design-notes) + * [Requirements](#requirements) + * [Field Kinds](#field-kinds) +* [Log and Event Record Definition](#log-and-event-record-definition) + * [Field: `Timestamp`](#field-timestamp) + * [Trace Context Fields](#trace-context-fields) + * [Field: `TraceId`](#field-traceid) + * [Field: `SpanId`](#field-spanid) + * [Field: `TraceFlags`](#field-traceflags) + * [Severity Fields](#severity-fields) + * [Field: `SeverityText`](#field-severitytext) + * [Field: `SeverityNumber`](#field-severitynumber) + * [Mapping of `SeverityNumber`](#mapping-of-severitynumber) + * [Reverse Mapping](#reverse-mapping) + * [Error Semantics](#error-semantics) + * [Displaying Severity](#displaying-severity) + * [Comparing Severity](#comparing-severity) + * [Field: `Name`](#field-name) + * [Field: `Body`](#field-body) + * [Field: `Resource`](#field-resource) + * [Field: `Attributes`](#field-attributes) +* [Example Log Records](#example-log-records) +* [Questions Resolved during OTEP discussion](#questions-resolved-during-otep-discussion) + * [TraceFlags vs TraceParent and TraceState](#traceflags-vs-traceparent-and-tracestate) + * [Severity Fields](#severity-fields-1) + * [Timestamp Requirements](#timestamp-requirements) + * [Security Logs](#security-logs) +* [Alternate Design](#alternate-design) +* [Prior Art](#prior-art) + * [RFC5424 Syslog](#rfc5424-syslog) + * [Fluentd Forward Protocol Model](#fluentd-forward-protocol-model) +* [Appendix A. Example Mappings](#appendix-a-example-mappings) + * [RFC5424 Syslog](#rfc5424-syslog-1) + * [Windows Event Log](#windows-event-log) + * [SignalFx Events](#signalfx-events) + * [Splunk HEC](#splunk-hec) + * [Log4j](#log4j) + * [Zap](#zap) + * [Apache HTTP Server access log](#apache-http-server-access-log) + * [CloudTrail Log Event](#cloudtrail-log-event) + * [Google Cloud Logging](#google-cloud-logging) +* [Elastic Common Schema](#elastic-common-schema) +* [Appendix B: `SeverityNumber` example mappings](#appendix-b-severitynumber-example-mappings) +* [References](#references) + +## Motivation + +This is a proposal of a data model and semantic conventions that allow to +represent logs from various sources: application log files, machine generated +events, system logs, etc. Existing log formats can be unambiguously mapped to +this data model. Reverse mapping from this data model is also possible to the +extent that the target log format has equivalent capabilities. + +The purpose of the data model is to have a common understanding of what a log +record is, what data needs to be recorded, transferred, stored and interpreted +by a logging system. + +This proposal defines a data model for [Standalone +Logs](https://github.com/open-telemetry/oteps/blob/master/text/logs/0091-logs-vocabulary.md#standalone-log). +Relevant parts of it may be adopted for +[Embedded Logs](https://github.com/open-telemetry/oteps/blob/master/text/logs/0091-logs-vocabulary.md#embedded-log) +in a future OTEP. + +## Design Notes + +### Requirements + +The Data Model was designed to satisfy the following requirements: + +- It should be possible to unambiguously map existing log formats to this Data + Model. Translating log data from an arbitrary log format to this Data Model + and back should ideally result in identical data. + +- Mappings of other log formats to this Data Model should be semantically + meaningful. The Data Model must preserve the semantics of particular elements + of existing log formats. + +- Translating log data from an arbitrary log format A to this Data Model and + then translating from the Data Model to another log format B ideally must + result in a meaningful translation of log data that is no worse than a + reasonable direct translation from log format A to log format B. + +- It should be possible to efficiently represent the Data Model in concrete + implementations that require the data to be stored or transmitted. We + primarily care about 2 aspects of efficiency: CPU usage for + serialization/deserialization and space requirements in serialized form. This + is an indirect requirement that is affected by the specific representation of + the Data Model rather than the Data Model itself, but is still useful to keep + in mind. + +The Data Model aims to successfully represent 3 sorts of logs and events: + +- System Formats. These are logs and events generated by the operating system + and over which we have no control - we cannot change the format or affect what + information is included (unless the data is generated by an application which + we can modify). An example of system format is Syslog. + +- Third-party Applications. These are generated by third-party applications. We + may have certain control over what information is included, e.g. customize the + format. An example is Apache log file. + +- First-party Applications. These are applications that we develop and we have + some control over how the logs and events are generated and what information + we include in the logs. We can likely modify the source code of the + application if needed. + +### Field Kinds + +This Data Model defines a logical model for a log record (irrespective of the +physical format and encoding of the record). Each record contains 2 kinds of +fields: + +- Named top-level fields of specific type and meaning. + +- Fields stored in the key/value pair lists, which can contain arbitrary values + of different types. The keys and values for well-known fields follow semantic + conventions for key names and possible values that allow all parties that work + with the field to have the same interpretation of the data. See references to + semantic conventions for `Resource` and `Attributes` fields and examples in + [Appendix A](#appendix-a-example-mappings). + +The reasons for having these 2 kinds of fields are: + +- Ability to efficiently represent named top-level fields, which are almost + always present (e.g. when using encodings like Protocol Buffers where fields + are enumerated but not named on the wire). + +- Ability to enforce types of named fields, which is very useful for compiled + languages with type checks. + +- Flexibility to represent less frequent data via key/value pair lists. This + includes well-known data that has standardized semantics as well as arbitrary + custom data that the application may want to include in the logs. + +When designing this data model we followed the following reasoning to make a +decision about when to use a top-level named field: + +- The field needs to be either mandatory for all records or be frequently + present in well-known log and event formats (such as `Timestamp`) or is + expected to be often present in log records in upcoming logging systems (such + as `TraceId`). + +- The field’s semantics must be the same for all known log and event formats and + can be mapped directly and unambiguously to this data model. + +Both of the above conditions were required to give the field a place in the +top-level structure of the record. + +## Log and Event Record Definition + +Note: below we use type `any`, which can be a scalar value (number, string or +boolean), or an array or map of values. Arbitrary deep nesting of values for +arrays and maps is allowed (essentially allow to represent an equivalent of a +JSON object). + +[Appendix A](#appendix-a-example-mappings) contains many examples that show how +existing log formats map to the fields defined below. If there are questions +about the meaning of the field reviewing the examples may be helpful. + +Here is the list of fields in a log record: + +Field Name |Description +---------------|-------------------------------------------- +Timestamp |Time when the event occurred. +TraceId |Request trace id. +SpanId |Request span id. +TraceFlags |W3C trace flag. +SeverityText |The severity text (also known as log level). +SeverityNumber |Numerical value of the severity. +Name |Short event identifier. +Body |The body of the log record. +Resource |Describes the source of the log. +Attributes |Additional information about the event. + +Below is the detailed description of each field. + +### Field: `Timestamp` + +Type: Timestamp, uint64 nanoseconds since Unix epoch. + +Description: Time when the event occurred measured by the origin clock. This +field is optional, it may be missing if the timestamp is unknown. + +### Trace Context Fields + +#### Field: `TraceId` + +Type: byte sequence. + +Description: Request trace id as defined in +[W3C Trace Context](https://www.w3.org/TR/trace-context/#trace-id). Can be set +for logs that are part of request processing and have an assigned trace id. This +field is optional. + +#### Field: `SpanId` + +Type: byte sequence. + +Description: Span id. Can be set for logs that are part of a particular +processing span. If SpanId is present TraceId SHOULD be also present. This field +is optional. + +#### Field: `TraceFlags` + +Type: byte. + +Description: Trace flag as defined in +[W3C Trace Context](https://www.w3.org/TR/trace-context/#trace-flags) +specification. At the time of writing the specification defines one flag - the +SAMPLED flag. This field is optional. + +### Severity Fields + +#### Field: `SeverityText` + +Type: string. + +Description: severity text (also known as log level). This is the original +string representation of the severity as it is known at the source. If this +field is missing and `SeverityNumber` is present then the short name that +corresponds to the `SeverityNumber` may be used as a substitution. This field is +optional. + +#### Field: `SeverityNumber` + +Type: number. + +Description: numerical value of the severity, normalized to values described in +this document. This field is optional. + +`SeverityNumber` is an integer number. Smaller numerical values correspond to +less severe events (such as debug events), larger numerical values correspond to +more severe events (such as errors and critical events). The following table +defines the meaning of `SeverityNumber` value: + +SeverityNumber range|Range name|Meaning +--------------------|----------|------- +1-4 |TRACE |A fine-grained debugging event. Typically disabled in default configurations. +5-8 |DEBUG |A debugging event. +9-12 |INFO |An informational event. Indicates that an event happened. +13-16 |WARN |A warning event. Not an error but is likely more important than an informational event. +17-20 |ERROR |An error event. Something went wrong. +21-24 |FATAL |A fatal error such as application or system crash. + +Smaller numerical values in each range represent less important (less severe) +events. Larger numerical values in each range represent more important (more +severe) events. For example `SeverityNumber=17` describes an error that is less +critical than an error with `SeverityNumber=20`. + +#### Mapping of `SeverityNumber` + +Mappings from existing logging systems and formats (or **source format** for +short) must define how severity (or log level) of that particular format +corresponds to `SeverityNumber` of this data model based on the meaning given +for each range in the above table. + +If the source format has more than one severity that matches a single range in +this table then the severities of the source format must be assigned numerical +values from that range according to how severe (important) the source severity +is. + +For example if the source format defines "Error" and "Critical" as error events +and "Critical" is a more important and more severe situation then we can choose +the following `SeverityNumber` values for the mapping: "Error"->17, +"Critical"->18. + +If the source format has only a single severity that matches the meaning of the +range then it is recommended to assign that severity the smallest value of the +range. + +For example if the source format has an "Informational" log level and no other +log levels with similar meaning then it is recommended to use +`SeverityNumber=9` for "Informational". + +Source formats that do not define a concept of severity or log level MAY omit +`SeverityNumber` and `SeverityText` fields. Backend and UI may represent log +records with missing severity information distinctly or may interpret log +records with missing `SeverityNumber` and `SeverityText` fields as if the +`SeverityNumber` was set equal to INFO (numeric value of 9). + +#### Reverse Mapping + +When performing a reverse mapping from `SeverityNumber` to a specific format +and the `SeverityNumber` has no corresponding mapping entry for that format +then it is recommended to choose the target severity that is in the same +severity range and is closest numerically. + +For example Zap has only one severity in the INFO range, called "Info". When +doing reverse mapping all `SeverityNumber` values in INFO range (numeric 9-12) +will be mapped to Zap’s "Info" level. + +#### Error Semantics + +If `SeverityNumber` is present and has a value of ERROR (numeric 17) or higher +then it is an indication that the log record represents an erroneous situation. +It is up to the reader of this value to make a decision on how to use this fact +(e.g. UIs may display such errors in a different color or have a feature to find +all erroneous log records). + +If the log record represents an erroneous event and the source format does not +define a severity or log level concept then it is recommended to set +`SeverityNumber` to ERROR (numeric 17) during the mapping process. If the log +record represents a non-erroneous event the `SeverityNumber` field may be +omitted or may be set to any numeric value less than ERROR (numeric 17). The +recommended value in this case is INFO (numeric 9). See +[Appendix B](#appendix-b-severitynumber-example-mappings) for more mapping +examples. + +#### Displaying Severity + +The following table defines the recommended short name for each +`SeverityNumber` value. The short name can be used for example for representing +the `SeverityNumber` in the UI: + +SeverityNumber|Short Name +--------------|---------- +1 |TRACE +2 |TRACE2 +3 |TRACE3 +4 |TRACE4 +5 |DEBUG +6 |DEBUG2 +7 |DEBUG3 +8 |DEBUG4 +9 |INFO +10 |INFO2 +11 |INFO3 +12 |INFO4 +13 |WARN +14 |WARN2 +15 |WARN3 +16 |WARN4 +17 |ERROR +18 |ERROR2 +19 |ERROR3 +20 |ERROR4 +21 |FATAL +22 |FATAL2 +23 |FATAL3 +24 |FATAL4 + +When an individual log record is displayed it is recommended to show both +`SeverityText` and `SeverityNumber` values. A recommended combined string in +this case begins with the short name followed by `SeverityText` in parenthesis. + +For example "Informational" Syslog record will be displayed as **INFO +(Informational)**. When for a particular log record the `SeverityNumber` is +defined but the `SeverityText` is missing it is recommended to only show the +short name, e.g. **INFO**. + +When drop down lists (or other UI elements that are intended to represent the +possible set of values) are used for representing the severity it is preferable +to display the short name in such UI elements. + +For example a dropdown list of severities that allows filtering log records by +severities is likely to be more usable if it contains the short names of +`SeverityNumber` (and thus has a limited upper bound of elements) compared to a +dropdown list, which lists all distinct `SeverityText` values that are known to +the system (which can be a large number of elements, often differing only in +capitalization or abbreviated, e.g. "Info" vs "Information"). + +#### Comparing Severity + +In the contexts where severity participates in less-than / greater-than +comparisons `SeverityNumber` field should be used. `SeverityNumber` can be +compared to another `SeverityNumber` or to numbers in the 1..24 range (or to the +corresponding short names). + +When severity is used in equality or inequality comparisons (for example in +filters in the UIs) the recommendation is to attempt to use both `SeverityText` +and short name of `SeverityNumber` to perform matches (i.e. equality with either +of these fields should be considered a match). For example if we have a record +with `SeverityText` field equal to "Informational" and `SeverityNumber` field +equal to INFO then it may be preferable from the user experience perspective to +ensure that **severity="Informational"** and **severity="INFO"** conditions both +to are TRUE for that record. + +### Field: `Name` + +Type: string. + +Description: Short event identifier that does not contain varying parts. +`Name` describes what happened (e.g. "ProcessStarted"). Recommended to be +no longer than 50 characters. Not guaranteed to be unique in any way. Typically +used for filtering and grouping purposes in backends. This field is optional. + +### Field: `Body` + +Type: any. + +Description: A value containing the body of the log record (see the description +of `any` type above). Can be for example a human-readable string message +(including multi-line) describing the event in a free form or it can be a +structured data composed of arrays and maps of other values. Can vary for each +occurrence of the event coming from the same source. This field is optional. + +### Field: `Resource` + +Type: key/value pair list. + +Description: Describes the source of the log, aka +[resource](../../specification/overview.md#resources). +"key" of each pair is a `string` and "value" is of `any` type. Multiple +occurrences of events coming from the same event source can happen across time +and they all have the same value of `Resource`. Can contain for example +information about the application that emits the record or about the +infrastructure where the application runs. Data formats that represent this data +model may be designed in a manner that allows the `Resource` field to be +recorded only once per batch of log records that come from the same source. +SHOULD follow OpenTelemetry +[semantic conventions for Resources](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/resource). +This field is optional. + +### Field: `Attributes` + +Type: key/value pair list. + +Description: Additional information about the specific event occurrence. "key" +of each pair is a `string` and "value" is of `any` type. Unlike the `Resource` +field, which is fixed for a particular source, `Attributes` can vary for each +occurrence of the event coming from the same source. Can contain information +about the request context (other than TraceId/SpanId). SHOULD follow +OpenTelemetry +[semantic conventions for Attributes](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/attribute-naming.md). +This field is optional. + +## Example Log Records + +Below are examples that show one possible representation of log records in JSON. +These are just examples to help understand the data model. Don’t treat the +examples as _the_ way to represent this data model in JSON. + +This document does not define the actual encoding and format of the log record +representation. Format definitions will be done in separate OTEPs (e.g. the log +records may be represented as msgpack, JSON, Protocol Buffer messages, etc). + +Example 1 + +```javascript +{ + "Timestamp": 1586960586000, // JSON needs to make a decision about + // how to represent nanoseconds. + "Attributes": { + "http.status_code": 500, + "http.url": "http://example.com", + "my.custom.application.tag": "hello", + }, + "Resource": { + "service.name": "donut_shop", + "service.version": "semver:2.0.0", + "k8s.pod.uid": "1138528c-c36e-11e9-a1a7-42010a800198", + }, + "TraceId": "f4dbb3edd765f620", // this is a byte sequence + // (hex-encoded in JSON) + "SpanId": "43222c2d51a7abe3", + "SeverityText": "INFO", + "SeverityNumber": 9, + "Body": "20200415T072306-0700 INFO I like donuts" +} +``` + +Example 2 + +```javascript +{ + "Timestamp": 1586960586000, + ... + "Body": { + "i": "am", + "an": "event", + "of": { + "some": "complexity" + } + } +} +``` + +Example 3 + +```javascript +{ + "Timestamp": 1586960586000, + "Attributes":{ + "http.scheme":"https", + "http.host":"donut.mycie.com", + "http.target":"/order", + "http.method":"post", + "http.status_code":500, + "http.flavor":"1.1", + "http.user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36", + } +} +``` + +## Questions Resolved during OTEP discussion + +These were Open Questions that were discussed and resolved in +[OTEP Pull Request]( https://github.com/open-telemetry/oteps/pull/97) + +### TraceFlags vs TraceParent and TraceState + +Question: Should we store entire +[W3C Trace Context](https://www.w3.org/TR/trace-context/), including +`traceparent` and `tracestate` fields instead of only `TraceFlags`? + +Answer: the discussion did not reveal any evidence that `traceparent` and +`tracestate` are needed. + +### Severity Fields + +Question: Is `SeverityText`/`SeverityNumber` fields design good enough? + +Answer: Discussions have shown that the design is reasonable. + +### Timestamp Requirements + +Question: Early draft of this proposal specified that `Timestamp` should be +populated from a monotonic, NTP-synchronized source. We removed this requirement +to avoid confusion. Do we need any requirements for timestamp sources? + +Answer: discussions revealed that it is not data model's responsibility to +specify such requirements. + +### Security Logs + +Question: Is there a need for special treatment of security logs? + +Answer: discussions in the OTEP did not reveal the need for any special +treatment of security logs in the context of the data model proposal. + +## Alternate Design + +An +[alternate design](https://docs.google.com/document/d/1ix9_4TQO3o-qyeyNhcOmqAc1MTyr-wnXxxsdWgCMn9c/edit?ts=5e990fe2#heading=h.cw69q2ga62p6) +that used an envelop approach was considered but I did not find it to be overall +better than this one. + +## Prior Art + +### RFC5424 Syslog + +[RFC5424](https://tools.ietf.org/html/rfc5424) defines structured log data +format and protocol. The protocol is ubiquitous (although unfortunately many +implementations don’t follow structured data recommendations). Here are some +drawbacks that do not make Syslog a serious contender for a data model: + +- While it allows structured attributes the body of the message can be only a + string. + +- Severity is hard-coded to 8 possible numeric values, and does not allow custom + severity texts. + +- Structured data does not allow arbitrary nesting and is 2-level only. + +- No clear separate place to specify data source (aka resource). There are a + couple hard-coded fields that serve this purpose in a limited way (HOSTNAME, + APP-NAME, FACILITY). + +### Fluentd Forward Protocol Model + +[Forward protocol](https://github.com/fluent/fluentd/wiki/Forward-Protocol-Specification-v1) +defines a log Entry concept as a timestamped record. The record consists of 2 +elements: a tag and a map of arbitrary key/value pairs. + +The model is universal enough to represent any log record. However, here are +some drawbacks: + +- All attributes of a record are represented via generic key/value pairs (except + tag and timestamp). This misses the optimization opportunities (see [Design + Notes](#design-notes)). + +- There is no clear separate place to specify data source (aka resource). + +- There is no mention of how exactly keys should be named and what are expected + values. This lack of any naming convention or standardization of key/value + pairs makes interoperability difficult. + +## Appendix A. Example Mappings + +This section contains examples of mapping of other events and logs formats to +this data model. + +### RFC5424 Syslog + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PropertyTypeDescriptionMaps to Unified Model Field
TIMESTAMPTimestampTime when an event occurred measured by the origin clock.Timestamp
SEVERITYenumDefines the importance of the event. Example: `Debug`Severity
FACILITYenumDescribes where the event originated. A predefined list of Unix processes. Part of event source identity. Example: `mail system`Attributes["syslog.facility"]
VERSIONnumberMeta: protocol version, orthogonal to the event.Attributes["syslog.version"]
HOSTNAMEstringDescribes the location where the event originated. Possible values are FQDN, IP address, etc.Resource["host.hostname"]
APP-NAMEstringUser-defined app name. Part of event source identity.Resource["service.name"]
PROCIDstringNot well defined. May be used as a meta field for protocol operation purposes or may be part of event source identity.Attributes["syslog.procid"]
MSGIDstringDefines the type of the event. Part of event source identity. Example: "TCPIN"Name
STRUCTURED-DATAarray of maps of string to stringA variety of use cases depending on the SDID: +Can describe event source identity +Can include data that describes particular occurrence of the event. +Can be meta-information, e.g. quality of timestamp value.SDID origin.swVersion map to Resource["service.version"] + +SDID origin.ip map to attribute[net.host.ip"] + +Rest of SDIDs -> Attributes["syslog.*"]
MSGstringFree-form text message about the event. Typically human readable.Body
+ +### Windows Event Log + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PropertyTypeDescriptionMaps to Unified Model Field
TimeCreatedTimestampThe time stamp that identifies when the event was logged.Timestamp
LevelenumContains the severity level of the event.Severity
ComputerstringThe name of the computer on which the event occurred.Resource["host.hostname"]
EventIDuintThe identifier that the provider used to identify the event.Name
MessagestringThe message string.Body
Rest of the fields.anyAll other fields in the event.Attributes["winlog.*"]
+ +### SignalFx Events + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldTypeDescriptionMaps to Unified Model Field
TimestampTimestampTime when the event occurred measured by the origin clock.Timestamp
EventTypestringShort machine understandable string describing the event type. SignalFx specific concept. Non-namespaced. Example: k8s Event Reason field.Name
CategoryenumDescribes where the event originated and why. SignalFx specific concept. Example: AGENT. If this attribute is not present on the SignalFx Event, it should be set to the null attribute value in the LogRecord -- this will allow unambigous identification of SignalFx events when they are represented as LogRecords.Attributes["com.splunk.signalfx.event_category"]
Dimensionsmap of string to stringHelps to define the identity of the event source together with EventType and Category. Multiple occurrences of events coming from the same event source can happen across time and they all have the same value of Dimensions. In SignalFx, event Dimensions, along with the EventType, determine individual Event Time Series (ETS).Attributes
Propertiesmap of string to anyAdditional information about the specific event occurrence. Unlike Dimensions which are fixed for a particular event source, Properties can have different values for each occurrence of the event coming from the same event source. In SignalFx, event Properties are considered additional metadata about an event and do not factor into the identity of an Event Time Series (ETS).Attributes["com.splunk.signalfx.event_properties"]
+ +### Splunk HEC + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldTypeDescriptionMaps to Unified Model Field
timenumeric, stringThe event time in epoch time format, in seconds.Timestamp
hoststringThe host value to assign to the event data. This is typically the host name of the client that you are sending data from.Resource["host.hostname"]
sourcestringThe source value to assign to the event data. For example, if you are sending data from an app you are developing, you could set this key to the name of the app.Resource["service.name"]
sourcetypestringThe sourcetype value to assign to the event data.Attributes["source.type"]
eventanyThe JSON representation of the raw body of the event. It can be a string, number, string array, number array, JSON object, or a JSON array.Body
fieldsMap of anySpecifies a JSON object that contains explicit custom fields.Attributes
indexstringThe name of the index by which the event data is to be indexed. The index you specify here must be within the list of allowed indexes if the token has the indexes parameter set.TBD, most like will go to attributes
+ +### Log4j + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldTypeDescriptionMaps to Unified Model Field
InstantTimestampTime when an event occurred measured by the origin clock.Timestamp
LevelenumLog level.Severity
MessagestringHuman readable message.Body
All other fieldsanyStructured data.Attributes
+ +### Zap + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldTypeDescriptionMaps to Unified Model Field
tsTimestampTime when an event occurred measured by the origin clock.Timestamp
levelenumLogging level.Severity
callerstringCalling function's filename and line number. +Attributes, key=TBD
msgstringHuman readable message.Body
All other fieldsanyStructured data.Attributes
+ +### Apache HTTP Server access log + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldTypeDescriptionMaps to Unified Model Field
%tTimestampTime when an event occurred measured by the origin clock.Timestamp
%astringClient IPAttributes["net.peer.ip"]
%AstringServer IPAttributes["net.host.ip"]
%hstringRemote hostname. Attributes["net.peer.name"]
%mstringThe request method.Attributes["http.method"]
%v,%p,%U,%qstringMultiple fields that can be composed into URL.Attributes["http.url"]
%>sstringResponse status.Attributes["http.status_code"]
All other fieldsanyStructured data.Attributes, key=TBD
+ +### CloudTrail Log Event + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldTypeDescriptionMaps to Unified Model Field
eventTimestringThe date and time the request was made, in coordinated universal time (UTC).Timestamp
eventSourcestringThe service that the request was made to. This name is typically a short form of the service name without spaces plus .amazonaws.com.Resource["service.name"]?
awsRegionstringThe AWS region that the request was made to, such as us-east-2.Resource["cloud.region"]
sourceIPAddressstringThe IP address that the request was made from.Resource["net.peer.ip"] or Resource["net.host.ip"]? TBD
errorCodestringThe AWS service error if the request returns an error.Name
errorMessagestringIf the request returns an error, the description of the error.Body
All other fields*Attributes["cloudtrail.*"]
+ +### Google Cloud Logging + +Field | Type | Description | Maps to Unified Model Field +-----------------|--------------------| ------------------------------------------------------- | --------------------------- +timestamp | string | The time the event described by the log entry occurred. | Timestamp +resource | MonitoredResource | The monitored resource that produced this log entry. | Resource +log_name | string | The URL-encoded LOG_ID suffix of the log_name field identifies which log stream this entry belongs to. | Name +json_payload | google.protobuf.Struct | The log entry payload, represented as a structure that is expressed as a JSON object. | Body +proto_payload | google.protobuf.Any | The log entry payload, represented as a protocol buffer. | Body +text_payload | string | The log entry payload, represented as a Unicode string (UTF-8). | Body +severity | LogSeverity | The severity of the log entry. | Severity +trace | string | The trace associated with the log entry, if any. | TraceId +span_id | string | The span ID within the trace associated with the log entry. | SpanId +labels | map | A set of user-defined (key, value) data that provides additional information about the log entry. | Attributes +All other fields | | | Attributes["google.*"] + +## Elastic Common Schema + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FieldTypeDescriptionMaps to Unified Model Field
@timestampdatetimeTime the event was recordedtimestamp
messagestringAny type of messagebody
labelskey/valueArbitrary labels related to the eventattributes[*]
tagsarray of stringList of values related to the event?
trace.idstringTrace IDTraceId
span.id*stringSpan IDSpanId
agent.ephemeral_idstringEphemeral ID created by agent**resource
agent.idstringUnique identifier of this agent**resource
agent.namestringName given to the agentresource["telemetry.sdk.name"]
agent.typestringType of agentresource["telemetry.sdk.language"]
agent.versionstringVersion of agentresource["telemetry.sdk.version"]
source.ip, client.ipstringThe IP address that the request was made from.attributes["net.peer.ip"] or attributes["net.host.ip"]
cloud.account.idstringID of the account in the given cloudresource["cloud.account.id"]
cloud.availability_zonestringAvailability zone in which this host is running.resource["cloud.zone"]
cloud.instance.idstringInstance ID of the host machine.**resource
cloud.instance.namestringInstance name of the host machine.**resource
cloud.machine.typestringMachine type of the host machine.**resource
cloud.providerstringName of the cloud provider. Example values are aws, azure, gcp, or digitalocean.resource["cloud.provider"]
cloud.regionstringRegion in which this host is running.resource["cloud.region"]
cloud.image.id*stringresource["host.image.name"]
container.idstringUnique container idresource["container.id"]
container.image.namestringName of the image the container was built on.resource["container.image.name"]
container.image.tagArray of stringContainer image tags.**resource
container.labelskey/valueImage labels.attributes[*]
container.namestringContainer name.resource["container.name"]
container.runtimestringRuntime managing this container. Example: "docker"**resource
destination.addressstringDestination address for the eventattributes["destination.address"]
error.codestringError code describing the error.attributes["error.code"]
error.idstringUnique identifier for the error.attributes["error.id"]
error.messagestringError message.attributes["error.message"]
error.stack_tracestringThe stack trace of this error in plain text.attributes["error.stack_trace]
host.architecturestringOperating system architecture**resource
host.domainstringName of the domain of which the host is a member. + +For example, on Windows this could be the host’s Active Directory domain or NetBIOS domain name. For Linux this could be the domain of the host’s LDAP provider.**resource
host.hostnamestringHostname of the host. + +It normally contains what the hostname command returns on the host machine.resource["host.hostname"]
host.idstringUnique host id.resource["host.id"]
host.ipArray of stringHost IPresource["host.ip"]
host.macarray of stringMAC addresses of the hostresource["host.mac"]
host.namestringName of the host. + +It may contain what hostname returns on Unix systems, the fully qualified, or a name specified by the user. resource["host.name"]
host.typestringType of host.resource["host.type"]
host.uptimestringSeconds the host has been up.?
service.ephemeral_id + +stringEphemeral identifier of this service**resource
service.idstringUnique identifier of the running service. If the service is comprised of many nodes, the service.id should be the same for all nodes.**resource
service.namestringName of the service data is collected from.resource["service.name"]
service.node.namestringSpecific node serving that serviceresource["service.instance.id"]
service.statestringCurrent state of the service.attributes["service.state"]
service.typestringThe type of the service data is collected from.**resource
service.versionstringVersion of the service the data was collected from.resource["service.version"]
+ +\* Not yet formalized into ECS. + +\*\* A resource that doesn’t exist in the +[OpenTelemetry resource semantic convention](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/resource). + +This is a selection of the most relevant fields. See +[for the full reference](https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html) +for an exhaustive list. + +## Appendix B: `SeverityNumber` example mappings + +|Syslog |WinEvtLog |Log4j |Zap |java.util.logging|SeverityNumber| +|-------------|-----------|------|------|-----------------|--------------| +| | |TRACE | | FINEST |TRACE | +|Debug |Verbose |DEBUG |Debug | FINER |DEBUG | +| | | | | FINE |DEBUG2 | +| | | | | CONFIG |DEBUG3 | +|Informational|Information|INFO |Info | INFO |INFO | +|Notice | | | | |INFO2 | +|Warning |Warning |WARN |Warn | WARNING |WARN | +|Error |Error |ERROR |Error | SEVERE |ERROR | +|Critical |Critical | |Dpanic| |ERROR2 | +|Emergency | | |Panic | |ERROR3 | +|Alert | |FATAL |Fatal | |FATAL | + +## References + +- [Draft discussion of Data Model](https://docs.google.com/document/d/1ix9_4TQO3o-qyeyNhcOmqAc1MTyr-wnXxxsdWgCMn9c/edit#) + +- [Discussion of Severity field](https://docs.google.com/document/d/1WQDz1jF0yKBXe3OibXWfy3g6lor9SvjZ4xT-8uuDCiA/edit#) diff --git a/oteps/logs/0130-logs-1.0ga-definition.md b/oteps/logs/0130-logs-1.0ga-definition.md new file mode 100644 index 00000000000..9e42e50cbf5 --- /dev/null +++ b/oteps/logs/0130-logs-1.0ga-definition.md @@ -0,0 +1,63 @@ +# Logs GA Scope + +This document defines what's in scope for OpenTelemetry Logs General +Availability release. + +## Motivation + +Clearly defined scope is important to align all logs contributors and to make +sure we know what target we are working towards. Note that some of the listed +items are already fully or partially implemented but are still listed for +completeness. + +General Availability of OpenTelemetry Logs is expected after OpenTelemetry 1.0 +GA (which will only include traces and metrics). + +## Logs Roadmap Items + +### Specification and SDK + +- Write guidelines and specification for logging libraries to support + OpenTelemetry-compliant logs. + +- Implement OpenTelemetry logs SDK for Java that support trace context + extraction from the current execution context. + +- Show full OpenTelemetry-compliant implementation of an "addon" to one of the + popular logging libraries for Java (e.g. Log4J, SLF4J, etc). Use OpenTelemetry + SDK to automatically include extracted trace context in the logs and support + exporting via OTLP. + +- Implement an example Java application that shows how to emit correlated traces + and logs. Use the supported popular logging library together with + OpenTelemetry SDK, and export logs using OTLP. + +- Add Logs support to OTLP specification. + +### Collector + +- Implement receiver and exporter for OTLP logs in Collector. + +- Implement "array" value type for in-memory data structures (it is already part + of OTLP but is not supported by the Collector yet). + +- Implement log exporters in Collector for a few vendor formats from + participating vendors. + +- Implement Fluent Forward protocol receiver to receive logs from + FluentBit/FluentD. + +- Add support for log data type to the following processors: `resource`, + `batch`, `attributes`, `k8s_tagger`, `resourcedetection`. + +- Add end-to-end performance tests for log forwarding (similar to existing trace + and metric tests) at least for OTLP and Fluent Forward protocols. + +- Test operation of Collector together with at least one other logging agent + (e.g. FluentBit), allowing to read file logs as described here. Publish test + results (including performance). + +- Implement an example that shows how to use OpenTelemetry Collector to collect + correlated logs, traces and metrics from a distributed microservices + application (preferably running in a cloud-native control plane like + Kubernetes) diff --git a/oteps/logs/0150-logging-library-sdk.md b/oteps/logs/0150-logging-library-sdk.md new file mode 100644 index 00000000000..2c7d4033406 --- /dev/null +++ b/oteps/logs/0150-logging-library-sdk.md @@ -0,0 +1,269 @@ +# Logging Library SDK Prototype Specification + +Initial draft specification to create prototypes of OpenTelemetry Logging +Library SDKs. + +Status: for prototyping. Don't merge to main OpenTelemetry specification until +prototyping is done. + +## Motivation + +This is a draft proposal for OpenTelemetry Logging Library SDK specification. +The purpose of his proposal is to lay out the initial understanding of +what Logging Library SDKs will look like. + +We would like to create prototypes based on this proposal in a few languages. +The learnings from the prototypes will then allow us to refine this proposal and +eventually make the proposed approach part of the OpenTelemetry specification. + +The approach proposed in this OTEP is not intended to be merged into the +OpenTelemetry specification after the OTEP itself is approved. We want to have +successful prototypes first. This OTEP is first and foremost a specification and +guidelines on how to create the prototypes. + +The specification defines how the OpenTelemetry Logging Library SDK exposes its +functionality to authors of extensions to language-specific 3rd party logging +libraries and to end users that want to produce logs in the +[OpenTelemetry manner](../../specification/logs/README.md). + +The specification defines SDK elements that to some extent mirror the +OpenTelemetry +[Trace SDK](../../specification/trace). +This ensures uniformity and consistency of the OpenTelemetry specification and +of the implementations across traces and logs. For additional clarity the +definitions in this document refer to the Trace analogs where appropriate. + +The descriptions of interfaces and methods in this document are intentionally +very brief. Detailed and precise descriptions will be borrowed from Trace +specification when this OTEP is submitted as a Log specification PR. I kept the +descriptions as short as possible in this document to make reviewing it easy and +fast. + +## Specification + +Many existing logging libraries have some sort of extension mechanism that +allows to customize how log records are encoded and delivered to their +destinations (for example Appender in Log4j or Core in zapcore). The +OpenTelemetry Logging Library SDK is intended to be used by such extensions to +emit logs in OpenTelemetry formats. + +Note: The functionality that this document proposes will be an SDK package. A +logging-related API package may be added in the future if we decide to have an +end-user callable logging API. Until then the functions and methods callable +from this SDK package are intended to be used by Logging Libraries only and are +NOT intended to be used by the end user and will NOT be exposed in the +OpenTelemetry API package. + +We will have the following in the SDK. + +### LogEmitterProvider + +Methods: + +- Get LogEmitter. Accepts the instrumentation library name and version and + returns a LogEmitter associated with the instrumentation library. +- Shutdown. +- ForceFlush. + +LogEmitterProvider can be configured at startup time, to be associated with a +Resource and with LogProcessor/LogExporter pipeline. + +### LogEmitter + +Methods: + +- Emit(LogRecord). Emits a log record. The LogRecord and the Resource and + Instrumentation library associated with the LogEmitter will be converted into + a readable LogData and will be pushed through the SDK and the configured + LogProcessors and LogExporter. It is expected that the caller will populate + trace context related fields (TraceId,SpanId,TraceFlags) if applicable before + making the call. Open Question: do we need to also pass the Baggage so that + log processors and exporters can use it if they see the need? + + Note: some languages may opt to avoid having a LogRecord data type and instead + use a more idiomatic builder pattern to prepare and emit a log record (see + e.g. + [Java discussion](https://github.com/open-telemetry/opentelemetry-java/pull/3759#discussion_r738019425)) + +- Flush. + +### LogRecord + +See LogRecord +[data model](../../specification/logs/data-model.md) +for the list of fields. + +Open Question: should LoggerName field be added to the data model to allow +logging libraries to supply it? We have an +[open PR](https://github.com/open-telemetry/opentelemetry-specification/pull/1236). + +### LogProcessor + +Plugin interface. Analog of SpanProcessor. Interface to hook the log record +emitting action. + +Methods: + +- Emit(LogData). Call when a log record is ready to be processed and exported. +- Shutdown. +- ForceFlush. + +Built-in implementations: SimpleLogProcessor, BatchLogProcessor. + +### LogData + +Readable LogRecord data plus associated Resource and InstrumentationLibrary. +Analog of SpanData. + +### LogExporter + +Plugin interface. Analog of SpanExporter. Allows to implement protocol-specific +exporters so that they can be plugged into OpenTelemetry SDK and support sending +of log data. + +Methods: + +- Export(batch). Exports a batch of LogData. +- Shutdown. + +## Usage + +### How to Create Log4J Style Appender + +An Appender implementation can be used to allow emitting log records via +OpenTelemetry Logging Library exporters. This approach is typically used for +applications which are fine with changing the log transport and is +[one of the supported](../../specification/logs/README.md#direct-to-collector) +log collection approaches. + +The Appender implementation will typically acquire a LogEmitter from the global +LogEmitterProvider at startup time, then call LogEmitter.Emit for log records +received from the application. + +For languages with implicit Context, the Appender may call Context API to get +the currently +[active Span](../../specification/trace/api.md#context-interaction) +and populate TraceId, SpanId, TraceFlags fields of the LogRecord before emitting +it. The log library may also have an alternate way to inject the context into +log records (e.g. MDC in Log4j). + +![Appender](images/otep0150/appender.png) + +This same approach can be also used for example for: + +- Python logging library by creating a Handler. +- Go zap logging library by implementing the Core interface. Note that since + there is no implicit Context in Go it is not possible to get and use the + active Span. + +Appenders can be created in OpenTelemetry language libraries by OpenTelemetry +maintainers, or by 3rd parties for any logging library that supports a similar +extension mechanism. This specification recommends each OpenTelemetry language +library to include out-of-the-box Appender implementation for at least one most +popular logging library. + +### Logging to File + +One of the possible approaches to emit and collect logs that OpenTelemetry +supports is via intermediary files. When configuring the LogEmitterProvider, +OTLP File exporter should be used to write logs to a file or stdout in either +OTLP JSON or OTLP Protobuf binary format. + +![Logging to File](images/otep0150/otlp-file.png) + +### Logging Directly to OTLP Network Destination + +The approach is the same as for logging to a file, except OTLP/gRPC or OTLP/HTTP +exporter implementation is used. + +### Implicit Context Injection + +When Context is implicitly available (e.g. in Java) it may be fetched by the log +library extension synchronously for every log record by calling the +OpenTelemetry Context API and injecting the span context fields into the +LogRecord before emitting it. + +Some log libraries have mechanisms specifically tailored for injecting +contextual information into log records. An example of such a mechanism is Log4j +MDC. When available such mechanisms may be the preferable place to fetch the +span context and inject it into the log records, since it usually allows +fetching of the context to work correctly even when log records are emitted +asynchronously (which otherwise can result in the incorrect implicit context +being fetched. + +### Explicit Context + +In languages where the Context must be provided explicitly (e.g. Go) the end +user must capture the context and explicitly pass it to the logging subsystem in +order for trace context to be recorded in Log records. + +Support for OpenTelemetry for logging libraries in these languages typically can +be implemented in the form of logger wrappers that can capture the context once, +when the span is created and then use the wrapped logger to execute log +statements in a normal way. The wrapper will be responsible for injecting the +captured context in the log records. + +This specification does not define how exactly it is achieved since the actual +mechanism depends on the language and the particular logging library used. In +any case the wrappers are expected to make use of the Trace Context API to get +the current active span. + +See +[an example](https://docs.google.com/document/d/15vR7D1x2tKd7u3zaTF0yH1WaHkUr2T4hhr7OyiZgmBg/edit#heading=h.4xuru5ljcups) +of how it can be done for zap logging library for Go. + +### Custom LogExporter + +LogExporter implementations can be plugged into OpenTelemetry Logging Library to send logs via custom protocols. + +OTLP/gRPC, OTLP/HTTP, OTLP/File log exporters are provided with OpenTelemetry Logging Library out of the box. + +![Custom Exporter](images/otep0150/custom-exporter.png) + +### Custom LogProcessor + +LogProcessor implementations can be plugged into the OpenTelemetry Logging +Library to have custom processing of logs before they are exported. + +Simple and Batch processors should be provided by the OpenTelemetry Logging +Library out of the box. + +![Custom Processor](images/otep0150/custom-processor.png) + +## Prior art and alternatives + +This specification relies on OpenTelemetry Trace and API specification as much +as possible. It is written after examining existing logging libraries (see +References). + +An alternative approach was considered where we had the traditional separation +into API and SDK packages which followed the OpenTelemetry Trace library more +closely. However we found that the alternative approach was artificially trying +to enforce uniformity at the cost of ending up with a worse and less coherent +solution from logs perspective and also would make future introduction of +user-facing logging API more complicated. + +## Open questions + +- Decide if want to pass the Baggage to LogEmitter's Emit method. For example + Spring Cloud Sleuth currently enables appending Sleuth Baggage to log records + via the slf4j MDC. See + [Spring documentation](https://docs.spring.io/spring-cloud-sleuth/docs/current/reference/htmlsingle/#features-baggage) + of this feature. + +- Make final decision on + [adding LoggerName](https://github.com/open-telemetry/opentelemetry-specification/pull/1236). + +## Future possibilities + +In the future we may also add end-user callable OpenTelemetry Logging API +specification. + +## References + +- Overview of logging library structure by + [David Poncelow](https://docs.google.com/document/d/15vR7D1x2tKd7u3zaTF0yH1WaHkUr2T4hhr7OyiZgmBg/edit#) +- Log4J + [Extending guide](https://logging.apache.org/log4j/2.x/manual/extending.html). +- Python [logging handlers](https://docs.python.org/3/library/logging.handlers.html). +- Zap [Core](https://pkg.go.dev/go.uber.org/zap@v1.16.0/zapcore#Core). diff --git a/oteps/logs/images/otep0150/appender.png b/oteps/logs/images/otep0150/appender.png new file mode 100644 index 00000000000..70ab7414a8b Binary files /dev/null and b/oteps/logs/images/otep0150/appender.png differ diff --git a/oteps/logs/images/otep0150/custom-exporter.png b/oteps/logs/images/otep0150/custom-exporter.png new file mode 100644 index 00000000000..c0809a581de Binary files /dev/null and b/oteps/logs/images/otep0150/custom-exporter.png differ diff --git a/oteps/logs/images/otep0150/custom-processor.png b/oteps/logs/images/otep0150/custom-processor.png new file mode 100644 index 00000000000..26325bf2218 Binary files /dev/null and b/oteps/logs/images/otep0150/custom-processor.png differ diff --git a/oteps/logs/images/otep0150/otlp-file.png b/oteps/logs/images/otep0150/otlp-file.png new file mode 100644 index 00000000000..80c9113a4c8 Binary files /dev/null and b/oteps/logs/images/otep0150/otlp-file.png differ diff --git a/oteps/metrics/0003-measure-metric-type.md b/oteps/metrics/0003-measure-metric-type.md new file mode 100644 index 00000000000..4f1d2a1ceb8 --- /dev/null +++ b/oteps/metrics/0003-measure-metric-type.md @@ -0,0 +1,179 @@ +# Consolidate pre-aggregated and raw metrics APIs + +## Foreword + +A working group convened on 8/21/2019 to discuss and debate the two metrics RFCs (0003 and 0004) and several surrounding concerns. This document has been revised with related updates that were agreed upon during this working session. See the [meeting notes](https://docs.google.com/document/d/1d0afxe3J6bQT-I6UbRXeIYNcTIyBQv4axfjKF4yvAPA/edit#). + +## Overview + +Introduce a `Measure` kind of metric object that supports a `Record` API method. Like existing `Gauge` and `Cumulative` metrics, the new `Measure` metric supports pre-defined labels. A new `RecordBatch` measurement API is introduced for recording multiple metric observations simultaneously. + +## Terminology + +This RFC changes how "Measure" is used in the OpenTelemetry metrics specification. Before, "Measure" was the name of a series of raw measurements. After, "Measure" is the kind of a metric object used for recording a series raw measurements. + +Since this document will be read in the future after the proposal has been written, uses of the word "current" lead to confusion. For this document, the term "preceding" refers to the state that was current prior to these changes. + +The preceding specification used the term `TimeSeries` to describe an instrument bound with a set of pre-defined labels. In this document, [the term "Handle" is used to describe an instrument with bound labels](0009-metric-handles.md). In a future OTEP this will be again changed to "Bound instrument". The term "Handle" is used throughout this document to refer to a bound instrument. + +## Motivation + +In the preceding `Metric.GetOrCreateTimeSeries` API for Gauges and Cumulatives, the caller obtains a `TimeSeries` handle for repeatedly recording metrics with certain pre-defined label values set. This enables an important optimization for exporting pre-aggregated metrics, since the implementation is able to compute the aggregate summary "entry" using a pointer or fast table lookup. The efficiency gain requires that the aggregation keys be a subset of the pre-defined labels. + +Application programs with long-lived objects and associated Metrics can take advantage of pre-defined labels by computing label values once per object (e.g., in a constructor), rather than once per call site. In this way, the use of pre-defined labels improves the usability of the API as well as makes an important optimization possible to the implementation. + +The preceding raw statistics API did not specify support for pre-defined labels. This RFC replaces the raw statistics API by a new, general-purpose kind of metric, `MeasureMetric`, generally intended for recording individual measurements like the preceding raw statistics API, with explicit support for pre-defined labels. + +The preceding raw statistics API supported all-or-none recording for interdependent measurements using a common label set. This RFC introduces a `RecordBatch` API to support recording batches of measurements in a single API call, where a `Measurement` is now defined as a pair of `MeasureMetric` and `Value` (integer or floating point). + +## Explanation + +The common use for `MeasureMetric`, like the preceding raw statistics API, is for reporting information about rates and distributions over structured, numerical event data. Measure metrics are the most general-purpose of metrics. Informally, the individual metric event has a logical format expressed as one primary key=value (the metric name and a numerical value) and any number of secondary key=values (the labels, resources, and context). + + metric_name=_number_ + pre_defined1=_any_value_ + pre_defined2=_any_value_ + ... + resource1=_any_value_ + resource2=_any_value_ + ... + context_tag1=_any_value_ + context_tag2=_any_value_ + ... + +Here, "pre_defined" keys are those captured in the metrics handle, "resource" keys are those configured when the SDK was initialized, and "context_tag" keys are those propagated via context. + +Events of this form can logically capture a single update to a named metric, whether a cumulative, gauge, or measure kind of metric. This logical structure defines a _low-level encoding_ of any metric event, across the three kinds of metric. An SDK could simply encode a stream of these events and the consumer, provided access to the metric definition, should be able to interpret these events according to the semantics prescribed for each kind of metric. + +## Metrics API concepts + +The `Meter` interface represents the metrics portion of the OpenTelemetry API. + +There are three kinds of metric instrument, `CumulativeMetric`, `GaugeMetric`, and `MeasureMetric`. + +Metric instruments are constructed through the `Meter` API. Constructing an instrument automatically registers it with the SDK. The common attributes of a metric instrument are: + +| Field | Description | +|------|-----------| +| Name | A string. | +| Kind | One of Cumulative, Gauge, or Measure. | +| Recommended Keys | Default aggregation keys. | +| Unit | The unit of measurement being recorded. | +| Description | Information about this metric. | + +See the specification for more information on these fields, including formatting and uniqueness requirements. To define a new metric, use one of the `Meter` API methods (e.g., with names like `NewCumulativeMetric`, `NewGaugeMetric`, or `NewMeasureMetric`). + +Metric instrument Handles combine a metric instrument with a set of pre-defined labels. Handles are obtained by calling a language-specific API method (e.g., `GetHandle`) on the metric instrument with certain label values. Handles may be used to `Set()`, `Add()`, or `Record()` metrics according to their kind. + +## Selecting Metric Kind + +By separation of API and implementation in OpenTelemetry, we know that an implementation is free to do _anything_ in response to a metric API call. By the low-level interpretation defined above, all metric events have the same structural representation, only their logical interpretation varies according to the metric definition. Therefore, we select metric kinds based on two primary concerns: + +1. What should be the default implementation behavior? Unless configured otherwise, how should the implementation treat this metric variable? +2. How will the program source code read? Each metric uses a different verb, which helps convey meaning and describe default behavior. Cumulatives have an `Add()` method. Gauges have a `Set()` method. Measures have a `Record()` method. + +To guide the user in selecting the right kind of metric for an application, we'll consider the following questions about the primary intent of reporting given data. We use "of primary interest" here to mean information that is almost certainly useful in understanding system behavior. Consider these questions: + +- Does the measurement represent a quantity of something? Is it also non-negative? +- Is the sum a matter of primary interest? +- Is the event count a matter of primary interest? +- Is the distribution (p50, p99, etc.) a matter of primary interest? + +The specification will be updated with the following guidance. + +### Cumulative metric + +Likely to be the most common kind of metric, cumulative metric events express the computation of a sum. Choose this kind of metric when the value is a quantity, the sum is of primary interest, and the event count and distribution are not of primary interest. To raise (or lower) a cumulative metric, call the `Add()` method. + +If the quantity in question is always non-negative, it implies that the sum is monotonic. This is the common case, `Monotonic(true)`, where cumulative sums only rise, and these metric instruments serve to compute a rate. For this reason, cumulative metrics have a `Monotonic(false)` option to be declared as allowing negative inputs, the uncommon case. The SDK should reject negative inputs to monotonic cumulative metrics, but it is not required to. + +For cumulative metrics, the default OpenTelemetry implementation exports the sum of event values taken over an interval of time. + +### Gauge metric + +Gauge metrics express a pre-calculated value that is either `Set()` by explicit instrumentation or observed through a callback. Generally, this kind of metric should be used when the metric cannot be expressed as a sum or a rate because the measurement interval is arbitrary. Use this kind of metric when the measurement is a computed value and the sum and event count are not of interest. + +Only the gauge kind of metric supports observing the metric via a gauge `Observer` callback (as an option, see `0008-metric-observer.md`). Semantically, there is an important difference between explicitly setting a gauge and observing it through a callback. In case of setting the gauge explicitly, the `Set()` call happens inside of an implicit or explicit context. The implementation is free to associate the explicit `Set()` event with a context, for example. When observing gauge metrics via a callback, there is no context associated with the event. + +As a special case, to support existing metrics infrastructure and the `Observer` pattern, a gauge metric may be declared as a precomputed, monotonic sum using the `Monotonic(true)` option, in which case it is may be used to define a rate. The initial value is presumed to be zero. The SDK should reject descending updates to monotonic gauges, but it is not required to. + +For gauge metrics, the default OpenTelemetry implementation exports the last value that was explicitly `Set()`, or if using a callback, the current value from the `Observer`. + +### Measure metric + +Measure metrics express a distribution of measured values. This kind of metric should be used when the count or rate of events is meaningful and either: + +1. The sum is of interest in addition to the count (rate) +2. Quantile information is of interest. + +The key property of a measure metric event is that computing quantiles and/or summarizing a distribution (e.g., via a histogram) may be expensive. Not only will implementations have various capabilities and algorithms for this task, users may wish to control the quality and cost of aggregating measure metrics. + +Like cumulative metrics, non-negative measures are an important case because they support rate calculations. Measure metrics are described as `Absolute(true)` when the inputs are non-negative. As an option, measure metrics may be declared as `Absolute(false)` to support positive and negative values. The SDK should reject negative measurements for Absolute measures, but it is not required to. + +### Option to disable metrics by default + +Metric instruments are enabled by default, meaning that SDKs will export metric data for this instrument without configuration. Metric instruments support a `Disabled` option, marking them as verbose sources of information that may be configured on an as-needed basis to control cost (e.g., using a "views" API). + +### Kind-specific option summary + +The kind-specific optional properties of a metric instrument are: + +| Property | Description | Metric kind | +|----------|-------------|-------------| +| Monotonic(true) | Indicates a cumulative that accepts only non-negative values | Cumulative (default) | +| | Indicate a gauge supports ascending value sequences starting at 0 | Gauge | +| Monotonic(false) | Indicates a cumulative that accepts positive and negative values | Cumulative | +| | Indicate a gauge that expresses a monotonic cumulative value | Gauge (default) | +| Absolute(true) | Indicates a measure that accepts non-negative values | Measure (default) | +| Absolute(false) | Indicates a measure that accepts positive and negative values | Measure | + +### RecordBatch API + +Applications sometimes want to act upon multiple metric instruments in a single API call, either because the values are inter-related to each other, or because it lowers overhead. RecordBatch logically updates each instrument in the batch using the supplied value. A single label set applies to the batch. + +A single measurement is defined as: + +- Instrument: the measure instrument (not a Handle) +- Value: the recorded floating point or integer data + +The batch measurement API uses a language-specific method name (e.g., `RecordBatch`). The entire batch of measurements takes place within a (implicit or explicit) context. + +## Prior art and alternatives + +Prometheus supports the notion of vector metrics, which are those that support pre-defined labels for a specific set of required keys. The vector-metric API supports a variety of methods like `WithLabelValues` to associate labels with a metric handle, similar to `GetHandle` in OpenTelemetry. As in this proposal, Prometheus supports a vector API for all metric types. + +## Open questions + +### `GetHandle` argument ordering + +Argument ordering has been proposed as the way to pass pre-defined label values in `GetHandle`. The argument list must match the parameter list exactly, and if it doesn't we generally find out at runtime or not at all. This model has more optimization potential, but is easier to misuse than the alternative. The alternative approach is to always pass label:value pairs to `GetOrCreateTimeseries`, as opposed to an ordered list of values. + +### `RecordBatch` argument ordering + +The discussion above can be had for the proposed `RecordBatch` method. It can be declared with an ordered list of metrics, then the `Record` API takes only an ordered list of numbers. Alternatively, and less prone to misuse, the `RecordBatch` API has been declared with a list of metric:number pairs. + +### Eliminate `GetDefaultHandle()` + +Instead of a mechanism to obtain a default handle, some languages may prefer to simply operate on the metric instrument directly in this case. Should OpenTelemetry eliminate `GetDefaultHandle` and instead specify that cumulative, gauge, and measure metric instruments implement `Add()`, `Set()`, and `Record()` with the same interpretation? + +If we eliminate `GetDefaultHandle()`, the SDK may keep a map of metric instrument to default handle on its own. + +### `RecordBatch` support for all metrics + +In the 8/21 working session, we agreed to limit `RecordBatch` to recording of simultaneous measure metrics, meaning to exclude cumulatives and gauges from batch recording. There are arguments in favor of supporting batch recording for all metric instruments. + +- If atomicity (i.e., the all-or-none property) is the reason for batch reporting, it makes sense to include all the metric instruments in the API +- `RecordBatch` support for cumulatives and gauges will be natural for SDKs that act as forwarders for metric events . The natural implementation for `Add()` and `Set()` methods will be `RecordBatch` with a single event. +- Likewise, it is simple for an SDK that acts as an aggregator (not a forwarder) to redirect `Add()` and `Set()` APIs to the handle-specific `Add()` and `Set()` methods; while the SDK, as the implementation, still may (not must) treat these cumulative and gauge updates as atomic. + +Arguments against batch recording for all metric instruments: + +- The `Record` in `RecordBatch` suggests it is to be applied to measure metrics. This is due to measure metrics being the most general-purpose of metric instruments. + +## Issues addressed + +[Raw vs. other metrics / measurements are unclear](https://github.com/open-telemetry/opentelemetry-specification/issues/83) + +[Eliminate Measurement class to save on allocations](https://github.com/open-telemetry/opentelemetry-specification/issues/145) + +[Implement three more types of Metric](https://github.com/open-telemetry/opentelemetry-specification/issues/146) diff --git a/oteps/metrics/0008-metric-observer.md b/oteps/metrics/0008-metric-observer.md new file mode 100644 index 00000000000..4efc15a8ecf --- /dev/null +++ b/oteps/metrics/0008-metric-observer.md @@ -0,0 +1,37 @@ +# Metrics observer specification + +**Status:** Superceded entirely by [0072-metric-observer](0072-metric-observer.md) + +Propose metric `Observer` callbacks for context-free access to current Gauge instrument values on demand. + +## Motivation + +The current specification describes metric callbacks as an alternate means of generating metrics for the SDK, allowing the application to generate metrics only as often as desired by the monitoring infrastructure. This proposal limits callback metrics to only support gauge `Observer` callbacks, arguably the only important case. + +## Explanation + +Gauge metric instruments are typically used to reflect properties that are pre-computed by a system, where the measurement interval is arbitrary. When selecting a gauge, as opposed to the cumulative or measure kind of metric instrument, there could be significant computational cost in computing the current value. When this is the case, it is understandable that we are interested in computing them on demand to minimize cost. + +Why are gauges different than cumulative and measure instruments? Measure instruments, by definition, carry information in the individual event, so the callback cannot optimize any better than the SDK can in this case. Cumulative instruments are more commonly used to record amounts that are readily available, such as the number of bytes read or written, and while this may not always be true, recall the special case of `NonDescending` gauges. + +`NonDescending` gauges owe their existence to this case, that we support non-negative cumulative metrics which, being expensive to compute, are recommended for use with `Observer` callbacks. For example, if it requires a system call or more to compute a non-descending sum, such as the _cpu seconds_ consumed by the process, we should declare a non-descending gauge `Observer` for the instrument, instead of a cumulative. This allows the cost of the metric to be reduced according to the desired monitoring frequency. + +One significant difference between gauges that are explicitly `Set()`, as compared with observer callbacks, is that `Set()` happens inside a context, whereas the observer callback does not. + +## Details + +Observer callbacks are only supported for gauge metric instruments. Use the language-specific constructor for an Observer gauge (e.g., `metric.NewFloat64Observer()`). Observer gauges support the `NonDescending` option. + +Callbacks return a map from _label set_ to gauge value. Gauges declared with observer callbacks cannot also be `Set`. + +Callbacks should avoid blocking. The implementation may be required to cancel computation if the callback blocks for too long. + +Callbacks must not be called synchronously with application code via any OpenTelemetry API. Implementations that cannot provide this guarantee should prefer not to implement observer callbacks. + +Callbacks may be called synchronously in the SDK on behalf of an exporter. + +Callbacks should avoid calling OpenTelemetry APIs, but we recognize this may be impossible to enforce. + +## Trade-offs and mitigations + +Callbacks are a relatively dangerous programming pattern, which may require care to avoid deadlocks between the application and the API or the SDK. Implementations may consider preventing deadlocks through runtime callstack introspection, to make these interfaces absolutely safe. diff --git a/oteps/metrics/0009-metric-handles.md b/oteps/metrics/0009-metric-handles.md new file mode 100644 index 00000000000..a8370d5a335 --- /dev/null +++ b/oteps/metrics/0009-metric-handles.md @@ -0,0 +1,37 @@ +# Metric Handle API specification + +Specify the behavior of the Metrics API "Handle" type, for efficient repeated-use of metric instruments. + +## Motivation + +The specification currently names this concept "TimeSeries", the type returned by `GetOrCreateTimeseries`, which supports binding a metric to a pre-defined set of labels for repeated use. This proposal renames these "Handle" and `GetHandle`, respectively, and adds further detail to the API specification for handles. + +## Explanation + +The `TimeSeries` is referred to as a "Handle", as the former name suggests an implementation, not an API concept. "Handle", we feel, is more descriptive of the intended use. Likewise with `GetOrCreateTimeSeries` to `GetHandle` and `GetDefaultTimeSeries` to `GetDefaultHandle`, these names suggest an implementation and not the intended use. + +Applications are encouraged to re-use metric handles for efficiency. + +Handles are useful to reduce the cost of repeatedly recording a metric instrument (cumulative, gauge, or measure) with a pre-defined set of label values. + +`GetHandle` gets a new handle given a [`LabelSet`](./0049-metric-label-set.md). + +As a language-optional feature, the API may provide an _ordered_ form of the API for supplying labels in known order. The ordered label-value API is provided as a (language-optional) potential optimization that facilitates a simple lookup for the SDK. In this ordered-value form, the API is permitted to throw an exception or return an error when there is a mismatch in the arguments to `GetHandle`, although languages without strong type-checking may wish to omit this feature. When label values are accepted in any order, SDKs may be forced to canonicalize the labels in order to find an existing metrics handle, but they must not throw exceptions. + +`GetHandle` supports arbitrary label sets. There is no requirement that the LabelSet used to construct a handle covers the recommended aggregation keys of a metric instrument. + +## Internal details + +Because each of the metric kinds supports a different operation (`Add()`, `Set()`, and `Record()`), there are logically distinct kinds of handle. The names of the distinct handle types should reflect their instrument kind. + +The names (`Handle`, `GetHandle`, ...) are just language-neutral recommendations. Language APIs should feel free to choose type and method names with attention to the language's style. + +### Metric `Attachment` support + +OpenCensus has the notion of a metric attachment, allowing the application to include additional information associated with the event, for sampling purposes. Any label value not used for aggregation may be used as a sample "attachment", including the OpenTelemetry span context, to associate sample trace context with exported metrics. + +## Issues addressed + +[Agreements reached on handles and naming in the working group convened on 8/21/2019](https://docs.google.com/document/d/1d0afxe3J6bQT-I6UbRXeIYNcTIyBQv4axfjKF4yvAPA/edit#). + +[`record` should take a generic `Attachment` class instead of having tracing dependency](https://github.com/open-telemetry/opentelemetry-specification/issues/144) diff --git a/oteps/metrics/0010-cumulative-to-counter.md b/oteps/metrics/0010-cumulative-to-counter.md new file mode 100644 index 00000000000..1b6bc95b1f8 --- /dev/null +++ b/oteps/metrics/0010-cumulative-to-counter.md @@ -0,0 +1,44 @@ +# Rename "Cumulative" to "Counter" in the metrics API + +Prefer the name "Counter" as opposed to "Cumulative". + +## Motivation + +Informally speaking, it seems that OpenTelemetry community members would prefer to call Cumulative metric instruments "Counters". During conversation (e.g., in the 8/21 working session), this has become clear. + +Counter is a noun, like the other kinds Gauge and Measure. Cumulative is an adjective, so while "Cumulative instrument" makes sense, it describes a "Counter". + +## Explanation + +This will eliminate the cognitive cost of mapping "cumulative" to "counter" when speaking about these APIs. + +This is the term used for a cumulative metric instrument, for example, in [Statsd](https://github.com/statsd/statsd/blob/master/docs/metric_types.md) and [Prometheus](https://prometheus.io/docs/concepts/metric_types/#counter). + +However, we have identified important sub-cases of Counter that are treated as follows. Counters have an option: + +- True-cumulative Counter: By default, `Add()` arguments must be >= 0. +- Bi-directional Counter: As an option, `Add()` arguments must be +/-0. + +Gauges are sometimes used to monitoring non-descending quantities (e.g., cpu usage), as an option: + +- Bi-directional Gauge: By default, `Set()` arguments may by +/- 0. +- Uni-directional Gauge: As an option, `Set()` arguments must change by >= 0. + +Uni-directional Gauge instruments are typically used in metric `Observer` callbacks where the observed value is cumulative. + +## Trade-offs and mitigations + +Other ways to describe the distinction between true-cumulative and bi-directional Counters are: + +- Additive (vs. Cumulative) +- GaugeDelta (vs. Gauge) + +It is possible that reducing all of these cases into the broad term "Counter" creates more confusion than it addresses. + +## Internal details + +Simply replace every "Cumulative" with "Counter", then edit for grammar. + +## Prior art and alternatives + +In a survey of existing metrics libraries, Counter is far more common. diff --git a/oteps/metrics/0049-metric-label-set.md b/oteps/metrics/0049-metric-label-set.md new file mode 100644 index 00000000000..b2c7096aa69 --- /dev/null +++ b/oteps/metrics/0049-metric-label-set.md @@ -0,0 +1,109 @@ +# Metric `LabelSet` specification + +Introduce a first-class `LabelSet` API type as a handle on a pre-defined set of labels for the Metrics API. + +## Motivation + +Labels are the term for key-value pairs used in the OpenTelemetry Metrics API. Treatment of labels in the Metrics API is especially important for performance across a variety of export strategies. + +Label serialization is often one of the most expensive tasks when processing metric events. Creating a `LabelSet` once and re-using it many times can greatly reduce the overall cost of processing many events. + +The Metrics API supports three calling conventions: the Handle convention, the Direct convention, and the Batch convention. Each of these conventions stands to benefit when a `LabelSet` is re-used, as it allows the SDK to process the label set once instead of once per call. Whenever more than one handle will be created with the same labels, more than one instrument will be called directly with the same labels, or more than one batch of metric events will be recorded with the same labels, re-using a `LabelSet` makes it possible for the SDK to improve performance. + +## Explanation + +Metric instrument APIs which presently take labels in the form `{ Key: Value, ... }` will be updated to take an explicit `LabelSet`. The `Meter.Labels()` API method supports getting a `LabelSet` from the API, allowing the programmer to acquire a pre-defined label set. Here are several examples of `LabelSet` re-use. Assume we have two instruments: + +```golang +var ( + cumulative = metric.NewFloat64Cumulative("my_counter") + gauge = metric.NewFloat64Gauge("my_gauge") +) +``` + +Use a `LabelSet` to construct multiple Handles: + +```golang +var ( + labels = meter.Labels({ "required_key1": value1, "required_key2": value2 }) + chandle = cumulative.GetHandle(labels) + ghandle = gauge.GetHandle(labels) +) +for ... { + // ... + chandle.Add(...) + ghandle.Set(...) +} +``` + +Use a `LabelSet` to make multiple Direct calls: + +```golang +labels := meter.Labels({ "required_key1": value1, "required_key2": value2 }) +cumulative.Add(quantity, labels) +gauge.Set(quantity, labels) +``` + +Of course, repeated calls to `Meter.RecordBatch()` could re-use a `LabelSet` as well. + +### Ordered `LabelSet` option + +As a language-level decision, APIs may support _ordered_ LabelSet +construction, in which a pre-defined set of ordered label keys is +defined such that values can be supplied in order. This allows a +faster code path to construct the `LabelSet`. For example, + +```golang + +var rpcLabelKeys = meter.OrderedLabelKeys("a", "b", "c") + +for _, input := range stream { + labels := rpcLabelKeys.Values(1, 2, 3) // a=1, b=2, c=3 + + // ... +} +``` + +This is specified as a language-optional feature because its safety, +and therefore its value as an input for monitoring, depends on the +availability of type-checking in the source language. Passing +unordered labels (i.e., a list of bound keys and values) to +`Meter.Labels(...)` is considered the safer alternative. + +### Interaction with "Named" Meters + +LabelSet values may be used with any named Meter originating from the +same Meter provider. That is, LabelSets acquired through a named +Meter may be used by any Meter from the same Meter provider. + +## Internal details + +Metric SDKs that do not or cannot take advantage of the LabelSet optimizations are not especially burdened by having to support these APIs. It is trivial to supply an implementation of `LabelSet` that simply stores a list of labels. This may not be acceptable in performance-critical applications, but this is the common case in many metrics and diagnostics APIs today. + +## Trade-offs and mitigations + +In languages where overloading is a standard convenience, the metrics API may elect to offer alternate forms that elide the call to `Meter.Labels()`, for example: + +``` +instrument.GetHandle({ Key: Value, ... }) +``` + +as opposed to this: + +``` +instrument.GetHandle(meter.Labels({ Key: Value, ... })) +``` + +A key distinction between `LabelSet` and similar concepts in existing metrics libraries is that it is a _write-only_ structure. `LabelSet` allows the developer to input metric labels without being able to read them back. This avoids forcing the SDK to retain a reference to memory that is not required. + +## Prior art and alternatives + +Some existing metrics APIs support this concept. For example, see `Scope` in the [Tally metric API for Go](https://godoc.org/github.com/uber-go/tally#Scope). + +Some libraries take `LabelSet` one step further. In the future, we may add to the the `LabelSet` API a method to extend the label set with additional labels. For example: + +``` +serviceLabels := meter.Labels({ "k1": "v1", "k2": "v2" }) +// ... +requestLabels := serviceLabels.With({ "k3": "v3", "k4": "v4" }) +``` diff --git a/oteps/metrics/0070-metric-bound-instrument.md b/oteps/metrics/0070-metric-bound-instrument.md new file mode 100644 index 00000000000..357a24f63d5 --- /dev/null +++ b/oteps/metrics/0070-metric-bound-instrument.md @@ -0,0 +1,61 @@ +# Rename metric instrument Handles to "Bound Instruments" + +The OpenTelemetry metrics API specification refers to a concept known +as ["metric handles"](0009-metric-handles.md), which is a metric +instrument bound to a `LabelSet`. This OTEP proposes to change that +term to "bound instruments" to avoid the more-generic term "handle". + +The corresponding method to create a bound instrument will be renamed +"Bind" as opposed to "GetHandle". + +## Motivation + +The term "Handle" is widely seen as too general for its purpose in the +metrics API. Rather than re-use a widely-used noun for this concept, +we instead will re-use the metric "instrument" noun and apply an +adjective, "bound" to convey that it has been bound to a `LabelSet`. + +## Explanation + +"Handle" has been confusing from the start. However it was preceded by +other potentially confusing terms (e.g., "TimeSeries", "Entry"). The +term "Bound instrument" was initially suggested +[here](https://github.com/open-telemetry/opentelemetry-specification/pull/299#discussion_r334211154) +and widely accepted. + +## Internal details + +This is a simple renaming. All uses of "handle" will be replaced by +"bound instrument" in the specification. All uses of the `GetHandle` +method become `Bind`. + +Note that the phrase "bound instrument" may not appear directly in the +user-facing API, nor is it required to, whereas the method `GetHandle` +is a specified method on metric instruments. + +The newly-named `Bind()` method returns a bound instrument type. The +name of the returned type may simply take the name of its instrument +with the prefix `Bound`. For example, an `Int64Counter` instrument's +`Bind()` method should return a `BoundInt64Counter` type. + +As usual, the spelling and capitalization of these names are just +recommendations, individual language committees should select names +that are well suited to their language and existing API style. + +## Trade-offs and mitigations + +This is widely seen as an improvement, based on informal discussions. + +## Prior art and alternatives + +The OpenCensus libraries named this concept "Entries", with a +`GetEntry` method, as they are entries some kind of map. + +The earliest appearance in OpenTelemetry renamed these "TimeSeries", +hoping to improve matters, but "TimeSeries" more commonly refers to +the output the bound instruments, after aggregation. "Handle" was +decided upon in an August 2019 working group on metrics. + +The Prometheus library refers to unbound instruments as "Vectors" and +supports a variety of "With" methods to bind labels with the vector to +yield a bound instrument. diff --git a/oteps/metrics/0072-metric-observer.md b/oteps/metrics/0072-metric-observer.md new file mode 100644 index 00000000000..eeecb8537ab --- /dev/null +++ b/oteps/metrics/0072-metric-observer.md @@ -0,0 +1,184 @@ +# Metric observer specification (refinement) + +The metric observer gauge was described in [OTEP +0008](0008-metric-observer.md) but left out of the current metrics +specification because the prior OTEP did not clarify the valid calling +conventions for observer gauge metric instruments. This proposal +completely replaces OTEP 0008. + +## Motivation + +An [earlier version of the metrics specification]( +https://github.com/open-telemetry/opentelemetry-specification/blob/597718b3fcfaf10bcf45d93f99b66f94a28048cb/specification/api-metrics.md) +described metric callbacks as an alternate means of generating metric +events, allowing the application to generate metric events only as +often as desired by the collection interval. It specified this +support for all instrument kinds. + +This proposal restores the ability to use callbacks only with a +dedicated `Observer` kind of instrument with the same semantics as +Gauge instruments. Like a Gauge instrument, Observer instruments are +used to report the current value of a variable. + +We may ask, why should Observer instruments be a first-class part of +the API, as opposed to simply registering non-instrument-specific +callbacks to call user-level code on the metrics collection interval? +That would permit the use of ordinary Gauge instruments as a stand-in +for the Observer instrument proposed here. The approach proposed here +is more flexible because it permits the Meter implementation to +control the collection interval on a per-instrument basis as well as +to disable instruments. + +## Explanation + +Gauge metric instruments are typically used to reflect properties that +are pre-computed or instantaneously read by a system, where the +measurement interval is arbitrary. When selecting a Gauge, as opposed +to the Counter or measure kind of metric instrument, there could be +significant computational cost in computing or reading the current +value. When this is the case, it is understandable that we are +interested in providing values on demand, as an optimization. + +The optimization aspect of Observer instruments is critical to their +purpose. If the simpler alternative suggested above--registering +non-instrument-specific callbacks--were implemented instead, callers +would demand a way to ask whether an instrument was "recording" or +not, similar to the [`Span.IsRecording` +API](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/api.md#isrecording). + +Observer instruments are semantically equivalent to gauge instruments, +except they support callbacks instead of a `Set()` operation. +Observer callbacks support `Observe()` instead. Why support callbacks +with Gauge semantics but not do the same for Counter and Measure +semantics? + +### Why not Measure callbacks? + +Measure instruments, by definition, carry information about the +individual measurements, so there is no benefit to be had in deferring +evaluation to a callback. Observer callbacks are designed to reduce +the number of measurements, which is incompatible with the semantics +of Measure instruments. + +### Why not Counter callbacks? + +Counter instruments can be expressed as Observer instruments when they +are expensive to pre-compute or will be instantaneously read. There +are two ways these can be treated using Observer instrument semantics. + +Observer instruments, like Gauge instruments, use a "last value" +aggregation by default. With this default interpretation in mind, a +monotonic Counter can be expressed as a monotonic Observer instrument +simply by reporting the current sum from `Observe()`, in which case +the "last value" may be interpreted directly as a sum. Systems with +support for rate calculations over current sums (e.g., Prometheus) +will be able to use these metrics directly. Non-monotonic Counters +may be expressed as their current value, but they cannot meaningfully +be aggregated in this way. + +The preferred way to `Observe()` Counter-like data from an Observer +instrument callback is to report deltas in the callback and configure +a Sum aggregation in the exporter. Data reported in this way will +support rate calculations just as if they were true Counters. + +### Differences between Gauge and Observer + +One significant difference between gauges that are explicitly `Set()`, +as compared with observer callbacks, is that `Set()` happens inside a +context (i.e., its distributed context), whereas the observer callback +does not execute with any distributed context. + +Whereas Gauge values do have context at the moment `Set()` is called, +Observer callbacks do not. Observer instruments are appropriate for +reporting values that are not request specific. + +## Details + +Observer instruments are semantically equivalent to Gauge instruments +but use different calling conventions. Use the language-specific +constructor for an Observer instrument (e.g., +`metric.NewFloat64Observer()`). Observer instruments support the +`Monotonic` and `NonMonotonic` options, the same as Gauge instruments. + +Callbacks should avoid blocking. The implementation may be required +to cancel computation if the callback blocks for too long. + +Callbacks must not be called synchronously with application code via +any OpenTelemetry API. This prevents the application from potentially +deadlocking itself by being called synchronously from its own thread. +Implementations that cannot provide this guarantee should prefer not +to implement Observer instrsuments. + +Callbacks may be called synchronously in the SDK on behalf of an +exporter, provided it does not contradict the requirement above. + +Callbacks should avoid calling OpenTelemetry APIs other than the +interface provided to `Observe()`. This prevents the SDK from +potentially deadlocking itself by being called synchronously from its +own thread. We recognize this may be impossible or expensive to +enforce. SDKs should document how they respond to such attempts at +re-entry. + +### Observer calling conventions + +Observer callbacks are called with an `ObserverResult`, an interface +that supports capturing events directly in the callback, as follows. + +To capture an observation with a specific `LabelSet`, call the +`ObserverResult` directly using `ObserverResult.Observe(value, +LabelSet)`. + +There is no equivalent of a "bound" observer instrument as there is +with Counter, Gauge, and Measure instruments. A bound calling +convention is not needed for Observer instruments because there is +little if any performance benefit in doing so (as Observer instruments +are called during collection, there is no need to maintain "active" +records concurrent with collection). + +Multiple observations are permitted in a single callback invocation. + +The `ObserverResult` passed to a callback should not be used outside the +invocation to which it is passed. + +#### One callback per instrument + +The API _could_ support registering independent callbacks tied to +registered ("bound") label sets, instead it takes the approach of +supporting one callback per instrument. There are two cases to +consider: (a) where the source of an instrument's values provides one +value at a time, (b) where the source of an instrument's values +provides several values at once. + +The decision to support one callback per instrument is justified +because it is relatively easy in case (a) above to call the source +multiple times for multiple values, while it is relatively difficult +in case (b) above to call the source once and report values from +multiple callbacks. + +### Pseudocode + +An example: + +``` +class YourClass { + private static final Meter meter = ...; + private static final ObserverDouble cpuLoad = ...; + + void init() { + LabelSet labelSet = meter.createLabelSet("low_power", isLowPowerMode()); + cpuLoad.setCallback( + new ObserverDouble.Callback() { + @Override + public void update(Result result) { + result.Observe(getCPULoad(), labelSet); + }); + } +} +``` + +## Trade-offs and mitigations + +Callbacks are a relatively dangerous programming pattern, which may +require care to avoid deadlocks between the application and the API or +the SDK. Implementations SHOULD consider preventing deadlocks through +any means that are safe and economical. diff --git a/oteps/metrics/0080-remove-metric-gauge.md b/oteps/metrics/0080-remove-metric-gauge.md new file mode 100644 index 00000000000..59055d22075 --- /dev/null +++ b/oteps/metrics/0080-remove-metric-gauge.md @@ -0,0 +1,130 @@ +# Remove the Metric API Gauge instrument + +The [Observer instrument](./0072-metric-observer.md) is semantically +identical to the metric Gauge instrument, only it is reported via a +callback instead of synchronous API calls. Implementation has shown +that Gauge instruments are difficult to reason about because the +semantics of a "last value" Aggregator have to address questions about +statefulness--the SDK's ability to recall old values. Observer +instruments avoid some of these concerns because they are reported +once per collection period, making it easier to reason about "all +values" in an aggregator. + +## Motivation + +Observer instruments improve on our ability to compute well-defined +sum and average-value aggregations over a set of last-value aggregated +data, compared with the existing Gauge instrument. Using data from an +Observer instrument, we are easily able to pose queries about the +current sum of all current values as well as the number of distinct +values, which together define the average value. + +To do the same with synchronous Gauge instruments, the SDK would +potentially be required to maintain state outside a single collection +window, which complicates memory management. The SDK is required to +maintain state about all distinct label sets over the query evaluation +interval. + +The question is: how long should the SDK remember a gauge value? +Observer instruments do not pose this complication, because +observations are synchronized with collection instead of with the +application. + +Unlike with Gauge instruments, Observer instruments naturally define +the current set of all values for a single collection period, making +sum and average-value aggregations possible without mention of the +query evaluation interval, and without the implied additional state +management. + +## Explanation + +The Gauge instrument's most significant feature is that its +measurement interval is arbitrary -- controlled by the application +through explicit, synchronous calls to `Set()`. It is used to report +a current value in a synchronous context, meaning the metric event is +associated with a label set determined by some "request". + +This proposal recommends that synchronously reporting Gauge values can +always be accomplished using one of the three other kinds of +instrument. + +It was _already_ recommended in the specification that if the +instrument reports values you would naturally sum, you should have +used a Counter in the first place. These are not really "current" +values when reported, they are current contributions to the sum. We +still recommend Counters in this case. + +If the gauge reports values, where you would naturally average the +last value across distinct label sets, use a Measure instrument. +Configure the instrument for last-value aggregation. Since last-value +aggregation is not the default for Measure instruments, this will be +non-standard and require extra configuration. + +If the gauge reports values, where you would naturally sum the last +value across distinct label sets, use an Observer instrument. The +current set of entities (e.g., shards, active users, etc) constributes +a last value that should be summed. These are different from Counter +instruments because we are not interested in a sum across time, we are +interested in a sum across distinct instances. + +### Example: Reporting per-request CPU usage + +Use a counter to report a quantity that is naturally summed over time, +such as CPU usage. + +### Example: Reporting per-shard memory holdings + +There are a number of current shards holding variable amounts of +memory by a widely-used library. Observe the current allocation per +shard using an Observer instrument. These can be aggregated across +hosts to compute cluster-wide memory holdings by shard, for example. + +It does not make sense to compute a sum of memory holdings over +multiple periods, as these are not additive quantities. It does makes +sense to sum the last value across hosts. + +### Example: Reporting a per-request finishing account balance + +There's a number that rises and falls such as a bank account balance. +This was being `Set()` at the finish of all transactions. Replace it +with a Measure instrument and `Record()` the last value. + +Similar cases: report a cpu load, specific temperature, fan speed, or +altitude measurement associated with a request. + +## Internal details + +The Gauge instrument will be removed from the specification at the +same time the Observer instrument is added. This will make the +transition easier because in many cases, Observer instruments simply +replace Gauge instruments in the text. + +## Trade-offs and mitigations + +Not much is lost to the user from removing Gauge instruments. + +There may be situations where an Observer instrument is the natural +choice but it is undesirable to be interrupted by the Metric SDK in +order to execute an Observer callback. Situations where Observer +semantics are correct (not Counter, not Measure) but a synchronous API +is more acceptable are expected to be very rare. + +To address such rare cases, here are two possibilities: + +1. Implement a Gauge Set instrument backed by an Observer instrument. +The Gauge Set's job is to maintain the current set of label sets +(e.g., explicitly managed or by time-limit) and their last value, to +be reported by the Observer at each collection interval. +2. Implement an application-specific metric collection API that would +allow the application to synchronize with the SDK on collection +intervals. For example, a transactional API allowing the application +to BEGIN and END synchronously reporting Observer instrument +observations. + +## Prior art and alternatives + +Many existing Metric libraries support both synchronous and +asynchronous Gauge-like instruments. + +See the initial discussion in [Spec issue +412](https://github.com/open-telemetry/opentelemetry-specification/issues/412). diff --git a/oteps/metrics/0088-metric-instrument-optional-refinements.md b/oteps/metrics/0088-metric-instrument-optional-refinements.md new file mode 100644 index 00000000000..46bff0eb82c --- /dev/null +++ b/oteps/metrics/0088-metric-instrument-optional-refinements.md @@ -0,0 +1,526 @@ +# Metric Instruments + +Removes the optional semantic declarations `Monotonic` and `Absolute` +for metric instruments, declares the Measure and Observer instruments +as _foundational_, and introduces a process for standardizing new +instrument _refinements_. + +Note that [OTEP 93](https://github.com/open-telemetry/oteps/pull/93) +contains a final proposal for the set of instruments, of which there +are seven. Note that [OTEP +96](https://github.com/open-telemetry/oteps/pull/96) contains a final +proposal for the names of the seven standard instruments. These three +OTEPs will be applied as a group to the specification, using the names +finalized in OTEP 96. + +## Motivation + +With the removal of Gauge instruments and the addition of Observer +instruments in the specification, the existing `Monotonic` and +`Absolute` options began to create confusion. For example, a Counter +instrument is used for capturing changes in a Sum, and we could say +that non-negative-valued metric events define a monotonic Counter, in +the sense that its Sum is monotonic. The confusion arises, in this +case, because `Absolute` refers to the captured values, whereas +`Monotonic` refers to the semantic output. + +From a different perspective, Counter instruments might be treated as +refinements of the Measure instrument. Whereas the Measure instrument +is used for capturing all-purpose synchronous measurements, the +Counter instrument is used specifically for synchronously capturing +measurements of changes in a sum, therefore it uses `Add()` instead of +`Record()`, and it specifies `Sum` as the standard aggregation. + +What this illustrates is that we have modeled this space poorly. This +proposal does not propose to change any existing metrics APIs, only +our understanding of the three instruments currently included in the +specification: Measure, Observer, and Counter. + +## Explanation + +The Measure and Observer instrument are defined as _foundational_ +here, in the sense that any kind of metric instrument must reduce to +one of these. The foundational instruments are unrestricted, in the +sense that metric events support any numerical value, positive or +negative, zero or infinity. + +The distinction between the two foundational instruments is whether +they are synchronous. Measure instruments are called synchronously by +the user, while Observer instruments are called asynchronously by the +implementation. Synchronous instruments (Measure and its refinements) +have three calling patterns (_Bound_, _Unbound_, and _Batch_) to +capture measurements. Asynchronous instruments (Observer and its +refinements) use callbacks to capture measurements. + +All measurement APIs produce metric events consisting of [timestamp, +instrument descriptor, label set, and numerical +value](../../specification/metrics/api.md#metric-event-format). Synchronous instrument +events additionally have [Context](../../specification/context/README.md), describing +properties of the associated trace and distributed correlation values. + +### Terminology: Kinds of Aggregation + +_Aggregation_ refers to the technique used to summarize many +measurements and/or observations into _some_ kind of summary of the +data. As detailed in the [metric SDK specification (TODO: +WIP)](https://github.com/open-telemetry/opentelemetry-specification/pull/347/files?short_path=5b01bbf#diff-5b01bbf3430dde7fc5789b5919d03001), +there are generally two relevant modes of aggregation: + +1. Within one collection interval, for one label set, the SDK's +`Aggregator.Add()` interface method incorporates one new measurement +value into the current aggregation value. This happens at run time, +therefore is referred to as _temporal aggregation_. This mode applies +only to Measure instruments. +2. Within one collection interval, when combining label sets, the +SDK's `Aggregator.Merge()` interface method incorporates two +aggregation values into one aggregation value. This is referred to as +_spatial aggregation_. This mode applies to both Measure and Observer +instruments. + +As discussed below, we are especially interested in aggregating rate +information, which sometimes requires that temporal and spatial +aggregation be treated differently. + +### Last-value relationship + +Observer instruments have a well-defined _Last Value_ measured by the +instrument, that can be useful in defining aggregations. The Last +Value of an Observer instrument is the value that was captured during +the last-completed collection interval, and it is a useful +relationship because it is defined without relation to collection +interval timing. The Last Value of an Observer is determined by the +single most-recently completed collection interval--it is not +necessary to consider prior collection intervals. The Last Value of +an Observer is undefined when it is not observed during a collection +interval. + +To maintain this property, we impose a requirement: two or more +`Observe()` calls with an identical LabelSet during a single Observer +callback invocation are treated as duplicates of each other, where the +last call to `Observe()` wins. + +Based on the Last Value relationship, we can ask and answer questions +such as "what is the average last value of a metric at a point in +time?". Observer instruments define the Last Value relationship +without referring to the collection interval and without ambiguity. + +### Last-value and Measure instruments + +Measure instruments do not define a Last Value relationship. One +reason is that [synchronous events can happen +simultaneously](../../specification/metrics/api.md#time). + +For Measure instruments, it is possible to compute an aggregation that +computes the last-captured value in a collection interval, but it is +potentially not unique and the result will vary depending on the +timing of the collection interval. For example, a synchronous metric +event that last took place one minute ago will appear as the last +value for collection intervals one minute or longer, but the last +value will be undefined if the collection interval is shorter than one +minute. + +### Aggregating changes to a sum: Rate calculation + +The former `Monotonic` option had been introduced in order to support +reporting of a current sum, such that a rate calculation is implied. +Here we defined _Rate_ as an aggregation, defined for a subset of +instruments, that may be calculated differently depending on how the +instrument is defined. The rate aggregation outputs the amount of +change in a quantity divided by the amount of change in time. + +A rate can be computed from values that are reported as differences, +referred to as _delta_ reporting here, or as sums, referred to as +_cumulative_ reporting here. The primary goal of the instrument +refinements introduced in this proposal is to facilitate rate +calculations in more than one way. + +When delta reporting, a rate is calculated by summing individual +measurements or observations. When cumulative reporting, a rate is +calculated by computing a difference between individual values. + +Note that cumulative-reported metric data requires special treatment +of the time dimension when computing rates. When aggregating across +the time dimension, the difference should be computed. When +aggregating across spatial dimensions, the sum should be computed. + +### Standard implementation of Measure and Observer + +OpenTelemetry specifies how the default SDK should treat metric +events, by default, when asked to export data from an instrument. +Measure and Observer instruments compute `Sum` and `Count` +aggregations, by default, in the standard implementation. This pair +of measurements, of course, defines an average value. There are no +restrictions placed on the numerical value in an event for the two +foundational instruments. + +### Refinements to Measure and Observer + +The `Monotonic` and `Absolute` options were removed in the 0.3 +specification. Here, we propose to regain the equivalent effects +through instrument refinements. Instrument refinements are added to +the foundational instruments, yielding new instruments with the same +calling patterns as the foundational instrument they refine. These +refinements support adding either a different standard implementation +or a restriction of the input domain to the instrument. + +We have done away with instrument options, in other words, in favor of +optional metric instruments. Here we discuss four significant +instrument refinements. + +#### Non-negative + +For some instruments, such as those that measure real quantities, +negative values are meaningless. For example, it is impossible for a +person to weigh a negative amount. + +A non-negative instrument refinement accepts only non-negative values. +For instruments with this property, negative values are considered +measurement errors. Both Measure and Observer instruments support +non-negative refinements. + +#### Sum-only + +A sum-only instrument is one where only the sum is considered to be of +interest. For a sum-only instrument refinement, we have a semantic +property that two events with numeric values `M` and `N` are +semantically equivalent to a single event with value `M+N`. For +example, in a sum-only count of users arriving by bus to an event, we +are not concerned with the number of buses that arrived. + +A sum-only instrument is one where the number of events is not +counted, only the `Sum`. A key property of sum-only instruments is +that they always support a Rate aggregation, whether reporting delta- +or cumulative-values. Both Measure and Observer instruments support +sum-only refinements. + +#### Precomputed-sum + +A precomputed-sum refinement indicates that values reported through an +instrument are observed or measured in terms of a sum that changes +over time. Pre-computed sum instruments support cumulative reporting, +meaning the rate aggregation is defined by computing a difference +across timestamps or collection intervals. + +A precomputed sum refinement implies a sum-only refinement. Note that +values associated with a precomputed sum are still sums. Precomputed +sum values are combined using addition, when aggregating over the +spatial dimensions; only the time dimension receives special treatment. + +#### Non-negative-rate + +A non-negative-rate instrument refinement states that rate aggregation +produces only non-negative results. There are non-negative-rate cases +of interest for delta reporting and cumulative reporting, as follows. + +For delta reporting, any non-negative and sum-only instrument is also +a non-negative-rate instrument. + +For cumulative reporting, a sum-only and pre-computed sum instrument +does not necessarily have a non-negative rate, but adding an explicit +non-negative-rate refinement makes it the equivalent of `Monotonic` in +the 0.2 specification. + +For example, the CPU time used by a process, as read in successive +collection intervals, cannot change by a negative amount, because it +is impossible to use a negative amount of CPU time. CPU time a +typical value to report through an Observer instrument, so the rate +for a specific set of labels is defined by subtracting the prior +observation from the current observation. Using a non-negative-rate +refinement asserts that the values increases by a non-negative amount +on subsequent collection intervals. + +#### Discussion: Additive vs. Non-Additive numbers + +The refinements proposed above may leave us wondering about the +distinction between an unrefined Measure and the +_UpDownCumulativeCounter_. Both values are unrestricted, in terms of +range, so why should they be treated differently? + +The _UpDownCumulativeCounter_ has sum-only and precomputed-sum +refinements, which indicate that the numbers being observed are the +result of addition. These instruments have the additive property that +observing `N` and `M` separately is equivalent to observing `N+M`. +When performing spatial aggregation over data with these additive +properties, it is natural to compute the sum. + +When performing spatial aggregation over data without additive +properties, it is natural to combine the distributions. The +distinction is about how we interpret the values when aggregating. +Use one of the sum-only refinments to report a sum in the default +configuration, otherwise use one of the non-sum-only instruments to +report a distribution. + +#### Language-level refinements + +OpenTelemetry implementations may wish to add instrument refinements +to accommodate built-in types. Languages with distinct integer and +floating point should offer instrument refinements for each, leading +to type names like `Int64Measure` and `Float64Measure`. + +A language with support for unsigned integer types may wish to create +dedicated instruments to report these values, leading to type names +like `UnsignedInt64Observer` and `UnsignedFloat64Observer`. These +would naturally apply a non-negative refinment. + +Other uses for built-in type refinements involve the type for duration +measurements. For example, where there is built-in type for the +difference between two clock measurements, OpenTelemetry APIs should +offer a refinement to automatically apply the correct unit of time to +the measurement. + +### Counter refinement + +Counter is a sum-only, non-negative, thus non-negative-rate refinement +of the Measure instrument. + +### Standardizing new instruments + +With these refinements we can exhaustively list each distinct kind of +instrument. There are a total of twelve hypothetical instruments +listed in the table below, of which only one has been standardized. +Hypothetical future instrument names are _italicized_. + +| Foundation instrument | Sum-only? | Precomputed-sum? | Non-negative? | Non-negative-rate? | Instrument name _(hyptothetical)_ | +|--|--|--|--|--|--| +| Measure | sum-only | | non-negative | non-negative-rate | Counter | +| Measure | sum-only | precomputed-sum | | non-negative-rate | _CumulativeCounter_ | +| Measure | sum-only | | | | _UpDownCounter_ | +| Measure | sum-only | precomputed-sum | | | _UpDownCumulativeCounter_ | +| Measure | | | non-negative | | _AbsoluteDistribution_ | +| Measure | | | | | _Distribution_ | +| Observer | sum-only | | non-negative | non-negative-rate | _DeltaObserver_ | +| Observer | sum-only | precomputed-sum | | non-negative-rate | _CumulativeObserver_ | +| Observer | sum-only | | | | _UpDownDeltaObserver_ | +| Observer | sum-only | precomputed-sum | | | _UpDownCumulativeObserver_ | +| Observer | | | non-negative | | _AbsoluteLastValueObserver_ | +| Observer | | | | | _LastValueObserver_ | + +To arrive at this listing, several assumptions have been made. For +example, the precomputed-sum and non-negative-rate refeinments are +only applicable in conjunction with a sum-only refinement. + +For the precomputed-sum instruments, we technically do not care +whether the inputs are non-negative, because rate aggregation computes +differences. However, it is useful for other aggregations to assume +that precomputed sums start at zero, and we will ignore the case where +a precomputed sum has an initial value other than zero. + +#### Gauge instrument + +A Measure instrument with a default Last Value aggregation could be +defined, hypothetically named a _Gauge_ instrument. This would offer +convenience for users that want this behavior, for there is otherwise +no standard Measure refinement with Last Value aggregation. + +Sum-only uses for this hypothetical instrument should instead use +either _CumulativeCounter_ or _UpDownCumulativeCounter_, since they +are reporting a sum. This (hypothetical) _Gauge_ instrument would be +useful when a value is time-dependent and the average value is not of +interest. + +## Internal details + +This is a change of understanding. It does not request any new +instruments be created or APIs be changed, but it does specify how we +should think about adding new instruments. + +No API changes are called for in this proposal. + +### Translation into well-known systems + +#### Prometheus + +The Prometheus system defines four kinds of [synchronous metric +instrument](https://prometheus.io/docs/concepts/metric_types). + +| System | Metric Kind | Operation | Aggregation | Notes | +| ---------- | ------------ | ------------------- | -------------------- | ------------------- | +| Prometheus | Counter | Inc() | Sum | Sum of positive deltas | +| Prometheus | Counter | Add() | Sum | Sum of positive deltas | +| Prometheus | Gauge | Set() | Last Value | Non-additive or monotonic cumulative | +| Prometheus | Gauge | Inc()/Dec() | Sum | Sum of deltas | +| Prometheus | Gauge | Add()/Sub() | Sum | Sum of deltas | +| Prometheus | Histogram | Observe() | Histogram | Non-negative values | +| Prometheus | Summary | Observe() | Summary | Aggregation does not merge | + +Note that the Prometheus Gauge supports five methods (`Set`, `Inc`, +`Dec`, `Add`, and `Sub`), one which sets the last value while the +others modify the last value. This interface is not compatible with +OpenTelemetry, because it requires the SDK to maintain long-lived +state about Gauge values in order to compute the last value following +one of the additive methods (`Inc`, `Dec`, `Add`, and `Sub`). + +If we restrict Prometheus Gauges to support only a `Set` method, or to +support only the additive methods, then we can model these two +instruments seprately, in a way that is compatible with OpenTelemetry. +A Prometheus Gauge that is used exclusively with `Set()` can be +modeled as a Measure instrument with Last Value aggregation. A +Prometheus Gauge that is used exclusively with the additive methods be +modeled as a `UpDownCounter` + +Prometheus has support for asynchronous reporting via the "Collector" +interface, but this is a low-level API to support directly exporting +encoded metric data. The Prometheus "Collector" interface could be +used to implement Observer-like instruments, but they are not natively +supported in Prometheus. + +#### Statsd + +The Statsd system supports only synchronous reporting. + +| System | Metric Event | Operation | Aggregation | Notes | +| ------ | ------------ | ------------------- | -------------------- | ------------------- | +| Statsd | Count | Count() | Sum | Sum of deltas | +| Statsd | Gauge | Gauge() | Last Value | | +| Statsd | Histogram | Histogram() | Histogram | | +| Statsd | Distribution | Distribution() | _Not specified_ | A distribution summary | +| Statsd | Timing | Timing() | _Not specified_ | Non-negative, distribution summary, Millisecond units | +| Statsd | Set | Set() | Cardinality | Unique value count | + +The Statsd Count operation translates into either a Counter, if +increments are non-negative, or an _UpDownCounter_ if values may be +negative. The Statsd Gauge operation translates into a Measure +instrument configured with Last Value aggregation. + +The Histogram, Distribution, and Timing operations are semantically +identical, but have different units and default behavior in statsd +systems. Each of these distribution-valued instruments can be +replaced using a Measure with a distribution-valued aggregation such +as MinMaxSumCount, Histogram, Exact, or Summary. + +The Set operation does not have a direct replacement in OpenTelemetry, +however one can be constructed using a Measure or Observer instrument +and a dummy value. Each distinct label set is naturally output each +collection interval, whether reported synchronously or asynchronously, +so the set size can be computed by using a metric label as the unique +element and no aggregation operator. + +#### OpenCensus + +The OpenCensus system defines three kinds of instrument: + +| System | Metric Kind | Operation | Aggregation | Notes | +| ------ | ---------------- | -------------- | ----------------- | ------------------- | +| OpenCensus | Cumulative | Inc() | Sum | Positive deltas | +| OpenCensus | Gauge | Set() | LastValue | | +| OpenCensus | Gauge | Add() | Sum | Deltas | +| OpenCensus | Raw-Stats | Record() | Sum, Count, Mean, or Distribution | | + +OpenCensus departed from convention with the introduction of a Views +API, which makes it possible to support fewer kinds of instrument +directly, since they can be configured in multiple ways. + +Like Prometheus, the combination of multiple APIs in the Gauge +instrument is not compatible with OpenTelemetry. A Gauge used with +Set() generally implies last-value aggregation, whereas a Gauge used +with Add() is additive and uses Sum aggregation. + +Raw statstistics can be aggregated using any aggregation, and all the +OpenCensus aggregations have equivalents in OpenTelemetry. + +OpenCensus supported callback-oriented asynchronous forms of both +Cumulative and Gauge instruments. An asynchronous Cumulative +instrument would be replaced by a CumulativeObserver in OpenTelemetry. +An asynchronous Last-value Gauge would be replaced by AbsoluteObserver +or just the unrestricted Observer. An asynchronous Additive Gauge +would be replaced by a DeltaObserver. + +### Sample Proposal + +The the information above will be used to propose a set of refinements +for both synchronous and asynchronous instruments in a follow-on OTEP. +What follows is a sample of the forthcoming proposal, to motivate the +discussion here. + +#### Synchronous instruments + +The foundational `Measure` instrument without refinements or +restrictions will be called a `Distribution` instrument. + +Along with `Counter` and `Distribution`, we recognize several less-common +but still important cases and reasons why they should be standardized: + +- _UpDownCounter_: Support Prometheus additive Gauge instrument use +- _Timing_: Support Prometheus and Statsd timing measurements. + +Instruments that are not standardized but may be in the future (and why): + +- _CumulativeCounter_: Support a synchronous monotonic cumulative instrument +- _AbsoluteDistribution_: Support non-negative valued distributions + +Instruments that are probably not seen as widely useful: + +- _UpDownCumulativeCounter_: We believe this is better handled asynchronously. + +#### Observer instruments + +The foundational `Observer` instrument without refinements or +restrictions shall be called a `LastValueObserver` instrument. + +We have identified important cases that should be standardized: + +- _CumulativeObserver_: Support a cumulative monotone counter +- _DeltaObserver_: Support an asynchronous delta counter. + +Observer refinements that could be standardized in the future: + +- _UpDownCumulativeObserver_: Observe a non-monotonic cumluative counter +- _UpDownDeltaObserver_: Observe positive and negative deltas +- _AbsoluteLastValueObserver_: Observe non-negative current values. + +## Example: Observer aggregation + +Suppose you wish to capture the CPU usage of a process broken down by +the CPU core ID. The operating system provides a mechanism to read +the current usage from the `/proc` file system, which will be reported +once per collection interval using an Observer instrument. Because +this is a precomputed sum with a non-negative rate, use a +_CumulativeObserver_ to report this quantity with a metric label +indicating the CPU core ID. + +It will be common to compute a rate of CPU usage over this data. The +rate can be calculated for an individual CPU core by computing a +difference between the value of two metric events. To compute the +aggregate rate across all cores–a spatial aggregation–these +differences are added together. + +## Open Questions + +Are there still questions surrounding the former Monotonic refinement? + +Should the _CumulativeObserver_ instrument be named +_MonotonicObserver_? In this proposal, we prefer _Cumulative_ and +_UpDownCumulative_. _Cumulative_ is a good descriptive term in this +setting (i.e., some additive values are _cumulative_, some are +_delta_). Being _Cumulative_ and not _UpDownCumulative_ implies +monotonicity in this proposal. + +For synchronous instruments, this proposals does not standardize +_CumulativeCounter_. Such an instrument might be named +_MonotonicCounter_. + +## Trade-offs and mitigations + +The trade-off explicitly introduced here is that we should prefer to +create new instrument refinements, each for a dedicated purpose, +rather than create generic instruments with support for multiple +semantic options. + +## Prior art and alternatives + +The optional behaviors `Monotonic` and `Absolute` were first discussed +in the August 2019 Metrics working group meeting. + +## Future possibilities + +A future OTEP will request the introduction of two standard +refinements for the 0.4 API specification. This will be the +`CumulativeObserver` instrument described above plus a synchronous +timing instrument named `TimingMeasure` that is equivalent to +_AbsoluteMeasure_ with the correct unit and a language-specific +duration type for measuring time. + +If the above open question is decided in favor of treating the +foundational instruments as abstract, instrument names like +_NonAbsoluteMeasure_ and _NonAbsoluteCounter_ will need to be +standardized. diff --git a/oteps/metrics/0090-remove-labelset-from-metrics-api.md b/oteps/metrics/0090-remove-labelset-from-metrics-api.md new file mode 100644 index 00000000000..aedeef3bea4 --- /dev/null +++ b/oteps/metrics/0090-remove-labelset-from-metrics-api.md @@ -0,0 +1,49 @@ +# Remove the LabelSet object from the metrics API + +The proposal is to remove the current [`LabelSet`](./0049-metric-label-set.md) +API and change all the current APIs that accept LabelSet to accept directly the +labels (list of key-values, or a map of key-values based on the language +capabilities). + +## Motivation + +The [`LabelSet`](./0049-metric-label-set.md) API type was added to serve as a +handle on a pre-defined set of labels for the Metrics API. + +This API represents an optimization for the current metrics API that allows the +implementations to avoid encoding and checking labels restrictions multiple +times for the same set of lables. Usages and implementations of the metrics API +have shown that LabelSet adds extra unnecessary complexity with little benefit. + +Some users prefer to avoid this performance optimization for the benefit of a +cleaner code and OpenTelemetry needs to address them as well, so this means that +it is important for OpenTelemetry to support record APIs where users can pass +directly the labels. + +OpenTelementry can always add this optimization later (backwards compatible +change) if we determine that it is very important to have. + +## Trade-offs and mitigations + +In case where performance matters, here are the ways to achieve almost the same performance: + +- In the current API if a `LabelSet` is reused across multiple individual +records across different instruments (one record to every instrument) then user +can use the batch recording mechanism, so internally the SDK can do the labels +encoding once. +- In the current API if a `LabelSet` is used multiple times to record to the +same instrument then user can use instrument bindings. +- In the current API if a `LabelSet` is used across multiple batch recordings, +and this pattern becomes very important, then OpenTelemetry can add support for +batches to accept bindings. + +To ensure that the current batch recording can help in scenarios where there are +some local conditions that control which measurements to be recorded, the +recommendation is to have the `newBatchRecorder` return an interface called +`BatchRecorder` that can be used to add `measurement` and when all entries are +added call `record` to record all the `measurements`. + +## Prior art and alternatives + +Almost all the existing Metric libraries do not require users to create +something like LabelSet when recording a value. diff --git a/oteps/metrics/0098-metric-instruments-explained.md b/oteps/metrics/0098-metric-instruments-explained.md new file mode 100644 index 00000000000..f5b359b8044 --- /dev/null +++ b/oteps/metrics/0098-metric-instruments-explained.md @@ -0,0 +1,251 @@ +# Explain the metric instruments + +Propose and explain final names for the standard metric instruments theorized in [OTEP 88][otep-88] and address related confusion. + +## Motivation + +[OTEP 88][otep-88] introduced a logical structure for metric instruments with two foundational categories of instrument, called "synchronous" vs. "asynchronous", named "Measure" and "Observer" in the abstract sense. The proposal identified four kinds of "refinement" and mapped out the space of _possible_ instruments, while not proposing which would actually be included in the standard. + +[OTEP 93](https://github.com/open-telemetry/oteps/pull/93) proposed with a list of six standard instruments, the most necessary and useful combination of instrument refinements, plus one special case used to record timing measurements. OTEP 93 was closed without merging after a more consistent approach to naming was uncovered. [OTEP 96](https://github.com/open-telemetry/oteps/pull/96) made another proposal, that was closed in favor of this one after more debate surfaced. + +This proposal finalizes the naming proposal for the standard instruments, seeking to address core confusion related to the "Measure" and "Observer" terms: + +1. [OTEP 88][otep-88] stipulates that the terms currently in use to name synchronous and asynchronous instruments--"Measure" and "Observer"--become _abstract_ terms. It also used phrases like "Measure-like" and "Observer-like" to discuss instruments with refinements. This proposal states that we shall prefer the adjectives, commonly abbreviated "Sync" and "Async", when describing the kind of an instrument. "Measure-like" means an instrument is synchronous. "Observer-like" means that an instrument is asynchronous. +2. There is inconsistency in the hypothetical naming scheme for instruments presented in [OTEP 88][otep-88]. Note that "Counter" and "Observer" end in "-er", a noun suffix used in the sense of "[person occupationally connected with](https://www.merriam-webster.com/dictionary/-er)", while the term "Measure" does not fit this pattern. This proposal proposes to replace the abstract term "Measure" by "Recorder", since the associated function name (verb) is specified as `Record()`. + +This proposal also repeats the current specification--and the justification--for the default aggregation of each standard instrument. + +## Explanation + +The following table summarizes the final proposed standard instruments resulting from this set of proposals. The columns are described in more detail below. + +| Existing name | **Standard name** | Instrument kind | Function name | Input temporal quality | Default aggregation | Rate support (Monotonic) | Notes | +| ------------- | ----------------------- | ----- | --------- | -------------- | ------------- | --- | ------------------------------------ | +| Counter | **Counter** | Sync | Add() | Delta | Sum | Yes | Per-request, part of a monotonic sum | +| | **UpDownCounter** | Sync | Add() | Delta | Sum | No | Per-request, part of a non-monotonic sum | +| Measure | **ValueRecorder** | Sync | Record() | Instantaneous | MinMaxSumCount | No | Per-request, any non-additive measurement | +| | **SumObserver** | Async | Observe() | Cumulative | Sum | Yes | Per-interval, reporting a monotonic sum | +| | **UpDownSumObserver** | Async | Observe() | Cumulative | Sum | No | Per-interval, reporting a non-monotonic sum | +| Observer | **ValueObserver** | Async | Observe() | Instantaneous | MinMaxSumCount | No | Per-interval, any non-additive measurement | + +There are three synchronous instruments and three asynchronous instruments in this proposal, although a hypothetical 10 instruments were discussed in [OTEP 88][otep-88]. Although we consider them rational and logical, two categories of instrument are excluded in this proposal: synchronous cumulative instruments and asynchronous delta instruments. + +Synchronous cumulative instruments are excluded from the standard based on the [OpenTelemetry library performance guidelines](../../specification/performance.md). To report a cumulative value correctly at runtime requires a degree of order dependence--thus synchronization--that OpenTelemetry API will not itself admit. In a hypothetical example, if two actors both synchronously modify a sum and were to capture it using a synchronous cumulative metric event, the OpenTelemetry library would have to guarantee those measurements were processed in order. The library guidelines do not support this level of synchronization; we cannot block for the sake of instrumentation, therefore we do not support synchronous cumulative instruments. + +Asynchronous delta instruments are excluded from the standard based on the lack of motivating examples, but we could also justify this as a desire to keep asynchronous callbacks stateless. An observer has to have memory in order to compute deltas; it is simpler for asynchronous code to report cumulative values. + +With six instruments in total, one may be curious--how does the historical Metrics API term _Gauge_ translate into this specification? _Gauge_, in Metrics API terminology, may cover all of these instrument use-cases with the exception of `Counter`. As defined in [OTEP 88][otep-88], the OpenTelemetry Metrics API will disambiguate these use-cases by requiring _single purpose instruments_. The choice of instrument implies a default interpretation, a standard aggregation, and suggests how to treat Metric data in observability systems, out of the box. Uses of `Gauge` translate into the various OpenTelemetry Metric instruments depending on what kind of values is being captured and whether the measurement is made synchronously or not. + +Summarizing the naming scheme: + +- If you've measured an amount of something that adds up to a total, where you are mainly interested in that total, use one of the additive instruments: + - If synchronous and monotonic, use `Counter` with non-negative values + - If synchronous and not monotonic, use `UpDownCounter` with arbitrary values + - If asynchronous and a cumulative, monotonic sum is measured, use `SumObserver` + - If asynchronous and a cumulative, arbitrary sum is measured, use `UpDownSumObserver` +- If the measurements are non-additive or additive with an interest in the distribution, use an instantaneous instrument: + - If synchronous, use `ValueRecorder` to record a value that is part of a distribution + - If asynchronous use `ValueObserver` to record a single measurement nearing the end of a collection interval. + +### Sync vs Async instruments + +Synchronous instruments are called in a request context, meaning they potentially have an associated tracing context and distributed correlation values. Multiple metric events may occur for a synchronous instrument within a given collection interval. Note that synchronous instruments may be called outside of a request context, such as for background computation. In these scenarios, we may simply consider the Context to be empty. + +Asynchronous instruments are reported by a callback, once per collection interval, and lack request context. They are permitted to report only one value per distinct label set per period. If the application observes multiple values in a single callback, for one collection interval, the last value "wins". + +### Temporal quality + +Measurements can be described in terms of their relationship with time. Note: although this term logically applies and is used throughout this OTEP, discussion in the Metrics SIG meeting (4/30/2020) leads us to exclude this term from use in documenting the Metric API. The explanation of terms here is consistent with the [terminology used in the protocol], but we will prefer to use these adjectives to describe properties of an aggregation, not properties of an instrument (despite this document continuing to use the terms freely). In the API specification, this distinction will be described using "additive synchronous" in contrast with "additive asynchronous". + +Delta measurements are those that measure a change to a sum. Delta instruments are usually selected because the program does not need to compute the sum for itself, but is able to measure the change. In these cases, it would require extra state for the user to report cumulative values and reporting deltas is natural. + +Cumulative measurements are those that report the current value of a sum. Cumulative instruments are usually selected because the program maintains a sum for its own purposes, or because changes in the sum are not instrumented. In these cases, it would require extra state for the user to report delta values and reporting cumulative values is natural. + +Instantaneous measurements are those that report a non-additive measurement, one where it is not natural to compute a sum. Instantaneous instruments are usually chosen when the distribution of values is of interest, not only the sum. + +The terms "Delta", "Cumulative", and "Instantaneous" as used in this proposal refer to measurement values passed to the Metric API. The argument to an (additive) instrument with the Delta temporal quality is the change in a sum. The argument to an (additive) instrument with the Cumulative temporal quality is itself a sum. The argument to an instrument with the Instantaneous temporal quality is simply a value. In the SDK specification, as measurements are aggregated and transformed for export, these terms will be used again, with the same meanings, to describe aggregates. + +### Function names + +Synchronous delta instruments support an `Add()` function, signifying that they add to a sum and are not cumulative. + +Synchronous instantaneous instruments support a `Record()` function, signifying that they capture individual events, not only a sum. + +Asynchronous instruments all support an `Observe()` function, signifying that they capture only one value per measurement interval. + +### Rate support + +Rate aggregation is supported for Counter and SumObserver instruments in the default implementation. + +The `UpDown-` forms of additive instrument are not suitable for aggregating rates because the up- and down-changes in state may cancel each other. + +Non-additive instruments can be used to derive a sum, meaning rate aggregation is possible when the values are non-negative. There is not a standard non-additive instrument with a non-negative refinement in the standard. + +### Default Aggregations + +Additive instruments use `Sum` aggregation by default, since by definition they are used when only the sum is of interest. + +Instantaneous instruments use `MinMaxSumCount` aggregation by default, which is an inexpensive way to summarize a distribution of values. + +## Detail + +Here we discuss the six proposed instruments individually and mention other names considered for each. + +### Counter + +`Counter` is the most common synchronous instrument. This instrument supports an `Add(delta)` function for reporting a sum, and is restricted to non-negative deltas. The default aggregation is `Sum`, as for any additive instrument, which are those instruments with Delta or Cumulative measurement kind. + +Example uses for `Counter`: + +- count the number of bytes received +- count the number of accounts created +- count the number of checkpoints run +- count a number of 5xx errors. + +These example instruments would be useful for monitoring the rate of any of these quantities. In these situations, it is usually more convenient to report a change of the associated sums, as the change happens, as opposed to maintaining and reporting the sum. + +Other names considered: `Adder`, `SumCounter`. + +### UpDownCounter + +`UpDownCounter` is similar to `Counter` except that `Add(delta)` supports negative deltas. This makes `UpDownCounter` not useful for computing a rate aggregation. It aggregates a `Sum`, only the sum is non-monotonic. It is generally useful for counting changes in an amount of resources used, or any quantity that rises and falls, in a request context. + +Example uses for `UpDownCounter`: + +- count memory in use by instrumenting `new` and `delete` +- count queue size by instrumenting `enqueue` and `dequeue` +- count semaphore `up` and `down` operations. + +These example instruments would be useful for monitoring resource levels across a group of processes. + +Other names considered: `NonMonotonicCounter`. + +### ValueRecorder + +`ValueRecorder` is a non-additive synchronous instrument useful for recording any non-additive number, positive or negative. Values captured by a `ValueRecorder` are treated as individual events belonging to a distribution that is being summarized. `ValueRecorder` should be chosen either when capturing measurements that do not contribute meaningfully to a sum, or when capturing numbers that are additive in nature, but where the distribution of individual increments is considered interesting. + +One of the most common uses for `ValueRecorder` is to capture latency measurements. Latency measurements are not additive in the sense that there is little need to know the latency-sum of all processed requests. We use a `ValueRecorder` instrument to capture latency measurements typically because we are interested in knowing mean, median, and other summary statistics about individual events. + +The default aggregation for `ValueRecorder` computes the minimum and maximum values, the sum of event values, and the count of events, allowing the rate, the mean, and and range of input values to be monitored. + +Example uses for `ValueRecorder` that are non-additive: + +- capture any kind of timing information +- capture the acceleration experienced by a pilot +- capture nozzle pressure of a fuel injector +- capture the velocity of a MIDI key-press. + +Example _additive_ uses of `ValueRecorder` capture measurements that are cumulative or delta values, but where we may have an interest in the distribution of values and not only the sum: + +- capture a request size +- capture an account balance +- capture a queue length +- capture a number of board feet of lumber. + +These examples show that although they are additive in nature, choosing `ValueRecorder` as opposed to `Counter` or `UpDownCounter` implies an interest in more than the sum. If you did not care to collect information about the distribution, you would have chosen one of the additive instruments instead. Using `ValueRecorder` makes sense for distributions that are likely to be important in an observability setting. + +Use these with caution because they naturally cost more than the use of additive measurements. + +Other names considered: `Distribution`, `Measure`, `LastValueRecorder`, `GaugeRecorder`, `DistributionRecorder`. + +### SumObserver + +`SumObserver` is the asynchronous instrument corresponding to `Counter`, used to capture a monotonic count. "Sum" appears in the name to remind users that it is a cumulative instrument. Use a `SumObserver` to capture any value that starts at zero and rises throughout the process lifetime but never falls. + +Example uses for `SumObserver`. + +- capture process user/system CPU seconds +- capture the number of cache misses. + +A `SumObserver` is a good choice in situations where a measurement is expensive to compute, such that it would be wasteful to compute on every request. For example, a system call is needed to capture process CPU usage, therefore it should be done periodically, not on each request. A `SumObserver` is also a good choice in situations where it would be impractical or wasteful to instrument individual deltas that comprise a sum. For example, even though the number of cache misses is a sum of individual cache-miss events, it would be too expensive to synchronously capture each event using a `Counter`. + +Other names considered: `CumulativeObserver`. + +### UpDownSumObserver + +`UpDownSumObserver` is the asynchronous instrument corresponding to `UpDownCounter`, used to capture a non-monotonic count. "Sum" appears in the name to remind users that it is a cumulative instrument. Use a `UpDownSumObserver` to capture any value that starts at zero and rises or falls throughout the process lifetime. + +Example uses for `UpDownSumObserver`. + +- capture process heap size +- capture number of active shards +- capture number of requests started/completed +- capture current queue size. + +The same considerations mentioned for choosing `SumObserver` over the synchronous `Counter` apply for choosing `UpDownSumObserver` over the synchronous `UpDownCounter`. If a measurement is expensive to compute, or if the corresponding delta events happen so frequently that it would be impractical to instrument them, use a `UpDownSumObserver`. + +Other names considered: `UpDownCumulativeObserver`. + +### ValueObserver + +`ValueObserver` is the asynchronous instrument corresponding to `ValueRecorder`, used to capture non-additive measurements that are expensive to compute and/or are not request-oriented. + +Example uses for `ValueObserver`: + +- capture CPU fan speed +- capture CPU temperature. + +Note that these examples use non-additive measurements. In the `ValueRecorder` case above, example uses were given for capturing synchronous cumulative measurements in a request context (e.g., current queue size seen by a request). In the asynchronous case, however, how should users decide whether to use `ValueObserver` as opposed to `UpDownSumObserver`? + +Consider how to report the (cumulative) size of a queue asynchronously. Both `ValueObserver` and `UpDownSumObserver` logically apply in this case. Asynchronous instruments capture only one measurement per interval, so in this example the `SumObserver` reports a current sum, while the `ValueObserver` reports a current sum (equal to the max and the min) and a count equal to 1. When there is no aggregation, these results are equivalent. + +The recommendation is to choose the instrument with the more-appropriate default aggregation. If you are observing a queue size across a group of machines and the only thing you want to know is the aggregate queue size, use `SumObserver`. If you are observing a queue size across a group of machines and you are interested in knowing the distribution of queue sizes across those machines, use `ValueObserver`. + +Other names considered: `GaugeObserver`, `LastValueObserver`, `DistributionObserver`. + +## Details Q&A + +### Why MinMaxSumCount for `ValueRecorder`, `ValueObserver`? + +There has been a question about the choice of `MinMaxSumCount` for the two non-additive instruments. The use of four values in the default aggregation for these instruments means that four values will be exported for these two instrument kinds. The choice of Min, Max, Sum, and Count was intended to be an inexpensive default, but there is an even-more-minimal default aggregation we could choose. The question was: Should "SumCount" be the default aggregation for these instruments? The use of "SumCount" implies the ability to monitor the rate and the average, but not the range of values. + +This proposal continues to specify the use of MinMaxSumCount for these two instruments. Our belief is that in cases where performance and cost are concerns, usually the is an additive instruments that can be applied to lower cost. In the case of `ValueObserver`, consider using a `SumObserver` or `UpDownSumObserver`. In the case of `ValueRecorder`, consider configuring a less expensive view of these instruments than the default. + +### `ValueObserver` temporal quality: Delta or Instantaneous? + +There has been a question about labeling `ValueObserver` measurements with the temporal quality Delta vs. Instantaneous. There is a related question: What does it mean aggregate a Min and Max value for an asynchronous instrument, which may only produce one measurement per collection interval? + +The purpose of defining the default aggregation, when there is only one measurement per interval, is to specify how values will be aggregated across multiple collection intervals. When there is no aggregation being applied, the result of MinMaxSumCount aggregation for a single collection interval is a single measurement equal to the Min, the Max, and the Sum, as well as a Count equal to 1. Before we apply aggregation to a `ValueObserver` measurement, we can clearly define it as an Intantaneous measurement. A measurement, captured at an instant near the end of the collection interval, is neither a cumulative nor a delta with respect to the prior collection interval. + +[OTEP 88][otep-88] discusses the Last Value relationship to help address this question. After capturing a single `ValueObserver` measurement for a given instrument and label set, that measurement becomes the Last value associated with that instrument until the next measurement is taken. + +To aggregate `ValueObserver` measurements across spatial dimensions means to combine last values into a distribution at an effective moment in time. MinMaxSumCount aggregation, in this case, means computing the Min and Max values, the measurement sum, and the count of distinct label sets that contributed measurements. The aggregated result is considered instantaneous: it may have been computed using data points from different machines, potentially using different collection intervals. The aggregate value must be considered approximate, with respect to time, since it averages the results from uncoordinated collection intervals. We may have combined the last-value from a 1-minute collection interval with the last-value from a 10-second collection interval: the result is an instantaneous summary of the distribution across spatial dimensions. + +Aggregating `ValueObserver` measurements across the time dimension for a given instrument and label set yields a set of measurements that were taken across a span of time, but this does not automatically lead us to consider them delta measurements. If we aggregate 10 consecutive collection intervals for a given label set, what we have is distribution of instantaneous measurements with Count equal to 10, with the Min, Max and Sum serving to convey the average value and the range of values present in the distribution. The result is a time-averaged distribution of instantaneous measurements. + +Whether aggregating across time or space, it has been argued, the result of a `ValueObserver` instrument has the Instantaneous temporal quality. + +#### Temporal and spatial aggregation of `ValueObserver` measurements + +Aggregating `ValueObserver` measurements across both spatial and time dimensions must be done carefully to avoid a bias toward results computed over shorter collection intervals. A time-averaged aggregation across spatial dimensions must take the collection interval into account, which can be done as follows: + +1. Decide the time span being queried, say [T_begin, T_end]. +2. Divide the time span into a list of timestamps, say [T_begin, T_begin+(T_end-T_begin)/2, T_end]. +3. For each distinct label set and timestamp, compute the spatial aggregation using the last-value definition at that timestamp. This results in a set of timestamped aggregate measurements with comparable counts. +4. Aggregate the timestamped measurements from step 3. + +Steps 2 and 3 ensure that measurements taken less frequently have equal representation in the output, by virtue of computing the spatial aggregation first. If we were to compute the temporal aggregation first, then aggreagate across spatial dimensions, then instruments collected at a higher frequency will contribute correspondingly more points to the aggregation. Thus, we must aggregate across `ValueObserver` instruments across spatial dimensions before averaging across time. + +## Open Questions + +### Timing instrument + +One potentially important special-purpose instrument, found in some metrics APIs, is a dedicated instrument for reporting timings. The rationale is that when reporting timings, getting the units right is important and often not easy. Many programming languages use a different type to represent time or a difference between times. To correctly report a timing distribution, OpenTelemetry requires using a `ValueRecorder` but also configuring it for the units output by the clock that was used. + +In the past, a proposal to create a dedicated `TimingValueRecorder` instrument was rejected. This instrument would be identical to a `ValueRecorder`, but its `Record()` method would be specialized for the correct type used to represent a duration, so that the units could be set correctly and automatically. A related pattern is a `Timer` or `StopWatch` instrument, one responsible for both measuring and capturing a timing. + +Should types such as these be added as helpers? For example, should `TimingValueRecorder` be a real instrument, or should it be a helper that wraps around a `ValueRecorder`? There is a concern that making `TimingValueRecorder` into a helper makes it less visible, less standard, and that not having it at all will encourage instrumentation mistakes. + +This may be revisited in the future. + +### Synchronous cumulative and asynchronous delta helpers + +A cumulative measurement can be converted into delta measurement by remembering the last-reported value. A helper instrument could offer to emulate synchronous cumulative measurements by remembering the last-reported value and reporting deltas synchronously. + +A delta measurement can be converted into a cumluative measurement by remembering the sum of all reported values. A helper instrument could offer to emulate asynchronous delta measurements in this way. + +Should helpers of this nature be standardized, if there is demand? These helpers are excluded from the standard because they carry a number of caveats, but as helpers they can easily do what an OpenTelemery SDK cannot do in general. For example, we are avoiding synchronous cumulative instruments because they seem to imply ordering that an SDK is not required to support, however an instrument helper that itself uses a lock can easily convert to deltas. + +Should such helpers be standardized? The answer is probably no. + +[otep-88]: ./0088-metric-instrument-optional-refinements.md diff --git a/oteps/metrics/0108-naming-guidelines.md b/oteps/metrics/0108-naming-guidelines.md new file mode 100644 index 00000000000..61f76681277 --- /dev/null +++ b/oteps/metrics/0108-naming-guidelines.md @@ -0,0 +1,23 @@ +# Metric instrument naming guidelines + +## Purpose + +Names and labels for metric instruments are primarily how humans interact with metric data -- users rely on these names to build dashboards and perform analysis. The names and hierarchical structure need to be understandable and discoverable during routine exploration -- and this becomes critical during incidents. + +To ensure these goals and consistency in future metric naming standards, this outlines a meta-standard for these names. + +## Guidelines + +Metric names and labels exist within a single universe and a single hierarchy. Metric names and labels MUST be considered within the universe of all existing metric names. When defining new metric names and labels, consider the prior art of existing standard metrics and metrics from frameworks/libraries. + +Associated metrics SHOULD be nested together in a hierarchy based on their usage. Define a top-level hierarchy for common metric categories: for OS metrics, like CPU and network; for app runtimes, like GC internals. Libraries and frameworks should nest their metrics into a hierarchy as well. This aids in discovery and adhoc comparison. This allows a user to find similar metrics given a certain metric. + +The hierarchical structure of metrics defines the namespacing. Supporting OpenTelemetry artifacts define the metric structures and hierarchies for some categories of metrics, and these can assist decisions when creating future metrics. + +Common labels SHOULD be consistently named. This aids in discoverability and disambiguates similar labels to metric names. + +["As a rule of thumb, **aggregations** over all the dimensions of a given metric **SHOULD** be meaningful,"](https://prometheus.io/docs/practices/naming/#metric-names) as Prometheus recommends. + +Semantic ambiguity SHOULD be avoided. Use prefixed metric names in cases where similar metrics have significantly different implementations across the breadth of all existing metrics. For example, every garbage collected runtime has slightly different strategies and measures. Using a single set of metric names for GC, not divided by the runtime, could create dissimilar comparisons and confusion for end users. (For example, prefer `runtime.java.gc*` over `runtime.gc.*`.) Measures of many operating system metrics are similar. + +For conventional metrics or metrics that have their units included in OpenTelemetry metadata (eg `metric.WithUnit` in Go), SHOULD NOT include the units in the metric name. Units may be included when it provides additional meaning to the metric name. Metrics MUST, above all, be understandable and usable. diff --git a/oteps/metrics/0113-exemplars.md b/oteps/metrics/0113-exemplars.md new file mode 100644 index 00000000000..d0a4be102cb --- /dev/null +++ b/oteps/metrics/0113-exemplars.md @@ -0,0 +1,104 @@ +# Integrate Exemplars with Metrics + +This OTEP adds exemplar support to aggregations defined in the Metrics SDK. + +## Definition + +Exemplars are example data points for aggregated data. They provide specific context to otherwise general aggregations. For histogram-type metrics, exemplars are points associated with each bucket in the histogram giving an example of what was aggregated into the bucket. Exemplars are augmented beyond just measurements with references to the sampled trace where the measurement was recorded and labels that were attached to the measurement. + +## Motivation + +Defining exemplar behaviour for aggregations allows OpenTelemetry to support exemplars in Google Cloud Monitoring. + +Exemplars provide a link between metrics and traces. Consider a user using a Histogram aggregation to track response latencies over time for a high QPS server. The histogram is composed of buckets based on the speed of the request, for example, "there were 55 requests that took 400-500 milliseconds". The user wants to troubleshoot slow requests, so they would need to find a trace where the latency was high. With exemplars, the user is able to get an exemplar trace from a high latency bucket, an exemplar trace from a low latency bucket, and compare them to figure out the reason for the high latency. + +Exemplars are meaningful for all aggregations where relevant traces can provide more context to the aggregation, as well as when exemplars can display specific information not otherwise shown in the aggregation (for example, the full set of labels where they otherwise might be aggregated away). + +## Internal details + +An exemplar is a `RawValue`, which is defined as: + +``` +message RawValue { + // Numerical value of the measurement that was recorded. Only one of these two fields is + // used for the data, depending on its type + double double_value = 0; + int64 int64_value = 1; + + // Exact time that the measurement was recorded + fixed64 time_unix_nano = 2; + + // 'label:value' map of all labels that were provided by the user recording the measurement + repeated opentelemetry.proto.common.v1.StringKeyValue labels = 3; + + // Span ID of the current trace + optional bytes span_id = 4; + + // Trace ID of the current trace + optional bytes trace_id = 5; + + // When sample_count is non-zero, this exemplar has been chosen in a statistically + // unbiased way such that the exemplar is representative of `sample_count` individual events + optional double sample_count = 6; +} +``` + +Exemplar collection should be enabled through an optional parameter (disabled by default), and when not enabled, there should be no collection/logic performed related to exemplars. This is to ensure that when necessary, aggregators are as high performance as possible. Aggregators should also have a parameter to determine whether exemplars should only be collected if they are recorded during a sampled trace, or if tracing should have no effect on which exemplars are sampled. This allows aggregations to prioritize either the link between metrics and traces or the statistical significance of exemplars, when necessary. + +[#347](https://github.com/open-telemetry/opentelemetry-specification/pull/347) describes a set of standard aggregators in the metrics SDK. Here we describe how exemplars could be implemented for each aggregator. + +### Exemplar behaviour for standard aggregators + +#### HistogramAggregator + +The HistogramAggregator MUST (when enabled) maintain a list of exemplars whose values are distributed across all buckets of the histogram (there should be one or more exemplars in every bucket that has a population of at least one sample-able measurement). Implementations SHOULD NOT retain an unbounded number of exemplars. + +#### Sketch + +A Sketch aggregator SHOULD maintain a list of exemplars whose values are spaced out across the distribution. There is no specific number of exemplars that should be retained (although the amount SHOULD NOT be unbounded), but the implementation SHOULD pick exemplars that represent as much of the distribution as possible. (Specific details not defined, see open questions.) + +#### Last-Value + +Most (if not all) Last-Value aggregators operate asynchronously and do not ever interact with context. Since the value of a Last-Value is the last measurement (essentially the other parts of an exemplar), exemplars are not worth implementing for Last-Value. + +#### Exact + +The Exact aggregator will function by maintaining a list of `RawValue`s, which contain all of the information exemplars would carry. Therefore the Exact aggregator will not need to maintain any exemplars. + +#### Counter + +Exemplars give value to counter aggregations in two ways: One, by tying metric and trace data together, and two, by providing necessary information to re-create the input distribution. When enabled, the aggregator will retain a bounded list of exemplars at each checkpoint, sampled from across the distribution of the data. Exemplars should be sampled in a statistically significant way. + +#### MinMaxSumCount + +Similar to Counter, MinMaxSumCount should retain a bounded list of exemplars that were sampled from across the input distribution in a statistically significant way. + +#### Custom Aggregators + +Custom aggregators MAY support exemplars by maintaining a list of exemplars that can be retrieved by exporters. Custom aggregators should select exemplars based on their usage by the connected exporter (for example, exemplars recorded for Google Cloud Monitoring should only be retained if they were recorded within a sampled trace). + +Exemplars will always be retrieved from aggregations (by the exporter) as a list of RawValue objects. They will be communicated via a + +``` +optional repeated RawValue exemplars = 6 +``` + +attribute on the `Metric` object. + +## Trade-offs and mitigations + +Performance (in terms of memory usage and to some extent time complexity) is the main concern of implementing exemplars. However, by making recording exemplars optional, there should be minimal overhead when exemplars are not enabled. + +## Prior art and alternatives + +Exemplars are implemented in [OpenCensus](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/Exemplars.md#exemplars), but only for HistogramAggregator. This OTEP is largely a port from the OpenCensus definition of exemplars, but it also adds exemplar support to other aggregators. + +[Cloud monitoring API doc for exemplars](https://cloud.google.com/monitoring/api/ref_v3/rpc/google.api#google.api.Distribution.Exemplar) + +## Open questions + +- Exemplars usually refer to a span in a sampled trace. While using the collector to perform tail-sampling, the sampling decision may be deferred until after the metric would be exported. How do we create exemplars in this case? + +- We don’t have a strong grasp on how the sketch aggregator works in terms of implementation - so we don’t have enough information to design how exemplars should work properly. + +- The spec doesn't yet define a standard set of aggregations, just default aggregations for standard metric instruments. Since exemplars are always attached to particular aggregations, it's impossible to fully specify the behavior of exemplars. diff --git a/oteps/metrics/0126-Configurable-Metric-Aggregations.md b/oteps/metrics/0126-Configurable-Metric-Aggregations.md new file mode 100644 index 00000000000..2aa1707084d --- /dev/null +++ b/oteps/metrics/0126-Configurable-Metric-Aggregations.md @@ -0,0 +1,102 @@ +# A Proposal For SDK Support for Configurable Batching and Aggregations (Basic Views) + +Add support to the default SDK for the ability to configure Metric Aggregations. + +## Motivation + +OpenTelemetry's architecture separates the concerns of instrumentation and operation. The Metric Instruments +provided by the Metric API are all defined to have a default aggregation. And, by default, aggregations are +performed with all Labels being used to define a unit of aggregation. Although this is a good default +configuration for the SDK to provide, more configurability is needed. + +There are 3 main use-cases that this proposal is intended to address: + +1) The application developer/operator wishes to use an aggregation other than the default provided by the SDK +for a given instrument or set of instruments. +2) An exporter author wishes to inform the SDK what "Temporality" (delta vs. cumulative) the resulting metric +data points represent. "Delta" means only metric recordings since the last reporting interval are considered +in the aggregation, and "Cumulative" means that all metric recordings over the lifetime of the Instrument are +considered in the aggregation. +3) The application developer/operator wishes to constrain the cardinality of labels for metrics being reported +to the metric vendor/backend of choice. + +## Explanation + +I propose a new feature for the default SDK, available on the interface of the SDK's MeterProvider implementation, to configure +the batching strategies and aggregations that will be used by the SDK when metric recordings are made. This is the beginnings +of a "Views" API, but does not intend to implement the full View functionality from OpenCensus. + +The basic API has two parts. + +* InstrumentSelector - Enables specifying the selection of one or more instruments for the configuration to apply to. + - Selection options include: the instrument type (Counter, ValueRecorder, etc), and a regex for instrument name. + - If more than one option is provided, they are considered additive. + - Example: select all ValueRecorders whose name ends with ".duration". +* View - configures how the batching and aggregation should be done. + - 3 things can be specified: The aggregation (Sum, MinMaxSumCount, Histogram, etc), the "temporality" of the batching, + and a set of pre-defined labels to consider as the subset to be used for aggregations. + - Note: "temporality" can be one of "DELTA" and "CUMULATIVE" and specifies whether the values of the aggregation + are reset after a collection is done or not, respectively. + - If not all are specified, then the others should be considered to be requesting the default. + - Examples: + - Use a MinMaxSumCount aggregation, and provide delta-style batching. + - Use a Histogram aggregation, and only use two labels "route" and "error" for aggregations. + - Use a quantile aggregation, and drop all labels when aggregating. + +In this proposal, there is only one View associated with each selector. + +As a concrete example, in Java, this might look something like this: + +```java + // get a handle to the MeterSdkProvider (note, this is concrete name of the default SDK class in java, not a general SDK) + MeterSdkProvider meterProvider = OpenTelemetrySdk.getMeterProvider(); + + // create a selector to select which instruments to customize: + InstrumentSelector instrumentSelector = InstrumentSelector.newBuilder() + .instrumentType(InstrumentType.COUNTER) + .build(); + + // create a configuration of how you want the metrics aggregated: + View view = + View.create(Aggregations.minMaxSumCount(), Temporality.DELTA); + + //register the configuration with the MeterSdkProvider + meterProvider.registerView(instrumentSelector, view); +``` + +## Internal details + +This OTEP does not specify how this should be implemented in a particular language, only the functionality that is desired. + +A prototype with a partial implementation of this proposal in Java is available in PR form [here](https://github.com/open-telemetry/opentelemetry-java/pull/1412) + +## Trade-offs and mitigations + +This does not intend to deliver a full "Views" API, although it is the basis for one. The goal here is +simply to allow configuration of the batching and aggregation by operators and exporter authors. + +This does not intend to specify the exact interface for providing these configurations, nor does it +consider a non-programmatic configuration option. + +## Prior art and alternatives + +* Prior Art is probably mostly in the [OpenCensus Views](https://opencensus.io/stats/view/) system. +* Another [OTEP](https://github.com/open-telemetry/oteps/pull/89) attempted to address building a Views API. + +## Open questions (to be resolved in an official specification) + +1. Should custom aggregations be allowable for all instruments? How should an SDK respond to a request for a non-supported aggregation? +2. Should the requesting of DELTA vs. CUMULATIVE be only available via an exporter-only API, rather than generally available to all operators? +3. Is regex-based name matching too broad and dangerous? Would the alternative (having to know the exact name of all instruments to configure) be too onerous? +4. Is there anything in this proposal that would make implementing a full Views API (i.e. having multiple, named aggregations per instrument) difficult? +5. How should an exporter interact with the SDK for which it is configured, in order to change aggregation settings? +6. Should the first implementation include label reduction, or should that be done in a follow-up OTEP/spec? +7. Does this support disabling an aggregation altogether, and if so, what is the interface for that? +8. What is the precedence of selectors, if more than one selector can apply to a given Instrument? + +## Future possibilities + +What are some future changes that this proposal would enable? + +- A full-blown views API, which would allow multiple "views" per instrument. It's unclear how an exporter would specify which one it wanted, or if it would all the generated metrics. +- Additional non-programmatic configuration options. diff --git a/oteps/metrics/0131-otlp-export-behavior.md b/oteps/metrics/0131-otlp-export-behavior.md new file mode 100644 index 00000000000..29aeedeb5b8 --- /dev/null +++ b/oteps/metrics/0131-otlp-export-behavior.md @@ -0,0 +1,43 @@ +# OTLP Exporters Configurable Export Behavior + + Add support for configurable export behavior in OTLP exporters. + + The expected behavior required are 1) exporting cumulative values since start time by default, and 2) exporting delta values per collection interval when configured. + +## Motivation + +1. **Export behavior should be configurable**: Metric backends such as Prometheus, Cortex and other backends supporting Prometheus time-series that ingest data from the Prometheus remote write API, require cumulative values for cumulative metrics and additive metrics, per collection interval. In order to export metrics generated by the SDK using the Collector, incoming values from the SDK should be cumulative values. Note than in comparison, backends like Statsd expect delta values for each collection interval. To support different backend requirements, OTLP metric export behavior needs to be configurable, with cumulative values exported as a default. See discussion in [#731](https://github.com/open-telemetry/opentelemetry-specification/issues/731). +2. **Cumulative export should be the default behavior since it is more reliable**: Cumulative export also addresses the problem of missing delta values for an UpDownCounter. The final consumer of the UpDownCounter metrics is almost always interested in the cumulative value. If the Metrics SDK exports deltas and allows the consumer aggregate cumulative values, then any deltas lost in-transit will lead to inaccurate final values. This loss may impact the condition on which an alert is fired or not. On the other hand, exporting cumulative values guarantees only resolution is lost, but the value received by the final consumer will be correct eventually. + 1. *Note:* The [Metrics SIG](https://docs.google.com/document/d/1LfDVyBJlIewwm3a0JtDtEjkusZjzQE3IAix8b0Fxy3Y/edit#heading=h.fxqkpi2ya3br) *July 23 and July 30 meetings concluded that cumulative export behavior is more reliable.* For example, Bogdan Drutu in [#725](https://github.com/open-telemetry/opentelemetry-specification/issues/725) notes “When exporting delta values of an UpdownCounter instrument, the export pipeline becomes a single point of failure for the alerts, any dropped "delta" will influence the "current" value of the metric in an undefined way." + +## Explanation + +In order to support Prometheus backends using cumulative values as well as other backends that use delta values, the SDK needs to be configurable and support an OTLP exporter which handles both cumulative values by default and delta values for export. The implication is that the OTLP metric protocol should support both cumulative and delta reporting strategies. + +Users should be allowed to declare an environment variable or configuration field that determines this setting for OTLP exporters. + +## Internal details + +OTLP exporters can report using the behavior it needs to the Metrics SDK. The SDK can merge the previous state of metrics with current value and return the appropriate values to the exporter. + +Configurable export behavior is already coded in the Metrics Processor component in the [Go SDK](https://github.com/open-telemetry/opentelemetry-go/pull/840). However, this functionality is hardcoded today and would need to rewritten to handle user-defined configuration. See the OTLP metrics definition in [PR #193](https://github.com/open-telemetry/opentelemetry-proto/pull/193), which support both export behaviors. + +## Trade-offs and mitigations + + High memory usage: To support cumulative exports, the SDK needs to maintain state for each cumulative metrics. This means users with high-cardinality metrics can experience high memory usage. + +The high-cardinality metrics use case could be addressed by adding the metrics aggregation processor in the Collector. This would enable the Collector, when configured as an Agent, to support converting delta OTLP to Cumulative OTLP. This functionality requires a single agent for each metric-generating client so that all delta values of a metric are converted by the same Collector instance. + +## Prior art and alternatives + +A discussed solution is to convert deltas to cumulative in the Collector both as an agent and as a standalone service. However, supporting conversion in the Collector when it is a standalone service requires implementation of a routing mechanism across all Collector instances to ensure delta values of the same cumulative metric are aggregated by the same Collector instance. + +## Open questions + +As stated in the previous section, delta to cumulative conversion in the Collector is needed to support Prometheus type backends. This may be necessary in the Collector in the future because the Collector may also accept metrics from other sources that report delta values. On the other hand, if sources are reporting cumulative values, cumulative to delta conversion is needed to support Statsd type backends. + +The future implementation for conversions in the Collector is still under discussion. There is a proposal is to add a [Metric Aggregation Processor](https://github.com/open-telemetry/opentelemetry-collector/issues/1422) in the Collector which recommends a solution for delta to cumulative conversion. + +## Future possibilities + +A future improvement that could be considered is to support a dynamic configuration from a configuration server that determines the appropriate export strategy of OTLP clients at startup. diff --git a/oteps/metrics/0146-metrics-prototype-scenarios.md b/oteps/metrics/0146-metrics-prototype-scenarios.md new file mode 100644 index 00000000000..1c112b111b6 --- /dev/null +++ b/oteps/metrics/0146-metrics-prototype-scenarios.md @@ -0,0 +1,266 @@ +# Scenarios for Metrics API/SDK Prototyping + +With the stable release of the tracing specification, the OpenTelemetry +community is willing to spend more energy on metrics API/SDK. The goal is to get +the metrics API/SDK specification to +[`Experimental`](../../specification/versioning-and-stability.md#experimental) +state by end of 5/2021, and make it +[`Stable`](../../specification/versioning-and-stability.md#stable) +before end of 2021: + +* By end of 5/31/2021, we should have a good confidence that we can recommend + language client owners to work on metrics preview release. This means starting + from 6/1/2021 the specification should not have major surprises or big + changes. We will then start recommending client maintainers to implement it. + We might introduce additional features but there should be a high bar. + +* By end of 9/30/2021, we should mark the metrics API/SDK specification as + [`Feature-freeze`](../../specification/document-status.md#feature-freeze), + and focus on bug fixing or editorial changes. + +* By end of 11/30/2021, we want to have a stable release of metrics API/SDK + specification, with multiple language SIGs providing RC (release candidate) or + [stable](../../specification/versioning-and-stability.md#stable) + clients. + +In this document, we will focus on two scenarios that we use for prototyping +metrics API/SDK. The goal is to have two scenarios which clearly capture the +major requirements, so we can work with language client SIGs to prototype, +gather the learnings, determine the scopes and stages. Later the scenarios can +be used as examples and test cases for all the language clients. + +Here are the languages we've agreed to use during the prototyping: + +* C# +* Java +* Python + +In order to not undertake such an enormous task at once, we will need to have an incremental approach and divide the work into multiple +stages: + +1. Do the end-to-end prototype to get the overall understanding of the problem + domain. We should also clarify the scope and be able to articulate it + precisely during this stage, here are some examples: + + * Why do we want to introduce brand new metrics APIs versus taking a well + established API (e.g. Prometheus and Micrometer), what makes OpenTelemetry + metrics API different (e.g. Baggage)? + * Do we need to consider OpenCensus Stats API shim, or this is out of scope? + +2. Focus on a core subset of API, cover the end-to-end library instrumentation + scenario. At this stage we don't expect to cover all the APIs as some of them + might be very similar (e.g. if we know how to record an integer, we don't + have to work on float/double as we can add them later by replicating what + we've done for integer). + +3. Focus on a core subset of SDK. This would help us to get the end-to-end + application. + +4. Replicate stage 2 to cover the complete set of APIs. + +5. Replicate stage 3 to cover the complete set of SDKs. + +## Scenario 1: Grocery + +The **Grocery** scenario covers how a developer could use metrics API and SDK in +a final application. It is a self-contained application which covers: + +* How to instrument the code in a vendor agnostic way +* How to configure the SDK and exporter + +Considering there might be multiple grocery stores, the metrics we collect will +have the store name as a dimension - which is fairly static (not changing while +the store is running). + +The store has plenty supply of potato and tomato, with the following price: + +* Potato: $1.00 / ea +* Tomato: $3.00 / ea + +Each customer has a unique name (e.g. customerA, customerB), a customer could +come to the store multiple times. Here goes the Python snippet: + +```python +store = GroceryStore("Portland") +store.process_order("customerA", {"potato": 2, "tomato": 3}) +store.process_order("customerB", {"tomato": 10}) +store.process_order("customerC", {"potato": 2}) +store.process_order("customerA", {"tomato": 1}) +``` + +We will need the following metrics every minute: + +**Order info:** + +| Store | Customer | Number of Orders | Amount (USD) | +| -------- | --------- | ---------------- | ------------ | +| Portland | customerA | 2 | 14.00 | +| Portland | customerB | 1 | 30.00 | +| Portland | customerC | 1 | 2.00 | + +**Items sold:** + +| Store | Customer | Item | Count | +| -------- | --------- | ------ | ----- | +| Portland | customerA | potato | 2 | +| Portland | customerA | tomato | 4 | +| Portland | customerB | tomato | 10 | +| Portland | customerC | potato | 2 | + +Each customer may enter and exit a grocery store. + +Here goes the Python snippet: + +```python +store = GroceryStore("Portland") +store.enter_customer("customerA", {"account_type": "restaurant"}) +store.enter_customer("customerB", {"account_type": "home cook"}) +store.exit_customer("customerB", {"account_type": "home cook"}) +store.exit_customer("customerA", {"account_type": "restaurant"}) +``` + +We will need the following metrics every minute: + +**Customers in store:** + +| Store | Account type | Count | +| -------- | ----------- | ----- | +| Portland | restaurant | 1 | +| Portland | home cook | 1 | + +## Scenario 2: HTTP Server + +The _HTTP Server_ scenario covers how a library developer X could use metrics +API to instrument a library, and how the application developer Y can configure +the library to use OpenTelemetry SDK in a final application. X and Y are working +for different companies and they don't communicate. The demo has two parts - the +library (HTTP lib and ClimateControl lib owned by X) and the server app (owned by Y): + +* How developer X could instrument the library code in a vendor agnostic way + * Performance is critical for X + * X doesn't know which metrics and which dimensions will Y pick + * X doesn't know the aggregation time window, nor the final destination of the + metrics + * X would like to provide some default recommendation (e.g. default + dimensions, aggregation time window, histogram buckets) so consumers of his + library can have a better onboarding experience. +* How developer Y could configure the SDK and exporter + * How should Y hook up the metrics SDK with the library + * How should Y configure the time window(s) and destination(s) + * How should Y pick the metrics and the dimensions + +### Library Requirements + +The library developer (developer X) will provide two libraries: + +* Server climate control library - a library which monitors and controls the + temperature and humidity of the server. +* HTTP server library - a library which provides HTTP service. + +Both libraries will provide out-of-box metrics, the metrics have two categories: + +* Push metrics - the value is reported (via the API) when it is available, and + collected (via the SDK) based on the ask from consumer(s). If there is no ask + from the consumer, the API will be no-op and the data will be dropped on the + floor. +* Pull metrics - the value is always available, and is only reported and + collected based on the ask from consumer(s). If there is no ask from the + consumer, the value will not be reported at all (e.g. there is no API call to + fetch the temperature unless someone is asking for the temperature). + +#### Server Climate Control Library + +Note: the **Host Name** should leverage [`OpenTelemetry +Resource`](../../specification/resource/sdk.md), +so it should be covered by the metrics SDK rather than API, and strictly +speaking it is not considered as a "dimension" from the SDK perspective. + +**Server temperature:** + +| Host Name | Temperature (F) | +| --------- | --------------- | +| MachineA | 65.3 | + +Note: Temperature may take on negative values. For this reason, you must choose the instrument carefully to ensure it is one that +can accept negative recordings and that they can be aggregated appropriately. + +**Server humidity:** + +| Host Name | Humidity (%) | +| --------- | ------------ | +| MachineA | 21 | + +**Server CPU usage:** + +| Host Name | CPU usage (seconds) | +| --------- | ------------------- | +| MachineA | 100.1 | + +**Server Memory usage:** + +| Host Name | Memory usage (bytes) | +| --------- | -------------------- | +| MachineA | 1000000000 | +| MachineA | 2000000000 | + +#### HTTP Server Library + +**Received HTTP requests:** + +Note: the **Client Type** is passed in via the [`OpenTelemetry +Baggage`](../../specification/baggage/api.md), +strictly speaking it is not part of the metrics API, but it is considered as a +"dimension" from the metrics SDK perspective. + +| Host Name | Process ID | Client Type | HTTP Method | HTTP Host | HTTP Flavor | Peer IP | Peer Port | Host IP | Host Port | +| --------- | ---------- | ----------- | ----------- | --------- | ----------- | --------- | --------- | --------- | --------- | +| MachineA | 1234 | Android | GET | otel.org | 1.1 | 127.0.0.1 | 51327 | 127.0.0.1 | 80 | +| MachineA | 1234 | Android | POST | otel.org | 1.1 | 127.0.0.1 | 51328 | 127.0.0.1 | 80 | +| MachineA | 1234 | iOS | PUT | otel.org | 1.1 | 127.0.0.1 | 51329 | 127.0.0.1 | 80 | + +**HTTP server request duration:** + +Note: the server duration is only available for **finished HTTP requests**. + +| Host Name | Process ID | Client Type | HTTP Method | HTTP Host | HTTP Status Code | HTTP Flavor | Peer IP | Peer Port | Host IP | Host Port | Duration (ms) | +| --------- | ---------- | ----------- | ----------- | --------- | ---------------- | ----------- | --------- | --------- | --------- | --------- | ------------- | +| MachineA | 1234 | Android | GET | otel.org | 200 | 1.1 | 127.0.0.1 | 51327 | 127.0.0.1 | 80 | 8.5 | +| MachineA | 1234 | Android | POST | otel.org | 304 | 1.1 | 127.0.0.1 | 51328 | 127.0.0.1 | 80 | 100.0 | + +**HTTP active sessions:** + +| HTTP Host | HTTP flavor | Active sessions | +| --------- | ------------- | --------------- | +| otel.org | 1.1 | 17 | +| otel.org | 2.0 | 20 | + +### Application Requirements + +The application owner (developer Y) would only want the following metrics: + +* Server temperature - reported every 5 seconds +* Server humidity - reported every minute +* HTTP server request duration, reported every 5 seconds, with a subset of the + dimensions: + * Host Name + * HTTP Method + * HTTP Host + * HTTP Status Code + * Client Type + * 90%, 95%, 99% and 99.9% server duration +* HTTP request counters, reported every 5 seconds: + * Total number of received HTTP requests + * Total number of finished HTTP requests + * Number of currently-in-flight HTTP requests (concurrent HTTP requests) + + | Host Name | Process ID | HTTP Host | Received Requests | Finished Requests | Concurrent Requests | + | --------- | ---------- | --------- | ----------------- | ----------------- | ------------------- | + | MachineA | 1234 | otel.org | 630 | 601 | 29 | + | MachineA | 5678 | otel.org | 1005 | 1001 | 4 | +* Exception samples (exemplar) - in case HTTP 5xx happened, developer Y would + want to see a sample request with trace id, span id and all the dimensions + (IP, Port, etc.) + + | Trace ID | Span ID | Host Name | Process ID | Client Type | HTTP Method | HTTP Host | HTTP Status Code | HTTP Flavor | Peer IP | Peer Port | Host IP | Host Port | Exception | + | -------------------------------- | ---------------- | --------- | ---------- | ----------- | ----------- | --------- | ---------------- | ----------- | --------- | --------- | --------- | --------- | -------------------- | + | 8389584945550f40820b96ce1ceb9299 | 745239d26e408342 | MachineA | 1234 | iOS | PUT | otel.org | 500 | 1.1 | 127.0.0.1 | 51329 | 127.0.0.1 | 80 | SocketException(...) | diff --git a/oteps/profiles/0212-profiling-vision.md b/oteps/profiles/0212-profiling-vision.md new file mode 100644 index 00000000000..783340856bc --- /dev/null +++ b/oteps/profiles/0212-profiling-vision.md @@ -0,0 +1,188 @@ +# Propose OpenTelemetry profiling vision + +The following are high-level items that define our long-term vision for +Profiling support in the OpenTelemetry project that we aspire to achieve. + +While this vision document reflects our current desires, it is meant to be a +guide towards a collectively agreed upon set of objectives rather than a +checklist of requirements. A group of OpenTelemetry community members have +participated in a series of bi-weekly meetings for 2 months. The group +represents a cross-section of industry and domain expertise, who have found +common cause in the creation of this document. It is our shared intention to +continue to ensure alignment moving forward. As our vision evolves and matures, +we intend to incorporate our learnings further to facilitate an optimal outcome. + +This document and efforts thus far are motivated by: + +- This [long-standing issue](https://github.com/open-telemetry/oteps/issues/139) + created in October 2020 +- A conversation about priorities at the in-person OpenTelemetry meeting at Kubecon EU + 2022 +- Increasing community interest in profiling as an observability signal + alongside logs, metrics, and traces + +## What is profiling + +While the terms "profile" and "profiling" can have slightly different meanings +depending on the context, for the purposes of this OTEP we are defining the two +terms as follows: + +- Profile: A collection of stack traces with some metric associated with each + stack trace, typically representing the number of times that stack trace was + encountered +- Profiling: The process of collecting profiles from a running program, + application, or the system + +## How profiling aligns with the OpenTelemetry vision + +The [OpenTelemetry +vision](https://opentelemetry.io/mission/#vision-mdash-the-world-we-imagine-for-otel-end-users) +states: + +_Effective observability is powerful because it enables developers to innovate +faster while maintaining high reliability. But effective observability +absolutely requires high-quality telemetry – and the performant, consistent +instrumentation that makes it possible._ + +While existing OpenTelemetry signals fit all of these criteria, until recently +no effort has been explicitly geared towards creating performant and consistent +instrumentation based upon profiling data. + +## Making a well-rounded observability suite by adding profiling + +Currently Logs, Metrics, and Traces are widely accepted as the main “pillars” of +observability, each providing a different set of data from which a user can +query to answer questions about their system/application. + +Profiling data can help further this goal by answering certain questions about a +system or application which logs, metrics, and traces are less equipped to +answer. We aim to facilitate implementations capable of best-in-class support +for collecting, processing, and transporting this profiling data. + +Our goals for profiling align with those of OpenTelemetry as a whole: + +- **Profiling should be easy**: the nature of profiling offers fast + time-to-value by often being able to optionally drop in a minimal amount of + code and instantly have details about application resource utilization +- **Profiling should be universal**: currently profiling is slightly different + across different languages, but with a little effort the representation of + profiling data can be standardized in a way where not only are languages + consistent, but profiling data itself is also consistent with the other + observability signals as well +- **Profiling should be vendor neutral**: From one profiling agent, users should + be able to send data to whichever vendor they like (or no vendor at all) and + interoperate with other OSS projects + +## Current state of profilers + +As it currently stands, the method for collecting profiles for an application +and the format of the profiles collected varies greatly depending on several +factors such as: + +- Language (and language runtime) +- Profiler Type +- Data type being profiled (i.e. cpu, memory, etc) +- Availability or utilization of symbolic information + +A fairly comprehensive taxonomy of various profiling formats can be found on the +[profilerpedia website](https://profilerpedia.markhansen.co.nz/formats/). + +As a result of this variation, the tooling and collection of profiling data +lacks in exactly the areas in which OpenTelemetry has built as its core +engineering values: + +- Profiling currently lacks compatibility: Each vendor, open source project, and + language has different ways of collecting, sending, and storing profiling data + and often with no regard to linking to other signals +- Profiling currently lacks consistency: Currently profiling agents and formats + can change arbitrarily with no unified criteria for how to take end-users into + account + +## Making profiling compatible with other signals + +Profiles are particularly useful in the context of other signals. For example, +having a profile for a particular “slow” span in a trace yields more actionable +information than simply knowing that the span was slow. + +OpenTelemetry will define how profiles will be correlated with logs, traces, and +metrics and how this correlation information will be stored. + +Correlation will work across 2 major dimensions: + +- To correlate telemetry emitted for the same request (also known as request or + trace context correlation) +- To correlate telemetry emitted from the same source (also known as resource + context correlation) + +## Standardize profiling data model for industry-wide sharing and reuse + +We will design a profiling data model that will aim to represent the vast +majority of profiling data with the following goals in mind: + +- Profiling formats should be as compact as possible +- Profiling data should be transferred as efficiently as possible and the model + should be lossless with intentional bias for enabling efficient marshaling, + transcoding (to and from other formats), and analysis +- Profiling formats should be able to be unambiguously mapped to the + standardized data model (i.e. collapsed, pprof, JFR, etc.) +- Profiling formats should contain mechanisms for representing relationships + between other telemetry signals (i.e. linking call stacks with spans) + +## Supporting legacy profiling formats + +For existing profilers we will provide instructions on how these legacy formats +can emit profiles in a manner that makes them compatible with OpenTelemetry’s +approach and enables telemetry data correlation. + +Particularly for popular profilers such as the ones native to Golang and Java +(JFR) we will help to have them produce OpenTelemetry-compatible profiles with +minimal overhead. + +## Performance considerations + +Profiling agents can be architected in a variety of differing ways, with +reasonable trade offs made that may impact performance, completeness, accuracy +and so on. Similarly, the manner in which such a profiler might produce or +consume OpenTelemetry-compatible data could vary significantly. As such, in our +standardization effort it is not feasible to be prescriptive on the matter of +resource usage for profilers. + +However, the output of OpenTelemetry's standardization effort must take into +account that some existing profilers are designed to be low overhead and high +performance. For example, they may operate in a whole-datacenter, always-on +manner, and/or in environments where they must guarantee low CPU/RAM/network +usage. The OpenTelemetry standardisation effort should take this into account +and strive to produce a format that is usable by profilers of this nature +without sacrificing their performance guarantees. + +Similar to other OpenTelemetry signals, we target production environments. Thus, the +profiling signal must be implementable with low overhead and conforming to +OpenTelemetry-wide runtime overhead / intrusiveness and wire data size requirements. + +## Promoting cloud-native best practices with profiling + +The CNCF’s mission states: _Cloud native technologies empower organizations to +build and run scalable applications in modern, dynamic environments such as +public, private, and hybrid clouds_ + +We will have best-in-class support for profiles emitted in cloud native +environments (e.g. Kubernetes, serverless, etc), including legacy applications +running in those environments. As we aim to achieve this goal we will center our +efforts around making profiling applications resilient, manageable and +observable. This is in line with the Cloud Native Computing Foundation and +OpenTelemetry missions and will thus allow us to further expand and leverage +those communities to further the respective missions. + +## Profiling use cases + +- Tracking resource utilization of an application over time to understand how + code changes, hardware configuration changes, and ephemeral environmental + issues influence performance +- Understanding what code is responsible for consuming resources (i.e. CPU, Ram, + disk, network) +- Planning for resource allotment for a group of services running in production +- Comparing profiles of different versions of code to understand how code has + improved or degraded over time +- Detecting frequently used and "dead" code in production +- Breaking a trace span into code-level granularity (i.e. function call and line + of code) to understand the performance for that particular unit diff --git a/oteps/profiles/0239-profiles-data-model.md b/oteps/profiles/0239-profiles-data-model.md new file mode 100644 index 00000000000..dff615e8a20 --- /dev/null +++ b/oteps/profiles/0239-profiles-data-model.md @@ -0,0 +1,1603 @@ +# Profiles Data Format + +Introduces Data Model for Profiles signal to OpenTelemetry. + + +* [Motivation](#motivation) +* [Design Notes](#design-notes) + * [Design Goals](#design-goals) +* [Data Model](#data-model) + * [Relationships Diagram](#relationships-diagram) + * [Relationships With Other Signals](#relationships-with-other-signals) + * [From Profiles to Other Signals](#from-profiles-to-other-signals) + * [From Other Signals to Profiles](#from-other-signals-to-profiles) + * [Compatibility With Original pprof](#compatibility-with-original-pprof) + * [Proto Definition](#proto-definition) + * [Message Descriptions](#message-descriptions) + * [Message `ProfilesData`](#message-profilesdata) + * [Message `ResourceProfiles`](#message-resourceprofiles) + * [Message `ScopeProfiles`](#message-scopeprofiles) + * [Message `ProfileContainer`](#message-profilecontainer) + * [Message `Profile`](#message-profile) + * [Message `ValueType`](#message-valuetype) + * [Message `Sample`](#message-sample) + * [Message `AttributeUnit`](#message-attributeunit) + * [Message `Link`](#message-link) + * [Message `Location`](#message-location) + * [Message `Line`](#message-line) + * [Message `Mapping`](#message-mapping) + * [Message `Function`](#message-function) + * [Example Payloads](#example-payloads) + * [Simple Example](#simple-example) + * [Notable Differences Compared to Other Signals](#notable-differences-compared-to-other-signals) + * [Relationships Between Messages](#relationships-between-messages) + * [Relationship Between Samples and Locations](#relationship-between-samples-and-locations) +* [Trade-Offs and Mitigations](#trade-offs-and-mitigations) +* [Prior Art and Alternatives](#prior-art-and-alternatives) + * [Other Popular Formats](#other-popular-formats) + * [Folded Stacks](#folded-stacks) + * [Chromium's Trace Event Format](#chromiums-trace-event-format) + * [Linux perf.data](#linux-perfdata) + * [Java Flight Recorder (JFR)](#java-flight-recorder-jfr) + * [Alternative Representations](#alternative-representations) + * [Benchmarking](#benchmarking) + * ["average" Profile](#average-profile) + * ["average" Profile With Timestamps Added to Each Sample](#average-profile-with-timestamps-added-to-each-sample) + * ["ruby" Profile With Very Deep Stacktraces](#ruby-profile-with-very-deep-stacktraces) + * ["large" Profile](#large-profile) + * [Conclusions](#conclusions) + * [Semantic Conventions](#semantic-conventions) + * [Profile Types](#profile-types) + * [Profile Units](#profile-units) + * [Decision Log](#decision-log) +* [Open Questions](#open-questions) + * [Units in Attributes](#units-in-attributes) + * [Timestamps](#timestamps) + * [Repetition of Attribute Keys](#repetition-of-attribute-keys) + * [Locations Optimization](#locations-optimization) +* [Future Possibilities](#future-possibilities) + + +## Motivation + +This is a proposal of a data model and semantic conventions that allow to represent profiles coming from a variety of different applications or systems. Existing profiling formats can be unambiguously mapped to this data model. Reverse mapping from this data model is also possible to the extent that the target profiling format has equivalent capabilities. + +The purpose of the data model is to have a common understanding of what a profile is, what data needs to be recorded, transferred, stored and interpreted by a profiling system. + +## Design Notes + +### Design Goals + +These goals are based on the vision set out in [Profiling Vision OTEP](./0212-profiling-vision.md): + +* Make profiling compatible with other signals. +* Standardize profiling data model for industry-wide sharing and reuse. +* Profilers must be implementable with low overhead and conforming to OpenTelemetry-wide runtime overhead/intrusiveness and wire data size requirements. + +The last point is particularly important in the context of profiling. Profilers generate large amounts of data, and users of profiling technology are very sensitive to the overhead that profiling introduces. In the past high overhead has been a blocker for wider adoption of continuous profiling and was one of the reasons why profiling was not used in production environments. Therefore, it is important to make sure that the overhead of handling the profiling data on the client side as well as in intermediaries (e.g collector) is minimal. + +## Data Model + +This section describes various protobuf messages that are used to represent profiles data. + +### Relationships Diagram + +The following diagram shows the relationships between the messages. Relationships between messages are represented by either embedding one message in another (red arrows), or by referencing a message by index in a lookup table (blue arrows). More on that in [Relationships Between Messages](#relationships-between-messages) section below. + +In addition to that, relationship between `samples` and `locations` is further optimized for better performance. More on that in [Relationship Between Samples and Locations](#relationship-between-samples-and-locations) section below. + +![diagram of data relationships](./images/otep0239/profiles-data-model.png) + +### Relationships With Other Signals + +There are two types of relationships between profiles and other signals: + +* from other signals to profiles (e.g from log records, exemplars or trace spans) +* from profiles to other signals + +#### From Profiles to Other Signals + +[Link](#message-link) is a message that is used to represent connections between profile [Samples](#message-sample) and trace spans. It uses `trace_id` and `span_id` as identifiers. + +For other signals, such as logs or metrics, because other signals use the same way of linking between such signals and traces (`trace_id` and `span_id`), it is possible to correlate profiles with other signals using this same information. + +#### From Other Signals to Profiles + +Other signals can use `profile_id` to reference a profile. For example, a log record can reference a profile that was collected at the time when the log record was generated by using `profile_id` as one of the attributes. This allows to correlate logs with profiles. + +Additionally, `trace_id`, `span_id` can be used to reference groups of [Samples](#message-sample) (but not individual [Samples](#message-sample)) in a Profile, since [Samples](#message-sample) are linked to traces with these same identifiers using [Links](#message-link). + +The exact details of such linking are out of scope for this OTEP. It is expected that the exact details will be defined in Profiles part of [opentelemetry-specification](https://github.com/open-telemetry/opentelemetry-specification). + +### Compatibility With Original pprof + +The proposed data model is backward compatible with original pprof in a sense that a pprof file generated by existing software can be parsed using the new proto. All fields in the original pprof are preserved, so that original pprof files can still be parsed using the new proto, and no data is lost. + +It is not forward compatible, meaning that a pprof file generated by the new proto cannot be parsed by existing software. This is mainly due to the sharing of the call stacks between samples + new format for labels (more on these differences below). + +### Proto Definition + +Proto definition is based on [pprof format](https://github.com/google/pprof/blob/main/proto/profile.proto). + +In the landscape of performance profiling tools, pprof's data format stands as a clear industry standard. Its evolution and enduring relevance are a reflection of its effectiveness in addressing diverse and complex performance profiling needs. Major technology firms and open-source projects alike routinely employ pprof, underscoring its universal applicability and reliability. + +According to the [data from Profilerpedia](https://docs.google.com/spreadsheets/d/1UM-WFQhNf4GcyXmluSUGnMbOenvN-TqP2HQC9-Y50Lc/edit?usp=sharing), pprof is one of the most widely used formats. Compared to other formats it has the highest number of profilers, UIs, formats it can be converted to and from. + +The original pprof data model underwent enhancements to more effectively manage profiling data within the scope of OpenTelemetry, and certain upgrades were implemented to overcome a few of the original format's constraints. + +Here's a [link to a diff between original pprof and modified pprof](https://github.com/open-telemetry/opentelemetry-proto-profile/compare/2cf711b3cfcc1edd4e46f9b82d19d016d6d0aa2a...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2) and here's a list of main differences between pprof and OTLP profiles: + +* Sharing of the call stacks between samples. +* Sharing of labels (now called attributes) between samples. +* Reuse of OpenTelemetry conventions and message types. +* Semantic conventions for linking to other signals via `trace_id`s and `span_id`s. +* First-class timestamp support. +* Expanded metadata attach points (Sample / Location / Mapping). + +Below you will find the proto for the new Profiles signal. It is split into two parts: the first part is the OpenTelemetry specific part, and the second part is the modified pprof proto. Intention here is to make it easier to compare modified pprof proto to the original pprof proto. + +OpenTelemetry specific part: + + +```proto +syntax = "proto3"; + +package opentelemetry.proto.profiles.v1; + +import "opentelemetry/proto/common/v1/common.proto"; +import "opentelemetry/proto/resource/v1/resource.proto"; + +import "opentelemetry/proto/profiles/v1/alternatives/pprofextended/pprofextended.proto"; + +option csharp_namespace = "OpenTelemetry.Proto.Profiles.V1"; +option java_multiple_files = true; +option java_package = "io.opentelemetry.proto.profiles.v1"; +option java_outer_classname = "ProfilesProto"; +option go_package = "go.opentelemetry.io/proto/otlp/profiles/v1"; + +// Relationships Diagram +// +// ┌──────────────────┐ LEGEND +// │ ProfilesData │ +// └──────────────────┘ ─────▶ embedded +// │ +// │ 1-n ─────▷ referenced by index +// ▼ +// ┌──────────────────┐ +// │ ResourceProfiles │ +// └──────────────────┘ +// │ +// │ 1-n +// ▼ +// ┌──────────────────┐ +// │ ScopeProfiles │ +// └──────────────────┘ +// │ +// │ 1-n +// ▼ +// ┌──────────────────┐ +// │ ProfileContainer │ +// └──────────────────┘ +// │ +// │ 1-1 +// ▼ +// ┌──────────────────┐ +// │ Profile │ +// └──────────────────┘ +// │ 1-n +// │ 1-n ┌───────────────────────────────────────┐ +// ▼ │ ▽ +// ┌──────────────────┐ 1-n ┌──────────────┐ ┌──────────┐ +// │ Sample │ ──────▷ │ KeyValue │ │ Link │ +// └──────────────────┘ └──────────────┘ └──────────┘ +// │ 1-n △ △ +// │ 1-n ┌─────────────────┘ │ 1-n +// ▽ │ │ +// ┌──────────────────┐ n-1 ┌──────────────┐ +// │ Location │ ──────▷ │ Mapping │ +// └──────────────────┘ └──────────────┘ +// │ +// │ 1-n +// ▼ +// ┌──────────────────┐ +// │ Line │ +// └──────────────────┘ +// │ +// │ 1-1 +// ▽ +// ┌──────────────────┐ +// │ Function │ +// └──────────────────┘ +// + +// ProfilesData represents the profiles data that can be stored in persistent storage, +// OR can be embedded by other protocols that transfer OTLP profiles data but do not +// implement the OTLP protocol. +// +// The main difference between this message and collector protocol is that +// in this message there will not be any "control" or "metadata" specific to +// OTLP protocol. +// +// When new fields are added into this message, the OTLP request MUST be updated +// as well. +message ProfilesData { + // An array of ResourceProfiles. + // For data coming from a single resource this array will typically contain + // one element. Intermediary nodes that receive data from multiple origins + // typically batch the data before forwarding further and in that case this + // array will contain multiple elements. + repeated ResourceProfiles resource_profiles = 1; +} + +// A collection of ScopeProfiles from a Resource. +message ResourceProfiles { + reserved 1000; + + // The resource for the profiles in this message. + // If this field is not set then no resource info is known. + opentelemetry.proto.resource.v1.Resource resource = 1; + + // A list of ScopeProfiles that originate from a resource. + repeated ScopeProfiles scope_profiles = 2; + + // This schema_url applies to the data in the "resource" field. It does not apply + // to the data in the "scope_profiles" field which have their own schema_url field. + string schema_url = 3; +} + +// A collection of Profiles produced by an InstrumentationScope. +message ScopeProfiles { + // The instrumentation scope information for the profiles in this message. + // Semantically when InstrumentationScope isn't set, it is equivalent with + // an empty instrumentation scope name (unknown). + opentelemetry.proto.common.v1.InstrumentationScope scope = 1; + + // A list of ProfileContainers that originate from an instrumentation scope. + repeated ProfileContainer profiles = 2; + + // This schema_url applies to all profiles and profile events in the "profiles" field. + string schema_url = 3; +} + +// A ProfileContainer represents a single profile. It wraps pprof profile with OpenTelemetry specific metadata. +message ProfileContainer { + // A globally unique identifier for a profile. The ID is a 16-byte array. An ID with + // all zeroes is considered invalid. + // + // This field is required. + bytes profile_id = 1; + + // start_time_unix_nano is the start time of the profile. + // Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970. + // + // This field is semantically required and it is expected that end_time >= start_time. + fixed64 start_time_unix_nano = 2; + + // end_time_unix_nano is the end time of the profile. + // Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970. + // + // This field is semantically required and it is expected that end_time >= start_time. + fixed64 end_time_unix_nano = 3; + + // attributes is a collection of key/value pairs. Note, global attributes + // like server name can be set using the resource API. Examples of attributes: + // + // "/http/user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" + // "/http/server_latency": 300 + // "abc.com/myattribute": true + // "abc.com/score": 10.239 + // + // The OpenTelemetry API specification further restricts the allowed value types: + // https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/README.md#attribute + // Attribute keys MUST be unique (it is not allowed to have more than one + // attribute with the same key). + repeated opentelemetry.proto.common.v1.KeyValue attributes = 4; + + // dropped_attributes_count is the number of attributes that were discarded. Attributes + // can be discarded because their keys are too long or because there are too many + // attributes. If this value is 0, then no attributes were dropped. + uint32 dropped_attributes_count = 5; + + // Specifies format of the original payload. Common values are defined in semantic conventions. [required if original_payload is present] + string original_payload_format = 6; + + // Original payload can be stored in this field. This can be useful for users who want to get the original payload. + // Formats such as JFR are highly extensible and can contain more information than what is defined in this spec. + // Inclusion of original payload should be configurable by the user. Default behavior should be to not include the original payload. + // If the original payload is in pprof format, it SHOULD not be included in this field. + // The field is optional, however if it is present `profile` MUST be present and contain the same profiling information. + bytes original_payload = 7; + + // This is a reference to a pprof profile. Required, even when original_payload is present. + opentelemetry.proto.profiles.v1.alternatives.pprofextended.Profile profile = 8; +} +``` + + + +Modified pprof: + + + +```proto +// Profile is a common stacktrace profile format. +// +// Measurements represented with this format should follow the +// following conventions: +// +// - Consumers should treat unset optional fields as if they had been +// set with their default value. +// +// - When possible, measurements should be stored in "unsampled" form +// that is most useful to humans. There should be enough +// information present to determine the original sampled values. +// +// - On-disk, the serialized proto must be gzip-compressed. +// +// - The profile is represented as a set of samples, where each sample +// references a sequence of locations, and where each location belongs +// to a mapping. +// - There is a N->1 relationship from sample.location_id entries to +// locations. For every sample.location_id entry there must be a +// unique Location with that index. +// - There is an optional N->1 relationship from locations to +// mappings. For every nonzero Location.mapping_id there must be a +// unique Mapping with that index. + +syntax = "proto3"; + +package opentelemetry.proto.profiles.v1.alternatives.pprofextended; + +import "opentelemetry/proto/common/v1/common.proto"; + +option csharp_namespace = "OpenTelemetry.Proto.Profiles.V1.Alternatives.PprofExtended"; +option go_package = "go.opentelemetry.io/proto/otlp/profiles/v1/alternatives/pprofextended"; + +// Represents a complete profile, including sample types, samples, +// mappings to binaries, locations, functions, string table, and additional metadata. +message Profile { + // A description of the samples associated with each Sample.value. + // For a cpu profile this might be: + // [["cpu","nanoseconds"]] or [["wall","seconds"]] or [["syscall","count"]] + // For a heap profile, this might be: + // [["allocations","count"], ["space","bytes"]], + // If one of the values represents the number of events represented + // by the sample, by convention it should be at index 0 and use + // sample_type.unit == "count". + repeated ValueType sample_type = 1; + // The set of samples recorded in this profile. + repeated Sample sample = 2; + // Mapping from address ranges to the image/binary/library mapped + // into that address range. mapping[0] will be the main binary. + repeated Mapping mapping = 3; + // Locations referenced by samples via location_indices. + repeated Location location = 4; + // Array of locations referenced by samples. + repeated int64 location_indices = 15; + // Functions referenced by locations. + repeated Function function = 5; + // Lookup table for attributes. + repeated opentelemetry.proto.common.v1.KeyValue attribute_table = 16; + // Represents a mapping between Attribute Keys and Units. + repeated AttributeUnit attribute_units = 17; + // Lookup table for links. + repeated Link link_table = 18; + // A common table for strings referenced by various messages. + // string_table[0] must always be "". + repeated string string_table = 6; + // frames with Function.function_name fully matching the following + // regexp will be dropped from the samples, along with their successors. + int64 drop_frames = 7; // Index into string table. + // frames with Function.function_name fully matching the following + // regexp will be kept, even if it matches drop_frames. + int64 keep_frames = 8; // Index into string table. + + // The following fields are informational, do not affect + // interpretation of results. + + // Time of collection (UTC) represented as nanoseconds past the epoch. + int64 time_nanos = 9; + // Duration of the profile, if a duration makes sense. + int64 duration_nanos = 10; + // The kind of events between sampled occurrences. + // e.g [ "cpu","cycles" ] or [ "heap","bytes" ] + ValueType period_type = 11; + // The number of events between sampled occurrences. + int64 period = 12; + // Free-form text associated with the profile. The text is displayed as is + // to the user by the tools that read profiles (e.g. by pprof). This field + // should not be used to store any machine-readable information, it is only + // for human-friendly content. The profile must stay functional if this field + // is cleaned. + repeated int64 comment = 13; // Indices into string table. + // Index into the string table of the type of the preferred sample + // value. If unset, clients should default to the last sample value. + int64 default_sample_type = 14; +} + +// Represents a mapping between Attribute Keys and Units. +message AttributeUnit { + // Index into string table. + int64 attribute_key = 1; + // Index into string table. + int64 unit = 2; +} + +// A pointer from a profile Sample to a trace Span. +// Connects a profile sample to a trace span, identified by unique trace and span IDs. +message Link { + // A unique identifier of a trace that this linked span is part of. The ID is a + // 16-byte array. + bytes trace_id = 1; + + // A unique identifier for the linked span. The ID is an 8-byte array. + bytes span_id = 2; +} + +// Specifies the method of aggregating metric values, either DELTA (change since last report) +// or CUMULATIVE (total since a fixed start time). +enum AggregationTemporality { + /* UNSPECIFIED is the default AggregationTemporality, it MUST not be used. */ + AGGREGATION_TEMPORALITY_UNSPECIFIED = 0; + + /** DELTA is an AggregationTemporality for a profiler which reports + changes since last report time. Successive metrics contain aggregation of + values from continuous and non-overlapping intervals. + + The values for a DELTA metric are based only on the time interval + associated with one measurement cycle. There is no dependency on + previous measurements like is the case for CUMULATIVE metrics. + + For example, consider a system measuring the number of requests that + it receives and reports the sum of these requests every second as a + DELTA metric: + + 1. The system starts receiving at time=t_0. + 2. A request is received, the system measures 1 request. + 3. A request is received, the system measures 1 request. + 4. A request is received, the system measures 1 request. + 5. The 1 second collection cycle ends. A metric is exported for the + number of requests received over the interval of time t_0 to + t_0+1 with a value of 3. + 6. A request is received, the system measures 1 request. + 7. A request is received, the system measures 1 request. + 8. The 1 second collection cycle ends. A metric is exported for the + number of requests received over the interval of time t_0+1 to + t_0+2 with a value of 2. */ + AGGREGATION_TEMPORALITY_DELTA = 1; + + /** CUMULATIVE is an AggregationTemporality for a profiler which + reports changes since a fixed start time. This means that current values + of a CUMULATIVE metric depend on all previous measurements since the + start time. Because of this, the sender is required to retain this state + in some form. If this state is lost or invalidated, the CUMULATIVE metric + values MUST be reset and a new fixed start time following the last + reported measurement time sent MUST be used. + + For example, consider a system measuring the number of requests that + it receives and reports the sum of these requests every second as a + CUMULATIVE metric: + + 1. The system starts receiving at time=t_0. + 2. A request is received, the system measures 1 request. + 3. A request is received, the system measures 1 request. + 4. A request is received, the system measures 1 request. + 5. The 1 second collection cycle ends. A metric is exported for the + number of requests received over the interval of time t_0 to + t_0+1 with a value of 3. + 6. A request is received, the system measures 1 request. + 7. A request is received, the system measures 1 request. + 8. The 1 second collection cycle ends. A metric is exported for the + number of requests received over the interval of time t_0 to + t_0+2 with a value of 5. + 9. The system experiences a fault and loses state. + 10. The system recovers and resumes receiving at time=t_1. + 11. A request is received, the system measures 1 request. + 12. The 1 second collection cycle ends. A metric is exported for the + number of requests received over the interval of time t_1 to + t_0+1 with a value of 1. + + Note: Even though, when reporting changes since last report time, using + CUMULATIVE is valid, it is not recommended. */ + AGGREGATION_TEMPORALITY_CUMULATIVE = 2; +} + +// ValueType describes the type and units of a value, with an optional aggregation temporality. +message ValueType { + int64 type = 1; // Index into string table. + int64 unit = 2; // Index into string table. + + AggregationTemporality aggregation_temporality = 3; +} + +// Each Sample records values encountered in some program +// context. The program context is typically a stack trace, perhaps +// augmented with auxiliary information like the thread-id, some +// indicator of a higher level request being handled etc. +message Sample { + // The indices recorded here correspond to locations in Profile.location. + // The leaf is at location_index[0]. [deprecated, superseded by locations_start_index / locations_length] + repeated uint64 location_index = 1; + // locations_start_index along with locations_length refers to to a slice of locations in Profile.location. + // Supersedes location_index. + uint64 locations_start_index = 7; + // locations_length along with locations_start_index refers to a slice of locations in Profile.location. + // Supersedes location_index. + uint64 locations_length = 8; + // A 128bit id that uniquely identifies this stacktrace, globally. Index into string table. [optional] + uint32 stacktrace_id_index = 9; + // The type and unit of each value is defined by the corresponding + // entry in Profile.sample_type. All samples must have the same + // number of values, the same as the length of Profile.sample_type. + // When aggregating multiple samples into a single sample, the + // result has a list of values that is the element-wise sum of the + // lists of the originals. + repeated int64 value = 2; + // label includes additional context for this sample. It can include + // things like a thread id, allocation size, etc. + // + // NOTE: While possible, having multiple values for the same label key is + // strongly discouraged and should never be used. Most tools (e.g. pprof) do + // not have good (or any) support for multi-value labels. And an even more + // discouraged case is having a string label and a numeric label of the same + // name on a sample. Again, possible to express, but should not be used. + // [deprecated, superseded by attributes] + repeated Label label = 3; + // References to attributes in Profile.attribute_table. [optional] + repeated uint64 attributes = 10; + + // Reference to link in Profile.link_table. [optional] + uint64 link = 12; + + // Timestamps associated with Sample represented in ms. These timestamps are expected + // to fall within the Profile's time range. [optional] + repeated uint64 timestamps = 13; +} + +// Provides additional context for a sample, +// such as thread ID or allocation size, with optional units. [deprecated] +message Label { + int64 key = 1; // Index into string table + + // At most one of the following must be present + int64 str = 2; // Index into string table + int64 num = 3; + + // Should only be present when num is present. + // Specifies the units of num. + // Use arbitrary string (for example, "requests") as a custom count unit. + // If no unit is specified, consumer may apply heuristic to deduce the unit. + // Consumers may also interpret units like "bytes" and "kilobytes" as memory + // units and units like "seconds" and "nanoseconds" as time units, + // and apply appropriate unit conversions to these. + int64 num_unit = 4; // Index into string table +} + +// Indicates the semantics of the build_id field. +enum BuildIdKind { + // Linker-generated build ID, stored in the ELF binary notes. + BUILD_ID_LINKER = 0; + // Build ID based on the content hash of the binary. Currently no particular + // hashing approach is standardized, so a given producer needs to define it + // themselves and thus unlike BUILD_ID_LINKER this kind of hash is producer-specific. + // We may choose to provide a standardized stable hash recommendation later. + BUILD_ID_BINARY_HASH = 1; +} + +// Describes the mapping of a binary in memory, including its address range, +// file offset, and metadata like build ID +message Mapping { + // Unique nonzero id for the mapping. [deprecated] + uint64 id = 1; + // Address at which the binary (or DLL) is loaded into memory. + uint64 memory_start = 2; + // The limit of the address range occupied by this mapping. + uint64 memory_limit = 3; + // Offset in the binary that corresponds to the first mapped address. + uint64 file_offset = 4; + // The object this entry is loaded from. This can be a filename on + // disk for the main binary and shared libraries, or virtual + // abstractions like "[vdso]". + int64 filename = 5; // Index into string table + // A string that uniquely identifies a particular program version + // with high probability. E.g., for binaries generated by GNU tools, + // it could be the contents of the .note.gnu.build-id field. + int64 build_id = 6; // Index into string table + // Specifies the kind of build id. See BuildIdKind enum for more details [optional] + BuildIdKind build_id_kind = 11; + // References to attributes in Profile.attribute_table. [optional] + repeated uint64 attributes = 12; + // The following fields indicate the resolution of symbolic info. + bool has_functions = 7; + bool has_filenames = 8; + bool has_line_numbers = 9; + bool has_inline_frames = 10; +} + +// Describes function and line table debug information. +message Location { + // Unique nonzero id for the location. A profile could use + // instruction addresses or any integer sequence as ids. [deprecated] + uint64 id = 1; + // The index of the corresponding profile.Mapping for this location. + // It can be unset if the mapping is unknown or not applicable for + // this profile type. + uint64 mapping_index = 2; + // The instruction address for this location, if available. It + // should be within [Mapping.memory_start...Mapping.memory_limit] + // for the corresponding mapping. A non-leaf address may be in the + // middle of a call instruction. It is up to display tools to find + // the beginning of the instruction if necessary. + uint64 address = 3; + // Multiple line indicates this location has inlined functions, + // where the last entry represents the caller into which the + // preceding entries were inlined. + // + // E.g., if memcpy() is inlined into printf: + // line[0].function_name == "memcpy" + // line[1].function_name == "printf" + repeated Line line = 4; + // Provides an indication that multiple symbols map to this location's + // address, for example due to identical code folding by the linker. In that + // case the line information above represents one of the multiple + // symbols. This field must be recomputed when the symbolization state of the + // profile changes. + bool is_folded = 5; + + // Type of frame (e.g. kernel, native, python, hotspot, php). Index into string table. + uint32 type_index = 6; + + // References to attributes in Profile.attribute_table. [optional] + repeated uint64 attributes = 7; +} + +// Details a specific line in a source code, linked to a function. +message Line { + // The index of the corresponding profile.Function for this line. + uint64 function_index = 1; + // Line number in source code. + int64 line = 2; + // Column number in source code. + int64 column = 3; +} + +// Describes a function, including its human-readable name, system name, +// source file, and starting line number in the source. +message Function { + // Unique nonzero id for the function. [deprecated] + uint64 id = 1; + // Name of the function, in human-readable form if available. + int64 name = 2; // Index into string table + // Name of the function, as identified by the system. + // For instance, it can be a C++ mangled name. + int64 system_name = 3; // Index into string table + // Source file containing the function. + int64 filename = 4; // Index into string table + // Line number in source file. + int64 start_line = 5; +} +``` + + + +### Message Descriptions + +These are detailed descriptions of protobuf messages that are used to represent profiling data. + + +#### Message `ProfilesData` + +ProfilesData represents the profiles data that can be stored in persistent storage, +OR can be embedded by other protocols that transfer OTLP profiles data but do not +implement the OTLP protocol. +The main difference between this message and collector protocol is that +in this message there will not be any "control" or "metadata" specific to +OTLP protocol. +When new fields are added into this message, the OTLP request MUST be updated +as well. + +#### Message `ResourceProfiles` + +A collection of ScopeProfiles from a Resource. + +
+Field Descriptions + +##### Field `resource` + +The resource for the profiles in this message. +If this field is not set then no resource info is known. + +##### Field `scope_profiles` + +A list of ScopeProfiles that originate from a resource. + +##### Field `schema_url` + +This schema_url applies to the data in the "resource" field. It does not apply +to the data in the "scope_profiles" field which have their own schema_url field. +
+ +#### Message `ScopeProfiles` + +A collection of Profiles produced by an InstrumentationScope. + +
+Field Descriptions + +##### Field `scope` + +The instrumentation scope information for the profiles in this message. +Semantically when InstrumentationScope isn't set, it is equivalent with +an empty instrumentation scope name (unknown). + +##### Field `profiles` + +A list of ProfileContainers that originate from an instrumentation scope. + +##### Field `schema_url` + +This schema_url applies to all profiles and profile events in the "profiles" field. +
+ +#### Message `ProfileContainer` + +A ProfileContainer represents a single profile. It wraps pprof profile with OpenTelemetry specific metadata. + +
+Field Descriptions + +##### Field `profile_id` + +A globally unique identifier for a profile. The ID is a 16-byte array. An ID with +all zeroes is considered invalid. +This field is required. + +##### Field `start_time_unix_nano` + +start_time_unix_nano is the start time of the profile. +Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970. +This field is semantically required and it is expected that end_time >= start_time. + +##### Field `end_time_unix_nano` + +end_time_unix_nano is the end time of the profile. +Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970. +This field is semantically required and it is expected that end_time >= start_time. + +##### Field `attributes` + +attributes is a collection of key/value pairs. Note, global attributes +like server name can be set using the resource API. + +##### Field `dropped_attributes_count` + +dropped_attributes_count is the number of attributes that were discarded. Attributes +can be discarded because their keys are too long or because there are too many +attributes. If this value is 0, then no attributes were dropped. + +##### Field `original_payload_format` + +Specifies format of the original payload. Common values are defined in semantic conventions. [required if original_payload is present] + +##### Field `original_payload` + +Original payload can be stored in this field. This can be useful for users who want to get the original payload. +Formats such as JFR are highly extensible and can contain more information than what is defined in this spec. +Inclusion of original payload should be configurable by the user. Default behavior should be to not include the original payload. +If the original payload is in pprof format, it SHOULD not be included in this field. +The field is optional, however if it is present `profile` MUST be present and contain the same profiling information. + +##### Field `profile` + +This is a reference to a pprof profile. Required, even when original_payload is present. +
+ +#### Message `Profile` + +Profile is a common stacktrace profile format. +Measurements represented with this format should follow the +following conventions: + +- Consumers should treat unset optional fields as if they had been + set with their default value. +- When possible, measurements should be stored in "unsampled" form + that is most useful to humans. There should be enough + information present to determine the original sampled values. +- On-disk, the serialized proto must be gzip-compressed. +- The profile is represented as a set of samples, where each sample + references a sequence of locations, and where each location belongs + to a mapping. +- There is a N->1 relationship from sample.location_id entries to + locations. For every sample.location_id entry there must be a + unique Location with that index. +- There is an optional N->1 relationship from locations to + mappings. For every nonzero Location.mapping_id there must be a + unique Mapping with that index. + +Represents a complete profile, including sample types, samples, +mappings to binaries, locations, functions, string table, and additional metadata. + +
+Field Descriptions + +##### Field `sample_type` + +A description of the samples associated with each Sample.value. +For a cpu profile this might be: +[["cpu","nanoseconds"]] or [["wall","seconds"]] or [["syscall","count"]] +For a heap profile, this might be: +[["allocations","count"], ["space","bytes"]], +If one of the values represents the number of events represented +by the sample, by convention it should be at index 0 and use +sample_type.unit == "count". + +##### Field `sample` + +The set of samples recorded in this profile. + +##### Field `mapping` + +Mapping from address ranges to the image/binary/library mapped +into that address range. mapping[0] will be the main binary. + +##### Field `location` + +Locations referenced by samples via location_indices. + +##### Field `location_indices` + +Array of locations referenced by samples. + +##### Field `function` + +Functions referenced by locations. + +##### Field `attribute_table` + +Lookup table for attributes. + +##### Field `attribute_units` + +Represents a mapping between Attribute Keys and Units. + +##### Field `link_table` + +Lookup table for links. + +##### Field `string_table` + +A common table for strings referenced by various messages. +string_table[0] must always be "". + +##### Field `drop_frames` + +frames with Function.function_name fully matching the following +regexp will be dropped from the samples, along with their successors. + +##### Field `keep_frames` + +Index into string table. +frames with Function.function_name fully matching the following +regexp will be kept, even if it matches drop_frames. + +##### Field `time_nanos` + +Index into string table. +The following fields are informational, do not affect +interpretation of results. +Time of collection (UTC) represented as nanoseconds past the epoch. + +##### Field `duration_nanos` + +Duration of the profile, if a duration makes sense. + +##### Field `period_type` + +The kind of events between sampled occurrences. +e.g [ "cpu","cycles" ] or [ "heap","bytes" ] + +##### Field `period` + +The number of events between sampled occurrences. + +##### Field `comment` + +Free-form text associated with the profile. The text is displayed as is +to the user by the tools that read profiles (e.g. by pprof). This field +should not be used to store any machine-readable information, it is only +for human-friendly content. The profile must stay functional if this field +is cleaned. + +##### Field `default_sample_type` + +Indices into string table. +Index into the string table of the type of the preferred sample +value. If unset, clients should default to the last sample value. +
+ +#### Message `ValueType` + +ValueType describes the type and units of a value, with an optional aggregation temporality. + +
+Field Descriptions + +##### Field `unit` + +Index into string table. + +##### Field `aggregation_temporality` + +Index into string table. +
+ +#### Message `Sample` + +Each Sample records values encountered in some program +context. The program context is typically a stack trace, perhaps +augmented with auxiliary information like the thread-id, some +indicator of a higher level request being handled etc. + +
+Field Descriptions + +##### Field `location_index` + +The indices recorded here correspond to locations in Profile.location. +The leaf is at location_index[0]. [deprecated, superseded by locations_start_index / locations_length] + +##### Field `locations_start_index` + +locations_start_index along with locations_length refers to to a slice of locations in Profile.location. +Supersedes location_index. + +##### Field `locations_length` + +locations_length along with locations_start_index refers to a slice of locations in Profile.location. +Supersedes location_index. + +##### Field `stacktrace_id_index` + +A 128bit id that uniquely identifies this stacktrace, globally. Index into string table. [optional] + +##### Field `value` + +The type and unit of each value is defined by the corresponding +entry in Profile.sample_type. All samples must have the same +number of values, the same as the length of Profile.sample_type. +When aggregating multiple samples into a single sample, the +result has a list of values that is the element-wise sum of the +lists of the originals. + +##### Field `label` + +label includes additional context for this sample. It can include +things like a thread id, allocation size, etc. +NOTE: While possible, having multiple values for the same label key is +strongly discouraged and should never be used. Most tools (e.g. pprof) do +not have good (or any) support for multi-value labels. And an even more +discouraged case is having a string label and a numeric label of the same +name on a sample. Again, possible to express, but should not be used. +[deprecated, superseded by attributes] + +##### Field `attributes` + +References to attributes in Profile.attribute_table. [optional] + +##### Field `link` + +Reference to link in Profile.link_table. [optional] + +##### Field `timestamps` + +Timestamps associated with Sample represented in ms. These timestamps are expected +to fall within the Profile's time range. [optional] +
+ +#### Message `AttributeUnit` + +Represents a mapping between Attribute Keys and Units. + +
+Field Descriptions + +##### Field `attribute_key` + +Index into string table. + +##### Field `unit` + +Index into string table. +
+ +#### Message `Link` + +A pointer from a profile Sample to a trace Span. +Connects a profile sample to a trace span, identified by unique trace and span IDs. + +
+Field Descriptions + +##### Field `trace_id` + +A unique identifier of a trace that this linked span is part of. The ID is a +16-byte array. + +##### Field `span_id` + +A unique identifier for the linked span. The ID is an 8-byte array. +
+ +#### Message `Location` + +Describes function and line table debug information. + +
+Field Descriptions + +##### Field `id` + +Unique nonzero id for the location. A profile could use +instruction addresses or any integer sequence as ids. [deprecated] + +##### Field `mapping_index` + +The index of the corresponding profile.Mapping for this location. +It can be unset if the mapping is unknown or not applicable for +this profile type. + +##### Field `address` + +The instruction address for this location, if available. It +should be within [Mapping.memory_start...Mapping.memory_limit] +for the corresponding mapping. A non-leaf address may be in the +middle of a call instruction. It is up to display tools to find +the beginning of the instruction if necessary. + +##### Field `line` + +Multiple line indicates this location has inlined functions, +where the last entry represents the caller into which the +preceding entries were inlined. +E.g., if memcpy() is inlined into printf: +line[0].function_name == "memcpy" +line[1].function_name == "printf" + +##### Field `is_folded` + +Provides an indication that multiple symbols map to this location's +address, for example due to identical code folding by the linker. In that +case the line information above represents one of the multiple +symbols. This field must be recomputed when the symbolization state of the +profile changes. + +##### Field `type_index` + +Type of frame (e.g. kernel, native, python, hotspot, php). Index into string table. + +##### Field `attributes` + +References to attributes in Profile.attribute_table. [optional] +
+ +#### Message `Line` + +Details a specific line in a source code, linked to a function. + +
+Field Descriptions + +##### Field `function_index` + +The index of the corresponding profile.Function for this line. + +##### Field `line` + +Line number in source code. + +##### Field `column` + +Column number in source code. +
+ +#### Message `Mapping` + +Describes the mapping of a binary in memory, including its address range, +file offset, and metadata like build ID + +
+Field Descriptions + +##### Field `id` + +Unique nonzero id for the mapping. [deprecated] + +##### Field `memory_start` + +Address at which the binary (or DLL) is loaded into memory. + +##### Field `memory_limit` + +The limit of the address range occupied by this mapping. + +##### Field `file_offset` + +Offset in the binary that corresponds to the first mapped address. + +##### Field `filename` + +The object this entry is loaded from. This can be a filename on +disk for the main binary and shared libraries, or virtual +abstractions like "[vdso]". + +##### Field `build_id` + +Index into string table +A string that uniquely identifies a particular program version +with high probability. E.g., for binaries generated by GNU tools, +it could be the contents of the .note.gnu.build-id field. + +##### Field `build_id_kind` + +Index into string table +Specifies the kind of build id. See BuildIdKind enum for more details [optional] + +##### Field `attributes` + +References to attributes in Profile.attribute_table. [optional] + +##### Field `has_functions` + +The following fields indicate the resolution of symbolic info. +
+ +#### Message `Function` + +Describes a function, including its human-readable name, system name, +source file, and starting line number in the source. + +
+Field Descriptions + +##### Field `id` + +Unique nonzero id for the function. [deprecated] + +##### Field `name` + +Name of the function, in human-readable form if available. + +##### Field `system_name` + +Index into string table +Name of the function, as identified by the system. +For instance, it can be a C++ mangled name. + +##### Field `filename` + +Index into string table +Source file containing the function. + +##### Field `start_line` + +Index into string table +Line number in source file. +
+ + + +### Example Payloads + +#### Simple Example + +Considering the following example presented in a modified folded format: + +``` +foo;bar;baz 100 region=us,trace_id=0x01020304010203040102030401020304,span_id=0x9999999999999999 1687841528000000 +foo;bar 200 region=us +``` + +It represents 2 samples: + +* one for stacktrace `foo;bar;baz` with value `100`, attributes `region=us`, linked to trace `0x01020304010203040102030401020304` and span `0x9999999999999999`, and timestamp `1687841528000000` +* one for stacktrace `foo;bar` with value `200`, attributes `region=us`, no link, no timestamp + +The resulting profile in OTLP format would look like this (converted to YAML format for legibility): + +```yaml +resource_profiles: + - resource: + attributes: null + schema_url: todo + scope_profiles: + - profiles: + - profile_id: 0x0102030405060708090a0b0c0d0e0f10 + start_time_unix_nano: 1687841520000000 + end_time_unix_nano: 1687841530000000 + profile: + sample_type: + type: 4 + unit: 5 + aggregation_temporality: 1 + attribute_table: + - key: trace_id + value: + Value: + bytes_value: 0x01020304010203040102030401020304 + - key: span_id + value: + Value: + bytes_value: 0x9999999999999999 + - key: region + value: + Value: + string_value: us + function: + - name: 1 + - name: 2 + - name: 3 + location: + - line: + - function_index: 0 + - line: + - function_index: 1 + - line: + - function_index: 2 + location_indices: + - 0 + - 1 + - 2 + sample: + - locations_start_index: 0 + locations_length: 3 + timestamps: + - 1687841528000000 + value: + - 100 + - locations_start_index: 0 + locations_length: 2 + value: + - 200 + string_table: + - "" + - foo + - bar + - baz + - cpu + - samples +``` + +### Notable Differences Compared to Other Signals + +Due to the increased performance requirements associated with profiles signal, here are some notable differences between profiles signal and other signals. + +#### Relationships Between Messages + +There are two main ways relationships between messages are represented: + +* by embedding a message into another message (standard protobuf way) +* by referencing a message by index (similar to how it's done in pprof) + +Profiling signal is different from most other ones in that we use the referencing technique a lot to represent relationships between messages where there is a lot of duplication happening. This allows to reduce the size of the resulting protobuf payload and the number of objects that need to be allocated to parse such payload. + +This example illustrates the conceptual difference between the two approaches. Note that this example is simplified for clarity and provided in YAML format for legibility: + +```yaml +# denormalized +samples: +- stacktrace: + - foo + - bar + value: 100 + attribute_set: + endpoint: "/v1/users" +- stacktrace: + - foo + - bar + - baz + value: 200 + attribute_set: + endpoint: "/v1/users" + +# normalized +attribute_sets: +- endpoint: "/v1/users" +samples: +- stacktrace: + - foo + - bar + value: 100 + attribute_set_index: 0 +- stacktrace: + - foo + - bar + - baz + value: 200 + attribute_set_index: 0 + +``` + +Explanation: because multiple samples have the same attributes, we can store them in a separate table and reference them by index. This reduces the size of the resulting protobuf payload and the number of objects that need to be allocated to parse such payload. + +Benchmarking shows that this approach is significantly more efficient in terms of CPU utilization, memory consumption and size of the resulting protobuf payload. See [Prior art and alternatives](#prior-art-and-alternatives) for more details. + +#### Relationship Between Samples and Locations + +Relationship between Samples and Locations is using the referencing technique described above. However, there's an additional optimization technique used to reduce the size of the resulting protobuf payload and the number of objects that need to be allocated to parse such payload. The technique is based on the fact that many samples share the same locations. + +Considering the following example presented in a folded format: + +``` +foo;bar;baz 100 +abc;def 200 +foo;bar 300 +``` + +It represents 3 samples: + +* one for stacktrace `foo;bar;baz` with value `100`. +* one for stacktrace `abc;def` with value `200`. +* one for stacktrace `foo;bar` with value `300`. Note that 2 of the locations are shared with the first sample. + +By storing locations in a separate table and referring to start_index+length in that table, we can reduce the size of the resulting protobuf payload and the number of objects that need to be allocated to parse such payload. With this approach we can also take advantage of the fact that many samples share the same locations. Below is a representation of the resulting protobuf payload (in YAML format for legibility): + +```yaml +sample: + - locations_start_index: 0 + locations_length: 3 + value: + - 100 + - locations_start_index: 3 + locations_length: 2 + value: + - 200 + - locations_start_index: 0 + locations_length: 2 + value: + - 300 +location_indices: + - 0 # foo + - 1 # bar + - 2 # baz + - 3 # abc + - 4 # def +location: + - line: + - function_index: 0 # foo + - line: + - function_index: 1 # bar + - line: + - function_index: 2 # baz + - line: + - function_index: 3 # abc + - line: + - function_index: 4 # def +function: + - name: 1 # foo + - name: 2 # bar + - name: 3 # baz + - name: 4 # abc + - name: 5 # def +string_table: + - "" + - foo + - bar + - baz + - abc + - def +``` + +Benchmarking shows that this approach is significantly more efficient in terms of CPU utilization, memory consumption and size of the resulting protobuf payload. See [Prior art and alternatives](#prior-art-and-alternatives) for more details. + +## Trade-Offs and Mitigations + +The biggest trade-off was made between the performance characteristics of the format and it's simplicity. The emphasis was made on the performance characteristics, which resulted in a cognitively more complex format. + +Authors feel like the complexity is justified for the following reasons: + +* as presented in [Design Goals](#design-goals) section, the performance characteristics of the format are very important for the profiling signal +* the format is not intended to be used directly by the end users, but rather by the developers of profiling systems that are used to and are expected to be able to handle the complexity. It is not more complex than other existing formats + +Alternative formats that are simpler to understand were considered, but they were not as efficient in terms of CPU utilization, memory consumption and size of the resulting protobuf payload. See [next chapter, Prior art and alternatives](#prior-art-and-alternatives) for more details. + +## Prior Art and Alternatives + +This section describes other existing popular formats and alternative representations that were considered in the process of designing this data model. + +### Other Popular Formats + +Many other popular formats were considered as part of the process of designing this format. The popularity was assessed based on [data from Profilerpedia website](https://docs.google.com/spreadsheets/d/1UM-WFQhNf4GcyXmluSUGnMbOenvN-TqP2HQC9-Y50Lc/edit?usp=sharing). This chapter describes the most notable formats that were considered. + +#### Folded Stacks + +The [Folded Stacks representation](https://profilerpedia.markhansen.co.nz/formats/folded-stacks/), which is rooted in a straightforward text-based format, presents its own set of limitations. The main one is its inefficiency in handling large datasets, as it can struggle with both the complexity and the volume of data. The format's definition also reveals shortcomings, such as the absence of standardized, machine-readable attributes, resulting in varying interpretations and implementations by different profiling tools and analysis software. + +#### Chromium's Trace Event Format + +The [Chromium Trace Event Format](https://profilerpedia.markhansen.co.nz/formats/trace-event-format/), which employs JSON as its foundation, exhibits certain limitations. Notably, it does not excel in terms of payload size efficiency and lacks robust support for aggregation. Additionally, its specification demonstrates weaknesses, as it lacks machine-readable attributes, leading to distinct interpretations by various implementations, such as Perfetto and Catapult. + +#### Linux perf.data + +The [Linux perf.data format](https://profilerpedia.markhansen.co.nz/formats/linux-perf-data/), primarily aimed at low-level data collection, offers insights at a granularity that may not be suitable for high-level analysis. As it contains events generated by Performance Monitoring Units (PMUs) along with metadata, many of its fields find relevance primarily in data collected at the kernel level. + +#### Java Flight Recorder (JFR) + +[Java Flight Recorder](https://profilerpedia.markhansen.co.nz/formats/jfr/) (JFR) may not be the ideal choice for profiling applications outside the Java ecosystem. Its specialization in Java profiling limits its applicability in environments that rely on other programming languages, rendering it unsuitable for non-Java applications. + +### Alternative Representations + +In the process of refining the data model, multiple alternative representations were considered, including: + +* `pprof` representation is data in original pprof format. +* `denormalized` representation, where all messages are embedded and no references by index are used. This is the simplest representation, but it is also the least efficient (by a huge margin) in terms of CPU utilization, memory consumption and size of the resulting protobuf payload. +* `normalized` representation, where messages that repeat often are stored in separate tables and are referenced by indices. See [this chapter](#relationships-between-messages) for more details. This technique reduces the size of the resulting protobuf payload and the number of objects that need to be allocated to parse such payload. +* `arrays` representation, which is based on `normalized` representation, but uses arrays of integers instead of arrays of structures to represent messages. It further reduces the number of allocations, and the size of the resulting protobuf payload. +* `pprofextended` is a modified `pprof` representation. It is the one presented in this OTEP. + +You can find exact proto definitions for each one [here](https://github.com/open-telemetry/opentelemetry-proto-profile/commit/622c1658673283102a9429109185615bfcfaa78e#diff-a21ad1b0e4735fa9b5085cf46abe16f6c13d1710fd255b15b28adb2493a129bfR1). + +These alternative representations helped us narrow down various techniques for representing profiling data. It was found that all of the representations above are less performant compared to the data model presented in this OTEP. More on that in the next section, [Benchmarking](#benchmarking). + +### Benchmarking + +The benchmarking was done using Go benchmarking tools. +As part of the process of designing the data model we ran many benchmarks. All benchmarks follow the same 3 step process: + +* Get a profile in pprof format +* Convert it into a profile in some other format, serialize it to bytes, gzip the result +* Measure key performance indicators + +All benchmarks measured a few key indicators: + +* `bytes` — size of payload after conversion and serialization +* `gzipped_bytes` — size of payload after gzip compression. +* `retained_objects` — number of go runtime objects created and retained after conversion. +* `unique_label_sets` — number of unique label sets in source pprof file +* `bytes_allocated` - number of bytes allocated by go runtime during conversion +* `allocs` — number of allocations by go runtime during conversion + +`gzipped_bytes` is an important metric since this is a cost center for network traffic. + +`bytes`, `retained_objects`, `bytes_allocated`, `allocs` metrics are important because they directly affects memory as well as garbage collection overhead on data producers as well as the intermediaries (such as collector). + +[Benchmarking results](https://docs.google.com/spreadsheets/d/1Q-6MlegV8xLYdz5WD5iPxQU2tsfodX1-CDV1WeGzyQ0/edit#gid=0) spreadsheet shows the most recent benchmarking results as well as history of previous benchmarking runs. [Here](https://github.com/petethepig/opentelemetry-collector/pull/1#:~:text=in%20text%20form-,To%20run%20benchmarks%3A,-Clone%20this%20repo) are instructions on how to run the benchmarks. Here's a rough history of benchmarking our group has done: + +* "23 July 2023", "24 Aug 2023" — show differences between "pprof", "denormalized" and "normalized" representations +* "04 Oct 2023" — introduces "pprofextended" representation, compares it to "pprof" and "arrays". Shoes that "pprofextended" version is overall better than "pprof", but not as a good as "arrays" +* "06 Oct 2023" vs "After Stacktrace removal (Oct 6 2023)" — shows difference between representing stacktraces as separate struct vs a separate array of integers. Shows massive reduction in retained_objects. Demonstrates that after Stacktrace struct removal "pprofextended" representation is better than "arrays" representation +* "Attribute Representations (Oct 13 2023)" — focuses on differences between different attribute representations. Shows that having a lookup table for attributes is optimal compared to other representations + +Below you can see the benchmarking results for most notable example profiles: + +#### "average" Profile + +The source for this benchmark is a single 10 second pprof profile collected from a simple go program. It represents a typical profile that is collected from a running application. + +|name|bytes|gzipped_bytes|retained_objects|unique_label_sets|bytes_allocated|allocs| +|---|---|---|---|---|---|---| +|pprof|7,974|3,772|653|1|876,968|824| +|denormalized|83,204|3,844|3,166|1|1,027,424|3,191| +|normalized|7,940|3,397|753|1|906,848|1,815| +|arrays|7,487|3,276|586|1|922,948|2,391| +|pprofextended|7,695|3,347|654|1|899,400|779| + +#### "average" Profile With Timestamps Added to Each Sample + +The source is the same as in the previous example, but this time there were timestamps added to each sample in the profile. + +|name|bytes|gzipped_bytes|retained_objects|unique_label_sets|bytes_allocated|allocs| +|---|---|---|---|---|---|---| +|pprof|9,516|3,787|968|1|898,696|1,568| +|denormalized|121,396|4,233|4,536|1|1,126,280|4,894| +|normalized|9,277|3,394|900|1|925,904|2,632| +|arrays|8,877|3,309|588|1|946,468|3,306| +|pprofextended|8,863|3,387|806|1|919,904|1,476| + +#### "ruby" Profile With Very Deep Stacktraces + +The source for this test is an aggregated pprof profile collected from a Ruby application that has very deep stacktraces. + +|name|bytes|gzipped_bytes|retained_objects|unique_label_sets|bytes_allocated|allocs| +|---|---|---|---|---|---|---| +|pprof|1,869,549|115,289|19,759|1|14,488,578|42,359| +|denormalized|163,107,501|4,716,428|3,840,093|1|319,473,752|3,844,625| +|normalized|1,931,909|130,565|46,890|1|315,003,144|1,725,508| +|arrays|1,868,982|120,298|23,483|1|314,117,160|1,689,537| +|pprofextended|841,957|94,852|19,759|1|20,719,752|33,410| + +#### "large" Profile + +The source for this test is an aggregated pprof profile collected from a Go application over a long period of time (24 hours). + +|name|bytes|gzipped_bytes|retained_objects|unique_label_sets|bytes_allocated|allocs| +|---|---|---|---|---|---|---| +|pprof|2,874,764|1,110,109|350,659|27|27,230,584|470,033| +|denormalized|87,887,253|6,890,103|2,287,303|27|190,243,856|2,325,604| +|normalized|2,528,337|953,211|333,565|27|46,449,824|1,274,000| +|arrays|2,251,355|999,310|213,018|27|60,971,752|1,904,756| +|pprofextended|2,398,961|872,059|274,140|27|38,874,712|353,083| + +#### Conclusions + +After running many benchmarks and analyzing the results, we came to the following conclusions: + +* `pprof` representation is good but lacks deeper integration with OpenTelemetry standards and could be improved in terms of performance. +* `denormalized` representation is significantly more expensive in terms of CPU utilization, memory consumption and size of the resulting protobuf payload compared to `normalized` representation. It is not suitable for production use. +* `normalized` representation is much better than `denormalized` one +* `arrays` representation is generally better than `normalized` one, but introduces significant changes to the data model and is not as easy to understand +* `pprofextended` (the representation that is used in this OTEP) is the perfect mix of performance and simplicity. It is significantly better than `normalized` representation in terms of CPU utilization, memory consumption and size of the resulting protobuf payload, but it is also more similar to original pprof and easier to understand and implement than `arrays` representation. + +### Semantic Conventions + +We plan to leverage OTEL Semantic Conventions for various attributes and enums such as profile types or units. Here's a non-exhaustive list of semantic conventions that are used in data model. It is expected to be polished and extended in the future. + +#### Profile Types + +Here's a list of possible profile types. It is not exhaustive, and it is expected that more profile types will be added in the future: + +* `cpu` +* `wall` +* `goroutines` +* `alloc_objects` +* `alloc_space` +* `inuse_objects` +* `inuse_space` +* `mutex_contentions` +* `mutex_delay` +* `block_contentions` +* `block_delay` + +#### Profile Units + +Here's a list of possible profile units. It is not exhaustive, and it is expected that more units will be added in the future. [UCUM](https://ucum.org/) will be used as a reference for some of these units: + +* `bytes` +* `samples` +* `ns` +* `ms` +* `count` + +### Decision Log + +There were many other alternatives considered during the design process. See [Decision Log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/54bba7a86d839b9d29488de8e22d8c567d283e7b/opentelemetry/proto/profiles/v1/decision-log.md#L0-L1) for more information about various decisions that were made during the design process. + +## Open Questions + +Some minor details about the data model are still being discussed and will be flushed out in the future OTEPs. We intend to finalize these details after doing experiments with early versions of working client + collector + backend implementations and getting feedback from the community. The goal of this OTEP is to provide a solid foundation for these experiments and [more](#future-possibilities). + +Here's a list of open questions: + +### Units in Attributes + +Original pprof format allows to specify units for attributes. The current data model supports a similar concept via the use of string table (see `attribute_units` in `Sample` message). It might be a good idea to have the units be specified directly in `KeyValue` message. However, such change would require changes in virtually all signals and it is not clear if it is worth it. We intend to research this question in the future and if it is worth it, we will submit a separate OTEP to make this change. + +### Timestamps + +Although there's support for timestamps in the data model, it is not clear how they should be used and therefore we expect to make changes to this aspect of the data model in the future after we do more experiments. + +### Repetition of Attribute Keys + +Original pprof format allows for efficient encoding of repeating attribute keys. For example, if 3 attributes have the same key (e.g `pid: 1`, `pid: 2`, `pid: 3`), the key is only stored once in the string table. The current data model doesn't support this optimization. We intend to research if this optimization is truly needed and if it is, it will be added to the data model in the future. + +### Locations Optimization + +[Relationship Between Samples And Location](#relationship-between-samples-and-locations) section describes the technique that is used to reduce the size of the resulting protobuf payload and the number of objects that need to be allocated to parse such payload. However, there are concerns that this technique makes the data model more complex, in particular: + +* it requires somewhat custom memory management which means things like array-out-of-bounds can go unnoticed +* makes it harder to implement a `Merge` function for 2 or more profiles + +We intend to research this question in the future while we experiment with early versions of working client + collector + backend implementations. If it is not worth it, we will submit a separate OTEP to remove this change. + +## Future Possibilities + +This OTEP enables us to start working on various parts of [OTEL Specification](https://github.com/open-telemetry/opentelemetry-specification): + +* Profiles Data Model +* Profiles API +* Profiles SDK + +That in turn would enable us to start working on: + +* Profiles support in [OTEL Collector](https://github.com/open-telemetry/opentelemetry-collector) +* Client SDK implementations in various languages (e.g Go and Java) diff --git a/oteps/profiles/images/otep0239/profiles-data-model.png b/oteps/profiles/images/otep0239/profiles-data-model.png new file mode 100644 index 00000000000..3c9ac90a2bf Binary files /dev/null and b/oteps/profiles/images/otep0239/profiles-data-model.png differ diff --git a/oteps/trace/0002-remove-spandata.md b/oteps/trace/0002-remove-spandata.md new file mode 100644 index 00000000000..1a7c9b5b61e --- /dev/null +++ b/oteps/trace/0002-remove-spandata.md @@ -0,0 +1,51 @@ +# Remove SpanData + +Remove and replace SpanData by adding span start and end options. + +## Motivation + +SpanData represents an immutable span object, creating a fairly large API for all of the fields (12 to be exact). It exposes what feels like an SDK concern and implementation detail to the API surface. As a user, this is another API I need to learn how to use, and ID generation might also need to be exposed. As an implementer, it is a new data type that needs to be supported. The primary motivation for removing SpanData revolves around the desire to reduce the size of the tracing API. + +## Explanation + +SpanData has a couple of use cases. + +The first use case revolves around creating a span synchronously but needing to change the start time to a more accurate timestamp. For example, in an HTTP server, you might record the time the first byte was received, parse the headers, determine the span name, and then create the span. The moment the span was created isn't representative of when the request actually began, so the time the first byte was received would become the span's start time. Since the current API doesn't allow start timestamps, you'd need to create a SpanData object. The big downside is that you don't end up with an active span object. + +The second use case comes from the need to construct and report out of band spans, meaning that you're creating "custom" spans for an operation you don't own. One good example of this is a span sink that takes in structured logs that contain correlation IDs and a duration (e.g. from splunk) and converts them to spans for your tracing system. Another example is running a sidecar on an HAProxy machine, tailing the request logs, and creating spans. SpanData allows you to report the out of band reporting case, whereas you can’t with the current Span API as you cannot set the start and end timestamp. + +I'd like to propose getting rid of SpanData and `tracer.recordSpanData()` and replacing it by allowing `tracer.startSpan()` to accept a start timestamp option and `span.end()` to accept end timestamp option. This reduces the API surface, consolidating on a single span type. Options would meet the requirements for out of band reporting. + +## Internal details + +`startSpan()` would change so you can include an optional start timestamp, span ID, and resource. When you have a span sink, out of band spans may have different resources than the tracer they are being reported to, so you want to pass an explicit resource. For `span.end()` you would have an optional end timestamp. The exact implementation would be language specific, so some would use an options pattern with function overloading or variadic parameters, or add these options to the span builder. + +## Trade-offs and mitigations + +From : If the underlying SDK automatically adds tags to spans such as thread-id, stacktrace, and cpu-usage when a span is started, they would be incorrect for out of band spans as the tracer would not know the difference between in and out of band spans. This can be mitigated by indicating that the span is out of band to prevent attaching incorrect information, possibly with an `isOutOfBand()` option on `startSpan()`. + +## Prior art and alternatives + +The OpenTracing specification for `tracer.startSpan()` includes an optional start timestamp and zero or more tags. It also calls out an optional end timestamp and bulk logging for `span.end()`. + +## Open questions + +There also seems to be some hidden dependency between SpanData and the sampler API. For example, given a complete SpanData object with a start and end timestamp, I imagine there's a use case where the sampler can look at the that and decide "this took a long time" and sample it. Is this a real use case? Is there a requirement to be able to provide complete span objects to the sampler? + +## Future Work + +We might want to include attributes as a start option to give the underlying sampler more information to sample with. We also might want to include optional events, which would be for bulk adding events with explicit timestamps. + +We will also want to ensure, assuming the span or subtrace is being created in the same process, that the timestamps use the same precision and are monotonic. + +## Related Issues + +Removing SpanData would resolve [open-telemetry/opentelemetry-specification#71](https://github.com/open-telemetry/opentelemetry-specification/issues/71). + +Options would solve [open-telemetry/opentelemetry-specification#139](https://github.com/open-telemetry/opentelemetry-specification/issues/139). + +By removing SpanData, [open-telemetry/opentelemetry-specification#92](https://github.com/open-telemetry/opentelemetry-specification/issues/92) can be resolved and closed. + +[open-telemetry/opentelemetry-specification#68](https://github.com/open-telemetry/opentelemetry-specification/issues/68) can be closed. An optional resource can provide a different resource for out of band spans, otherwise the tracer can provide the default resource. + +[open-telemetry/opentelemetry-specification#60](https://github.com/open-telemetry/opentelemetry-specification/issues/60) can be closed due to removal of SpanData. diff --git a/oteps/trace/0006-sampling.md b/oteps/trace/0006-sampling.md new file mode 100644 index 00000000000..92ec4c9573d --- /dev/null +++ b/oteps/trace/0006-sampling.md @@ -0,0 +1,346 @@ +# Sampling API + +## TL;DR + +This section tries to summarize all the changes proposed in this RFC: + + 1. Move the `Sampler` interface from the API to SDK package. Apply some minor changes to the + `Sampler` API. + 2. Add capability to record `Attributes` that can be used for sampling decision during the `Span` + creation time. + 3. Remove `addLink` APIs from the `Span` interface, and allow recording links only during the span + construction time. + +## Motivation + +Different users of OpenTelemetry, ranging from library developers, packaged infrastructure binary +developers, application developers, operators, and telemetry system owners, have separate use cases +for OpenTelemetry that have gotten muddled in the design of the original Sampling API. Thus, we need +to clarify what APIs each should be able to depend upon, and how they will configure sampling and +OpenTelemetry according to their needs. + +``` + + +----------+ +-----------+ + grpc | Library | | | + Django | Devs +---------->| OTel API | + Express | | +------>| | + +----------+ | +--->+-----------+ +---------+ + | | ^ | OTel | + | | | +->| Proxy +---+ + | | | | | | | + +----------+ | | +-----+-----+------------+ | +---------+ | + | | | | | | OTel Wire | | | + Hbase | Infra | | | | | Export |+-+ v + Envoy | Binary +---+ | | OTel | | | +----v-----+ + | Devs | | | SDK +------------+ | | | + +----------+---------->| | | +---------->| Backend | + +------>| | Custom | +---------->| | + | | | | Export | | +----------+ + +----------+ | | | | |+-+ ^ + | +---+ | +-----------+------------+ | + | App +------+ ^ ^ | + | Devs + | | +------------+-+ + | | | | | | + +----------+ +---+----+ +----------+ Telemetry | + | SRE | | Owner | + | | | | + +--------+ +--------------+ + Lightstep + Honeycomb + +``` + +## Explanation + +We outline five different use cases (who may be overlapping sets of people), and how they should +interact with OpenTelemetry: + +### Library developer + +Examples: gRPC, Express, Django developers. + +* They must only depend upon the OpenTelemetry API and not upon the SDK. + * For testing only they may depend on the SDK with InMemoryExporter. +* They are shipping source code that will be linked into others' applications. +* They have no explicit runtime control over the application. +* They know some signal about what traces may be interesting (e.g. unusual control plane requests) + or uninteresting (e.g. health-checks), but have to write fully generically. + +**Solution:** + +* For the moment, the OpenTelemetry API will not offer any `SamplingHint` functionality for the last use case. +This is intentional to avoid premature optimizations, and it is based on the fact that changing an API is +backwards incompatible compared to adding a new API. + +### Infrastructure package/binary developer + +Examples: HBase, Envoy developers. + +* They are shipping self-contained binaries that may accept YAML or similar run-time configuration, + but are not expected to support extensibility/plugins beyond the default OpenTelemetry SDK, + OpenTelemetry SDKTracer, and OpenTelemetry wire format exporter. +* They may have their own recommendations for sampling rates, but don't run the binaries in + production, only provide packaged binaries. So their sampling rate configs, and sampling strategies + need to be a finite "built in" set from OpenTelemetry's SDK. +* They need to deal with upstream sampling decisions made by services that call them. + +**Solution:** + +* Allow different sampling strategies by default in OpenTelemetry SDK, all configurable easily via + YAML or feature flags. See [default samplers](#default-samplers). + +### Application developer + +These are the folks we've been thinking the most about for OpenTelemetry in general. + +* They have full control over the OpenTelemetry implementation or SDK configuration. When using the + SDK they can configure custom exporters, custom code/samplers, etc. +* They can choose to implement runtime configuration via a variety of means (e.g. baking in feature + flags, reading YAML files, etc.), or even configure the library in code. +* They make heavy usage of OpenTelemetry for instrumenting application-specific behavior, beyond + what may be provided by the libraries they use such as gRPC, Django, etc. + +**Solution:** + +* Allow application developers to link in custom samplers or write their own when using the + official SDK. + * These might include dynamic per-field sampling to achieve a target rate + (e.g. ) +* Sampling decisions are made within the start Span operation, after attributes relevant to the + span have been added to the Span start operation but before a concrete Span object exists (so that + either a NoOpSpan can be made, or an actual Span instance can be produced depending upon the + sampler's decision). +* Span.IsRecording() needs to be present to allow costly span attribute/log computation to be + skipped if the span is a NoOp span. + +### Application operator + +Often the same people as the application developers, but not necessarily + +* They care about adjusting sampling rates and strategies to meet operational needs, debugging, + and cost. + +**Solution:** + +* Use config files or feature flags written by the application developers to control the + application sampling logic. +* Use the config files to configure libraries and infrastructure package behavior. + +### Telemetry infrastructure owner + +They are the people who provide an implementation for the OpenTelemetry API by using the SDK with +custom `Exporter`s, `Sampler`s, hooks, etc. or by writing a custom implementation, as well as +running the infrastructure for collecting exported traces. + +* They care about a variety of things, including efficiency, cost effectiveness, and being able to + gather spans in a way that makes sense for them. + +**Solution:** + +* Infrastructure owners receive information attached to the span, after sampling hooks have already + been run. + +## Internal details + +In Dapper based systems (or systems without a deferred sampling decision) all exported spans are +stored to the backend, thus some of these systems usually don't scale to a high volume of traces, +or the cost to store all the Spans may be too high. In order to support this use-case and to +ensure the quality of the data we send, OpenTelemetry needs to natively support sampling with some +requirements: + +* Send as many complete traces as possible. Sending just a subset of the spans from a trace is + less useful because in this case the interaction between the spans may be missing. +* Allow application operator to configure the sampling frequency. + +For new modern systems that need to collect all the Spans and later may or may not make a deferred +sampling decision, OpenTelemetry needs to natively support a way to configure the library to +collect and export all the Spans. This is possible (even though OpenTelemetry supports sampling) by +setting a default config to always collect all the spans. + +### Sampling flags + +OpenTelemetry API has two flags/properties: + +* `RecordEvents` + * This property is exposed in the `Span` interface (e.g. `Span.isRecordingEvents()`). + * If `true` the current `Span` records tracing events (attributes, events, status, etc.), + otherwise all tracing events are dropped. + * Users can use this property to determine if expensive trace events can be avoided. +* `SampledFlag` + * This flag is propagated via the `TraceOptions` to the child Spans (e.g. + `TraceOptions.isSampled()`). For more details see the w3c definition [here][trace-flags]. + * In Dapper based systems this is equivalent to `Span` being `sampled` and exported. + +The flag combination `SampledFlag == false` and `RecordEvents == true` means that the current `Span` +does record tracing events, but most likely the child `Span` will not. This combination is +necessary because: + +* Allow users to control recording for individual Spans. +* OpenCensus has this to support z-pages, so we need to keep backwards compatibility. + +The flag combination `SampledFlag == true` and `RecordEvents == false` can cause gaps in the +distributed trace, and because of this OpenTelemetry API should NOT allow this combination. + +It is safe to assume that users of the API should only access the `RecordEvents` property when +instrumenting code and never access `SampledFlag` unless used in context propagators. + +### Sampler interface + +The interface for the Sampler class that is available only in the OpenTelemetry SDK: + +* `TraceID` +* `SpanID` +* Parent `SpanContext` if any +* `Links` +* Span name +* `SpanKind` +* Initial set of `Attributes` for the `Span` being constructed + +It produces an output called `SamplingResult` that includes: + +* A `SamplingDecision` enum [`NOT_RECORD`, `RECORD`, `RECORD_AND_PROPAGATE`]. +* A set of span Attributes that will also be added to the `Span`. + * These attributes will be added after the initial set of `Attributes`. +* (under discussion in separate RFC) the SamplingRate float. + +### Default Samplers + +These are the default samplers implemented in the OpenTelemetry SDK: + +* ALWAYS_ON +* ALWAYS_OFF +* ALWAYS_PARENT + * Trust parent sampling decision (trusting and propagating parent `SampledFlag`). + * For root Spans (no parent available) returns `NOT_RECORD`. +* Probability + * Allows users to configure to ignore the parent `SampledFlag`. + * Allows users to configure if probability applies only for "root spans", "root spans and remote + parent", or "all spans". + * Default is to apply only for "root spans and remote parent". + * Remote parent property should be added to the SpanContext see specs [PR/216][specs-pr-216] + * Sample with 1/N probability + +**Root Span Decision:** + +|Sampler|RecordEvents|SampledFlag| +|---|---|---| +|ALWAYS_ON|`True`|`True`| +|ALWAYS_OFF|`False`|`False`| +|ALWAYS_PARENT|`False`|`False`| +|Probability|`Same as SampledFlag`|`Probability`| + +**Child Span Decision:** + +|Sampler|RecordEvents|SampledFlag| +|---|---|---| +|ALWAYS_ON|`True`|`True`| +|ALWAYS_OFF|`False`|`False`| +|ALWAYS_PARENT|`ParentSampledFlag`|`ParentSampledFlag`| +|Probability|`Same as SampledFlag`|`ParentSampledFlag OR Probability`| + +### Links + +This RFC proposes that Links will be recorded only during the start `Span` operation, because: + +* Link's `SampledFlag` can be used in the sampling decision. +* OpenTracing supports adding references only during the `Span` creation. +* OpenCensus supports adding links at any moment, but this was mostly used to record child Links +which are not supported in OpenTelemetry. +* Allowing links to be recorded after the sampling decision is made will cause samplers to not +work correctly and unexpected behaviors for sampling. + +### When does sampling happen + +The sampling decision will happen before a real `Span` object is returned to the user, because: + +* If child spans are created they need to know the 'SampledFlag'. +* If `SpanContext` is propagated on the wire the 'SampledFlag' needs to be set. +* If user records any tracing event the `Span` object needs to know if the data are kept or not. + It may be possible to always collect all the events until the sampling decision is made but this is + an important optimization. + +There are two important use-cases to be considered: + +* All information that may be used for sampling decisions are available at the moment when the + logical `Span` operation should start. This is the most common case. +* Some information that may be used for sampling decision are NOT available at the moment when the + logical `Span` operation should start (e.g. `http.route` may be determine later). + +The current [span creation logic][span-creation] facilitates the first use-case very well, but +the second use-case requires users to record the logical `start_time` and collect all the +information necessarily to start the `Span` in custom objects, then when all the properties are +available call the span creation API. + +The RFC proposes that we keep the current [span creation logic][span-creation] as it is and we will +address the delayed sampling in a different RFC when that becomes a high priority. + +The SDK must call the `Sampler` every time a `Span` is created during the start span operation. + +**Alternatives considerations:** + +* We considered, to offer a delayed span construction mechanism: + * For languages where a `Builder` pattern is used to construct a `Span`, to allow users to + create a `Builder` where the start time of the Span is considered when the `Builder` is created. + * For languages where no intermediate object is used to construct a `Span`, to allow users maybe + via a `StartSpanOption` object to start a `Span`. The `StartSpanOption` allows users to set all + the start `Span` properties. + * Pros: + * Would resolve the second use-case posted above. + * Cons: + * We could not identify too many real case examples for the second use-case and decided to + postpone the decision to avoid premature decisions. +* We considered, instead of requiring that sampling decision happens before the `Span` is + created to add an explicit `MakeSamplingDecision(SamplingHint)` on the `Span`. Attempts to create + a child `Span`, or to access the `SpanContext` would fail if `MakeSamplingDecision()` had not yet + been run. + * Pros: + * Simplifies the case when all the attributes that may be used for sampling are not available + when the logical `Span` operation should start. + * Cons: + * The most common case would have required an extra API call. + * Error prone, users may forget to call the extra API. + * Unexpected and hard to find errors if user tries to create a child `Span` before calling + MakeSamplingDecision(). +* We considered allowing the sampling decision to be arbitrarily delayed, but guaranteed before + any child `Span` is created, or `SpanContext` is accessed, or before `Span.end()` finished. + * Pros: + * Similar and smaller API that supports both use-cases defined ahead. + * Cons: + * If `SamplingHint` needs to also be delayed recorded then an extra API on Span is required + to set this. + * Does not allow optimization to not record tracing events, all tracing events MUST be + recorded before the sampling decision is made. + +## Prior art and alternatives + +Prior art for Zipkin, and other Dapper based systems: all client-side sampling decisions are made at +head. Thus, we need to retain compatibility with this. + +## Open questions + +This RFC does not necessarily resolve the question of how to propagate sampling rate values between +different spans and processes. A separate RFC will be opened to cover this case. + +## Future possibilities + +In the future, we propose that library developers may be able to defer the decision on whether to +recommend the trace be sampled or not sampled until mid-way through execution; + +## Related Issues + +* [opentelemetry-specification/189](https://github.com/open-telemetry/opentelemetry-specification/issues/189) +* [opentelemetry-specification/187](https://github.com/open-telemetry/opentelemetry-specification/issues/187) +* [opentelemetry-specification/164](https://github.com/open-telemetry/opentelemetry-specification/issues/164) +* [opentelemetry-specification/125](https://github.com/open-telemetry/opentelemetry-specification/issues/125) +* [opentelemetry-specification/87](https://github.com/open-telemetry/opentelemetry-specification/issues/87) +* [opentelemetry-specification/66](https://github.com/open-telemetry/opentelemetry-specification/issues/66) +* [opentelemetry-specification/65](https://github.com/open-telemetry/opentelemetry-specification/issues/65) +* [opentelemetry-specification/53](https://github.com/open-telemetry/opentelemetry-specification/issues/53) +* [opentelemetry-specification/33](https://github.com/open-telemetry/opentelemetry-specification/issues/33) +* [opentelemetry-specification/32](https://github.com/open-telemetry/opentelemetry-specification/issues/32) +* [opentelemetry-specification/31](https://github.com/open-telemetry/opentelemetry-specification/issues/31) + +[trace-flags]: https://github.com/w3c/trace-context/blob/master/spec/20-http_request_header_format.md#trace-flags +[specs-pr-216]: https://github.com/open-telemetry/opentelemetry-specification/pull/216 +[span-creation]: https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/trace/api.md#span-creation diff --git a/oteps/trace/0059-otlp-trace-data-format.md b/oteps/trace/0059-otlp-trace-data-format.md new file mode 100644 index 00000000000..eb72afb8630 --- /dev/null +++ b/oteps/trace/0059-otlp-trace-data-format.md @@ -0,0 +1,371 @@ +# OTLP Trace Data Format + +**Author**: Tigran Najaryan, Splunk + +OTLP Trace Data Format specification describes the structure of the trace data that is transported by OpenTelemetry Protocol (RFC0035). + +## Motivation + +This document is a continuation of OpenTelemetry Protocol RFC0035 and is necessary part of OTLP specification. + +## Explanation + +OTLP Trace Data Format is primarily inherited from OpenCensus protocol. Several changes are introduced with the goal of more efficient serialization. Notable differences from OpenCensus protocol are: + +1. Removed `Node` as a concept. +2. Extended `Resource` to better describe the source of the telemetry data. +3. Replaced attribute maps by lists of key/value pairs. +4. Eliminated unnecessary additional nesting in various values. + +Changes 1-2 are conceptual, changes 3-4 improve performance. + +## Internal details + +This section specifies data format in Protocol Buffers. + +### Resource + +```protobuf +// Resource information. This describes the source of telemetry data. +message Resource { + // labels is a collection of attributes that describe the resource. See OpenTelemetry + // specification semantic conventions for standardized label names: + // https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/data-resource-semantic-conventions.md + repeated AttributeKeyValue labels = 1; + + // dropped_labels_count is the number of dropped labels. If the value is 0, then + // no labels were dropped. + int32 dropped_labels_count = 2; +} +``` + +### Span + +```protobuf +// Span represents a single operation within a trace. Spans can be +// nested to form a trace tree. Spans may also be linked to other spans +// from the same or different trace and form graphs. Often, a trace +// contains a root span that describes the end-to-end latency, and one +// or more subspans for its sub-operations. A trace can also contain +// multiple root spans, or none at all. Spans do not need to be +// contiguous - there may be gaps or overlaps between spans in a trace. +// +// The next field id is 18. +message Span { + // trace_id is the unique identifier of a trace. All spans from the same trace share + // the same `trace_id`. The ID is a 16-byte array. An ID with all zeroes + // is considered invalid. + // + // This field is semantically required. If empty or invalid trace_id is received: + // - The receiver MAY reject the invalid data and respond with the appropriate error + // code to the sender. + // - The receiver MAY accept the invalid data and attempt to correct it. + bytes trace_id = 1; + + // span_id is a unique identifier for a span within a trace, assigned when the span + // is created. The ID is an 8-byte array. An ID with all zeroes is considered + // invalid. + // + // This field is semantically required. If empty or invalid span_id is received: + // - The receiver MAY reject the invalid data and respond with the appropriate error + // code to the sender. + // - The receiver MAY accept the invalid data and attempt to correct it. + bytes span_id = 2; + + // TraceStateEntry is the entry that is repeated in tracestate field (see below). + message TraceStateEntry { + // key must begin with a lowercase letter, and can only contain + // lowercase letters 'a'-'z', digits '0'-'9', underscores '_', dashes + // '-', asterisks '*', and forward slashes '/'. + string key = 1; + + // value is opaque string up to 256 characters printable ASCII + // RFC0020 characters (i.e., the range 0x20 to 0x7E) except ',' and '='. + // Note that this also excludes tabs, newlines, carriage returns, etc. + string value = 2; + } + + // tracestate conveys information about request position in multiple distributed tracing graphs. + // It is a collection of TracestateEntry with a maximum of 32 members in the collection. + // + // See the https://github.com/w3c/distributed-tracing for more details about this field. + repeated TraceStateEntry tracestate = 3; + + // parent_span_id is the `span_id` of this span's parent span. If this is a root span, then this + // field must be omitted. The ID is an 8-byte array. + bytes parent_span_id = 4; + + // resource that is associated with this span. Optional. If not set, this span + // should be part of a ResourceSpans message that does include the resource information, + // unless resource information is unknown. + Resource resource = 5; + + // name describes the span's operation. + // + // For example, the name can be a qualified method name or a file name + // and a line number where the operation is called. A best practice is to use + // the same display name at the same call point in an application. + // This makes it easier to correlate spans in different traces. + // + // This field is semantically required to be set to non-empty string. + // + // This field is required. + string name = 6; + + // SpanKind is the type of span. Can be used to specify additional relationships between spans + // in addition to a parent/child relationship. + enum SpanKind { + // Unspecified. Do NOT use as default. + // Implementations MAY assume SpanKind to be INTERNAL when receiving UNSPECIFIED. + SPAN_KIND_UNSPECIFIED = 0; + + // Indicates that the span represents an internal operation within an application, + // as opposed to an operations happening at the boundaries. Default value. + INTERNAL = 1; + + // Indicates that the span covers server-side handling of an RPC or other + // remote network request. + SERVER = 2; + + // Indicates that the span describes a request to some remote service. + CLIENT = 3; + + // Indicates that the span describes a producer sending a message to a broker. + // Unlike CLIENT and SERVER, there is often no direct critical path latency relationship + // between producer and consumer spans. A PRODUCER span ends when the message was accepted + // by the broker while the logical processing of the message might span a much longer time. + PRODUCER = 4; + + // Indicates that the span describes consumer receiving a message from a broker. + // Like the PRODUCER kind, there is often no direct critical path latency relationship + // between producer and consumer spans. + CONSUMER = 5; + } + + // kind field distinguishes between spans generated in a particular context. For example, + // two spans with the same name may be distinguished using `CLIENT` (caller) + // and `SERVER` (callee) to identify network latency associated with the span. + SpanKind kind = 7; + + // start_time_unixnano is the start time of the span. On the client side, this is the time + // kept by the local machine where the span execution starts. On the server side, this + // is the time when the server's application handler starts running. + // + // This field is semantically required and it is expected that end_time >= start_time. + // + // This field is required. + int64 start_time_unixnano = 8; + + // end_time_unixnano is the end time of the span. On the client side, this is the time + // kept by the local machine where the span execution ends. On the server side, this + // is the time when the server application handler stops running. + // + // This field is semantically required and it is expected that end_time >= start_time. + // + // This field is required. + int64 end_time_unixnano = 9; + + // attributes is a collection of key/value pairs. The value can be a string, + // an integer, a double or the Boolean values `true` or `false`. Note, global attributes + // like server name can be set using the resource API. Examples of attributes: + // + // "/http/user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36" + // "/http/server_latency": 300 + // "abc.com/myattribute": true + // "abc.com/score": 10.239 + repeated AttributeKeyValue attributes = 10; + + // dropped_attributes_count is the number of attributes that were discarded. Attributes + // can be discarded because their keys are too long or because there are too many + // attributes. If this value is 0, then no attributes were dropped. + int32 dropped_attributes_count = 11; + + // TimedEvent is a time-stamped annotation of the span, consisting of either + // user-supplied key-value pairs, or details of a message sent/received between Spans. + message TimedEvent { + // time_unixnano is the time the event occurred. + int64 time_unixnano = 1; + + // name is a user-supplied description of the event. + string name = 2; + + // attributes is a collection of attribute key/value pairs on the event. + repeated AttributeKeyValue attributes = 3; + + // dropped_attributes_count is the number of dropped attributes. If the value is 0, + // then no attributes were dropped. + int32 dropped_attributes_count = 4; + } + + // timed_events is a collection of TimedEvent items. + repeated TimedEvent timed_events = 12; + + // dropped_timed_events_count is the number of dropped timed events. If the value is 0, + // then no events were dropped. + int32 dropped_timed_events_count = 13; + + // Link is a pointer from the current span to another span in the same trace or in a + // different trace. For example, this can be used in batching operations, + // where a single batch handler processes multiple requests from different + // traces or when the handler receives a request from a different project. + // See also Links specification: + // https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/overview.md#links-between-spans + message Link { + // trace_id is a unique identifier of a trace that this linked span is part of. + // The ID is a 16-byte array. + bytes trace_id = 1; + + // span_id is a unique identifier for the linked span. The ID is an 8-byte array. + bytes span_id = 2; + + // tracestate is the trace state associated with the link. + repeated TraceStateEntry tracestate = 3; + + // attributes is a collection of attribute key/value pairs on the link. + repeated AttributeKeyValue attributes = 4; + + // dropped_attributes_count is the number of dropped attributes. If the value is 0, + // then no attributes were dropped. + int32 dropped_attributes_count = 5; + } + + // links is a collection of Links, which are references from this span to a span + // in the same or different trace. + repeated Link links = 14; + + // dropped_links_count is the number of dropped links after the maximum size was + // enforced. If this value is 0, then no links were dropped. + int32 dropped_links_count = 15; + + // status is an optional final status for this span. Semantically when status + // wasn't set it is means span ended without errors and assume Status.Ok (code = 0). + Status status = 16; + + // child_span_count is an optional number of local child spans that were generated while this + // span was active. If set, allows an implementation to detect missing child spans. + int32 child_span_count = 17; +} + +// The Status type defines a logical error model that is suitable for different +// programming environments, including REST APIs and RPC APIs. +message Status { + + // StatusCode mirrors the codes defined at + // https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/api-tracing.md#statuscanonicalcode + enum StatusCode { + Ok = 0; + Cancelled = 1; + UnknownError = 2; + InvalidArgument = 3; + DeadlineExceeded = 4; + NotFound = 5; + AlreadyExists = 6; + PermissionDenied = 7; + ResourceExhausted = 8; + FailedPrecondition = 9; + Aborted = 10; + OutOfRange = 11; + Unimplemented = 12; + InternalError = 13; + Unavailable = 14; + DataLoss = 15; + Unauthenticated = 16; + }; + + // The status code. This is optional field. It is safe to assume 0 (OK) + // when not set. + StatusCode code = 1; + + // A developer-facing human readable error message. + string message = 2; +} +``` + +### AttributeKeyValue + +```protobuf +// AttributeKeyValue is a key-value pair that is used to store Span attributes, Resource +// labels, etc. +message AttributeKeyValue { + // ValueType is the enumeration of possible types that value can have. + enum ValueType { + STRING = 0; + BOOL = 1; + INT64 = 2; + DOUBLE = 3; + }; + + // key part of the key-value pair. + string key = 1; + + // The type of the value. + ValueType type = 2; + + // Only one of the following fields is supposed to contain data (determined by `type` field value). + // This is deliberately not using Protobuf `oneof` for performance reasons (verified by benchmarks). + + // A string value. + string string_value = 3; + // A 64-bit signed integer. + int64 int64_value = 4; + // A Boolean value represented by `true` or `false`. + bool bool_value = 5; + // A double value. + double double_value = 6; +} +``` + +## Trade-offs and mitigations + +Timestamps were changed from google.protobuf.Timestamp to a int64 representation in Unix epoch nanoseconds. This change reduces the type-safety but benchmarks show that for small spans there is 15-20% encoding/decoding CPU speed gain. This is the right trade-off to make because encoding/decoding CPU consumption tends to dominate many workloads (particularly in OpenTelemetry Service). + +## Prior art and alternatives + +OpenCensus and Jaeger protocol buffer data schemas were used as the inspiration for this specification. OpenCensus was the starting point, Jaeger provided performance improvement ideas. + +## Open questions + +A follow up RFC is required to define the data format for metrics. + +One of the original aspiring goals for OTLP was to _"support very fast pass-through mode (when no modifications to the data are needed), fast augmenting or tagging of data and partial inspection of data"_. This particular goal was not met directly (although performance improvements over OpenCensus encoding make OTLP more suitable for these tasks). This goal remains a good direction of future research and improvement. + +## Appendix A - Benchmarking + +The following shows [benchmarking of encoding/decoding in Go](https://github.com/tigrannajaryan/exp-otelproto/) using various schemas. + +Legend: + +- OpenCensus - OpenCensus protocol schema. +- OTLP/AttrMap - OTLP schema using map for attributes. +- OTLP/AttrList - OTLP schema using list of key/values for attributes and with reduced nesting for values. +- OTLP/AttrList/TimeWrapped - Same as OTLP/AttrList, except using google.protobuf.Timestamp instead of int64 for timestamps. + +Suffixes: + +- Attributes - a span with 3 attributes. +- TimedEvent - a span with 3 timed events. + +``` +BenchmarkEncode/OpenCensus/Attributes-8 10 605614915 ns/op +BenchmarkEncode/OpenCensus/TimedEvent-8 10 1025026687 ns/op +BenchmarkEncode/OTLP/AttrAsMap/Attributes-8 10 519539723 ns/op +BenchmarkEncode/OTLP/AttrAsMap/TimedEvent-8 10 841371163 ns/op +BenchmarkEncode/OTLP/AttrAsList/Attributes-8 50 128790429 ns/op +BenchmarkEncode/OTLP/AttrAsList/TimedEvent-8 50 175874878 ns/op +BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/Attributes-8 50 153184772 ns/op +BenchmarkEncode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8 30 232705272 ns/op +BenchmarkDecode/OpenCensus/Attributes-8 10 644103382 ns/op +BenchmarkDecode/OpenCensus/TimedEvent-8 5 1132059855 ns/op +BenchmarkDecode/OTLP/AttrAsMap/Attributes-8 10 529679038 ns/op +BenchmarkDecode/OTLP/AttrAsMap/TimedEvent-8 10 867364162 ns/op +BenchmarkDecode/OTLP/AttrAsList/Attributes-8 50 228834160 ns/op +BenchmarkDecode/OTLP/AttrAsList/TimedEvent-8 20 321160309 ns/op +BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/Attributes-8 30 277597851 ns/op +BenchmarkDecode/OTLP/AttrAsList/TimeWrapped/TimedEvent-8 20 443386880 ns/op +``` + +The benchmark encodes/decodes 1000 batches of 100 spans, each span containing 3 attributes or 3 timed events. The total uncompressed, encoded size of each batch is around 20KBytes. + +The results show OTLP/AttrList is 5-6 times faster than OpenCensus in encoding and about 3 times faster in decoding. + +Using google.protobuf.Timestamp instead of int64-encoded unix timestamp results in 1.18-1.32 times slower encoding and 1.21-1.38 times slower decoding (depending on what the span contains). diff --git a/oteps/trace/0136-error_flagging.md b/oteps/trace/0136-error_flagging.md new file mode 100644 index 00000000000..d55d928bf7b --- /dev/null +++ b/oteps/trace/0136-error_flagging.md @@ -0,0 +1,93 @@ +# Error Flagging with Status Codes + +This proposal reduces the number of status codes to three, adds a new field to identify status codes set by application developers and operators, and adds a mapping of semantic conventions to status codes. This clarifies how error reporting should work in OpenTelemetry. + +Note: The term **end user** in this document is defined as the application developers and operators of the system running OpenTelemetry. The term **instrumentation** refers to [instrumentation libraries](https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/glossary.md#instrumentation-library) for common code shared between different systems, such as web frameworks and database clients. + +## Motivation + +Error reporting is a fundamental use case for distributed tracing. While we prefer that error flagging occurs within analysis tools, and not within instrumentation, a number of currently supported analysis tools and protocols rely on the existence of an explicit error flag reported from instrumentation. In OpenTelemetry, the error flag is called "status codes". + +However, there is confusion over the mapping of semantic conventions to status codes, and concern over the subjective nature of errors. Which network failures count as an error? Are 404s an error? The answer is often dependent on the situation, but without even a baseline of suggested status codes for each convention, the instrumentation author is placed under the heavy burden of making the decision. Worse, the decisions will not be in sync across different instrumentation. + +There is one other missing piece, required for proper error flagging. Both application developers and operators have a deep understanding of what constitutes an error in their system. OpenTelemetry must provide a way for these users to control error flagging, and explicitly indicate that it is the end user setting the status code, and not instrumentation. In these specific cases, the error flagging is known to be correct: the end user has decided the status of the span, and they do not want another interpretation. + +While generic instrumentation can only provide a generic schema, end users are capable of making subjective decisions about their systems. And, as the end user, they should get to have the final call in what constitutes an error. In order to accomplish this, there must be a way to differentiate between errors flagged by instrumentation, and errors flagged by the end user. + +## Explanation + +The following changes add several missing features required for proper error reporting, and are completely backwards compatible with OpenTelemetry today. + +### Status Codes + +Currently, OpenTelemetry does not have a use case for differentiating between different types of errors. However, this use case may appear in the future. For now, we would like to reduce the number of status codes, and then add them back in as the need becomes clear. We would also like to differentiate between status codes which have not been +set, and an explicit OK status set by an end user. + +* `UNSET` is the default status code. +* `ERROR` represents all error types. +* `OK` represents a span which has been explicitly marked as being free of errors, and should not be counted against an error budget. Note that only end users should set this status. Instead, instrumentation should leave the status as `UNSET` for operations that do not generate an error. + +### `Status Source` + +A new Status Source field identifies the origin of the status code on the span. This is important, as statuses set by application developers and operators have been confirmed by the end user to be correct to the particular situation. Statuses set by instrumentation, on the other hand, are only following a generic schema. + +* `INSTRUMENTATION` is the default source. This is used for instrumentation contained within shared code, such as OSS libraries and frameworks. All instrumentation plugins shipped with OpenTelemetry use this status code. +* `USER` identifies statuses set by application developers or operators, either in application code or the collector. + +Analysis tools MAY disregard status codes, in favor of their own approach to error analysis. However, it is strongly suggested that analysis tools SHOULD pay attention to the status codes when set by `USER`, as it is a communication from the application developer or operator and contains valuable information. + +### Status Mapping Schema + +As part of the specification, OpenTelemetry provides a mapping of semantic conventions to status codes. This removes any ambiguity as to what OpenTelemetry ships with out of the box. + +Including the correct status codes as part of our semantic conventions will help ensure our instrumentation is consistent when errors relate to a cross-language concept, such as a database protocol. + +Please note that semantic conventions, and thus status mapping from conventions, are still a work in progress and will continue to change after GA. + +### Status Processor + +The collector will provide a processor and a configuration language to make adjustments to this status mapping schema. This provides the flexibility and customization needed for real world scenarios. + +### Convenience methods + +As a convenience, OpenTelemetry provides helper functions for adding semantic conventions and exceptions to a span. These helper functions will also set the correct status code. This simplifies the life of the instrumentation author, and helps ensure compliance and data quality. + +Note that these convenience methods simply wire together multiple API calls. They should live in a helper package, and should not be directly added to existing API interfaces. Given how many semantic conventions we have, there will be a pile of them. + +## Internal details + +This proposal is mostly backwards compatible with existing code, protocols, and the OpenTracing bridge. The only potential exception is the removal of status code enums from the current OTLP protocol, and the rewriting of the small number of instrumentation that were making use of them. + +## BUT ERRORS ARE SUBJECTIVE!! HOW CAN WE KNOW WHAT IS AN ERROR? WHO ARE WE TO DEFINE THIS? + +First of all, every tracing system to-date comes with a default set of errors. No system requires that end users start completely from scratch. So... be calm!! Have faith!! + +While flagging errors can be a subjective decision, it is true that many semantic conventions qualify as an error. By providing a default mapping of semantic conventions to errors, we ensure compatibility with existing analysis tools (e.g. Jaeger), and provide guidance to users and future implementers. + +Obviously, all systems are different, and users will want to adjust error reporting on a case by case basis. Unwanted errors may be suppressed, and additional errors may be added. The collector will provide a processor and a configuration language to make this a straightforward process. Working from a baseline of standard errors will provide a better experience than having to define a schema from scratch. + +Note that analysis tools MAY disregard Span Status, and do their own error analysis. There is no requirement that the status code is respected, even when Status Source is set. However, it is strongly suggested that analysis tools SHOULD pay attention to the status code when Status Source is set, as it represents a subjective decision made by either the operator or application developer. + +## Remind me why we need status codes again? + +Status codes provide a low overhead mechanism for checking if a span counts against an error budget, without having to scan every attribute and event. It is an inexpensive and low cardinality approach to track multiple types of error budgets. This reduces overhead and could be a benefit for many systems. + +However, adding in an existing set of error types without first clearly defining their use and how they might be set has caused confusion. If the status codes are not set consistently and correctly, then the resulting error budgeting will not be useful. So we are consolidating all error types into a single ERROR type, to avoid this situation. We may add more error types back in if we can agree on their use cases and a method for applying them consistently. + +## Open questions + +If we add error processing to the Collector, it is unclear what the overhead would be. + +It is also unclear what the cost is for backends to scan for errors on every span, without a hint from instrumentation that an error might be present. + +## Prior art and alternatives + +In OpenTracing, the lack of a Collector and status mapping schema proved to be unwieldy. It placed a burden on instrumentation plugin authors to set the error flag correctly, and led to an explosion of non-standardized configuration options in every plugin just to adjust the default error flagging. This in turn placed a configuration burden on application developers. + +An alternative is the `error.hint` proposal, paired with the removal of status code. This would work, but essentially provides the same mechanism provided in this proposal, only with a large number of breaking changes. It also does not address the need for user overrides. + +## Future Work + +The inclusion of status codes and status mappings help the OpenTelemetry community speak the same language in terms of error reporting. It lifts the burden on future analysis tools, and (when respected) it allows users to employ multiple analysis tools without having to synchronize an important form of configuration across multiple tools. + +In the future, OpenTelemetry may add a control plane which allows dynamic configuration of the status mapping schema. diff --git a/oteps/trace/0168-sampling-propagation.md b/oteps/trace/0168-sampling-propagation.md new file mode 100644 index 00000000000..2d8db37cef7 --- /dev/null +++ b/oteps/trace/0168-sampling-propagation.md @@ -0,0 +1,489 @@ +# Propagate parent sampling probability + +Use the W3C trace context to convey consistent parent sampling probability. + +## Motivation + +The parent sampling probability is the probability associated with +the start of a trace context that was used to determine whether the +W3C `sampled` flag is set, which determines whether child contexts +will be sampled by a `ParentBased` Sampler. It is useful to know the +parent sampling probability associated with a context in order to +build span-to-metrics pipelines when the built-in `ParentBased` +Sampler is used. Further motivation for supporting span-to-metrics +pipelines is presented in [OTEP +170](0170-sampling-probability.md). + +A consistent trace sampling decision is one that can be carried out at +any node in a trace, which supports collecting partial traces. +OpenTelemetry specifies a built-in `TraceIDRatioBased` Sampler that +aims to accomplish this goal but was left incomplete (see a +[TODO](../../specification/trace/sdk.md#traceidratiobased) +in the v1.0 Trace specification). + +We propose a Sampler option to propagate the necessary information +alongside the [W3C sampled flag](https://www.w3.org/TR/trace-context/#sampled-flag) +using `tracestate` with an `ot` vendor tag, which will require +(separately) [specifying how the OpenTelemetry project uses +`tracestate` +itself](https://github.com/open-telemetry/opentelemetry-specification/pull/1852). + +## Explanation + +Two pieces of information are needed to convey consistent parent sampling probability: + +1. p-value representing the parent sampling probability. +2. r-value representing the "randomness" as the source of consistent sampling decisions. + +This proposal uses 6 bits of information to propagate each of these +and does not depend on built-in TraceID randomness, which is not +sufficiently specified for probability sampling at this time. This +proposal closely follows [research by Otmar +Ertl](https://arxiv.org/pdf/2107.07703.pdf). + +### Adjusted count + +The concept of adjusted count is introduced in [OTEP +170](./0170-sampling-probability.md). Briefly, adjusted count is defined +in terms of the sampling probability, where: + +| Sampling probability | Adjusted count | Notes | +| -- | -- | -- | +| `probability` != 0 | `adjusted_count` = `1/probability` | For spans selected with non-zero probability, adjusted count is the inverse of their sampling probability. | +| `probability` == 0 | `adjusted_count` = 0 | For spans that were not selected by a probability sampler, adjusted count is zero. | + +The term is used to convey the representivity of an item that was (or +was not) selected by a probability sampler. Items that are not +selected by a probability sampler are logically assigned zero adjusted +count, such that if they are recorded for any other reason they do not +introduce bias in the estimated count of the total span population. + +### p-value + +To limit the cost of this extension and for statistical reasons +documented below, we propose to limit parent sampling probability +to powers of two. This limits the available parent sampling +probabilities to 1/2, 1/4, 1/8, and so on. We can compactly encode +these probabilities as small integer values using the base-2 logarithm +of the adjusted count. + +Using six bits of information we can convey known sampling rates as +small as 2**-62. The value 63 is reserved to mean sampling with +probability 0, which conveys an adjusted count of 0 for the associated +context. + +When propagated, the "p-value" as it is known will be interpreted as +shown in the following table. The p-value for known sampling +probabilities is the negative base-2 logarithm of the probability: + +| p-value | Parent Probability | +| ----- | ----------- | +| 0 | 1 | +| 1 | 1/2 | +| 2 | 1/4 | +| ... | ... | +| N | 2**-N | +| ... | ... | +| 61 | 2**-61 | +| 62 | 2**-62 | +| 63 | 0 | + +[As specified in OTEP 170 for the Trace data +model](0170-sampling-probability.md), +parent sampling probability can be stored in exported Span data to +enable span-to-metrics pipelines to be built. Because `tracestate` is +already encoded in the OpenTelemetry Span, this proposal is requires +no changes to the Span protocol. Accepting this proposal means the +p-value can be derived from `tracestate` when the parent sampling +probability is known. + +An unknown value for `p` cannot be propagated using `tracestate` +explicitly, simply omitting `p` conveys an unknown parent sampling +probability. + +### r-value + +With parent sampling probabilities limited to powers of two, the +amount of randomness needed per trace context is limited. A +consistent sampling decision is accomplished by propagating a specific +random variable known as the r-value. + +To develop an intuition for r-values, consider a scenario where every +bit of the `TraceID` is generated by a uniform random bit generator +(i.e., every bit is 0 or 1 with equal probability). An 128-bit +`TraceID` can therefore be treated as a 128-bit unsigned integer, +which can be mapped into a fraction with range [0, 1) by dividing by +2**128, a form known as the TraceID-ratio. Now, probability sampling +could be achieved by comparing the TraceID-ratio with the sampling +probability, setting the `sampled` flag when TraceID-ratio is less +than the sampling probability. + +It is easy to see that with sampling probability 1, all TraceIDs will +be accepted because TraceID ratios are exclusively less than 1. +Sampling with probability 50% will select TraceID ratios less than +0.5, which maps to all TraceIDs less than 2**127 or, equivalently, all +TraceIDs where the most significant bit is zero. By the same logic, +sampling with probability 25% means accepting TraceIDs where the most +significant two bits are zero. In general, with exact probability +`2**-S` is equivalent to selecting TraceIDs with `S` leading zeros in +this example scenario. + +The r-value specified here directly describes the number of leading +zeros in a random 62-bit string, specified in a way that does not +require TraceID values to be constructed with random bits in specific +positions or with hard requirements on their uniformity. In +mathematical terms, the r-value is described by a truncated geometric +distribution, listed below: + +| `r` value | Probability of `r` value | Implied sampling probabilities | +| ---------------- | ------------------------ | ---------------------- | +| 0 | 1/2 | 1 | +| 1 | 1/4 | 1/2 and above | +| 2 | 1/8 | 1/4 and above | +| 3 | 1/16 | 1/8 and above | +| ... | ... | ... | +| 0 <= r <= 61 | 1/(2**(-r-1)) | 2**(-r) and above | +| ... | ... | ... | +| 59 | 2**-60 | 2**-59 and above | +| 60 | 2**-61 | 2**-60 and above | +| 61 | 2**-62 | 2**-61 and above | +| 62 | 2**-62 | 2**-62 and above | + +Such a random variable `r` can be generated using efficient +instructions on modern computer architectures, for example we may +compute the number of leading zeros using hardware support: + +```golang +import ( + "math/rand" + "math/bits" +) + +func nextRValueLeading() int { + x := uint64(rand.Int63()) // 63 least-significant bits are random + y := x << 1 | 0x3 // 62 most-significant bits are random + return bits.LeadingZeros64(y) +} +``` + +Or we may compute the number of trailing zeros instead, for example +(not using special instructions): + +```golang +import ( + "math/rand" +) + +func nextRValueTrailing() int { + x := uint64(rand.Int63()) + for r := 0; r < 62; r++ { + if x & 0x1 == 0x1 { + return r + } + x = x >> 1 + } + return 62 +} +``` + +More examples for calculating r-values are shown in +[here](https://gist.github.com/jmacd/79c38c1056035c52f6fff7b7fc071274). +For example, the value 3 means there were three leading zeros and +corresponds with being sampled at probabilities 1-in-1 through 1-in-8 +but not at probabilities 1-in-16 and smaller. + +### Proposed `tracestate` syntax + +The consistent sampling r-value (`r`) and the parent sampling +probability p-value (`p`) will be propagated using two bytes of base16 +content for each of the two fields, as follows: + +``` +tracestate: ot=p:PP;r:RR +``` + +where `PP` are two bytes of base16 p-value and `RR` are two bytes of +base16 r-value. These values are omitted when they are unknown. + +This proposal should be taken as a recommendation and will be modified +to [match whatever format OpenTelemtry specifies for its +`tracestate`](https://github.com/open-telemetry/opentelemetry-specification/pull/1852). +The choice of base16 encoding is therefore just a recommendation, +chosen because `traceparent` uses base16 encoding. + +### Examples + +The following `tracestate` value is accompanied by `sampled=true`: + +``` +tracestate: ot=r:0a;p:03 +``` + +and translates to + +``` +base16(p-value) = 03 // 1-in-8 parent sampling probability +base16(r-value) = 0a // qualifies for 1-in-1024 or greater probability consistent sampling +``` + +A `ParentBased` Sampler will include `ot=r:0a;p:03` in the stored +`TraceState` field, allowing consumers to count it as with an adjusted +count of 8 spans. The `sampled=true` flag remains set. + +A `TraceIDRatioBased` Sampler configured with probability 2**-10 or +greater will enable `sampled=true` and convey a new parent sampling +probability via `tracestate: ot=r:0a;p:0a`. + +A `TraceIDRatioBased` Sampler configured with probability 2**-11 or +smaller will set `sampled=false` and remove `p` from the tracestate, +setting `tracestate: ot=r:0a`. + +## Internal details + +The reasoning behind restricting the set of sampling rates is that it: + +- Lowers the cost of propagating parent sampling probability +- Limits the number of random bits required +- Avoids floating-point to integer rounding errors +- Makes math involving partial traces tractable. + +[An algorithm for making statistical inference from partially-sampled +traces has been published](https://arxiv.org/pdf/2107.07703.pdf) that +explains how to work with a limited number of power-of-2 sampling rates. + +### Behavior of the `TraceIDRatioBased` Sampler + +The Sampler MUST be configured with a power-of-two probability +expressed as `2**-s` with s being an integer in the range [0, 62] +except for the special case of zero probability (in which case `p=63` +is used). + +If the context is a new root, the initial `tracestate` must be created +with randomness value `r`, as described above, in the range [0, 62]. +If the context is not a new root, output a new `tracestate` with the +same `r` value as the parent context. + +In both cases, set the sampled bit if the outgoing `p` is less than or +equal to the outgoing `r` (i.e., `p <= r`). + +When sampled, in both cases, the context's p-value `p` is set to the +value of `s` in the range [0, 62]. If the sampling probability is +zero (the special case where `s` is undefined), use `p=63` the +specified value for zero probability. + +If the context is not a new root and the incoming context's r-value +is not set, the implementation SHOULD notify the user of an error +condition and follow the incoming context's `sampled` flag. + +### Behavior of the `ParentBased` sampler + +The `ParentBased` sampler is unmodified by this proposal. It honors +the W3C `sampled` flag and copies the incoming `tracestate` keys to +the child context. If the incoming context has known parent sampling +probability, so does the Span. + +The span's parent sampling probability is known when both `p` and `r` +are defined in the `ot` sub-key of `tracestate`. When `r` or `p` are +not defined, the span's parent sampling probability is unknown. + +### Behavior of the `AlwaysOn` Sampler + +The `AlwaysOn` Sampler behaves the same as `TraceIDRatioBased` with +100% sampling probability (i.e., `p=1`). + +### Behavior of the `AlwaysOff` Sampler + +The `AlwaysOff` Sampler behaves the same as `TraceIDRatioBased` with +zero probability (i.e., `p=63`). + +## Worked 3-bit example + +The behavior of these tables can be verified by hand using a smaller +example. The following table shows how these equations work where +`r`, `p`, and `s` are limited to 3 bits instead of 6 bits. + +Values of `p` are interpreted as follows: + +| `p` value | Adjusted count | +| ----- | ----- | +| 0 | 1 | +| 1 | 2 | +| 2 | 4 | +| 3 | 8 | +| 4 | 16 | +| 5 | 32 | +| 6 | 64 | +| 7 | 0 | + +Note there are only seven known non-zero values for the adjusted count +(`p`) ranging from 1 to 64. Thus there are seven defined values of `r` +and `s`. The following table shows `r` and the corresponding +selection probability, along with the calculated adjusted count for +each `s`: + +| `r` value | probability of `r` | `s=0` | `s=1` | `s=2` | `s=3` | `s=4` | `s=5` | `s=6` | +| -- | -- | -- | -- | -- | -- | -- | -- | -- | +| 0 | 1/2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | +| 1 | 1/4 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | +| 2 | 1/8 | 1 | 2 | 4 | 0 | 0 | 0 | 0 | +| 3 | 1/16 | 1 | 2 | 4 | 8 | 0 | 0 | 0 | +| 4 | 1/32 | 1 | 2 | 4 | 8 | 16 | 0 | 0 | +| 5 | 1/64 | 1 | 2 | 4 | 8 | 16 | 32 | 0 | +| 6 | 1/64 | 1 | 2 | 4 | 8 | 16 | 32 | 64 | + +Notice that the sum of `r` probability times adjusted count in each of +the `s=*` columns equals 1. For example, in the `s=4` column we have +`0*1/2 + 0*1/4 + 0*1/8 + 0*1/16 + 16*1/32 + 16*1/64 + 16*1/64 = 1/2 + +1/4 + 1/4 = 1`. In the `s=2` column we have `0*1/2 + 0*1/4 + 4*1/8 + +4*1/16 + 4*1/32 + 4*1/64 + 4*1/64 = 1/2 + 1/4 + 1/8 + 1/16 + 1/16 = 1`. +We conclude that when `r` is chosen with the given probabilities, +any choice of `s` produces one expected span. + +## Invariant checking + +The following table summarizes how the three Sampler cases behave with +respect to the incoming and outgoing values for `p`, `r`, and +`sampled`: + +| Sampler | Incoming `r` | Incoming `p` | Incoming `sampled` | Outgoing `r` | Outgoing `p` | Outgoing `sampled` | +| -- | -- | -- | -- | -- | -- | -- | +| Parent | unused | expected | respected | checked and passed through | checked and passed through | checked and passed through | +| TraceIDRatio(Non-Root) | used | unused | ignored | checked and passed through | set to `s` | set to `p <= r` | +| TraceIDRatio(Root) | n/a | n/a | n/a | random variable | set to `s` | set to `p <= r` | + +There are several cases where the resulting span's parent sampling +probability is unknown: + +| Sampler | Unknown condition | +| -- | -- | +| Parent | no incoming `p` | +| TraceIDRatio(Non-Root) | no incoming `r` | +| TraceIDRatio(Root) | none | + +The inputs are recognized as out-of-range as follows: + +| Range invariate | Remedy | +| -- | -- | +| `p < 0` | drop `p` from tracestate | +| `p > 63` | drop `p` from tracestate | +| `r < 0` | drop `r` and `p` from tracestate | +| `r > 62` | drop `r` and `p` from tracestate | + +There are cases where the combination of `p` and `r` and `sampled` are +inconsistent with each other. The `sampled` flag is equivalent to the +expression `p <= r`. When the invariant `sampled <=> p <= r` is +violated, the `ParentBased` sampler MUST correct the propagated values +as discussed below. + +The violation is always addressed by honoring the `sampled` flag and +correcting `p` to either 63 (for zero adjusted count) or unset (for +unknown parent sampling probability). + +If `sampled` is false and the invariant is violated, drop `p` from the +outgoing context to convey unknown parent sampling probability. + +The case where `sampled` is true with `p=63` indicating 0% probability +may by regarded as a special case to allow zero adjusted count +sampling, which permits non-probabilistic sampling to take place in +the presence of probability sampling. Set `p` to 63. + +If `sampled` is true with `p<63` (but `p>r`), drop `p` from the +outgoing context to convey unknown parent sampling probability. + +## Prototype + +[This proposal has been prototyped in the OTel-Go +SDK.](https://github.com/open-telemetry/opentelemetry-go/pull/2177) No +changes in the OTel-Go Tracing SDK's `Sampler` or `tracestate` APIs +were needed. + +## Trade-offs and mitigations + +### Naming question + +This proposal changes the logic of the `TraceIDRatioBased` sampler, +currently part of the OpenTelemetry specification, in a way that makes +the name no longer meaningful. The proposed sampler may be named +`ConsistentSampler` and the existing `TraceIDRatioBased` sampler can +be deprecated. + +Many SDKs already implement the `TraceIDRatioBased` sampler and it has +been used for probability sampling at trace roots with arbitrary +(i.e., not power-of-two) probabilities. Because of this, we may keep +the current (under-specified) `TraceIDRatioBased` sampler and rename +it `ProbabilitySampler` to convey that it does behave in a specified +way with respect to the bits of the TraceID. + +### Not using TraceID randomness + +It would be possible, if TraceID were specified to have at least 62 +uniform random bits, to compute the randomness value described above +as the number of leading zeros among those 62 random bits. + +However, this would require modifying the W3C traceparent specification, +therefore we do not propose to use bits of the TraceID. + +See [W3C +trace context issue 467](https://github.com/w3c/trace-context/issues/467). + +### Not using TraceID hashing + +It would be possible to make a consistent sampling decision by hashing +the TraceID, but we feel such an approach is not sufficient for making +unbiased sampling decisions. It is seen as a relatively difficult +task to define and specify a good enough hashing function, much less +to have it implemented in multiple languages. + +Hashing is also computationally expensive. This proposal uses extra +data to avoid the computational cost of hashing TraceIDs. + +### Restriction to power-of-two + +Restricting parent sampling probabilities to powers of two does not limit tail +Samplers from using arbitrary probabilities. The companion [OTEP +170](https://github.com/open-telemetry/oteps/blob/main/text/trace/0170-sampling-probability.md) has discussed +the use of a `sampler.adjusted_count` attribute that would not be +limited to power-of-two values. Discussion about how to represent the +effective adjusted count for tail-sampled Spans belongs in [OTEP +170](https://github.com/open-telemetry/oteps/blob/main/text/trace/0170-sampling-probability.md), not this OTEP. + +Restricting parent sampling probabilities to powers of two does not limit +Samplers from using arbitrary effective probabilities over a period of +time. For example, a typical trace sampling rate of 5% (i.e., 1 in +20) can be accomplished by choosing 1/16 sampling 60% of the time and +1/32 sampling 40% of the time: + +``` +1/16 * 0.6 + 1/32 * 0.4 = 0.05 +``` + +### Propagating `p` when unsampled + +Consistent trace sampling requires the `r` value to be propagated even +when the span itself is not sampled. It is not necessary, however, to +propagate the `p` value when the context is not sampled, since +`ParentBased` samplers will not change the decision. Although one +use-case was docmented in Google's early Dapper system (known as +"inflationary sampling", see +[OTEP 170](https://github.com/open-telemetry/oteps/blob/main/text/trace/0170-sampling-probability.md#dappers-inflationary-sampler)), the same effect can +be achieved using a consistent sampling decision in this framework. + +### Default behavior + +In order for consistent trace sampling decisions to be made, the `r` +value MUST be set at the root of the trace. This behavior could be +opt-in or opt-out. If opt-in, users would have to enable the setting +of `r` and the setting and propagating of `p` in the tracestate. If +opt-out, users would have to disable these features to turn them off. +The cost and convenience of Sampling features depend on this choice. + +This author's recommendation is that these behaviors be opt-in at +first in order to demonstrate their usefulness. If it proves +successful, an on-by-default approach could be proposed using a +modified W3C trace context `traceparent`, as this would allow p-values +to be propagated cheaply. + +See [W3C issue trace context issue +463](https://github.com/w3c/trace-context/issues/463) which is about +propagating sampling probability in the `traceparent` header, which +makes it cheap enough to have on-by-default. diff --git a/oteps/trace/0170-sampling-probability.md b/oteps/trace/0170-sampling-probability.md new file mode 100644 index 00000000000..8f325323291 --- /dev/null +++ b/oteps/trace/0170-sampling-probability.md @@ -0,0 +1,881 @@ +# Probability sampling of telemetry events + + + +- [Motivation](#motivation) +- [Examples](#examples) + * [Span sampling](#span-sampling) + + [Sample spans to Counter Metric](#sample-spans-to-counter-metric) + + [Sample spans to Histogram Metric](#sample-spans-to-histogram-metric) + + [Sample span rate limiting](#sample-span-rate-limiting) +- [Explanation](#explanation) + * [Model and terminology](#model-and-terminology) + + [Sampling without replacement](#sampling-without-replacement) + + [Adjusted sample count](#adjusted-sample-count) + + [Sampling and variance](#sampling-and-variance) + * [Conveying the sampling probability](#conveying-the-sampling-probability) + + [Encoding adjusted count](#encoding-adjusted-count) + + [Encoding inclusion probability](#encoding-inclusion-probability) + + [Encoding base-2 logarithm of adjusted count](#encoding-base-2-logarithm-of-adjusted-count) + + [Multiply the adjusted count into the data](#multiply-the-adjusted-count-into-the-data) + * [Trace Sampling](#trace-sampling) + + [Counting child spans using root span adjusted counts](#counting-child-spans-using-root-span-adjusted-counts) + + [Using parent sampling probability to count all spans](#using-parent-sampling-probability-to-count-all-spans) + + [Parent sampling for traces](#parent-sampling-for-traces) + - [`Parent` Sampler](#parent-sampler) + - [`TraceIDRatio` Sampler](#traceidratio-sampler) + - [Dapper's "Inflationary" Sampler](#dappers-inflationary-sampler) +- [Proposed `Span` protocol](#proposed-span-protocol) + * [Span data model changes](#span-data-model-changes) + * [Proposed `Sampler` composition rules](#proposed-sampler-composition-rules) + + [Composing two consistent probability samplers](#composing-two-consistent-probability-samplers) + + [Composing a probability sampler and a non-probability sampler](#composing-a-probability-sampler-and-a-non-probability-sampler) + + [Composition rules summary](#composition-rules-summary) + * [Proposed `Sampler` interface changes](#proposed-sampler-interface-changes) +- [Recommended reading](#recommended-reading) +- [Acknowledgements](#acknowledgements) + + + +Objective: Specify a foundation for sampling techniques in OpenTelemetry. + +## Motivation + +Probability sampling allows consumers of sampled telemetry data to +collect a fraction of telemetry events and use them to estimate total +quantities about the population of events, such as the total rate of +events with a particular attribute. + +These techniques enable reducing the cost of telemetry collection, +both for producers (i.e., SDKs) and for processors (i.e., Collectors), +without losing the ability to (at least coarsely) monitor the whole +system. + +Sampling builds on results from probability theory. Estimates drawn +from probability samples are *random variables* that are expected to +equal their true value. When all outcomes are equally likely, meaning +all the potential combinations of items used to compute a sample of +the sampling logic are equally likely, we say the sample is *unbiased*. + +Unbiased samples can be used for after-the-fact analysis. We can +answer questions such as "what fraction of events had property X?" +using the fraction of events in the sample that have property X. + +This document outlines how producers and consumers of sample telemetry +data can convey estimates about the total count of telemetry events, +without conveying information about how the sample was computed, using +a quantity known as **adjusted count**. In common language, a +"one-in-N" sampling scheme emits events with adjusted count equal to +N. Adjusted count is the expected value of the number of events in +the population represented by an individual sample event. + +## Examples + +These examples use an attribute named `sampler.adjusted_count` to +convey sampling probability. Consumers of spans, metrics, and logs +annotated with adjusted counts are able to calculate accurate +statistics about the whole population of events, without knowing +details about the sampling configuration. + +The hypothetical `sampler.adjusted_count` attribute is used throughout +these examples to demonstrate this concept, although the proposal +below for OpenTelemetry `Span` messages introduces a dedicated field +with specific interpretation for conveying parent sampling probability. + +### Span sampling + +Example use-cases for probability sampling of spans generally involve +generating metrics from spans. + +#### Sample spans to Counter Metric + +In this example, an OpenTelemetry SDK for tracing is configured with a +`SpanProcessor` that counts sample spans as they are processed based +on their adjusted counts. The SDK could be used to monitor request +rates using Prometheus, for example. + +For every complete sample span it receives, the example +`SpanProcessor` will synthesize metric data as though a Counter named +`S_count` corresponding to a span named `S` had been incremented once +per original span. Using the adjusted count of sampled spans instead, +the value of `S_count` is expected to equal to equal the true number +of spans. + +This `SpanProcessor` will for every span it receives add the span's +adjusted count to a corresponding metric Counter instrument. For +example using the OpenTelemetry Metrics API directly, + +``` +func (p *spanToMetricsProcessor) OnEnd(span trace.ReadOnlySpan) { + ctx := context.Background() + counter := p.meter.NewInt64Counter(span.Name() + "_count") + counter.Add( + ctx, + span.AdjustedCount(), + span.Attributes()..., + ) +} +``` + +#### Sample spans to Histogram Metric + +For every span it receives, the example processor will synthesize +metric data as though a Histogram instrument named `S_duration` for +span named `S` had been observed once per original span. + +The OpenTelemetry Metric data model does not support histogram buckets +with non-integer counts, which forces the use of integer adjusted +counts here (i.e., 1-in-N sampling rates where N is an integer). + +Logically speaking, this processor will observe the span's duration +*adjusted count* number of times for every sample span it receives. +This example, therefore, uses a hypothetical `RecordMany()` method to +capture multiple observations of a Histogram measurement at once: + +``` + histogram := p.meter.NewFloat64Histogram( + span.Name() + "_duration", + metric.WithUnits("ms"), + ) + histogram.RecordMany( + ctx, + span.Duration().Milliseconds(), + span.AdjustedCount(), + span.Attributes()..., + ) +``` + +#### Sample span rate limiting + +A collector processor will introduce a slight delay in order to ensure +it has received a complete frame of data, during which time it +maintains a fixed-size buffer of complete input spans. If the number of spans +received exceeds the size of the buffer before the end of the +interval, begin weighted sampling using the adjusted count of each +span as input weight. + +This processor drops spans when the configured rate threshold is +exceeeded, otherwise it passes spans through with unmodifed adjusted +counts. + +When the interval expires and the sample frame is considered complete, +the selected sample spans are output with possibly updated adjusted +counts. + +## Explanation + +Consider a hypothetical telemetry signal in which a stream of +data items is produced containing one or more associated numbers. +Using the OpenTelemetry Metrics data model terminology, we have two +scenarios in which sampling is common. + +1. *Counter events:* Each event represents a count, signifying the change in a sum. +2. *Histogram events:* Each event represents an individual variable, signifying membership in a distribution. + +A Tracing Span event qualifies as both of these cases simultaneously. +One span can be interpreted as at least one Counter event (e.g., one +request, the number of bytes read) and at least one Histogram event +(e.g., request latency, request size). + +In Metrics, [Statsd Counter and Histogram events meet this definition](https://github.com/statsd/statsd/blob/master/docs/metric_types.md#sampling). + +In both cases, the goal in sampling is to estimate the count of events +in the whole population, meaning all the events, using only the events +that were selected in the sample. + +### Model and terminology + +This model is meant to apply in telemetry collection situations where +individual events at an API boundary are sampled for collection. Once +the process of sampling individual API-level events is understood, we +will learn to apply these techniques for sampling aggregated data. + +In sampling, the term *sampling design* refers to how sampling +probability is decided and the term *sample frame* refers to how +events are organized into discrete populations. The design of a +sampling strategy dictates how the population is framed. + +For example, a simple design uses uniform probability, and a simple +framing technique is to collect one sample per distinct span name per +hour. A different sample framing could collect one sample across all +span names every 10 minutes. + +After executing a sampling design over a frame, each item selected in +the sample will have known *inclusion probability*, that determines +how likely the item was to being selected. Implicitly, all the items +that were not selected for the sample have zero inclusion probability. + +Descriptive words that are often used to describe sampling designs: + +- *Fixed*: the sampling design is the same from one frame to the next +- *Adaptive*: the sampling design changes from one frame to the next based on the observed data +- *Equal-Probability*: the sampling design uses a single inclusion probability per frame +- *Unequal-Probability*: the sampling design uses multiple inclusion probabilities per frame +- *Reservoir*: the sampling design uses fixed space, has fixed-size output. + +Our goal is to support flexibility in choosing sampling designs for +producers of telemetry data, while allowing consumers of sampled +telemetry data to be agnostic to the sampling design used. + +#### Sampling without replacement + +We are interested in the common case in telemetry collection, where +sampling is performed while processing a stream of events and each +event is considered just once. Sampling designs of this form are +referred to as *sampling without replacement*. Unless stated +otherwise, "sampling" in telemetry collection always refers to +sampling without replacement. + +After executing a given sampling design over a complete frame of data, +the result is a set of selected sample events, each having known and +non-zero inclusion probability. There are several other quantities of +interest, after calculating a sample from a sample frame. + +- *Sample size*: the number of events with non-zero inclusion probability +- *True population total*: the exact number of events in the frame, which may be unknown +- *Estimated population total*: the estimated number of events in the frame, which is computed from the sample. + +The sample size is always known after it is calculated, but the size +may or may not be known ahead of time, depending on the design. +Probabilistic sampling schemes require that the estimated population +total equals the expected value of the true population total. + +#### Adjusted sample count + +Following the model above, every event defines the notion of an +*adjusted count*. + +- *Adjusted count* is zero if the event was not selected for the sample +- *Adjusted count* is the reciprocal of its inclusion probability, otherwise. + +The adjusted count of an event represents the expected contribution to +the estimated population total of a sample frame represented by the +individual event. + +The use of a reciprocal inclusion probability matches our intuition +for probabilities. Items selected with "one-out-of-N" probability of +inclusion count for N each, approximately speaking. + +This intuition is backed up with statistics. This equation is known +as the Horvitz-Thompson estimator of the population total, a +general-purpose statistical "estimator" that applies to all *without +replacement* sampling designs. + +Assuming sample data is correctly computed, the consumer of sample +data can treat every sample event as though an identical copy of +itself has occurred *adjusted count* times. Every sample event is +representative for adjusted count many copies of itself. + +There is one essential requirement for this to work. The selection +procedure must be *statistically unbiased*, a term meaning that the +process is required to give equal consideration to all possible +outcomes. + +#### Sampling and variance + +The use of unbiased sampling outlined above makes it possible to +estimate the population total for arbitrary subsets of the sample, as +every individual sample has been independently assigned an adjusted +count. + +There is a natural relationship between statistical bias and variance. +Approximate counting comes with variance, a matter of fact which can +be controlled for by the sample size. Variance is unavoidable in an +unbiased sample, but variance diminishes with increasing sample size. + +Although this makes it sound like small sample sizes are a problem, +due to expected high variance, this is just a limitation of the +technique. When variance is high, use a larger sample size. + +An easy approach for lowering variance is to aggregate sample frames +together across time, which generally increases the size of the +subpopulations being counted. For example, although the estimates for +the rate of spans by distinct name drawn from a one-minute sample may +have high variance, combining an hour of one-minute sample frames into +an aggregate data set is guaranteed to lower variance (assuming the +numebr of span names stays fixed). It must, because the data remains +unbiased, so more data results in lower variance. + +### Conveying the sampling probability + +Some possibilities for encoding the adjusted count or inclusion +probability are discussed below, depending on the circumstances and +the protocol. Here, the focus is on how to count sampled telemetry +events in general, not a specific kind of event. As we shall see in +the following section, tracing comes with additional complications. + +There are several ways of encoding this adjusted count or inclusion +probability: + +- as a dedicated field in an OTLP protobuf message +- as a non-descriptive Attribute in an OTLP Span, Metric, or Log +- without any dedicated field. + +#### Encoding adjusted count + +We can encode the adjusted count directly as a floating point or +integer number in the range [0, +Inf). This is a conceptually easy +way to understand sampling because larger numbers mean greater +representivity. + +Note that it is possible, given this description, to produce adjusted +counts that are not integers. Adjusted counts are an approximatation, +and the expected value of an integer can be a fractional count. +Floating-point adjusted counts can be avoided with the use of +integer-reciprocal inclusion probabilities. + +#### Encoding inclusion probability + +We can encode the inclusion probability directly as a floating point +number in the range [0, 1). This is typical of the Statsd format, +where each line includes an optional probability. In this context, +the probability is also commonly referred to as a "sampling rate". In +this case, smaller numbers mean greater representivity. + +#### Encoding base-2 logarithm of adjusted count + +We can encode the base-2 logarithm of adjusted count (i.e., negative +base-2 logarithm of inclusion probability). By using an integer +field, restricting adjusted counts and inclusion probabilities to +powers of two, this allows the use of small non-negative integers to +encode the adjusted count. In this case, larger numbers mean +exponentially greater representivity. + +#### Multiply the adjusted count into the data + +When the data itself carries counts, such as for the Metrics Sum and +Histogram points, the adjusted count can be multiplied into the data. + +This technique is less desirable because, while it preserves the +expected value of the count or sum, the data loses information about +variance. This may also lead to rounding errors, when adjusted counts +are not integer valued. + +### Trace Sampling + +Sampling techniques are always about lowering the cost of data +collection and analysis, but in trace collection and analysis +specifically, approaches can be categorized by whether they reduce +Tracer overhead. Tracer overhead is reduced by not recording spans +for unsampled traces and requires making the sampling decision at the +time a new span context is created, sometimes before all of its +attributes are known. + +Traces are said to be complete when the all spans that were part of +the trace are collected. When sampling is applied to reduce Tracer +overhead, there is generally an expectation that complete traces will +still be produced. Sampling techniques that lower Tracer overhead and +produce complete traces are known as *Head trace sampling* techniques. + +The decision to produce and collect a sample trace has to be made when +the root span starts, to avoid incomplete traces. Then, assuming +complete traces can be collected, the adjusted count of the root span +determines an adjusted count for every span in the trace. + +#### Counting child spans using root span adjusted counts + +The adjusted count of a root span determines the adjusted count of +each of its children based on the following logic: + +- The root span is considered representative of `adjusted_count` many + identical root spans, because it was selected using unbiased sampling +- Context propagation conveys *causation*, the fact the one span produces + another +- A root span causes each of the child spans in its trace to be produced +- A sampled root span represents `adjusted_count` many traces, representing + the cause of `adjusted_count` many occurrences per child span in the + sampled trace. + +Using this reasoning, we can define a sample collected from all root +spans in the system, which allows estimating the count of all spans in +the population. Take a simple probability sample of root spans: + +1. In the `Sampler` decision for root spans, use the initial span properties + to determine the inclusion probability `P` +2. Make a pseudo-random selection with probability `P`, if true return + `RECORD_AND_SAMPLE` (so that the W3C Trace Context `is-sampled` + flag is set in all child contexts) +3. Encode a span adjusted count attribute equal to `1/P` on the root span +4. Collect all spans where the W3C Trace Context `is-sampled` flag is set. + +After collecting all sampled spans, locate the root span for each. +Apply the root span's adjusted count to every child in the associated +trace. The sum of adjusted counts on all sampled spans is expected to +equal the population total number of spans. + +Now, having stored the sample spans with their adjusted counts, and +assuming the source of randomness is good, we can extrapolate counts +for the population using arbitrary queries over the sampled spans. +Sampled spans can be translated into approximate metrics over the +population of spans, after their adjusted counts are known. + +The cost of this analysis, using only the root span's adjusted count, +is that all root spans have to be collected before we can count +non-root spans. The cost of indexing and looking up the root span +adjusted counts makes this analysis relatively expensive to perform in +real time. + +#### Using parent sampling probability to count all spans + +If the W3C `is-sampled` flag will be used to determine whether +`RECORD_AND_SAMPLE` is returned in a Sampler, then in order to count +sample spans without first locating the root span requires propagating +information about the parent sampling probability through the +context. Using the parent sampling probability, instead of the +root, allows individual spans in a trace to control the sampling +probability of their descendents in a sub-trace that use `ParentBased` +sampler. Such techniques are referred to as *parent sampling* +techniques. + +Parent sampling probability may be thought of as the probability +of causing a child span to be a sampled. Propagators that maintain +this variable MUST obey the rules of conditional probability. In this +model, the adjusted count of each span depends on the adjusted count +of its parent, not of the root in a trace. Still, the sum of adjusted +counts of all sampled spans is expected to equal the population total +number of spans. + +This applies to other forms of telemetry that happen (i.e., are +caused) within a context carrying parent sampling probability. For +example, we may record log events and metrics exemplars with adjusted +counts equal to the inverse of the current parent sampling probability +when they are produced. + +This technique allows translating spans and logs to metrics without +first locating their root span, a significant performance advantage +compared with first collecting and indexing root spans. + +Several parent sampling techniques are discussed in the following +sections and evaluated in terms of their ability to meet all of the +following criteria: + +- Reduces Tracer overhead +- Produces complete traces +- Spans are countable. + +#### Parent sampling for traces + +Details about Sampler implementations that meet +the requirements stated above. + +##### `Parent` Sampler + +The `Parent` Sampler ensures complete traces, provided all spans are +successfully recorded. A downside of `Parent` sampling is that it +takes away control over Tracer overhead from non-roots in the trace. +To support real-time span-to-metrics applications, this Sampler +requires propagating the sampling probability or adjusted count of +the context in effect when starting child spans. This is expanded +upon in [OTEP 168 (WIP)](https://github.com/open-telemetry/oteps/pull/168). + +When propagating parent sampling probability, spans recorded by the +`Parent` sampler could encode the adjusted count in the corresponding +`SpanData` using a Span attribute named `sampler.adjusted_count`. + +##### `TraceIDRatio` Sampler + +The OpenTelemetry tracing specification includes a built-in Sampler +designed for probability sampling using a deterministic sampling +decision based on the TraceID. This Sampler was not finished before +the OpenTelemetry version 1.0 specification was released; it was left +in place, with [a TODO and the recommendation to use it only for trace +roots](https://github.com/open-telemetry/opentelemetry-specification/issues/1413). +[OTEP 135 proposed a solution](https://github.com/open-telemetry/oteps/pull/135). + +The goal of the `TraceIDRatio` Sampler is to coordinate the tracing +decision, but give each service control over Tracer overhead. Each +service sets its sampling probability independently, and the +coordinated decision ensures that some traces will be complete. +Traces are complete when the TraceID ratio falls below the minimum +Sampler probability across the whole trace. Techniques have been +developed for [analysis of partial traces that are compatible with +TraceID ratio sampling](https://arxiv.org/pdf/2107.07703.pdf). + +The `TraceIDRatio` Sampler has another difficulty with testing for +completeness. It is impossible to know whether there are missing leaf +spans in a trace without using external information. One approach, +[lost in the transition from OpenCensus to OpenTelemetry is to count +the number of children of each +span](https://github.com/open-telemetry/opentelemetry-specification/issues/355). + +Lacking the number of expected children, we require a way to know the +minimum Sampler probability across traces to ensure they are complete. + +To count TraceIDRatio-sampled spans, each span could encode its +adjusted count in the corresponding `SpanData` using a Span attribute +named `sampler.adjusted_count`. + +##### Dapper's "Inflationary" Sampler + +Google's [Dapper](https://research.google/pubs/pub36356/) tracing +system describes the use of sampling to control the cost of trace +collection at scale. Dapper's early Sampler algorithm, referred to as +an "inflationary" approach (although not published in the paper), is +reproduced here. + +This kind of Sampler allows non-root spans in a trace to raise the +probability of tracing, using a conditional probability formula shown +below. Traces produced in this way are complete sub-trees, not +necessarily complete. This technique is successful especially in +systems where a high-throughput service on occasion calls a +low-throughput service. Low-throughput services are meant to inflate +their sampling probability. + +The use of this technique requires propagating the parent inclusion +probability (as discussed for the `Parent` sampler) of the incoming +Context and whether it was sampled, in order to calculate the +probability of starting to sample a new "sub-root" in the trace. + +Using standard notation for conditional probability, `P(x)` indicates +the probability of `x` being true, and `P(x|y)` indicates the +probability of `x` being true given that `y` is true. The axioms of +probability establish that: + +``` +P(x)=P(x|y)*P(y)+P(x|not y)*P(not y) +``` + +The variables are: + +- **`H`**: The parent inclusion probability of the parent context that + is in effect, independent of whether the parent context was sampled +- **`I`**: The inflationary sampling probability for the span being + started. +- **`D`**: The decision probability for whether to start a new sub-root. + +This Sampler cannot lower sampling probability, so if the new span is +started with `H >= I` or when the context is already sampled, no new +sampling decisions are made. If the incoming context is already +sampled, the adjusted count of the new span is `1/H`. + +Assuming `H < I` and the incoming context was not sampled, we have the +following probability equations: + +``` +P(span sampled) = I +P(parent sampled) = H +P(span sampled | parent sampled) = 1 +P(span sampled | parent not sampled) = D +``` + +Using the formula above, + +``` +I = 1*H + D*(1-H) +``` + +solve for D: + +``` +D = (I - H) / (1 - H) +``` + +Now the Sampler makes a decision with probability `D`. Whether the +decision is true or false, propagate `I` as the new parent inclusion +probability. If the decision is true, begin recording a sub-rooted +trace with adjusted count `1/I`. + +## Proposed `Span` protocol + +Following the proposal for propagating consistent parent sampling +probability developed in [OTEP +168](https://github.com/open-telemetry/oteps/pull/168), this proposal +is limited to adding a field to encode the parent sampling probability. +The OTEP 168 proposal for propagation limits parent sampling +probabilities to powers of two, hence we are able to encode the +corresponding adjusted count using a small non-negative integer. + +The OpenTelemetry Span protocol already includes the Span's +`tracestate`, which allows consumers to calculate the adjusted count +of the span by applying the rules specified that proposal to calculate +the parent sampling probability. + +The OTEP 168 proposal for propagating parent sampling probability uses 6 +bits of information, with 63 ordinary values and a special zero value. +When `tracestate` is empty, the `ot` subkey cannot be found, or the +`p` value cannot be determined, the parent sampling probability is +considered unknown. + +| Value | Parent Adjusted Count | +| ----- | ---------------- | +| 0 | 1 | +| 1 | 2 | +| 2 | 4 | +| 3 | 8 | +| 4 | 16 | +| ... | ... | +| X | 2**X | +| ... | ... | +| 62 | 2**62 | +| 63 | 0 | + +Combined with the proposal for propagating parent sampling probability +in OTEP 168, the result is that Sampling can be enabled in an +up-to-date system and all Spans, roots and children alike, will have +known adjusted count. Consumers of a stream of Span data with the +OTEP 168 `tracestate` value can approximately and accurately count +Spans by their adjusted count. + +### Span data model changes + +Addition to the Span data model: + +``` +### Definitions Used in this Document + +#### Sampler + +A Sampler provides configurable logic, used by the SDK, for selecting +which Spans are "recorded" and/or "sampled" in a tracing client +library. To "record" a span means to build a representation of it in +the client's memory, which makes it eligible for being exported. To +"sample" a span implies setting the W3C `sampled` flag and recording +the span for export. + +OpenTelemetry supports spans that are "recorded" and not "sampled" +for "live" observability of spans (e.g., z-pages). + +The Sampler interface and the built-in Samplers defined by OpenTelemetry +must be capable of deciding immediately whether to sample the child +context. The term "sampling" may be used in a more general sense. +For example, a reservoir sampling scheme limits the rate of sample items +selected over a period of time, but such a scheme necessarily defers its +decision making, thus "Sampling" may be applied anywhere on a collection +path whereas the "Sampler" API is restricted to logic that can immediately +decide to sample a trace in side an OpenTelemetry SDK. + +#### Parent-based sampling + +A Sampler that makes its decision to sample based on the W3C `sampled` +flag is said to use parent-based sampling. + +#### Parent sampling + +In a tracing context, Parent sampling refers to the initial decision to +sample a span or a trace, which determines the W3C `sampled` flag of +the child context. The OpenTelemetry tracing data model currently +supports only parent sampling. + +#### Probability sampler + +A probability Sampler is a Sampler that knows immediately, for each +of its decisions, the probability that the span had of being selected. + +Sampling probability is defined as a number less than or equal to 1 +and greater than 0 (i.e., `0 < probability <= 1`). The case of 0 +probability is treated as a special, non-probabilistic case. + +#### Consistent probability sampler + +A consistent probability sampler is a Sampler that supports independent +sampling decisions at each span in a trace while maintaining that +traces will be complete with probability equal to the minimum sampling +probability across the trace. Consistent probability sampling requires that +for any span in a given trace, if a Sampler with lesser sampling probability +selects the span for sampling, then the span would also be selected by a +Sampler configured with greater sampling probability. + +In OpenTelemetry, consistent probability samplers are limited to +power-of-two probabilities. OpenTelemetry consistent probability sampling +is defined in terms of a "p-value" and an "r-value", both of which are +propagated via the context to assist in making consistent sampling decisions. + +### Always-on sampler + +An always-on sampler is another name for a consistent probability +sampler with probability equal to one. + +### Always-off sampler + +An always-off Sampler has the effect of disabling a span completely, +effectively excluding it from the population. This is not defined as +a probability sampler with zero probability, because these spans are +effectively uncountable. + +### Non-probability sampler + +A non-probability sampler is a Sampler that makes its decisions not +based on chance, but instead uses arbitrary logic and internal state +to make its decisions. Because OpenTelemetry specifies the use of +consistent probability samplers, any sampler other than a parent-based +sampler that does not meet all the requirements for consistent probability +sampling is termed a non-probability sampler. + +#### Adjusted count + +Adjusted count is defined as a measure of representivity, the number +of spans in the population that are represented by the individually +sampled span. Span-to-metrics pipelines may be built by adding the +adjusted count of each sample span to a counter of matching spans, +observing the duration of each sample span in a histogram adjusted +count many times, and so on. + +The adjusted count 1 means an one-to-one sampling was in effect. +Adjusted counts greater than 1 indicate the use of a probability +sampler. Adjusted counts are unknown when using a non-probability +sampler. + +Zero adjusted count is defined in a way to support composition of +probability and non-probability samplers. In effect, spans that are +"recorded" but not "sampled" have adjusted count of zero. + +#### Unbiased probability sampling + +The statistical term "unbiased" is a requirement applied to the +adjusted count of a span, which states that the expected value of the +sum of adjusted counts across all exported spans MUST equal the true +number of spans in the population. Statistical bias, a measure of the +difference between an estimate and its true value, of the estimated +span count in the population should equal zero. Moreover, this +requirement must be true for all subsets of the span population for a +sampler to be considered an unbiased probability sampler. + +It is easier to define probability sampling by what it is not. Here +are several samplers that should be categorized as non-probability +samplers because they cannot record unbiased adjusted counts: + +- A traditional form of "leaky-bucket" sampler applies a rate limit to + the starting of new sampled traces. When the configured limit is + not exceeded, all spans pass through with adjusted count 1. When + the configured rate limit is exceeded, it is impossible to set + adjusted count without introducing bias because future arrivals are + not known. +- A "every-N" sampler records spans on a regular interval, but instead + of making a probabilistic decision it makes an exact decision + (e.g., every 10,000 spans). This sampler knows the representivity + of the spans it samples, but the selection process is biased. +- A "at least once per time period" sampler remembers the last time + each distinct span name exported a span. When a span occurs after + more than the specified interval, it samples one (e.g., to ensure + that receivers know about these spans). This sampler introduces + bias because spans that happen between the intervals do not receive + consideration. +- The "always off" sampler is biased by definition. Since it exports + no spans, the sum of adjusted count is always zero. +``` + +### Proposed `Sampler` composition rules + +When combining multiple Samplers, the natural outcome is that a span +will be recorded and sampled if any one of the Samplers says to record +or sample the span. To combine Samplers in a way that preserves +adjusted count requires first classifying Samplers into one of the +following categories: + +1. Parent-based (`ParentBased`) +2. Known non-zero probability (`TraceIDRatio`, `AlwaysOn`) +3. Non-probability based (`AlwaysOff`, all other Samplers) + +The Parent-based sampler always reduces into one of other two at +runtime, based on whether the parent context includes known parent +probability or not. + +Here are the rules for combining Sampler decisions from each of these +categories that may be used to construct composite samplers. + +#### Composing two consistent probability samplers + +When two consistent probability samplers are used, the Sampler with +the larger probability by definition includes every span the smaller +probability sampler would select. The result is a consistent sampler +with the minimum p-value. + +#### Composing a probability sampler and a non-probability sampler + +When a probability sampler is composed with a non-probability sampler, +the effect is to change an unknown probability into a known +probability. When the probability sampler selects the span, its +adjusted count will be used. When the probability sampler does not +select a span, zero adjusted count will be used. + +The use of zero adjusted count allows recording spans that an unbiased +probability sampler did not select, allowing those spans to be +received at the backend without introducing statistical bias. + +#### Composition rules summary + +To create a composite Sampler, first express the result of each +Sampler in terms of the p-value and `sampled` flag. Note that +p-values fall into three categories: + +1. Unknown p-value indicates unknown adjusted count +2. Known non-zero p-value (in the range `[0,62]`) indicates known non-zero adjusted count +3. Known zero p-value (`p=63`) indicates known zero adjusted count + +While non-probability samplers always return unknown `p` and may set +`sampled=true` or `sampled=false`, a probability sampler is restricted +to returning either `p∈[0,62]` with `sampled=true` or to returning +`p=63` with `sampled=false`. No individual sampler can return `p=63` +with `sampled=true`, but this condition MAY result from composition of +`p=63` and unknown `p`. + +A composite sampler can be computed using the table below, as follows. +Although unknown `p` is never encoded in `tracestate`, for the purpose +of composition we assign unknowns `p=64`, which is 1 beyond the range +of the 6-bit that represent known p-values. The assignment of `p=64` +simplifies the formulas below . + +By following these simple rules, any numher of consistent probability +samplers and non-probability samplers can be combined. Starting with +`p=64` representing unknown and `sampled=false`, update the composite +p-value to the minimum value of the prior composite p-value and the +individual sampler p-value. + +``` +pout = min(pin, psamplerout = logicalOR(sampledin, sampledsampler=0 and <63, the adjusted + count of the Span is 2**value, representing power-of-two + probabilities between 1 and 2**-62. + + The corresonding `SamplerResult` field SHOULD be named + `log_adjusted_count` because it carries the newly-created + span and child context's adjusted count and is expressed as + the logarithm of adjusted count for spans selected by a + probability Sampler. +``` + +See [OTEP 168](https://github.com/open-telemetry/oteps/pull/168) for +details on how each of the built-in Samplers is expected to set +`tracestate` for conveying sampling probabilities. + +## Recommended reading + +[Sampling, 3rd Edition, by Steven +K. Thompson](https://www.wiley.com/en-us/Sampling%2C+3rd+Edition-p-9780470402313). + +[A Generalization of Sampling Without Replacement From a Finite Universe](https://www.jstor.org/stable/2280784), JSTOR (1952) + +[Priority sampling for estimation of arbitrary subset sums](https://dl.acm.org/doi/abs/10.1145/1314690.1314696) + +[Stream sampling for variance-optimal estimation of subset sums](https://arxiv.org/abs/0803.0473). + +[Estimation from Partially Sampled Distributed Traces](https://arxiv.org/pdf/2107.07703.pdf), 2021 Dynatrace Research report, Otmar Ertl + +## Acknowledgements + +Thanks to [Neena Dugar](https://github.com/neena) and [Alex +Kehlenbeck](https://github.com/akehlenbeck) for their help +reconstructing the Dapper Sampler algorithm. diff --git a/oteps/trace/0173-messaging-semantic-conventions.md b/oteps/trace/0173-messaging-semantic-conventions.md new file mode 100644 index 00000000000..091cfcd28b2 --- /dev/null +++ b/oteps/trace/0173-messaging-semantic-conventions.md @@ -0,0 +1,264 @@ +# Scenarios for Tracing semantic conventions for messaging + +This document aims to capture scenarios and a road map, both of which will +serve as a basis for [stabilizing](../../specification/versioning-and-stability.md#stable) +the [existing semantic conventions for messaging](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/messaging), +which are currently in an [experimental](../../specification/versioning-and-stability.md#experimental) +state. The goal is to declare messaging semantic conventions stable before the +end of 2021. + +## Motivation + +Many observability scenarios involve messaging systems, event streaming, or +event-driven architectures. For Distributed Tracing to be useful across the +entire scenario, having good observability for messaging or eventing operations +is critical. To achieve this, OpenTelemetry must provide stable conventions and +guidelines for instrumenting those operations. Popular messaging systems that +should be supported include Kafka, RabbitMQ, Apache RocketMQ, Azure Event Hubs +and Service Bus, Amazon SQS, SNS, and Kinesis. + +Bringing the existing experimental semantic conventions for messaging to a +stable state is a crucial step for users and instrumentation authors, as it +allows them to rely on [stability guarantees](../../specification/versioning-and-stability.md#not-defined-semantic-conventions-stability), +and thus to ship and use stable instrumentation. + +## Roadmap + +1. This OTEP, consisting of scenarios and a proposed roadmap, is approved and + merged. +2. [Stability guarantees](../../specification/versioning-and-stability.md#not-defined-semantic-conventions-stability) + for semantic conventions are approved and merged. This is not strictly related + to semantic conventions for messaging but is a prerequisite for stabilizing any + semantic conventions. +3. OTEPs proposing guidance for general instrumentation problems that also + pertain to messaging are approved and merged. Those general instrumentation + problems include retries and instrumentation layers. +4. An OTEP proposing a set of attributes and conventions covering the scenarios + in this document is approved and merged. +5. Proposed specification changes are verified by prototypes for the scenarios + and examples below. +6. The [specification for messaging semantic conventions for tracing](https://github.com/open-telemetry/semantic-conventions/tree/main/docs) + are updated according to the OTEP mentioned above and are declared + [stable](../../specification/versioning-and-stability.md#stable). + +The steps in the roadmap don't necessarily need to happen in the given order, +some steps can be worked on in parallel. + +## Terminology + +The terminology used in this document is based on the [CloudEvents specification](https://github.com/cloudevents/spec/blob/v1.0.1/spec.md). +CloudEvents is hosted by the CNCF and provides a specification for describing +event data in common formats to provide interoperability across services, +platforms and systems. + +### Message + +A "message" is a transport envelope for the transfer of information. The +information is a combination of a payload and metadata. Metadata can be +directed at consumers or at intermediaries on the message path. Messages are +transferred via one or more intermediaries. Messages are uniquely +identifiable. + +In the strict sense, a _message_ is a payload that is sent to a specific +destination, whereas an _event_ is a signal emitted by a component upon +reaching a given state. This document is agnostic of those differences and uses +the term "message" in a wider sense to cover both concepts. + +### Producer + +The "producer" is a specific instance, process or device that creates and +publishes a message. "Publishing" is the process of sending a message or batch +to the intermediary or consumer. + +### Consumer + +A "consumer" receives the message and acts upon it. It uses the context and +data to execute some logic, which might lead to the occurrence of new events. + +The consumer receives, processes, and settles a message. "Receiving" is the +process of obtaining a message from the intermediary, "processing" is the +process of acting on the information a message contains, "settling" is the +process of notifying an intermediary that a message was processed successfully. + +### Intermediary + +An "intermediary" receives a message to forward it to the next receiver, which +might be another intermediary or a consumer. + +## Scenarios + +Producing and consuming a message involves five stages: + +``` +PRODUCER + +Create + | CONSUMER + v +--------------+ +Publish -> | INTERMEDIARY | -> Receive + +--------------+ | + ^ v + . Process + . | + . v + . . . . . . Settle +``` + +1. The producer creates a message. +2. The producer publishes the message to an intermediary. +3. The consumer receives the message from an intermediary. +4. The consumer processes the message. +5. The consumer settles the message by notifying the intermediary that the + message was processed. In some cases (fire-and-forget), the settlement stage + does not exist. + +The messaging semantic conventions need to define how to model those stages in +traces, how to propagate context, and how to enrich traces with attributes. +Failures and retries need to be handled in all stages that interface with the +intermediary (publish, receive and settle) and will be covered by general +instrumentation guidance. + +Based on this model, the following scenarios capture major requirements and +can be used for prototyping, as examples, and as test cases. + +### Individual settlement + +Individual settlement systems imply independent logical message flows. A single +message is created and published in the same context, and it's delivered, +consumed, and settled as a single entity. Each message needs to be settled +individually. Usually, settlement information is stored by the intermediary, not +by the consumer. + +Transport batching can be treated as a special case: messages can be +transported together as an optimization, but are produced and consumed +individually. + +As the diagram below shows, each message can be settled individually, +regardless of the position of the message in the stream or queue. In contrast +to checkpoint-based settlement, settlement information is related to individual +messages and not to the overall message stream. + +``` ++---------+ +---------+ +---------+ +---------+ +---------+ +---------+ +|Message A| |Message B| |Message C| |Message D| |Message E| |Message F| ++---------+ +---------+ +---------+ +---------+ +---------+ +---------+ + Settled Settled Settled +``` + +#### Examples + +1. The following configurations should be instrumented and tested for RabbitMQ + or a similar messaging system: + + * 1 producer, 1 queue, 2 consumers + * 1 producer, fanout exchange to 2 queues, 2 consumers + * 2 producers, fanout exchange to 2 queues, 2 consumers + + Each of the producers continuously produces messages. + +### Checkpoint-based settlement + +Messages are processed as a stream and settled by moving a checkpoint. A +checkpoint points to a position of the stream up to which messages were +processed and settled. Messages cannot be settled individually, instead, the +checkpoint needs to be forwarded. Usually, the consumer is responsible for +storing checkpointing information, not the intermediary. + +Checkpoint-based settlement systems are designed to efficiently receive and +settle batches of messages. However, it is not possible to settle messages +independent of their position in the stream (e. g., if message B is located at +a later position in the stream than message A, then message B cannot be settled +without also settling message A). + +As the diagram below shows, messages cannot be settled individually. Instead, +settlement information is related to the overall ordered message stream. + +``` + Checkpoint + | + v ++---------+ +---------+ +---------+ +---------+ +---------+ +---------+ +|Message A| |Message B| |Message C| |Message D| |Message E| |Message F| ++---------+ +---------+ +---------+ +---------+ +---------+ +---------+ + <--- Settled +``` + +#### Examples + +1. The following configurations should be instrumented and tested for Kafka or + a similar messaging system: + + * 1 producer, 2 consumers in the same consumer group + * 1 producer, 2 consumers in different consumer groups + * 2 producers, 2 consumers in the same consumer group + + Each of the producers produces a continuous stream of messages. + +## Open questions + +The following areas are considered out-of-scope of a first stable release of +semantic conventions for messaging. While not being explicitly considered for +a first stable release, it is important to ensure that this first stable +release can serve as a solid foundation for further improvements in these areas. + +### Sampling + +The current experimental semantic conventions rely heavily on span links as +a way to correlate spans. This is necessary, as several traces are needed to +model the complete path that a message takes through the system. With the currently +available sampling capabilities of OpenTelemetry, it is not possible to ensure +that a set of linked traces is sampled. As a result, it is unlikely to sample a +set of traces that covers the complete path a message takes. + +Solving this problem requires a solution for sampling based on span links, +which is not in scope for this OTEP. + +However, having a too high number of span links in a single trace or having too +many traces linked together can make the visualization and analysis of traces +inefficient. This problem is not related to sampling and needs to be addressed +by the semantic conventions. + +### Instrumenting intermediaries + +Instrumenting intermediaries can be valuable for debugging configuration or +performance issues, or for detecting specific intermediary failures. + +Stable semantic conventions for instrumenting intermediaries can be provided at +a future point in time, but are not in scope for this OTEP. The messaging +semantic conventions this document refers to need to provide instrumentation +that works well without the need to have intermediaries instrumented. + +### Metrics + +Messaging semantic conventions for tracing and for metrics overlap and should +be as consistent as possible. However, semantic conventions for metrics will be +handled separately and are not in scope for this OTEP. + +### Asynchronous message passing in the wider sense + +Asynchronous message passing in the wider sense is a communication method +wherein the system puts a message in a queue or channel and does not require an +immediate response to continue processing. This can range from utilizing a +simple queue implementation to a full-fledged messaging system. + +Messaging semantic conventions are intended for systems that fit into one of +the [scenarios laid out in the previous section](#scenarios), which cover a +significant part of asynchronous message passing applications. However, there +are low-level patterns of asynchronous message passing that don't fit in any of +those scenarios, e. g. channels in Go, or message passing in Erlang. Those +might be covered by a different set of semantic conventions in the future. + +There also exist several frameworks for queuing and executing background jobs, +often those frameworks utilize patterns of asynchronous message passing to +queue jobs. Those frameworks might utilize messaging semantic conventions if +they fit in any of the [scenarios laid out in the previous section](#scenarios), +but otherwise targeting those various frameworks is not an explicit goal for +these conventions. Those frameworks might be covered by [semantic conventions for "jobs"](https://github.com/open-telemetry/opentelemetry-specification/pull/1582) +in the future. + +## Further reading + +* [CloudEvents](https://github.com/cloudevents/spec/blob/v1.0.1/spec.md) +* [Message-Driven (in contrast to Event-Driven)](https://www.reactivemanifesto.org/glossary#Message-Driven) +* [Asynchronous message passing](https://en.wikipedia.org/wiki/Message_passing#Asynchronous_message_passing) +* [Existing semantic conventions for messaging](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/messaging) diff --git a/oteps/trace/0174-http-semantic-conventions.md b/oteps/trace/0174-http-semantic-conventions.md new file mode 100644 index 00000000000..f2f34b94874 --- /dev/null +++ b/oteps/trace/0174-http-semantic-conventions.md @@ -0,0 +1,186 @@ +# Scenarios and Open Questions for Tracing semantic conventions for HTTP + +This document aims to capture scenarios/open questions and a road map, both of +which will serve as a basis for [stabilizing](../../specification/versioning-and-stability.md#stable) +the [existing semantic conventions for HTTP](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/http), +which are currently in an [experimental](../../specification/versioning-and-stability.md#experimental) +state. The goal is to declare HTTP semantic conventions stable before the +end of Q1 2022. + +## Motivation + +Most observability scenarios involve HTTP communication. For Distributed Tracing +to be useful across the entire scenario, having good observability for +HTTP is critical. To achieve this, OpenTelemetry must provide stable conventions +and guidelines for instrumenting HTTP communication. + +Bringing the existing experimental semantic conventions for HTTP to a +stable state is a crucial step for users and instrumentation authors, as it +allows them to rely on [stability guarantees](../../specification/versioning-and-stability.md#not-defined-semantic-conventions-stability), +and thus to ship and use stable instrumentation. + +> NOTE. This OTEP captures a scope for changes should be done to existing +experimental semantic conventions for HTTP, but does not propose solutions. + +## Roadmap for v1.0 + +1. This OTEP, consisting of scenarios/open questions and a proposed roadmap, is + approved and merged. +2. [Stability guarantees](../../specification/versioning-and-stability.md#not-defined-semantic-conventions-stability) + for semantic conventions are approved and merged. This is not strictly related + to semantic conventions for HTTP but is a prerequisite for stabilizing any + semantic conventions. +3. Separate PRs addressing the scenarios and open questions listed in this + document are approved and merged. +4. Proposed specification changes are verified by prototypes for the scenarios + and examples below. +5. The [specification for HTTP semantic conventions for tracing](https://github.com/open-telemetry/semantic-conventions/tree/main/docs/http) + are updated according to this OTEP and are declared + [stable](../../specification/versioning-and-stability.md#stable). + +The steps in the roadmap don't necessarily need to happen in the given order, +some steps can be worked on in parallel. + +## Scope for v1.0: scenarios and open questions + +> NOTE. The scope defined here is subject for discussions and can be adjusted + until this OTEP is merged. + +Scenarios and open questions mentioned below must be addressed via separate PRs. + +### Error status defaults + +4xx responses are no longer create error status codes in case of +`SpanKind.SERVER`. It seems reasonable to define the same/similar behavior +for `SpanKind.CLIENT`. + +### Required attribute sets + +> At least one of the following sets of attributes is required: +> +> * `http.url` +> * `http.scheme`, `http.host`, `http.target` +> * `http.scheme`, `net.peer.name`, `net.peer.port`, `http.target` +> * `http.scheme`, `net.peer.ip`, `net.peer.port`, `http.target` + +As a result, users that write queries against raw data or Zipkin/Jaeger don't +have consistent story across instrumentations and languages. e.g. they'd need to +write queries like +`select * where (getPath(http.url) == "/a/b" || getPath(http.target) == "/a/b")` + +Related issue: [open-telemetry/opentelemetry-specification#2114](https://github.com/open-telemetry/opentelemetry-specification/issues/2114). + +### Retries and redirects + +Should each try/redirect request have unique context to be traceable and +to unambiguously ask for support from downstream service(which implies span per +call)? + +Redirects: users may need observability into what server hop had an error/took +too long. E.g., was 500/timeout from the final destination or a proxy? + +Related issues: [open-telemetry/opentelemetry-specification#1747](https://github.com/open-telemetry/opentelemetry-specification/issues/1747), +[open-telemetry/opentelemetry-specification#729](https://github.com/open-telemetry/opentelemetry-specification/issues/729). + +PR addressing this scenario: [open-telemetry/opentelemetry-specification#2078](https://github.com/open-telemetry/opentelemetry-specification/pull/2078). + +### Context propagation + +How to propagate context between tries? Should it be cleaned up before making +a call in case of reusing instances of client HTTP requests? + +## Scope for vNext: scenarios and open questions + +### Error status configuration + +In many cases 4xx error criteria depends on the app (e.g., for 404/409). As an +end user, I might want to have an ability to override existing defaults and +define what HTTP status codes count as errors. + +### Optional attributes + +As a library owner, I don't understand the benefits of optional attributes: +they create overhead, they don't seem to be generically useful (e.g. flavor), +and are inconsistent across languages/libraries unless unified. + +Related issue: [open-telemetry/opentelemetry-specification#2114](https://github.com/open-telemetry/opentelemetry-specification/issues/2114). + +### Security concerns + +Some attributes can contain potentially sensitive information. Most likely, by +default web frameworks/http clients should not expose that. For example, +`http.target` has a query string that may contain credentials. + +> NOTE. We didn’t omit security concerns from v1.0 on purpose, it’s just not + something we’ve fleshed out so far. + +### Sampling for noop case + +To make it efficient for noop case, it might be useful to have a hint for +instrumentation (e.g., `GlobalOTel.isEnabled()`) that SDK is present and +configured before creating pre-sampling attributes. + +### Long-polling and streaming + +Are there any specifics for these scenarios, e.g. from span duration or status +code perspective? How to model multiple requests within the same logical +session? + +### HTTP/2, gRPC, WebSockets + +Anything we can do better here? In many cases connection has app-lifetime, +messages are independent - can we explain to users how to do manual tracing +for individual messages? Do span events per message make sense at all? +Need some real-life/expertize here. + +### Request/Response body capturing + +> NOTE. This is technically out-of-scope, but we should have an idea how to let + users do it + +There is a lot of user feedback that they want it, but + +* We can’t read body in generic instrumentation +* We can let users collect them +* Attaching to server span is trivial +* Spec for client: we should have an approach to let users unambiguously + associate body with http client span (e.g. outer manual span that wraps HTTP + call and response reading and has event/log with body) +* Reading/writing body may happen outside of HTTP client API (e.g. through + network streams) – how users can track it too? + +Related issue: [open-telemetry/opentelemetry-specification#1284](https://github.com/open-telemetry/opentelemetry-specification/issues/1284). + +## Out of scope + +HTTP protocol is being widely used within many different platforms and systems, +which brings a lot of intersections with a transmission protocol layer and an +application layer. However, for HTTP Semantic Conventions specification we want +to be strictly focused on HTTP-specific aspects of distributed tracing to keep +the specification clear. Therefore, the following scenarios, including but not +limited to, are considered out of scope for this workgroup: + +* Batch operations. +* Fan-in and fan-out operations (e.g., GraphQL). +* Hedging policies. Hedging enables aggressively sending multiple copies of a + single request without waiting for a response. Hedged RPCs may be be executed + multiple times on the server side, typically by different backends. +* HTTP as a transport layer for other systems (e.g., Messaging system built on + top of HTTP). + +To address these scenarios, we might want to work with OpenTelemetry community +to build instrumentation guidelines going forward. + +## General OpenTelemetry open questions + +There are several general OpenTelemetry open questions exist today which most +likely will affect the way scenarios and open questions above are addressed: + +* What does a config language look like for overriding certain defaults. + For example, what HTTP status codes count as errors? +* How to handle additional levels of detail for spans, such as retries and + redirects? + Should it even be designed as levels of detail or as layers reflecting logical + or physical interactions/transactions. +* What is the data model for links? What would a reasonable storage + implementation look like? diff --git a/oteps/trace/0205-messaging-semantic-conventions-context-propagation.md b/oteps/trace/0205-messaging-semantic-conventions-context-propagation.md new file mode 100644 index 00000000000..4a18ff9d5a9 --- /dev/null +++ b/oteps/trace/0205-messaging-semantic-conventions-context-propagation.md @@ -0,0 +1,177 @@ +# Context propagation requirements for messaging semantic conventions + +The [existing messaging semantic conventions for tracing](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.11.0/specification/trace/semantic_conventions/messaging.md) +implicitly impose certain requirements on context propagation mechanisms used. +This document proposes a way to make these requirements explicit. + +This OTEP is based on [OTEP 0173](0173-messaging-semantic-conventions.md), +which defines basic terms and describes messaging scenarios that should be +supported by messaging semantic conventions. + +* [Terminology](#terminology) +* [Motivation](#motivation) + - [Example](#example) +* [Proposed addition to the messaging semantic conventions](#proposed-addition-to-the-messaging-semantic-conventions) + - [Context propagation](#context-propagation) + - [Requirements](#requirements) +* [Future possibilities](#future-possibilities) + - [Transport context propagation](#transport-context-propagation) + - [Standards for context propagation](#standards-for-context-propagation) + +## Terminology + +For terms used in this document, refer to [OTEP 173](0173-messaging-semantic-conventions.md#terminology). + +## Motivation + +The current [messaging semantic conventions for tracing](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.11.0/specification/trace/semantic_conventions/messaging.md) +provide a list of [examples](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.11.0/specification/trace/semantic_conventions/messaging.md#examples). +Those examples illustrate how producer and consumer spans can be correlated by +parent/child relationships or links. All the examples assume that context +information for a given message is propagated from the producer to the +consumer. + +However, this is not a trivial assumption, and it is not easily accommodated by +existing established context propagation mechanisms. Those mechanisms propagate +context on a per-request basis, whereas the messaging semantic conventions +assume that context is propagated on a per-message basis. This means, that +although several requests might be involved in the processing of a single +message (publishing a message, fetching a message, potentially multiple times by +multiple consumers), it is assumed that all components have access to the same +per-message context information that allows correlating all the stages of +processing a message. + +To achieve this desired outcome, a context needs to be attached to a message, +and intermediaries must not alter the context attached to the message. _This +requirement should be documented, as it is an important factor in deciding how +to propagate context for message scenarios and how to standardize context +propagation for existing message protocols._ + +The additions proposed in this document neither break nor invalidate any of +the existing semantic conventions for messaging, but rather make an implicit +requirement explicit. + +### Example + +Many intermediaries (message brokers) offer REST APIs for publishing and +fetching messages. A producer can publish a message by sending an HTTP request +to the intermediary and a consumer can pull a message by sending an HTTP request +to the intermediary: + +``` + +----------+ + | Producer | + +----------+ + | + | HTTP POST (publishing a message) + v ++--------------+ +| Intermediary | ++--------------+ + ^ + | HTTP GET (fetching a message) + | + +----------+ + | Consumer | + +----------+ +``` + +Existing semantic conventions suppose that the consumer can use context +information from the producer trace to create links or parent/child +relationships between consumer and producer traces. For this to be possible, +context information from the producer needs to be propagated to the consumer. +In the example outlined above, the consumers sends an HTTP GET request to the +intermediary to fetch a message, the message is returned as part of the +response. Via this HTTP request, context information can be propagated from the +consumer to the intermediary, but not from the intermediary to the consumer. +The consumer can obtain the necessary producer context information only if it +is propagated as part of the message itself, independent of HTTP context +propagation. + +For correlating producer and consumer traces without special intermediary +instrumentation it is thus necessary to attach a producer context to the +message so it can be extracted and used by the consumer, regardless of the +contexts that are propagated on HTTP requests for publishing and fetching the +message. + +Although OpenTelemetry semantic conventions cannot specify the exact mechanisms +to achieve this for every intermediary and every protocol, this requirement +must be clearly formulated, so that it can be implemented by protocols and +instrumentations. + +## Proposed addition to the messaging semantic conventions + +### Context propagation + +A message may pass many different components and layers in one or more +intermediaries when it is propagated from the producer to the consumer. It +cannot be assumed, and in many cases, it is not even desired, that all those +components and layers are instrumented and propagate context according to +OpenTelemetry requirements. + +A _message creation context_ allows correlating the producer with the +consumer(s) of a message, regardless of intermediary instrumentation. The +message creation context is created by the producer and should be propagated to +the consumer(s). It should not be altered by intermediaries. This context +helps to model dependencies between producers and consumers, regardless of the +underlying messaging transport mechanism and its instrumentation. + +Instrumentors are required to instrument producer and consumer applications +so that context is attached to messages and extracted from messages in a +coordinated way. Future versions of these conventions might recommend [context propagation according to certain industry standards](#standards-for-context-propagation). +If the message creation context cannot be attached to the message and +propagated, consumer traces cannot be directly correlated to producer traces. + +### Requirements + +A producer SHOULD attach a message creation context to each message. The message creation context +SHOULD be attached in a way so that it is not possible to be changed by intermediaries. + +## Future possibilities + +### Transport context propagation + +A message creation context can be attached to a message, while different +contexts are propagated with requests that publish and fetch a message. When +coming up with conventions and guidance for intermediary instrumentation, it +will be beneficial to clearly outline those two layers of context propagation +and build conventions for intermediary instrumentation on top of this outline: + +1. The _message context layer_ allows correlating the producer with the + consumers of a message, regardless of intermediary instrumentation. The + creation context is created by the producer and must be propagated to the + consumers. It must not be altered by intermediaries. + + This layer helps to model dependencies between producers and consumers, + regardless of the underlying messaging transport mechanism and its + instrumentation. +2. An additional _transport context layer_ allows correlating the producer and + the consumer with an intermediary. It also allows correlating multiple + intermediaries among each other. The transport context can be changed by + intermediaries, according to intermediary instrumentations. + + This layer helps to gain insights into details of the message transport. + +This would keep the existing correlation between producers and consumers intact +while allowing intermediaries to use the transport context to correlate +intermediary instrumentation with existing producer and consumer +instrumentations. + +### Standards for context propagation + +Currently, instrumentation authors have to decide how to attach and extract +context from messages to fulfil the [requirements for context propagation](#context-propagation). +While preserving the freedom for instrumentation authors to choose how to +propagate context, in the future these conventions should list recommended ways +of how to propagate context using well-established messaging protocols. + +There are several work-in-progress efforts to standardize context propagation for different +messaging protocols and scenarios: + +* [AMQP](https://w3c.github.io/trace-context-amqp/) +* [MQTT](https://w3c.github.io/trace-context-mqtt/) +* [CloudEvents via HTTP](https://github.com/cloudevents/spec/blob/v1.0.1/extensions/distributed-tracing.md) + +Once the standards reach a stable state and define how the message creation +context and transport context are represented, these semantic conventions will +give a clear and stable recommendation for each protocol and scenario. diff --git a/oteps/trace/0220-messaging-semantic-conventions-span-structure.md b/oteps/trace/0220-messaging-semantic-conventions-span-structure.md new file mode 100644 index 00000000000..48a1c2087f2 --- /dev/null +++ b/oteps/trace/0220-messaging-semantic-conventions-span-structure.md @@ -0,0 +1,679 @@ +# Span structure for messaging scenarios + +This OTEP aims at defining consistent conventions about what spans to create +for messaging scenarios, and at defining how those spans relate to each other. + +This OTEP is based on [OTEP 0173](0173-messaging-semantic-conventions.md), +which defines basic terms and describes messaging scenarios that should be +supported by messaging semantic conventions. It also relies on context +propagation requirements put forth in the existing [semantic conventions](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#context-propagation) +and detailed in [OTEP 0205](0205-messaging-semantic-conventions-context-propagation.md). + +* [Terminology](#terminology) +* [Motivation](#motivation) +* [Stages of producing and consuming messages](#stages-of-producing-and-consuming-messages) +* [Trace structure](#trace-structure) +* [Proposed changes and additions to the messaging semantic conventions](#proposed-changes-and-additions-to-the-messaging-semantic-conventions) + - [Operation name](#operation-name) + - [Span kind](#span-kind) + - [Span relationships](#span-relationships) +* [Open issues](#open-issues) +* [Future possibilities](#future-possibilities) + - [Intermediary instrumentation](#intermediary-instrumentation) +* [Prior art](#prior-art) +* [Examples](#examples) + +## Terminology + +For terms used in this document, refer to [OTEP 173](0173-messaging-semantic-conventions.md#terminology). + +## Motivation + +Tracking the path of an individual message through a distributed system poses +several challenges. Messaging systems allow for asynchronous workflows, which +means that the stages of producing and consuming a message can be separated by +a considerable time gap (this can be minutes, hours, or days). Furthermore, one +cannot rely on consistent instrumentation across all parts of the system that +touch a message. Correlating producer and consumer stages are expected even when +the intermediary forwarding the message between them is not instrumented. +Finally, batching of messages can happen in many different parts of a message +processing workflow, be it batch publishing, batch receiving, batch processing, +or batch settling. + +Despite all those challenges, requirements for instrumentation of messaging +scenarios are high. Besides correlating spans that model the different +processing stages of a message, it should also be possible to determine the +end-to-end latency of processing a message. If intermediaries are not +instrumented, this shouldn't impact the correlation of producer and consumer +stages. If, on the other hand, intermediaries are instrumented, spans from +intermediary instrumentation should seamlessly integrate with producer and +consumer instrumentation. This integration should not require any changes in +producer or consumer instrumentation, and it should not cause any changes to +the relationships of producer and consumer spans. Furthermore, it should be +possible to provide tracing instrumentation as an out-of-the-box experience +from messaging SDKs, without requiring any additional custom instrumentation +from the user. + +This OTEP aims at proposing consistent guidelines for creating spans that model +the stages of the messaging flow, and correlating those in a way so that the +requirements sketched above can be met in a consistent way across messaging +scenarios and different messaging systems. + +## Stages of producing and consuming messages + +As previously described in [OTEP 173](https://github.com/open-telemetry/oteps/blob/main/text/trace/0173-messaging-semantic-conventions.md#scenarios.), +producing and consuming a message involves five stages: + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + CR[Create] --> PU[Publish] + end + subgraph INTERMEDIARY + direction TB + end + subgraph CONSUMER + direction TB + RE[Receive] --> PR[Process] + PR --> SE[Settle] + end + PU --> INTERMEDIARY + INTERMEDIARY --> RE + SE -..- INTERMEDIARY +``` + +1. The producer creates a message. +2. The producer publishes the message to an intermediary. +3. The consumer receives the message from an intermediary. +4. The consumer processes the message. +5. The consumer settles the message by notifying the intermediary that the + message was processed. In some cases (fire-and-forget scenarios, or when + settlement happens on the broker), the settlement stage does not exist. + +The semantic conventions below define how to model those stages with spans. + +## Trace structure + +### Producer + +Producers are responsible for injecting a creation context into a message. +Subsequent consumers will use this context to link consumer traces to producer +traces. Ideally, each message gets a unique and distinct creation context +assigned. + +However, as a context must refer to a span this would require the +creation of a distinct span for each message, which is not feasible in all +scenarios. In certain batching scenarios where many messages are created and +published in large batches, creating a span for every single message would +obfuscate traces and is not desirable. Thus instrumentation libraries and +auto-instrumentation should default to creating a unique and distinct context +per message, but may support configuration or other ways to change this default +behavior. The latter can help to reduce the number of spans and to avoid overly +verbose traces. + +For each producer scenario, a "Publish" span needs to be created. This span +measures the duration of the call or operation that provides messages for +sending or publishing to an intermediary. This call or operation (and the +related "Publish" span) can either refer to a single message or to a batch of +multiple messages. + +There are four different scenarios for injecting a creation context into a message: + +1. A user provides custom creation contexts for the messages that are published, which don't refer + to any spans described in this document. This provides flexibility to users + to model custom scenarios. In this case, no other additional spans besides the "Publish" + span should be created. The "Publish" span should link to the provided + creation contexts. +2. If no custom creation context is provided for a message, it is recommended + to create a "Create" span for every single message and inject its context + into the message. "Create" spans can be created during the "Publish" operation + as children of the "Publish" span. +3. As a variation of the scenario above, "Create" spans can be created + independently of the "Publish" operation, e. g. in cases where messages are + created before they are passed to a "Publish" operation. In this case, the + "Publish" span should link to the "Create" spans. +4. For single-message scenarios or when large number of spans are a problem, + the context of the "Publish" span can be injected into the message, thus + acting as the creation context. In this case, no other spans besides the + "Publish" span should be created. + +### Consumer + +Existing semantic conventions [prescribe the use of "Process" spans](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#span-name) +for correlating producer with consumer traces. However, for many use cases, it +is not possible to rely on the presence of "Process" spans: there are cases +where a dedicated processing operation cannot be identified, or where +processing happens in a different trace than receiving or delivering. +Furthermore, processing operations are not often covered by messaging libraries +and SDKs, but take place in application code. Consistently creating spans for +"Processing" operations would require either effort from the application owner +to correctly instrument those operations, or additional capabilities of +messaging libraries and SDKs (e. g. hooks for processing callbacks, which can +then be instrumented by the libraries or SDKs). + +While it is possible to create "Process" spans and correlate those with +consumer traces in certain cases, this is not something that can be generally +required. Therefore, it is more feasible to require the creation of "Deliver" +spans (for push-based APIs) or "Receive" spans (for pull-based APIs) to +correlate producer with consumer traces. + +#### Instrumenting push-based scenarios + +In push-based consumer scenarios, the delivery of messages is not initiated by +the application code. Instead, callbacks or handlers are registered and then +called by messaging SDKs to forward messages to the application. + +A "Deliver" span covers the call of such a callback or handler and should link +to the creation context of all messages that are forwarded via the respective +call. + +#### Instrumenting pull-based scenarios + +In pull-based consumer scenarios, the message is actively requested by the +application code. This usually involves a blocking call, which returns zero or +more messages on completion. + +A "Receive" span covers such calls and should link to the creation context of +all messages that are forwarded via the respective call. To achieve this in an +idiomatic manner, it must be possible to add links after span creation, which +is not currently supported (see [open issues](#open-issues)). + +#### General considerations for both push-based and pull-based scenarios + +The operations modelled by "Deliver" or "Receive" spans do not strictly refer +to receiving the message from intermediaries, but instead refer to the +application receiving messages for processing. If messages are fetched from the +intermediary and forwarded to the application in one go, the whole operation +might be covered by a "Deliver" or "Receive" span. However, libraries or SDKs +might pre-fetch messages from intermediaries and cache those messages, and only +forward messages to the application at a later time. In this case, the +operation of pre-fetching and caching should not be covered by the "Deliver" or +"Receive" spans. + +Operations covered by "Deliver" or "Receive" can forward zero messages (e. g. +to notify the application that no message is available for processing), one +message, or multiple messages (a batch of messages). "Deliver" and "Receive" +spans should link to the creation context of the messages forwarded, thus those +spans can link to zero, one, or multiple producer spans. + +For single-message scenarios, and if the "Deliver" or "Receive" spans would be +root spans of a new trace, the creation context may also be used as a parent on +those operations in addition to being added as a link. Keeping single-messages +operations in the same trace can greatly improve the user experience. + +#### Settlement + +Messages can be settled in a variety of different ways: + +* The intermediary settles the messages as it is sent to the consumer. No + settlement operations happen on the consumer. +* The consumer settles a message without awaiting an acknowledgment from the + intermediary. +* The consumer settles a message and awaits an acknowledgment from the + intermediary. This involves a round-trip exchange between the consumer and + the intermediary. + +Settlement operations on the consumer can either be triggered manually by the +user, or can be triggered automatically by messaging SDKs based on return +values of callbacks. + +A "Settle" span should be created for every settlement operation that happens +on the consumer (at-least-once and exactly-once). SDKs will, in some cases, +auto-settle messages in push-scenarios when messages are delivered via +callbacks. + +"Settle" spans should link to creation context of the messages that are +settled, when possible. + +No settlement span should be created for settlement scenarios that do not +involve any settlement operations on the consumer side. + +## Proposed changes and additions to the messaging semantic conventions + +This section contains a concise and normative definition of what was outlined +in the [Trace structure](#trace-structure) section. The following subsections +are supposed to be merged into the semantic conventions, whereas the detailed +description in the [Trace structure](#trace-structure) section in this OTEP will +serve as explanation and future reference. + +### Operation name + +The following operations related to messages are covered by these semantic +conventions: + +| Operation name | Description | +|----------------|-------------| +| `publish` | One or more messages are provided for publishing to an intermediary. | +| `create` | A message is created. | +| `receive` | One or more messages are requested by a consumer. | +| `deliver` | One or more messages are passed to a consumer. | +| `settle` | One or more messages are settled. | + +For further details about each of those operations refer to the [section about trace structure](#trace-structure). + +### Span kind + +[Span kinds](../../specification/trace/api.md#spankind) +SHOULD be set according to the following table, based on the operation a span describes. + +| Operation name | Span kind| +|----------------|-------------| +| `publish` | `PRODUCER`, if no `create` spans are present. | +| `create` | `PRODUCER` | +| `receive` | `CONSUMER` | +| `deliver` | `CONSUMER` | +| `settle` | (see below) | + +The kind of `settle` spans should be set according to the [generic specification about span kinds](../../specification/trace/api.md#spankind), +e. g. it should be set to `CLIENT` if the `settle` spans models a synchronous call +to the intermediary. + +Setting span kinds according to this table ensures that span links between +consumers and producers always exist between a `PRODUCER` span on the producer +side and a `CONSUMER` span on the consumer side. This allows analysis tools to +interpret linked traces without the need for additional semantic hints. + +### Span relationships + +#### Producer spans + +"Publish" spans SHOULD be created for operations of providing messages for +sending or publishing to an intermediary. A single "Publish" span can account +for a single message, or for multiple messages (in the case of providing +messages in batches). "Create" spans MAY be created. A single "Create" span +SHOULD account only for a single message. "Create" spans SHOULD either be +children or links of the related "Publish" span. + +If a "Create" span exists for a message, its context SHOULD be injected into +the message. If no "Create" span exists, the context of the related "Publish" +span SHOULD be injected into the message. + +#### Consumer spans + +##### Push-based scenarios + +"Deliver" spans SHOULD be created for operations of passing messages to the +application when those operations are not initiated by the application +code. + +##### Pull-based scenarios + +"Receive" spans SHOULD be created for operations of passing messages to the +application when those operations are initiated by the application code. + +##### General considerations + +"Deliver" or "Receive" spans MUST NOT be created for messages which are not +forwarded to the caller, but are pre-fetched or cached by messaging +libraries or SDKs. + +A single "Deliver" or "Receive" span can account for a single message, for +multiple messages (in case messages are passed for processing as batches), or +for no message at all (if it is signalled that no messages were received). For +each message it accounts for, the "Deliver" or "Receive" span SHOULD link to +the message's creation context. In addition, if it is possible the creation +context MAY be set as a parent of the "Deliver" or "Receive" span. + +#### Settlement spans + +"Settle" spans SHOULD be created for every manually or automatically triggered +settlement operation. A single "Settle" span can account for a single message +or for multiple messages (in case messages are passed for settling as batches). +For each message it accounts for, the "Settle" span MAY link to the creation +context of the message. + +## Open issues + +Fully integrating the changes proposed in this document into messaging semantic +conventions requires some additions and clarifications in the specification, +which are listed in this section: + +* [open-telemetry/opentelemetry-specification#454](https://github.com/open-telemetry/opentelemetry-specification/issues/454) + To instrument pull-based "Receive" operations as described in this document, + it is necessary to add links to spans after those spans were created. The + reason for this is, that not all messages are present at the start of a + "Receive" operations, so links to related contexts cannot be added at the start + of the span. +* [open-telemetry/opentelemetry-specification#2176](https://github.com/open-telemetry/opentelemetry-specification/issues/2176) + When consuming a message with no attached creation context as part of a + batch, it would still be useful to capture related message-specific + attributes as part of a link which points to an invalid context. However, + according to the specification links pointing to an invalid context may be + ignored. For consistently reporting message-specific attributes on links, links + to invalid contexts should be allowed and supported. +* [open-telemetry/opentelemetry-specification#3172](https://github.com/open-telemetry/opentelemetry-specification/issues/3172) + Currently the specification is unclear about whether relationships between + producer and consumer spans can be modelled via links, the wording suggests + that it should be a parent/child relationship. The wording in the specification + needs to make it clear that this can be a link too. +* This OTEP allows the creation of parent/child relationships between producer + and consumer spans in addition to the required creation of links. However, + in some instances, adding this parent/child relationship might lead to + undesired consequences, e. g. to large traces in scenarios where batches are + published. Some further attention needs to be paid to those scenarios when + the changes proposed in this OTEP are merged into the semantic conventions. + +## Future possibilities + +### Intermediary instrumentation + +While intermediary instrumentation is not directly covered by the conventions +in this document, it certainly is necessary to keep the proposed conventions +extensible so that intermediary instrumentation can be easily added in a way +that integrates well with producer and consumer instrumentation. + +The diagram below gives an example of how intermediary instrumentation can be +added. The fact that producers and consumers are consistently correlated by +links across all scenarios provides maximal flexibility for adding intermediary +instrumentation. + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + PM1[Publish m1] + PM2[Publish m2] + end + subgraph CONSUMER + direction TB + D[Deliver]-.-PRM1[Process m1] + D-.-PRM2[Process m2] + end + PM1-. link .-D; + PM2-. link .-D; + PM1-- parent -->INTERMEDIARY; + PM2-- parent -->INTERMEDIARY; + INTERMEDIARY-- parent -->D; + + classDef normal fill:green + class PM1,PM2,D normal + classDef additional opacity:0.4 + class INTERMEDIARY,PRM1,PRM2 additional + linkStyle 0,1,4,5,6 opacity:0.4 + linkStyle 2,3 color:green,stroke:green +``` + +### Instrumentation of "Process" operations + +This OTEP focuses on a consistent set of conventions that can be applied across +all messaging scenarios, which in one form or another cover "Publish" and/or +"Create", "Deliver" or "Receive", and "Settle" operations. Those operations +share common characteristics across all messaging scenarios. + +Characteristics of "Process" operations on the other hand vary considerable +across messaging scenarios. Furthermore it is often hard or even impossible to +provide auto-instrumentation for such operations. For those reasons, +conventions for "Process" operations were declared as out-of-scope for this +OTEP. + +However, interest was expressed from many sides to also achieve some +consistency for the instrumentation of "Process" operations. Therefore, +[#3395](https://github.com/open-telemetry/opentelemetry-specification/issues/3395) +covers the effort to define conventions for "Process" operations, which will +build on the foundation that this OTEP lays. + +## Prior art + +The existing semantic conventions for messaging contain a [list of examples](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/messaging/messaging-spans.md#examples), +each specifying the spans with their attributes and relationships that should +be created for a given messaging scenario. + +Many users writing instrumentation for messaging systems expressed confusion +about those examples. The relationships between spans defined in the examples +don't follow a well-documented and consistent pattern, which creates confusion +for users whose use cases don't fit any of the given examples. Instrumentors +should be able to rely on a consistent set of conventions, as opposed to +deducing conventions from a set of examples. + +## Examples + +This section contains a list of examples illustrating the use of the +conventions outlined above. Green boxes denote spans that are required to +exist in order to conform to those conventions. Other boxes denote spans that +are not required and covered by the conventions, but are hopefully helpful in +understanding how messaging spans can be integrated into an overall trace flow. +Solid arrows denote parent/child relationships, dotted arrows denote link +relationships. + +### Push-based scenarios + +A producer creates and publishes a single message, the single message is delivered to a consumer: + +```mermaid +flowchart LR; + subgraph PRODUCER + PM1[Publish m1] + end + subgraph CONSUMER + DM1[Deliver m1] + end + PM1-. link .-DM1; + + classDef normal fill:green + class PM1,DM1 normal + linkStyle 0 color:green,stroke:green +``` + +When consuming a single message, the "Deliver" spans can be parented to the creation context: + +```mermaid +flowchart LR; + subgraph PRODUCER + PM1[Publish m1] + end + subgraph CONSUMER + DM1[Deliver m1] + end + PM1-. link .-DM1; + PM1-- parent -->DM1; + + classDef normal fill:green + class PM1,DM1 normal + linkStyle 0,1 color:green,stroke:green +``` + +It is recommended to add spans for settlement operations on the consumer side. +Those spans can either be created manually or via auto-instrumentation: + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + PM1[Publish m1] + end + subgraph CONSUMER + direction TB + DM1[Deliver m1]-->S1[Settle m1] + end + PM1-. link .-DM1; + PM1-. link .-S1; + + classDef normal fill:green + class PM1,DM1,S1 normal + linkStyle 0,1,2 color:green,stroke:green +``` + +A producer publishes a batch of messages, single messages are delivered to +consumers. "Create" spans are created as part of the "Publish" operation: + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + P[Publish]-->C1[Create m1] + P-->C2[Create m2] + end + subgraph CONSUMER + direction TB + DM1[Deliver m1] + DM2[Deliver m2] + end + C1-. link .-DM1; + C2-. link .-DM2; + + classDef normal fill:green + class P,C1,C2,DM1,DM2 normal + linkStyle 0,1,2,3 color:green,stroke:green +``` + +When consuming a single message, the "Deliver" spans can be parented to the creation context: + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + P[Publish]-->C1[Create m1] + P-->C2[Create m2] + end + subgraph CONSUMER + direction TB + DM1[Deliver m1] + DM2[Deliver m2] + end + C1-. link .-DM1; + C2-. link .-DM2; + C1-- parent -->DM1; + C2-- parent -->DM2; + + classDef normal fill:green + class P,C1,C2,DM1,DM2 normal + linkStyle 0,1,2,3,4,5 color:green,stroke:green +``` + +A producer creates and publishes a single message, it is delivered as part of a +batch of messages to a consumer: + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + PM1[Publish m1] + PM2[Publish m2] + end + subgraph CONSUMER + direction TB + D[Deliver]-.-PRM1[Process m1] + D-.-PRM2[Process m2] + end + PM1-. link .-D; + PM2-. link .-D; + + classDef normal fill:green + class PM1,PM2,D normal + classDef additional opacity:0.4 + class PRM1,PRM2 additional + linkStyle 0,1 opacity:0.4 + linkStyle 2,3 color:green,stroke:green +``` + +### Pull-based scenarios + +A producer creates and publishes a single message, the single message is +delivered to a consumer. "Create" spans are created independently of the +"Publish" operation: + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + A[Ambient]-- parent -->CM1[Create m1] + A-- parent -->CM2[Create m2] + A-- parent -->P[Publish] + end + subgraph CONSUMER + direction TB + RM1[Receive m1] + RM2[Receive m2] + end + CM1-. link .-RM1; + CM2-. link .-RM2; + CM1-. link .-P; + CM2-. link .-P; + + classDef normal fill:green + class CM1,CM2,P,RM1,RM2 normal + classDef additional opacity:0.4 + class A additional + linkStyle 0,1,2 opacity:0.4 + linkStyle 3,4,5,6 color:green,stroke:green +``` + +"Create" spans are created as part of the "Publish" operation: + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + P[Publish]-- parent -->CM1[Create m1] + P-- parent -->CM2[Create m2] + end + subgraph CONSUMER + direction TB + RM1[Receive m1] + RM2[Receive m2] + end + CM1-. link .-RM1; + CM2-. link .-RM2; + + classDef normal fill:green + class P,CM1,CM2,RM1,RM2 normal + linkStyle 0,1,2,3 color:green,stroke:green +``` + +A producer creates and publishes a single message, it is delivered as part of a +batch of messages to a consumer. "Process" spans for single messages can be +created, but are not covered by these conventions: + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + PM1[Publish m1] + PM2[Publish m2] + end + subgraph CONSUMER + direction TB + A[Ambient]-- parent -->R[Receive] + A-.-PRM1[Process m1] + A-.-PRM2[Process m2] + end + PM1-. link .-R; + PM2-. link .-R; + + classDef normal fill:green + class PM1,PM2,R normal + classDef additional opacity:0.4 + class A,PRM1,PRM2 additional + linkStyle 0,1,2 opacity:0.4 + linkStyle 3,4 color:green,stroke:green +``` + +It is recommended to add spans for settlement operations. Those spans can +either be created manually or via auto-instrumentation: + +```mermaid +flowchart LR; + subgraph PRODUCER + direction TB + PM1[Publish m1] + PM2[Publish m2] + end + subgraph CONSUMER + direction TB + A[Ambient]-- parent -->R[Receive] + A-- parent -->SM1[Settle m1] + A-- parent -->SM2[Settle m2] + end + PM1-. link .-R; + PM2-. link .-R; + PM1-. link .-SM1; + PM2-. link .-SM2; + + classDef normal fill:green + class PM1,PM2,SM1,SM2,R normal + classDef additional opacity:0.4 + class A additional + linkStyle 0,1,2 opacity:0.4 + linkStyle 3,4,5,6 color:green,stroke:green +``` diff --git a/oteps/trace/0235-sampling-threshold-in-trace-state.md b/oteps/trace/0235-sampling-threshold-in-trace-state.md new file mode 100644 index 00000000000..cec19441f6f --- /dev/null +++ b/oteps/trace/0235-sampling-threshold-in-trace-state.md @@ -0,0 +1,184 @@ +# Sampling Threshold Propagation in TraceState + +## Motivation + +Sampling is a broad topic; here it refers to the independent decisions made at points in a distributed tracing system of whether to collect a span or not. Multiple sampling decisions can be made before a span is finally consumed. When sampling is to be performed at multiple points in the process, the only way to reason about it effectively is to make sure that the sampling decisions are **consistent**. +In this context, consistency means that a positive sampling decision made for a particular span with probability p1 implies a positive sampling decision for any span belonging to the same trace, if it is made with probability p2 >= p1. + +## Explanation + +The existing, experimental [specification for probability sampling using TraceState](../../specification/trace/tracestate-probability-sampling.md) is limited to powers-of-two probabilities, and is designed to work without making assumptions about TraceID randomness. +This system can only achieve non-power-of-two sampling using interpolation between powers of two, which is unnecessarily restrictive. +In existing sampling systems, sampling probabilities like 1%, 10%, and 75% are common, and it should be possible to express these without interpolation. +There is also a need for consistent sampling in the collection path (outside of the head-sampling paths) and using inherent randomness in the traceID is a less-expensive solution than referencing a custom `r-value` from the tracestate in every span. +This proposal introduces a new value with the key `th` as an alternative to the `p` value in the previous specification. +The `p` value is limited to powers of two, while the `th` value in this proposal supports a large range of values. +This proposal allows for the continued expression of randomness using `r-value` as specified there using the key `r`. +To distinguish the cases, this proposal uses the key `rv`. + +In the general case, in order to make consistent sampling decisions across the entire path of the trace, two values MUST be present in the `SpanContext`: + +1. A _random_ (or pseudo-random) 56-bit value, called `R` below. +2. A 56-bit _rejection threshold_ (or just "threshold") as expressed in the TraceState, called `T` below. `T` represents the maximum threshold that was applied in all previous consistent sampling stages. If the current sampling stage applies a greater-valued threshold than any stage before, it MUST update (increase) the threshold correspondingly. + +One way to think about _rejection threshold_ is that is the number of spans that would be discarded out of 2^56 considered spans. This means that spans where `R >= T` will be sampled. + +Here is an example involving three participants `A`, `B`, and `C`: + +`A` -> `B` -> `C` + +where -> indicates a parent -> child relationship. + +`A` uses consistent probability sampling with a sampling probability of 0.25 (this corresponds to a rejection probability of .75). +`B` uses consistent probability sampling with a sampling probability of 0.5. +`C` uses a parent-based sampler. + +When `A` samples a span, its outgoing traceparent will have the 'sampled' flag SET and the 'th' in its outgoing tracestate will be set to `0xc0_0000_0000_0000`. +When `A` does not sample a span, its outgoing traceparent will have the 'sampled' flag UNSET but the 'th' in its outgoing tracestate will still be set to `0xc0_0000_0000_0000`. +When B samples a span, its outgoing traceparent will have the 'sampled' flag SET and the 'th' in its outgoing tracestate will be set to `0x80_0000_0000_0000`. +C (being a parent based sampler) samples a span purely based on its parent (B in this case), it will use the sampled flag to make the decision. Its outgoing 'th' value will continue to reflect what it got from B (`0x80_0000_0000_0000`), and this is useful to understand its adjusted count. + +This design requires that as a given span progresses along its collection path, `th` is non-decreasing (and, in particular, must be increased at stages that apply lower sampling probabilities). +It does not, however, restrict a span's initial `th` in any way (e.g., relating it to that of its parent, if it has one). +It is acceptable for B to have a lesser initial `th` than A has. It would not be ok if some later-stage sampler decreased A's `th`. + +The system has the following invariant: + +`(R >= T) = sampled flag` + +The sampling decision is propagated with the following algorithm: + +* If the `th` key is not specified, this implies that non-probabilistic sampling may be taking place. +* Else derive `T` by parsing the `th` key as a hex value as described below. +* If `T` is 0, Always Sample. +* Compare the 56 bits of `T` with the 56 bits of `R`. If `T > R`, then do not sample. + +The `R` value MUST be derived as follows: + +* If the key `rv` is present in the Tracestate header, then `R = rv`. +* Else if the Random Trace ID Flag is `true` in the traceparent header, then `R` is the lowest-order 56 bits of the trace-id. +* Else `R` MUST be generated as a random value in the range `[0, (2**56)-1]` and added to the Tracestate header with key `rv`. + +The preferred way to propagate the `R` value is as the lowest 56 bits of the trace-id. +If these bits are in fact random, the `random` trace-flag SHOULD be set as specified in [the W3C trace context specification](https://w3c.github.io/trace-context/#trace-id). +There are circumstances where trace-id randomness is inadequate (for example, sampling a group of traces together); in these cases, an `rv` value is required. + +The value of the `rv` and `th` keys MUST be expressed as up to 14 hexadecimal digits from the set `[0-9a-f]`. For `th` keys only, trailing zeros (but not leading zeros) may be omitted. `rv` keys MUST always be exactly 14 hex digits. + +Examples: + +- `th` value is missing: non-probabalistic sampling may be taking place. +- `th=4` -- equivalent to `th=40000000000000`, which is a 25% rejection threshold, corresponding to a 75% sampling probability. +- `th=c` -- equivalent to `th=c0000000000000`, which is a rejection threshold of 75%, corresponding to a sampling probability of 25%. +- `th=08` -- equivalent to `th=08000000000000`, which is a rejection threshold of 3.125%, corresponding to a sampling probability of 96.875%. +- `th=0` -- equivalent to `th=00000000000000`, which is a 0% rejection threshold, which means Always Sample. + +The `T` value MUST be derived as follows: + +* If the `th` key is not present in the Tracestate header, then non-probabalistic sampling may be in use. +* Else the value corresponding to the `th` key should be interpreted as above. + +Sampling Decisions MUST be propagated by setting the value of the `th` key in the Tracestate header according to the above. + +## Initializing and updating T and R values + +There are two categories of sampler: + +- **Head samplers:** Implementations of [`Sampler`](../../specification/trace/sdk.md#sampler), called by a `Tracer` during span creation. +- **Downstream samplers:** Any component that, given an ended Span, decides whether to drop or forward ("sample") it on to the next component in the system. Also known as "collection-path samplers" or "sampling processors". _Tail samplers_ are a special class of downstream samplers that buffer the spans in a trace and select a sampling probability for the trace as a whole using data from any span in the buffered trace. + +This section defines behavior for each kind of sampler. + +### Head samplers + +A head sampler is responsible for computing the `rv` and `th` values in a new span's initial [`TraceState`](../../specification/trace/api.md#tracestate). Notable inputs to that computation include the parent span's trace state (if a parent span exists) and the new span's trace ID. + +First, a consistent `Sampler` decides which sampling probability to use. The sampler MAY select any value of T. If a valid `SpanContext` is provided in the call to `ShouldSample` (indicating that the span being created will be a child span), + +- Choosing a T greater than the parent span's is expected to result in partial traces (the parent may be sampled but its child, the current span, dropped). +- Choosing a T less than or equal to the parent span is expected to result in complete traces (this is definition of consistent probability sampling). + +For the output TraceState, + +- The `th` key MUST be defined with a value corresponding to the sampling probability the sampler actually used. +- The `rv` value, if present on the input TraceState, MUST be defined and equal to the parent span's `rv`. Otherwise, `rv` MUST be defined if and only if the effective R was _generated_ during the decision, per the "derive R" algorithm given earlier. + +TODO: For _new_ spans, `ShouldSample` doesn't currently have a way to know the new Span's `TraceFlags`, so it can't determine whether the Random Trace ID Flag is set, and in turn can't execute the "derive R" algorithm. Maybe it should take `TraceFlags` as an additional parameter, just like it takes `TraceId`? + +### Downstream samplers + +A downstream sampler, in contrast, may output a given ended Span with a _modified_ trace state, complying with following rules: + +- If the chosen sampling probability is 1, the sampler MUST NOT modify any existing `th`, nor set any `th`. +- Otherwise, the chosen sampling probability is in `(0, 1)`. In this case the sampler MUST output the span with a `th` equal to `max(input th, chosen th)`. In other words, `th` MUST NOT be decreased (as it is not possible to retroactively adjust an earlier stage's sampling probability), and it MUST be increased if a lower sampling probability was used. This case represents the common case where a downstream sampler is reducing span throughput in the system. + +## Visual + +![Sampling decision flow](../img/0235-sampling-threshold-calculation.png) + +## Algorithms + +The `th` and `rv` values may be represented and manipulated in a variety of forms depending on the capabilities of the processor and needs of the implementation. As 56-bit values, they are compatible with byte arrays and 64-bit integers, and can also be manipulated with 64-bit floating point with a truly negligible loss of precision. + +The following examples are in Python3. They are intended as examples only for clarity, and not as a suggested implementation. + +### Converting t-value to a 56-bit integer threshold + +To convert a t-value string to a 56-bit integer threshold, pad it on the right with 0s so that it is 14 digits in length, and then parse it as a hexadecimal value. + +```py +padded = (tvalue + "00000000000000")[:14] +threshold = int('0x' + padded, 16) +``` + +### Converting integer threshold to a t-value + +To convert a 56-bit integer threshold value to the t-value representation, emit it as a hexadecimal value (without a leading '0x'), optionally with trailing zeros omitted: + +```py +h = hex(tvalue).rstrip('0') +# remove leading 0x +tv = 'tv='+h[2:] +``` + +### Testing rv vs threshold + +Given rv and threshold as 64-bit integers, a sample should be taken if rv is greater than or equal to the threshold. + +``` +shouldSample = (rv >= threshold) +``` + +### Converting threshold to a sampling probability + +The sampling probability is a value from 0.0 to 1.0, which can be calculated using floating point by dividing by 2^56: + +```py +# embedded _ in numbers for clarity (permitted by Python3) +maxth = 0x100_0000_0000_0000 # 2^56 +prob = float(maxth - threshold) / maxth +``` + +### Converting threshold to an adjusted count (sampling rate) + +The adjusted count indicates the approximate quantity of items from the population that this sample represents. It is equal to `1/probability`. It is not defined for spans that were obtained via non-probabilistic sampling (a sampled span with no `th` value). + +## Trade-offs and mitigations + +This proposal is the result of long negotiations on the Sampling SIG over what is required and various alternative forms of expressing it. [This issue](https://github.com/open-telemetry/opentelemetry-specification/issues/3602) exhaustively covers the various formats that were discussed and their pros and cons. This proposal is the result of that decision. + +## Prior art and alternatives + +The existing specification for `r-value` and `p-value` attempted to solve this problem, but were limited to powers of 2, which is inadequate. + +## Open questions + +This specification leaves room for different implementation options. For example, comparing hex strings or converting them to numeric format are both viable alternatives for handling the threshold. + +We also know that some implementations prefer to use a sampling probability (in the range from 0-1.0) or a sampling rate (1/probability); this design permits conversion to and from these formats without loss up to at least 6 decimal digits of precision. + +## Future possibilities + +This permits sampling systems to propagate consistent sampling information downstream where it can be compensated for. +For example, this will enable the tail-sampling processor in the OTel Collector to propagate its sampling decisions to backends in a standard way. +This permits backend systems to use the effective sampling probability in data presentations.