Skip to content

Commit

Permalink
Add update interval exponential backoff on source collector failure (#92
Browse files Browse the repository at this point in the history
)

* Disable source collectors on error by default

* Do not disable source indefinitely

* Use exponential backoff for update intervals

* Update changelog

* Rename update interval increase/reset methods

* Add field defaults

* Fix tests

* Refactor validation, add max_backoff

* Update sample config

* Update changelog
  • Loading branch information
pederhan authored Jan 13, 2025
1 parent f213300 commit 18349a7
Show file tree
Hide file tree
Showing 7 changed files with 305 additions and 77 deletions.
16 changes: 15 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,21 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

<!-- ## [Unreleased] -->
## Unreleased

### Added

- Default value for source collector config `source_collectors.<name>.error_duration` is now computed from `round(error_tolerance * update_interval + (update_interval*0.9))`
- New failure handling strategies for source collectors, which can be set using `disable_duration` for each source collector.
- `disable_duration == 0` (default): Use exponential backoff to increase the update interval on error. The update interval is reset to the original value on success.
- `disable_duration > 0`: Disable the source collector for a set duration.
- `disable_duration < 0`: Never disable, never increase the update interval.
- `exit_on_error` takes precedence over `disable_duration`. If `exit_on_error` is set to `true`, the source collector will exit on error regardless of the `disable_duration` setting.

### Changed

- The default value of `exit_on_error` for source collectors is now `false`.
- The default value of `disable_duration` for source collectors is now `0`. This means that the source collector will use exponential backoff to increase the update interval on error.

## [0.2.0](https://github.com/unioslo/zabbix-auto-config/releases/tag/zac-v0.2.0)

Expand Down
36 changes: 32 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ For automatic linking in templates you could create the templates:

## Database

The application requires a PostgreSQL database to store the state of the collected hosts. The database can be created with the following command:
The application requires a PostgreSQL database to store the state of the collected hosts. The database can be created with the following command from your local machine:

```bash
PGPASSWORD=secret psql -h localhost -U postgres -p 5432 -U zabbix << EOF
Expand All @@ -61,7 +61,9 @@ CREATE TABLE hosts_source (
EOF
```

Replace login credentials with your own when running against a different database. This is a one-time procedure per environment.
If running from inside a dev container, replace the host (`-h`) with the container name of the database container (default: `db`).

This is a one-time procedure per environment.

## Application

Expand Down Expand Up @@ -249,7 +251,11 @@ The following configurations options are available:

### Optional configuration (error handling)

If `error_tolerance` number of errors occur within `error_duration` seconds, the collector is disabled. Source collectors do not tolerate errors by default and must opt-in to this behavior by setting `error_tolerance` and `error_duration` to non-zero values. If `exit_on_error` is set to `true`, the application will exit. Otherwise, the collector will be disabled for `disable_duration` seconds.
If `error_tolerance` number of errors occur within `error_duration` seconds, the collector is disabled for a given duration. This is an opt-in feature per source collector.

By default, source collectors are never disabled, and instead increase their update intervals using an exponential backoff strategy on each successive error. See the `disable_duration` option for more information.



#### error_tolerance

Expand All @@ -271,7 +277,29 @@ If `error_tolerance` is set, but `error_duration` is not, the application will s

#### disable_duration

`disable_duration` (default: 3600) is the duration in seconds to disable collector for. If set to 0, the collector is disabled indefinitely, requiring a restart of the application to re-enable it.
`disable_duration` (default: 3600) is the duration in seconds to disable collector for. The following disable modes are supported:

- `disable_duration` > 0: Hard disable for `disable_duration` seconds after `error_tolerance` failures
- `disable_duration` = 0: Increase collection interval using exponential backoff after each failure instead of disabling source.
- `disable_duration` < 0: No disable mechanism (always try at fixed interval)

They are described in more detail below:

##### Hard disable

When `disable_duration` is greater than 0, the collector is disabled for `disable_duration` seconds after `error_tolerance` failures within `error_duration` seconds. The collector will not be called during this period. After the `disable_duration` has passed, the collector will be re-enabled and the error count will be reset.

##### Exponential backoff

When `disable_duration` is set to 0, the collector will not be disabled, but the update interval will be increased by a factor of `backoff_factor` after each failure. The update interval will be reset to the original value after a successful collection. This mode is useful for sources that are expected to be temporarily unavailable at times.

##### No disable

When `disable_duration` is less than 0, the collector will not be disabled, and the update interval will not be increased. This mode is useful when using sources that are frequently unavailable, but are not critical to the operation of the application.

#### backoff_factor

`backoff_factor` (default: 1.5) is the factor by which the update interval is increased after each failure when `disable_duration` is set to 0. The update interval is reset to the original value after a successful collection.

### Keyword arguments

Expand Down
68 changes: 53 additions & 15 deletions config.sample.toml
Original file line number Diff line number Diff line change
Expand Up @@ -98,29 +98,67 @@ module_name = "mysource"
# How often to run the source collector in seconds
update_interval = 60

# Any other options are passed as keyword arguments to the source collector's
# `collect()` function
kwarg_passed_to_source = "value" # extra fields are passed to the source module as kwargs
another_kwarg = "value2" # We can pass an arbitrary number of kwargs to the source module


# We can define multiple sources using the same module as long
# as their config entries have different names
[source_collectors.othersource] # different name
module_name = "mysource" # same module as above
update_interval = 60

# By default, the application applies an exponential backoff to sources
# that fail to collect data due to network issues or other problems.
# The backoff factor is multiplied by the update interval to determine
# how long to wait before retrying the source.
# The default backoff factor is 1.5. Backoff is disabled if the factor is 1.
backoff_factor = 2 # Increase the backoff factor for this source

# We can limit how long the backoff time can grow to prevent a source
# from waiting too long between retries.
max_backoff = 3600 # Maximum backoff time in seconds


[source_collectors.error_tolerance_source]
module_name = "mysource" # re-using same module
update_interval = 60

# Error tolerance settings
#
# We can define a custom error tolerance for each source collector.
# By setting an error tolerance, exponential backoff is disabled for the source
# and the source will keep retrying at the same interval until it succeeds
# or hits the error tolerance.

# By setting `error_tolerance` and `error_duration` we can control how many
# errors within a certain timespan are tolerated before the source is disabled
# for a certain duration.

# How many errors to tolerate before disabling the source
error_tolerance = 5 # Tolerate 5 errors within `error_duration` seconds

# How long an error should be kept in the error tally before discarding it
error_duration = 360 # should be greater than update_interval
# In this case, we consider 5 errors or more within 10 minutes as a failure
error_duration = 600 # should be greater than update_interval

# Duration to disable source if error threshold is reached
# If this is set to 0, error tolerance is disabled, and the source will
# go back to using exponential backoff as its retry strategy.
disable_duration = 3600 # time in seconds (1 hour)

# Exit the application if the source fails
# If true, the application will exit if the source fails
# If false, the source will be disabled for `disable_duration` seconds
exit_on_error = false # Disable source if it fails

# How long to wait before reactivating a disabled source
disable_duration = 3600 # Time in seconds to wait before reactivating a disabled source

# Any other options are passed as keyword arguments to the source collector's
# `collect()` function
kwarg_passed_to_source = "value" # extra fields are passed to the source module as kwargs
another_kwarg = "value2" # We can pass an arbitrary number of kwargs to the source module


[source_collectors.othersource]
module_name = "mysource"
[source_collectors.no_error_handling_source]
module_name = "mysource" # re-using same module
update_interval = 60
error_tolerance = 0 # no tolerance for errors (default)
exit_on_error = true # exit application if source fails (default)
source = "other" # extra kwarg used in mysource module

# If disable_duration is set to a negative value, the source uses neither
# exponential backoff nor error tolerance. It will keep retrying at the
# same pace no matter how many errors it encounters.
disable_duration = -1
69 changes: 45 additions & 24 deletions tests/test_config.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
from __future__ import annotations

import logging

import pytest
import tomli
from hypothesis import given
Expand All @@ -22,10 +20,9 @@ def test_config_extra_field(sample_config: str, caplog: pytest.LogCaptureFixture
config["foo"] = "bar"
models.Settings(**config)
assert len(caplog.records) == 1
record = caplog.records[0]
assert record.levelname == "WARNING"
assert record.levelno == logging.WARNING
assert "'foo'" in record.message
assert caplog.record_tuples == snapshot(
[("root", 30, "Settings: Got unknown config field 'foo'.")]
)


def test_config_extra_field_allowed(
Expand All @@ -50,12 +47,23 @@ def test_sourcecollectorsettings_defaults():
module_name="foo",
update_interval=60,
)
assert settings.module_name == "foo"
assert settings.update_interval == 60
assert settings.error_duration > 0
assert settings.error_tolerance == 0
assert settings.exit_on_error is True
assert settings.disable_duration == 3600

# Default strategy should be to use exponential backoff
assert settings.failure_strategy == models.FailureStrategy.BACKOFF

# Snapshot of values
assert settings.model_dump() == snapshot(
{
"module_name": "foo",
"update_interval": 60,
"error_tolerance": 0,
"error_duration": 9999,
"exit_on_error": False,
"disable_duration": 0,
"backoff_factor": 1.5,
"max_backoff": 3600,
}
)


def test_sourcecollectorsettings_no_tolerance() -> None:
Expand All @@ -72,7 +80,7 @@ def test_sourcecollectorsettings_no_tolerance() -> None:
error_tolerance=0,
error_duration=0,
)
assert settings.error_tolerance == 0
assert settings.error_tolerance == snapshot(0)
assert settings.error_duration == snapshot(9999)


Expand Down Expand Up @@ -106,6 +114,7 @@ def test_sourcecollectorsettings_no_error_duration_fuzz(
update_interval: int, error_tolerance: int
):
"""Test model with a variety of update intervals and error tolerances"""
# We only check that instantiating the model does not raise an exception
models.SourceCollectorSettings(
module_name="foo",
update_interval=update_interval,
Expand All @@ -123,13 +132,20 @@ def test_sourcecollectorsettings_duration_too_short():
error_tolerance=5,
error_duration=180,
)
errors = exc_info.value.errors()
assert len(errors) == 1
error = errors[0]
assert "greater than 300" in error["msg"]
assert error["type"] == "value_error"
assert error["msg"] == snapshot(
"Value error, Invalid value for error_duration (180). It should be greater than 300: error_tolerance (5) * update_interval (60)"
errors = exc_info.value.errors(include_url=False, include_context=False)
assert len(errors) == snapshot(1)
assert errors[0] == snapshot(
{
"type": "value_error",
"loc": (),
"msg": "Value error, Invalid value for error_duration (180). It should be greater than 300: error_tolerance (5) * update_interval (60)",
"input": {
"module_name": "foo",
"update_interval": 60,
"error_tolerance": 5,
"error_duration": 180,
},
}
)


Expand All @@ -142,8 +158,13 @@ def test_sourcecollectorsettings_duration_negative():
error_tolerance=5,
error_duration=-1,
)
errors = exc_info.value.errors()
errors = exc_info.value.errors(include_url=False, include_context=False)
assert len(errors) == 1
error = errors[0]
assert error["loc"] == ("error_duration",)
assert error["type"] == "greater_than_equal"
assert errors[0] == snapshot(
{
"type": "greater_than_equal",
"loc": ("error_duration",),
"msg": "Input should be greater than or equal to 0",
"input": -1,
}
)
9 changes: 9 additions & 0 deletions tests/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,3 +210,12 @@ def test_zabbix_settings_timeout(timeout: int, expect: Optional[int]) -> None:
timeout=timeout,
)
assert settings.timeout == expect


def test_failure_strategy_supports_error_tolerance() -> None:
"""Test that only EXIT and DISABLE support error tolerance."""
for strategy in models.FailureStrategy:
if strategy in (models.FailureStrategy.EXIT, models.FailureStrategy.DISABLE):
assert strategy.supports_error_tolerance()
else:
assert not strategy.supports_error_tolerance()
Loading

0 comments on commit 18349a7

Please sign in to comment.