Add update interval exponential backoff on source collector failure (#92

) * Disable source collectors on error by default * Do not disable source indefinitely * Use exponential backoff for update intervals * Update changelog * Rename update interval increase/reset methods * Add field defaults * Fix tests * Refactor validation, add max_backoff * Update sample config * Update changelog
unioslo · Jan 13, 2025 · 18349a7 · 18349a7
1 parent f213300
commit 18349a7
Show file tree

Hide file tree

Showing 7 changed files with 305 additions and 77 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,7 +5,21 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-<!-- ## [Unreleased] -->
+## Unreleased
+
+### Added
+
+- Default value for source collector config `source_collectors.<name>.error_duration` is now computed from `round(error_tolerance * update_interval + (update_interval*0.9))`
+- New failure handling strategies for source collectors, which can be set using `disable_duration` for each source collector.
+  - `disable_duration == 0` (default): Use exponential backoff to increase the update interval on error. The update interval is reset to the original value on success.
+  - `disable_duration > 0`: Disable the source collector for a set duration.
+  - `disable_duration < 0`: Never disable, never increase the update interval.
+  - `exit_on_error` takes precedence over `disable_duration`. If `exit_on_error` is set to `true`, the source collector will exit on error regardless of the `disable_duration` setting.
+
+### Changed
+
+- The default value of `exit_on_error` for source collectors is now `false`.
+- The default value of `disable_duration` for source collectors is now `0`. This means that the source collector will use exponential backoff to increase the update interval on error.
 
 ## [0.2.0](https://github.com/unioslo/zabbix-auto-config/releases/tag/zac-v0.2.0)
 

diff --git a/README.md b/README.md
@@ -46,7 +46,7 @@ For automatic linking in templates you could create the templates:
 
 ## Database
 
-The application requires a PostgreSQL database to store the state of the collected hosts. The database can be created with the following command:
+The application requires a PostgreSQL database to store the state of the collected hosts. The database can be created with the following command from your local machine:
 
 ```bash
 PGPASSWORD=secret psql -h localhost -U postgres -p 5432 -U zabbix << EOF
@@ -61,7 +61,9 @@ CREATE TABLE hosts_source (
 EOF
 ```
 
-Replace login credentials with your own when running against a different database. This is a one-time procedure per environment.
+If running from inside a dev container, replace the host (`-h`) with the container name of the database container (default: `db`).
+
+This is a one-time procedure per environment.
 
 ## Application
 
@@ -249,7 +251,11 @@ The following configurations options are available:
 
 ### Optional configuration (error handling)
 
-If `error_tolerance` number of errors occur within `error_duration` seconds, the collector is disabled. Source collectors do not tolerate errors by default and must opt-in to this behavior by setting `error_tolerance` and `error_duration` to non-zero values. If `exit_on_error` is set to `true`, the application will exit. Otherwise, the collector will be disabled for `disable_duration` seconds.
+If `error_tolerance` number of errors occur within `error_duration` seconds, the collector is disabled for a given duration. This is an opt-in feature per source collector.
+
+By default, source collectors are never disabled, and instead increase their update intervals using an exponential backoff strategy on each successive error. See the `disable_duration` option for more information.
+
+
 
 #### error_tolerance
 
@@ -271,7 +277,29 @@ If `error_tolerance` is set, but `error_duration` is not, the application will s
 
 #### disable_duration
 
-`disable_duration` (default: 3600) is the duration in seconds to disable collector for. If set to 0, the collector is disabled indefinitely, requiring a restart of the application to re-enable it.
+`disable_duration` (default: 3600) is the duration in seconds to disable collector for. The following disable modes are supported:
+
+- `disable_duration` > 0: Hard disable for `disable_duration` seconds after `error_tolerance` failures
+- `disable_duration` = 0: Increase collection interval using exponential backoff after each failure instead of disabling source.
+- `disable_duration` < 0: No disable mechanism (always try at fixed interval)
+
+They are described in more detail below:
+
+##### Hard disable
+
+When `disable_duration` is greater than 0, the collector is disabled for `disable_duration` seconds after `error_tolerance` failures within `error_duration` seconds. The collector will not be called during this period. After the `disable_duration` has passed, the collector will be re-enabled and the error count will be reset.
+
+##### Exponential backoff
+
+When `disable_duration` is set to 0, the collector will not be disabled, but the update interval will be increased by a factor of `backoff_factor` after each failure. The update interval will be reset to the original value after a successful collection. This mode is useful for sources that are expected to be temporarily unavailable at times.
+
+##### No disable
+
+When `disable_duration` is less than 0, the collector will not be disabled, and the update interval will not be increased. This mode is useful when using sources that are frequently unavailable, but are not critical to the operation of the application.
+
+#### backoff_factor
+
+`backoff_factor` (default: 1.5) is the factor by which the update interval is increased after each failure when `disable_duration` is set to 0. The update interval is reset to the original value after a successful collection.
 
 ### Keyword arguments
 

diff --git a/config.sample.toml b/config.sample.toml
@@ -98,29 +98,67 @@ module_name = "mysource"
 # How often to run the source collector in seconds
 update_interval = 60
 
+# Any other options are passed as keyword arguments to the source collector's
+# `collect()` function
+kwarg_passed_to_source = "value" # extra fields are passed to the source module as kwargs
+another_kwarg = "value2"         # We can pass an arbitrary number of kwargs to the source module
+
+
+# We can define multiple sources using the same module as long
+# as their config entries have different names
+[source_collectors.othersource] # different name
+module_name = "mysource" # same module as above
+update_interval = 60
+
+# By default, the application applies an exponential backoff to sources
+# that fail to collect data due to network issues or other problems.
+# The backoff factor is multiplied by the update interval to determine
+# how long to wait before retrying the source.
+# The default backoff factor is 1.5. Backoff is disabled if the factor is 1.
+backoff_factor = 2 # Increase the backoff factor for this source
+
+# We can limit how long the backoff time can grow to prevent a source
+# from waiting too long between retries.
+max_backoff = 3600 # Maximum backoff time in seconds
+
+
+[source_collectors.error_tolerance_source]
+module_name = "mysource" # re-using same module
+update_interval = 60
+
+# Error tolerance settings
+#
+# We can define a custom error tolerance for each source collector.
+# By setting an error tolerance, exponential backoff is disabled for the source
+# and the source will keep retrying at the same interval until it succeeds
+# or hits the error tolerance.
+
+# By setting `error_tolerance` and `error_duration` we can control how many
+# errors within a certain timespan are tolerated before the source is disabled
+# for a certain duration.
+
 # How many errors to tolerate before disabling the source
 error_tolerance = 5 # Tolerate 5 errors within `error_duration` seconds
 
 # How long an error should be kept in the error tally before discarding it
-error_duration = 360 # should be greater than update_interval
+# In this case, we consider 5 errors or more within 10 minutes as a failure
+error_duration = 600 # should be greater than update_interval
+
+# Duration to disable source if error threshold is reached
+# If this is set to 0, error tolerance is disabled, and the source will
+# go back to using exponential backoff as its retry strategy.
+disable_duration = 3600 # time in seconds (1 hour)
 
 # Exit the application if the source fails
 # If true, the application will exit if the source fails
 # If false, the source will be disabled for `disable_duration` seconds
 exit_on_error = false # Disable source if it fails
 
-# How long to wait before reactivating a disabled source
-disable_duration = 3600 # Time in seconds to wait before reactivating a disabled source
-
-# Any other options are passed as keyword arguments to the source collector's
-# `collect()` function
-kwarg_passed_to_source = "value" # extra fields are passed to the source module as kwargs
-another_kwarg = "value2"         # We can pass an arbitrary number of kwargs to the source module
-
-
-[source_collectors.othersource]
-module_name = "mysource"
+[source_collectors.no_error_handling_source]
+module_name = "mysource" # re-using same module
 update_interval = 60
-error_tolerance = 0      # no tolerance for errors (default)
-exit_on_error = true     # exit application if source fails (default)
-source = "other"         # extra kwarg used in mysource module
+
+# If disable_duration is set to a negative value, the source uses neither
+# exponential backoff nor error tolerance. It will keep retrying at the
+# same pace no matter how many errors it encounters.
+disable_duration = -1
diff --git a/tests/test_config.py b/tests/test_config.py
@@ -1,7 +1,5 @@
 from __future__ import annotations
 
-import logging
-
 import pytest
 import tomli
 from hypothesis import given
@@ -22,10 +20,9 @@ def test_config_extra_field(sample_config: str, caplog: pytest.LogCaptureFixture
     config["foo"] = "bar"
     models.Settings(**config)
     assert len(caplog.records) == 1
-    record = caplog.records[0]
-    assert record.levelname == "WARNING"
-    assert record.levelno == logging.WARNING
-    assert "'foo'" in record.message
+    assert caplog.record_tuples == snapshot(
+        [("root", 30, "Settings: Got unknown config field 'foo'.")]
+    )
 
 
 def test_config_extra_field_allowed(
@@ -50,12 +47,23 @@ def test_sourcecollectorsettings_defaults():
         module_name="foo",
         update_interval=60,
     )
-    assert settings.module_name == "foo"
-    assert settings.update_interval == 60
-    assert settings.error_duration > 0
-    assert settings.error_tolerance == 0
-    assert settings.exit_on_error is True
-    assert settings.disable_duration == 3600
+
+    # Default strategy should be to use exponential backoff
+    assert settings.failure_strategy == models.FailureStrategy.BACKOFF
+
+    # Snapshot of values
+    assert settings.model_dump() == snapshot(
+        {
+            "module_name": "foo",
+            "update_interval": 60,
+            "error_tolerance": 0,
+            "error_duration": 9999,
+            "exit_on_error": False,
+            "disable_duration": 0,
+            "backoff_factor": 1.5,
+            "max_backoff": 3600,
+        }
+    )
 
 
 def test_sourcecollectorsettings_no_tolerance() -> None:
@@ -72,7 +80,7 @@ def test_sourcecollectorsettings_no_tolerance() -> None:
         error_tolerance=0,
         error_duration=0,
     )
-    assert settings.error_tolerance == 0
+    assert settings.error_tolerance == snapshot(0)
     assert settings.error_duration == snapshot(9999)
 
 
@@ -106,6 +114,7 @@ def test_sourcecollectorsettings_no_error_duration_fuzz(
     update_interval: int, error_tolerance: int
 ):
     """Test model with a variety of update intervals and error tolerances"""
+    # We only check that instantiating the model does not raise an exception
     models.SourceCollectorSettings(
         module_name="foo",
         update_interval=update_interval,
@@ -123,13 +132,20 @@ def test_sourcecollectorsettings_duration_too_short():
             error_tolerance=5,
             error_duration=180,
         )
-    errors = exc_info.value.errors()
-    assert len(errors) == 1
-    error = errors[0]
-    assert "greater than 300" in error["msg"]
-    assert error["type"] == "value_error"
-    assert error["msg"] == snapshot(
-        "Value error, Invalid value for error_duration (180). It should be greater than 300: error_tolerance (5) * update_interval (60)"
+    errors = exc_info.value.errors(include_url=False, include_context=False)
+    assert len(errors) == snapshot(1)
+    assert errors[0] == snapshot(
+        {
+            "type": "value_error",
+            "loc": (),
+            "msg": "Value error, Invalid value for error_duration (180). It should be greater than 300: error_tolerance (5) * update_interval (60)",
+            "input": {
+                "module_name": "foo",
+                "update_interval": 60,
+                "error_tolerance": 5,
+                "error_duration": 180,
+            },
+        }
     )
 
 
@@ -142,8 +158,13 @@ def test_sourcecollectorsettings_duration_negative():
             error_tolerance=5,
             error_duration=-1,
         )
-    errors = exc_info.value.errors()
+    errors = exc_info.value.errors(include_url=False, include_context=False)
     assert len(errors) == 1
-    error = errors[0]
-    assert error["loc"] == ("error_duration",)
-    assert error["type"] == "greater_than_equal"
+    assert errors[0] == snapshot(
+        {
+            "type": "greater_than_equal",
+            "loc": ("error_duration",),
+            "msg": "Input should be greater than or equal to 0",
+            "input": -1,
+        }
+    )
diff --git a/tests/test_models.py b/tests/test_models.py
@@ -210,3 +210,12 @@ def test_zabbix_settings_timeout(timeout: int, expect: Optional[int]) -> None:
         timeout=timeout,
     )
     assert settings.timeout == expect
+
+
+def test_failure_strategy_supports_error_tolerance() -> None:
+    """Test that only EXIT and DISABLE support error tolerance."""
+    for strategy in models.FailureStrategy:
+        if strategy in (models.FailureStrategy.EXIT, models.FailureStrategy.DISABLE):
+            assert strategy.supports_error_tolerance()
+        else:
+            assert not strategy.supports_error_tolerance()