Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate on & Improve Latency Metrics #46

Open
y-eight opened this issue Dec 18, 2023 · 1 comment
Open

Investigate on & Improve Latency Metrics #46

y-eight opened this issue Dec 18, 2023 · 1 comment
Assignees
Labels
area/checks Issues/PRs related to Checks bug Something isn't working refactoring Refactoring of existing code

Comments

@y-eight
Copy link
Member

y-eight commented Dec 18, 2023

Problem to investigate & solve

Currently, 3 different latency metrics are available.

  • Counter
  • Latency time
  • Histogram

If the health check fails (internally) the latency time will be 0. The status code as well.

This might be ok for the counter and latency metrics but might be not the best practice for the histogram. The buckets will be filled.

Example with 2 errors and 308 total requests:

# HELP sparrow_latency_duration Latency of targets in seconds
# TYPE sparrow_latency_duration histogram
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.005"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.01"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.025"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.05"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.1"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.25"} **2**
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="0.5"} 288
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="1"} 307
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="2.5"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="5"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="10"} 308
sparrow_latency_duration_bucket{target="https://gitlab.devops.telekom.de",le="+Inf"} 308
sparrow_latency_duration_sum{target="https://gitlab.devops.telekom.de"} 120.39378972299998
sparrow_latency_duration_count{target="https://gitlab.devops.telekom.de"} 308

As @puffitos stated in #45 we should probably solve this with labelling or another set of metrics. E.g. label for the checks state.

@y-eight y-eight added the refactoring Refactoring of existing code label Dec 18, 2023
@lvlcn-t lvlcn-t added the area/checks Issues/PRs related to Checks label Jan 3, 2024
@niklastreml niklastreml added the bug Something isn't working label Jan 12, 2024
@niklastreml
Copy link
Collaborator

Maybe we should create an extra metric for failed requests and move those failed requests there. This would fix the issue with the buckets filling up, and also provide an easy way for monitoring failed requests

@niklastreml niklastreml self-assigned this Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks Issues/PRs related to Checks bug Something isn't working refactoring Refactoring of existing code
Projects
None yet
Development

No branches or pull requests

3 participants