Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No more sending on closed channels during checkscontroller shutdown #245

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

puffitos
Copy link
Collaborator

Motivation

Fixes #244

Changes

The change is just one LoC: return a copy of the checks slice when calling Iter(), which TBH is a safer and simpler way to handle this slice (long live functional programming).

The rest is just some light housekeeping:

  • Adding context to the logs
  • Adding a "sparrow was shutdown" log message to know that everything went well after shutting down the sparrow; currently nothing is being shown as feedback to the user outside an exit code of 0.

Tests done

The manual test case mentioned in the issue was carried out; started the sparrow and sent a keyboard interrupt. The error doesn't occur anymore:

time=17:07:25 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:185 msg="Shutting down sparrow"
time=17:07:25 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:154 msg="Shut down signal received"
time=17:07:25 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/remote/gitlab/gitlab.go:335 msg="Deleting file from gitlab" file="{AuthorEmail:dev.telekom.de@sparrow AuthorName:dev.telekom.de CommitMessage:Unregistering global target Content:{Url: LastSeen:0001-01-01 00:00:00 +0000 UTC} Name:dev.telekom.de.json}"
time=17:07:25 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/checks/latency/latency.go:82 msg="Context canceled" err="context canceled"
time=17:07:25 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=latency error="context canceled"
time=17:07:26 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:172 msg="Successfully unregistered as global target"
time=17:07:26 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:177 msg="Stopping gitlab reconciliation routine"
time=17:07:26 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/api/api.go:110 msg="Failed to serve api" error="http: Server closed" scheme=http
time=17:07:26 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/metrics/metrics.go:132 msg="Tracing shutdown"
time=17:07:26 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/config/file.go:135 msg="Sending signal to shut down file loader"
time=17:07:26 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:84 msg="Shutting down checks controller"
time=17:07:26 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:130 msg="Non-recoverable error in sparrow component" error="failed serving API: context canceled"
time=17:07:26 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:134 msg="Sparrow was shut down"

Process finished with the exit code 0

Additionally, I've added a new unit test case (which does fail in the main branch ATM). The test actually tests the shutdown function of the ChecksController, which we had unfortunately neglected.

  • Unit tests succeeded
  • E2E tests succeeded

TODO

  • I've assigned this PR to myself
  • I've labeled this PR correctly

First metric announces registration state.
When teh sparrow check controller would shut down, it would iterate over
the actual slice of the checks, shutdown each check and then procceed to
delete said check from the slice.

Since the shutting down procedure is instant, there was a race condition
that would delete a wrong check from the slice and then the same
shutting down loop would try and shutdown the same check again.

Just returning a copy of the slice resolves this problem, as the
iteration is now done on the copy only. A more sophisticated deletion
routine for the checks slice could be another way to handle this, but it
would probably increase the complexity of the checks and
checkscontroller structs.
@puffitos puffitos self-assigned this Dec 25, 2024
@puffitos puffitos added bug Something isn't working area/checks-controller Issues/PRs related to the ChecksController labels Dec 25, 2024
Copy link
Member

@y-eight y-eight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -53,5 +54,5 @@ func (c *Checks) Delete(check checks.Check) {
func (c *Checks) Iter() []checks.Check {
c.mu.RLock()
defer c.mu.RUnlock()
return c.checks
return slices.Clone(c.checks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK will this be a problem. When slices.Clone(c.checks) is called, it creates a copy of the checks. Any updates applied to these cloned instances will not affect the original running instances. This means that the original checks will never receive the updated configurations, which can prevent them from functioning correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write some tests to replicate the problem. And then we can talk about how to better fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news: the tests show that the change works as intended. Or at least I think they do :)

The shallow copy still provides a slice with the references intact. Each check instance will run its UpdateConfig and the actual check in the checks field of the controller will get updated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense, but let me give it a try in a manual e2e test just to make sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it was the next thing on my list. Let's sit down after you're done with your exams and talk about your e2e branch 🌿

This test proves that the shallow clone works as intended and returns a clone of the slice, where the original references can still be used and updated.
This should fix the bodyclose linting remarks

Signed-off-by: Bruno Bressi <[email protected]>
@puffitos puffitos requested review from lvlcn-t and y-eight January 23, 2025 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks-controller Issues/PRs related to the ChecksController bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Sparrow's shutdown sends signal on closed channel
3 participants