Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

Open
t0lya opened this issue Jan 16, 2025 · 13 comments
Assignees
Labels
bug Something isn't working need-info Need more info from the reporter

Comments

@t0lya
Copy link

t0lya commented Jan 16, 2025

What happened:
We have a fleet of 40 clusters with service discovery enabled using Submariner. We rely on Submariner to sync endpointslices across the fleet. We currently have over 1500 endpointslices in our broker cluster.

We have noticed an issue where a cluster in our fleet can stop syncing endpointslices from the broker for up to 10 minutes. This happens when submariner-lighthouse-agent pod restarts on the cluster. We have seen pod restarts due to preemption by Kubernetes scheduler and due to nodes getting tainted and drained for maintenance. We fixed the former by increasing submariner-lighthouse-agent scheduling priority. But node draining still affects submariner-lighthouse-agent availability.

When the agent pod restarts, we see the agent resync all endpointslices from broker to cluster (the processing seems to be done in endpointslice name/namespace alphabetical order). Even if endpointslices are updated on the broker during this resync, these update events do not get processed until the processing loop reaches the endpointslices in its alphabetical order. Due to the volume of endpointslices in our broker (> 1500), we see that endpointslices that are in the end of resync list get processed only after 10 mins.

What you expected to happen:

Is it possible to improve the availability of submariner-lighthouse-agent and make it able to tolerate random ocassional pod restarts? One solution is to increase agent replicas and add leader election to help keep agent keep running even when one of the pods goes down. We can discuss better alternative solutions.

How to reproduce it (as minimally and precisely as possible):

Create 2 clusters and join them to broker.
Create 1500 headless service/service exports in a cluster A. This should create 1500 endpointslices in broker.
Restart lighthouse agent in cluster B.
Restart a deployment for headless service in cluster A to trigger pod IP change in endpointslice.
Observe that endpointslice in broker cluster gets updated immediately. But endpointslice in cluster B will take some time to get latest changes.

Anything else we need to know?:

Environment: Linux

  • Diagnose information (use subctl diagnose all):
  • Gather information (use subctl gather):
  • Cloud provider or hardware configuration: Azure Kubernetes
  • Install tools: Submariner Helm chart
  • Others:
@t0lya t0lya added the bug Something isn't working label Jan 16, 2025
@tpantelis
Copy link
Contributor

What version of Submariner are you using?

@leasonliu
Copy link

What version of Submariner are you using?

0.17.3

@tpantelis
Copy link
Contributor

This looks similar to #1623 which eliminated the periodic resync but all EndpointSlices are still processed on startup and they're all added to the queue quickly which kicks in the rate limiter.

@t0lya
Copy link
Author

t0lya commented Jan 17, 2025

Is it possible to prioritize processing current events over resynced events? From logs I see many of the resynced events are NOOPs, only few of these resynced events are actually missed events that require processing. Whereas current events should be processed as soon as possible; why do they get processed after all resynced events are done processing?

One possible idea is to do partitioned rate limiting, one partition for ongoing events and one for resync events. That way there is always dedicated capacity for processing current events.

@tpantelis
Copy link
Contributor

Is it possible to prioritize processing current events over resynced events? From logs I see many of the resynced events are NOOPs, only few of these resynced events are actually missed events that require processing. Whereas current events should be processed as soon as possible; why do they get processed after all resynced events are done processing?

One possible idea is to do partitioned rate limiting, one partition for ongoing events and one for resync events. That way there is always dedicated capacity for processing current events.

We don't have control over the event processing ordering - that's handled by the K8s informer. On startup, it lists all current resources and notifies the event handler before starting the watch for new events.

We could just remove the rate limiting altogether, at least the BucketRateLimiter, which I think is the limiting factor here. You could try removing the BucketRateLimiter from here and see if that alleviates the problem. I recall from a previous issue you reported you changed some code and rebuilt the lighthouse agent.

@t0lya
Copy link
Author

t0lya commented Jan 17, 2025

Thank you, we can try adjusting the rate limiting settings on our end.

Also do you know if there is ongoing work to make submariner-lighthouse-agent deployment have multiple replicas with leader election? Or is this something we need to try ourselves?

@tpantelis
Copy link
Contributor

Thank you, we can try adjusting the rate limiting settings on our end.

Also do you know if there is ongoing work to make submariner-lighthouse-agent deployment have multiple replicas with leader election? Or is this something we need to try ourselves?

The current developers have no plans for that.

@t0lya
Copy link
Author

t0lya commented Jan 17, 2025

Sorry, one more question. Why is there only one goroutine processing events from queue? Is there some concurrency concern for using multiple goroutines to process endpointslices in parallel?

https://github.com/submariner-io/admiral/blob/6dee2d4b7e49d5047c81484eeacd12d7d25bea51/pkg/workqueue/queue.go#L83-L88

The bucket rate limiter is configured with 500 tokens and refills at 10 tokens per second. So rate limiter should only delay 1500 events by 100 seconds. I think the 10 minutes latency we are seeing is mostly from slow processing of events, and not due to rate limiting. Maybe processing events in parallel should improve latency, we can explore in our setup.

@tpantelis
Copy link
Contributor

Sorry, one more question. Why is there only one goroutine processing events from queue? Is there some concurrency concern for using multiple goroutines to process endpointslices in parallel?

https://github.com/submariner-io/admiral/blob/6dee2d4b7e49d5047c81484eeacd12d7d25bea51/pkg/workqueue/queue.go#L83-L88

The bucket rate limiter is configured with 500 tokens and refills at 10 tokens per second. So rate limiter should only delay 1500 events by 100 seconds. I think the 10 minutes latency we are seeing is mostly from slow processing of events, and not due to rate limiting. Maybe processing events in parallel should improve latency, we can explore in our setup.

The actual processing of the events in our code on sync from the broker is negligible (probably on the order of a few ms at most). The wildcard is in accessing the API server after transformation to update the local resource. If there is a significant total latency processing a single event, I'm sure the vast majority is accessing the API server and that's out of our hands, in which case I don't think parallelizing the processing is really going to help much with throughput. In fact it may make things worse by further overloading the API server. Then again, who knows, you can certainly experiment with it.

I still suggest removing the bucket rate limiter and see if that helps enough.

@vthapar
Copy link
Contributor

vthapar commented Jan 20, 2025

@tpantelis There could be some value in providing an easier user configurable way to tweak rate limiters, retry timers etc. As different users come up with more diverse deployments, one-size-fit-all approach will not work well. Maybe a configmap to tweak such params?

@tpantelis
Copy link
Contributor

@tpantelis There could be some value in providing an easier user configurable way to tweak rate limiters, retry timers etc. As different users come up with more diverse deployments, one-size-fit-all approach will not work well. Maybe a configmap to tweak such params?

Perhaps but I think we should just remove the bucket rate limiter. I think we get enough overall throttling via the single worker.

@tpantelis
Copy link
Contributor

So looking at this in more detail, with 1500 EPS's taking 10 min total to process (minus the 1+ min for the rate limiter), it takes about 300-400 ms per EPS on average. For the broker syncing, the processing once de-queued mainly entails:

  1. Convert from Unstructured to EndpointSlice
  2. Transform the object (onRemoteEndpointSlice)
  3. Update the local resource copy (Distribute)
    a) Convert from EndpointSlice to Unstructured
    b) Read the existing resource
    c) Update the resource if changed
  4. Call OnSuccessfulSyncFromBroker (enqueueForConflictCheck)

4 has an artificial wait, up to 100 ms worst case, for the synced object to get into the local cache. This should only come into play for newly created EPS plus is only done for ClusterIP services and it sounds like yours are headless.

The rest are (or should be) negligible except for 3b and 3c which involve requests to the API server. 3c shouldn't occur if nothing in the EPS was actually changed. So it seems reading the existing resource is taking most of the time, unless I'm missing something. I don't know if ~300 ms latency is normal but parallelizing would probably help at least somewhat.

@github-project-automation github-project-automation bot moved this to Backlog in Backlog Jan 21, 2025
@dfarrell07 dfarrell07 added the need-info Need more info from the reporter label Jan 21, 2025
@tpantelis
Copy link
Contributor

tpantelis commented Jan 21, 2025

It is possible to hook in a backend priority queue to the client-go rate-limiting work queue using the Queue interface added by this PR. There's example code for creating a PriorityQueue based on the container/heap package. I'll look into this when I get some cycles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working need-info Need more info from the reporter
Projects
Status: Backlog
Development

No branches or pull requests

5 participants