Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

t0lya · 2025-01-16T02:32:18Z

What happened:
We have a fleet of 40 clusters with service discovery enabled using Submariner. We rely on Submariner to sync endpointslices across the fleet. We currently have over 1500 endpointslices in our broker cluster.

We have noticed an issue where a cluster in our fleet can stop syncing endpointslices from the broker for up to 10 minutes. This happens when submariner-lighthouse-agent pod restarts on the cluster. We have seen pod restarts due to preemption by Kubernetes scheduler and due to nodes getting tainted and drained for maintenance. We fixed the former by increasing submariner-lighthouse-agent scheduling priority. But node draining still affects submariner-lighthouse-agent availability.

When the agent pod restarts, we see the agent resync all endpointslices from broker to cluster (the processing seems to be done in endpointslice name/namespace alphabetical order). Even if endpointslices are updated on the broker during this resync, these update events do not get processed until the processing loop reaches the endpointslices in its alphabetical order. Due to the volume of endpointslices in our broker (> 1500), we see that endpointslices that are in the end of resync list get processed only after 10 mins.

What you expected to happen:

Is it possible to improve the availability of submariner-lighthouse-agent and make it able to tolerate random ocassional pod restarts? One solution is to increase agent replicas and add leader election to help keep agent keep running even when one of the pods goes down. We can discuss better alternative solutions.

How to reproduce it (as minimally and precisely as possible):

Create 2 clusters and join them to broker.
Create 1500 headless service/service exports in a cluster A. This should create 1500 endpointslices in broker.
Restart lighthouse agent in cluster B.
Restart a deployment for headless service in cluster A to trigger pod IP change in endpointslice.
Observe that endpointslice in broker cluster gets updated immediately. But endpointslice in cluster B will take some time to get latest changes.

Anything else we need to know?:

Environment: Linux

Diagnose information (use subctl diagnose all):
Gather information (use subctl gather):
Cloud provider or hardware configuration: Azure Kubernetes
Install tools: Submariner Helm chart
Others:

The text was updated successfully, but these errors were encountered:

tpantelis · 2025-01-16T13:38:56Z

What version of Submariner are you using?

leasonliu · 2025-01-16T19:00:30Z

What version of Submariner are you using?

0.17.3

tpantelis · 2025-01-16T19:20:12Z

This looks similar to #1623 which eliminated the periodic resync but all EndpointSlices are still processed on startup and they're all added to the queue quickly which kicks in the rate limiter.

t0lya · 2025-01-17T21:07:03Z

Is it possible to prioritize processing current events over resynced events? From logs I see many of the resynced events are NOOPs, only few of these resynced events are actually missed events that require processing. Whereas current events should be processed as soon as possible; why do they get processed after all resynced events are done processing?

One possible idea is to do partitioned rate limiting, one partition for ongoing events and one for resync events. That way there is always dedicated capacity for processing current events.

tpantelis · 2025-01-17T21:26:47Z

Is it possible to prioritize processing current events over resynced events? From logs I see many of the resynced events are NOOPs, only few of these resynced events are actually missed events that require processing. Whereas current events should be processed as soon as possible; why do they get processed after all resynced events are done processing?

One possible idea is to do partitioned rate limiting, one partition for ongoing events and one for resync events. That way there is always dedicated capacity for processing current events.

We don't have control over the event processing ordering - that's handled by the K8s informer. On startup, it lists all current resources and notifies the event handler before starting the watch for new events.

We could just remove the rate limiting altogether, at least the BucketRateLimiter, which I think is the limiting factor here. You could try removing the BucketRateLimiter from here and see if that alleviates the problem. I recall from a previous issue you reported you changed some code and rebuilt the lighthouse agent.

t0lya · 2025-01-17T21:41:08Z

Thank you, we can try adjusting the rate limiting settings on our end.

Also do you know if there is ongoing work to make submariner-lighthouse-agent deployment have multiple replicas with leader election? Or is this something we need to try ourselves?

tpantelis · 2025-01-17T21:54:54Z

Thank you, we can try adjusting the rate limiting settings on our end.

Also do you know if there is ongoing work to make submariner-lighthouse-agent deployment have multiple replicas with leader election? Or is this something we need to try ourselves?

The current developers have no plans for that.

t0lya · 2025-01-17T22:05:11Z

Sorry, one more question. Why is there only one goroutine processing events from queue? Is there some concurrency concern for using multiple goroutines to process endpointslices in parallel?

https://github.com/submariner-io/admiral/blob/6dee2d4b7e49d5047c81484eeacd12d7d25bea51/pkg/workqueue/queue.go#L83-L88

The bucket rate limiter is configured with 500 tokens and refills at 10 tokens per second. So rate limiter should only delay 1500 events by 100 seconds. I think the 10 minutes latency we are seeing is mostly from slow processing of events, and not due to rate limiting. Maybe processing events in parallel should improve latency, we can explore in our setup.

tpantelis · 2025-01-18T00:41:39Z

Sorry, one more question. Why is there only one goroutine processing events from queue? Is there some concurrency concern for using multiple goroutines to process endpointslices in parallel?

https://github.com/submariner-io/admiral/blob/6dee2d4b7e49d5047c81484eeacd12d7d25bea51/pkg/workqueue/queue.go#L83-L88

The bucket rate limiter is configured with 500 tokens and refills at 10 tokens per second. So rate limiter should only delay 1500 events by 100 seconds. I think the 10 minutes latency we are seeing is mostly from slow processing of events, and not due to rate limiting. Maybe processing events in parallel should improve latency, we can explore in our setup.

The actual processing of the events in our code on sync from the broker is negligible (probably on the order of a few ms at most). The wildcard is in accessing the API server after transformation to update the local resource. If there is a significant total latency processing a single event, I'm sure the vast majority is accessing the API server and that's out of our hands, in which case I don't think parallelizing the processing is really going to help much with throughput. In fact it may make things worse by further overloading the API server. Then again, who knows, you can certainly experiment with it.

I still suggest removing the bucket rate limiter and see if that helps enough.

vthapar · 2025-01-20T11:55:29Z

@tpantelis There could be some value in providing an easier user configurable way to tweak rate limiters, retry timers etc. As different users come up with more diverse deployments, one-size-fit-all approach will not work well. Maybe a configmap to tweak such params?

tpantelis · 2025-01-20T12:45:30Z

@tpantelis There could be some value in providing an easier user configurable way to tweak rate limiters, retry timers etc. As different users come up with more diverse deployments, one-size-fit-all approach will not work well. Maybe a configmap to tweak such params?

Perhaps but I think we should just remove the bucket rate limiter. I think we get enough overall throttling via the single worker.

tpantelis · 2025-01-20T16:07:04Z

So looking at this in more detail, with 1500 EPS's taking 10 min total to process (minus the 1+ min for the rate limiter), it takes about 300-400 ms per EPS on average. For the broker syncing, the processing once de-queued mainly entails:

Convert from Unstructured to EndpointSlice
Transform the object (onRemoteEndpointSlice)
Update the local resource copy (Distribute)
a) Convert from EndpointSlice to Unstructured
b) Read the existing resource
c) Update the resource if changed
Call OnSuccessfulSyncFromBroker (enqueueForConflictCheck)

4 has an artificial wait, up to 100 ms worst case, for the synced object to get into the local cache. This should only come into play for newly created EPS plus is only done for ClusterIP services and it sounds like yours are headless.

The rest are (or should be) negligible except for 3b and 3c which involve requests to the API server. 3c shouldn't occur if nothing in the EPS was actually changed. So it seems reading the existing resource is taking most of the time, unless I'm missing something. I don't know if ~300 ms latency is normal but parallelizing would probably help at least somewhat.

tpantelis · 2025-01-21T15:49:28Z

It is possible to hook in a backend priority queue to the client-go rate-limiting work queue using the Queue interface added by this PR. There's example code for creating a PriorityQueue based on the container/heap package. I'll look into this when I get some cycles.

t0lya added the bug Something isn't working label Jan 16, 2025

dfarrell07 added this to Backlog Jan 21, 2025

github-project-automation bot moved this to Backlog in Backlog Jan 21, 2025

dfarrell07 assigned tpantelis Jan 21, 2025

dfarrell07 added the need-info Need more info from the reporter label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

t0lya commented Jan 16, 2025

tpantelis commented Jan 16, 2025

leasonliu commented Jan 16, 2025

tpantelis commented Jan 16, 2025

t0lya commented Jan 17, 2025

tpantelis commented Jan 17, 2025

t0lya commented Jan 17, 2025

tpantelis commented Jan 17, 2025

t0lya commented Jan 17, 2025 •

edited

Loading

tpantelis commented Jan 18, 2025

vthapar commented Jan 20, 2025

tpantelis commented Jan 20, 2025

tpantelis commented Jan 20, 2025

tpantelis commented Jan 21, 2025 •

edited

Loading

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

Lighthouse agent is slow to process current EndpointSlices update events from broker during resync after restart #1706

Comments

t0lya commented Jan 16, 2025

tpantelis commented Jan 16, 2025

leasonliu commented Jan 16, 2025

tpantelis commented Jan 16, 2025

t0lya commented Jan 17, 2025

tpantelis commented Jan 17, 2025

t0lya commented Jan 17, 2025

tpantelis commented Jan 17, 2025

t0lya commented Jan 17, 2025 • edited Loading

tpantelis commented Jan 18, 2025

vthapar commented Jan 20, 2025

tpantelis commented Jan 20, 2025

tpantelis commented Jan 20, 2025

tpantelis commented Jan 21, 2025 • edited Loading

t0lya commented Jan 17, 2025 •

edited

Loading

tpantelis commented Jan 21, 2025 •

edited

Loading