-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive DNS Queries - linkerd2-proxy #13508
Comments
I've been digging more into the problem, and I think I may have a possible understanding of what's happening. We use CoreDNS both as the AWS EKS Add-on, and also in node-local-dns. The CoreDNS cache plugin will by default (or when set to Combined with the SRV record's TTL of 30 seconds, and I suspect that this, combined with the linkerd-proxy's bad wait configuration is causing this. So to sum up, what seems to be happening is:
The whole thundering-herd problem is made worse by every linkerd-proxy instance in the cluster synchronising on the TTL interval. Every pod tries to make the same request at exactly the same time. I can't test today, but I will test on Monday my time to see if disabling the cache stops the issue. |
After some more testing today, I've managed to "resolve" the excessive DNS queries. If I change the cache plugin to have This confirms for me that linkerd-proxy's tight loop is ultimately to blame - it should have some form of protection against a zero TTL value, but does not. Additionally, IMO the sleep for the TTL Interval is likely to cause excessive load by having all linkerd-proxy instances in the cluster synchronise their queries to happen at approximately the same time. At the very minimum I think introducing random sleep jitter is necessary. This would stop all of the linkerd-proxy instances falling into lock-step and making DNS queries at the same instant. Given the TTL can be up to 30 seconds, introducing a random jitter of between at least 5 seconds and half the TTL value would make sense. |
@willhughes-au thanks for doing the research on this. Looking at Envoy for inspiration, it looks like they ignore 0 TTL values and revert to a default. That's probably what we aught to do in our DNS implementation as well. |
What is the issue?
In the course of debugging issues with DNS in our Kubernetes cluster, I noticed that our linkerd2 pods are each making hundreds of DNS queries within a 2 second period, every approximately 30 seconds.
The DNS queries are for two SRV records:
linkerd-policy.linkerd.svc.cluster.local
andlinkerd-dst-headless.linkerd.svc.cluster.local
We are observing approximately 95K queries for just these two records in the space of about 5 minutes across our cluster.
While trying to resolve the issue, I've been looking into the linkerd source to see if I can identify the cause.
I've found linkerd2-proxy's
async fn resolution(...)
does a loop to try and observe DNS changes.The
expiry.await
, which it waits on is set fromasync fn resolve_srv(...)
(when successfully resolving SRV) where it's using thevalid_until
value to set the sleep time.It's here that my knowledge of Rust falls short. To me it appears that this lacks protection against either a
valid_until
time that is now or a time in the past. It could be due to a parsing error for the TTL, or a timezone issue, or something else.If I'm correct, then what I believe is happening is that linkerd2-proxy is trying to resolve the two mentioned addresses via SRV - it gets either a zero TTL or calculates a time in the past. The sleep therefore is a no-op, and it runs in a tight loop until the DNS Server returns an error.
How can it be reproduced?
I'm not certain how it can be reproduced - for us it's happening on all of our clusters. I'm not certain when it began.
Logs, error output, etc
Not sure anything particularly useful.
But
dig
returns valid TTL values when using dig within the same cluster and nodes.output of
linkerd check -o short
Environment
Possible solution
Setting a minimum sleep interval would add protection against tight loops.
Additional context
Happy to help with testing out or getting more info.
Would you like to work on fixing this bug?
no
The text was updated successfully, but these errors were encountered: