Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive DNS Queries - linkerd2-proxy #13508

Open
willhughes-au opened this issue Dec 19, 2024 · 3 comments
Open

Excessive DNS Queries - linkerd2-proxy #13508

willhughes-au opened this issue Dec 19, 2024 · 3 comments

Comments

@willhughes-au
Copy link

What is the issue?

In the course of debugging issues with DNS in our Kubernetes cluster, I noticed that our linkerd2 pods are each making hundreds of DNS queries within a 2 second period, every approximately 30 seconds.

The DNS queries are for two SRV records: linkerd-policy.linkerd.svc.cluster.local and linkerd-dst-headless.linkerd.svc.cluster.local

We are observing approximately 95K queries for just these two records in the space of about 5 minutes across our cluster.

While trying to resolve the issue, I've been looking into the linkerd source to see if I can identify the cause.

I've found linkerd2-proxy's async fn resolution(...) does a loop to try and observe DNS changes.

The expiry.await, which it waits on is set from async fn resolve_srv(...) (when successfully resolving SRV) where it's using the valid_until value to set the sleep time.

It's here that my knowledge of Rust falls short. To me it appears that this lacks protection against either a valid_until time that is now or a time in the past. It could be due to a parsing error for the TTL, or a timezone issue, or something else.

If I'm correct, then what I believe is happening is that linkerd2-proxy is trying to resolve the two mentioned addresses via SRV - it gets either a zero TTL or calculates a time in the past. The sleep therefore is a no-op, and it runs in a tight loop until the DNS Server returns an error.

How can it be reproduced?

I'm not certain how it can be reproduced - for us it's happening on all of our clusters. I'm not certain when it began.

Logs, error output, etc

Not sure anything particularly useful.

But dig returns valid TTL values when using dig within the same cluster and nodes.

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 2860afa66b5e467a (echoed)
;; QUESTION SECTION:
;linkerd-policy.linkerd.svc.cluster.local. IN SRV

;; ANSWER SECTION:
linkerd-policy.linkerd.svc.cluster.local. 5 IN SRV 0 50 8090 172-30-90-196.linkerd-policy.linkerd.svc.cluster.local.
linkerd-policy.linkerd.svc.cluster.local. 5 IN SRV 0 50 8090 172-30-27-159.linkerd-policy.linkerd.svc.cluster.local.

;; ADDITIONAL SECTION:
172-30-27-159.linkerd-policy.linkerd.svc.cluster.local. 5 IN A 172.30.27.159
172-30-90-196.linkerd-policy.linkerd.svc.cluster.local. 5 IN A 172.30.90.196

;; Query time: 0 msec
;; SERVER: 169.254.20.10#53(169.254.20.10) (UDP)
;; WHEN: Thu Dec 19 23:12:25 UTC 2024
;; MSG SIZE  rcvd: 449

output of linkerd check -o short

$ ./linkerd2-cli-edge-24.11.8-linux-amd64 check -o short
linkerd-multicluster
--------------------
‼ Link and CLI versions match
            * (other-env): unable to determine version
        * (other-env): unable to determine version
        * (other-env): unable to determine version
        * (other-env): unable to determine version
        * (other-env): unable to determine version
        * (other-env): unable to determine version
    see https://linkerd.io/2/checks/#l5d-multicluster-links-version for hints

Status check results are √

Environment

  • Kubernetes Version: 1.31
  • Cluster Environment: Amazon EKS
  • Host OS: Amazon Linux 2023.6.20241212
  • Linkerd Version: edge-24.11.8

Possible solution

Setting a minimum sleep interval would add protection against tight loops.

Additional context

Happy to help with testing out or getting more info.

Would you like to work on fixing this bug?

no

@willhughes-au
Copy link
Author

I've been digging more into the problem, and I think I may have a possible understanding of what's happening.

We use CoreDNS both as the AWS EKS Add-on, and also in node-local-dns.

The CoreDNS cache plugin will by default (or when set to immediate) will serve the stale DNS result with a TTL of Zero.

Combined with the SRV record's TTL of 30 seconds, and I suspect that this, combined with the linkerd-proxy's bad wait configuration is causing this.

So to sum up, what seems to be happening is:

  • linkerd-proxy tries to resolve the host.
  • node-local-dns finds the entry in the cache, but it's expired, does an immediate return with TTL Zero.
  • node-local-dns starts a request to resolve the entry upstream
  • linkerd gets the TTL Zero response and immediately tries to resolve again, repeat from the top (this whole request took ~2ms)
  • node-local-dns has some thundering herd of resolve requests and causes some contention, eventually after 1 second updates it's cache with the new value.
  • linkerd-proxy finally sees a TTL ~29s and chills out for the next 29 seconds.

The whole thundering-herd problem is made worse by every linkerd-proxy instance in the cluster synchronising on the TTL interval. Every pod tries to make the same request at exactly the same time.

I can't test today, but I will test on Monday my time to see if disabling the cache stops the issue.

@willhughes-au
Copy link
Author

willhughes-au commented Dec 23, 2024

After some more testing today, I've managed to "resolve" the excessive DNS queries.

If I change the cache plugin to have keepttl which retains the TTL it originally received, rather than internally decrementing the TTL of the cached item, the behavior goes away.
I don't like setting this value though, as it can cause unexpected other behavior.

This confirms for me that linkerd-proxy's tight loop is ultimately to blame - it should have some form of protection against a zero TTL value, but does not.

Additionally, IMO the sleep for the TTL Interval is likely to cause excessive load by having all linkerd-proxy instances in the cluster synchronise their queries to happen at approximately the same time.

At the very minimum I think introducing random sleep jitter is necessary. This would stop all of the linkerd-proxy instances falling into lock-step and making DNS queries at the same instant. Given the TTL can be up to 30 seconds, introducing a random jitter of between at least 5 seconds and half the TTL value would make sense.

@olix0r
Copy link
Member

olix0r commented Jan 9, 2025

@willhughes-au thanks for doing the research on this.

Looking at Envoy for inspiration, it looks like they ignore 0 TTL values and revert to a default. That's probably what we aught to do in our DNS implementation as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants