Excessive DNS Queries - linkerd2-proxy #13508

willhughes-au · 2024-12-19T23:37:10Z

What is the issue?

In the course of debugging issues with DNS in our Kubernetes cluster, I noticed that our linkerd2 pods are each making hundreds of DNS queries within a 2 second period, every approximately 30 seconds.

The DNS queries are for two SRV records: linkerd-policy.linkerd.svc.cluster.local and linkerd-dst-headless.linkerd.svc.cluster.local

We are observing approximately 95K queries for just these two records in the space of about 5 minutes across our cluster.

While trying to resolve the issue, I've been looking into the linkerd source to see if I can identify the cause.

I've found linkerd2-proxy's async fn resolution(...) does a loop to try and observe DNS changes.

The expiry.await, which it waits on is set from async fn resolve_srv(...) (when successfully resolving SRV) where it's using the valid_until value to set the sleep time.

It's here that my knowledge of Rust falls short. To me it appears that this lacks protection against either a valid_until time that is now or a time in the past. It could be due to a parsing error for the TTL, or a timezone issue, or something else.

If I'm correct, then what I believe is happening is that linkerd2-proxy is trying to resolve the two mentioned addresses via SRV - it gets either a zero TTL or calculates a time in the past. The sleep therefore is a no-op, and it runs in a tight loop until the DNS Server returns an error.

How can it be reproduced?

I'm not certain how it can be reproduced - for us it's happening on all of our clusters. I'm not certain when it began.

Logs, error output, etc

Not sure anything particularly useful.

But dig returns valid TTL values when using dig within the same cluster and nodes.

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; COOKIE: 2860afa66b5e467a (echoed)
;; QUESTION SECTION:
;linkerd-policy.linkerd.svc.cluster.local. IN SRV

;; ANSWER SECTION:
linkerd-policy.linkerd.svc.cluster.local. 5 IN SRV 0 50 8090 172-30-90-196.linkerd-policy.linkerd.svc.cluster.local.
linkerd-policy.linkerd.svc.cluster.local. 5 IN SRV 0 50 8090 172-30-27-159.linkerd-policy.linkerd.svc.cluster.local.

;; ADDITIONAL SECTION:
172-30-27-159.linkerd-policy.linkerd.svc.cluster.local. 5 IN A 172.30.27.159
172-30-90-196.linkerd-policy.linkerd.svc.cluster.local. 5 IN A 172.30.90.196

;; Query time: 0 msec
;; SERVER: 169.254.20.10#53(169.254.20.10) (UDP)
;; WHEN: Thu Dec 19 23:12:25 UTC 2024
;; MSG SIZE  rcvd: 449

output of `linkerd check -o short`

$ ./linkerd2-cli-edge-24.11.8-linux-amd64 check -o short
linkerd-multicluster
--------------------
‼ Link and CLI versions match
            * (other-env): unable to determine version
        * (other-env): unable to determine version
        * (other-env): unable to determine version
        * (other-env): unable to determine version
        * (other-env): unable to determine version
        * (other-env): unable to determine version
    see https://linkerd.io/2/checks/#l5d-multicluster-links-version for hints

Status check results are √

Environment

Kubernetes Version: 1.31
Cluster Environment: Amazon EKS
Host OS: Amazon Linux 2023.6.20241212
Linkerd Version: edge-24.11.8

Possible solution

Setting a minimum sleep interval would add protection against tight loops.

Additional context

Happy to help with testing out or getting more info.

Would you like to work on fixing this bug?

no

The text was updated successfully, but these errors were encountered:

willhughes-au · 2024-12-20T03:02:26Z

I've been digging more into the problem, and I think I may have a possible understanding of what's happening.

We use CoreDNS both as the AWS EKS Add-on, and also in node-local-dns.

The CoreDNS cache plugin will by default (or when set to immediate) will serve the stale DNS result with a TTL of Zero.

Combined with the SRV record's TTL of 30 seconds, and I suspect that this, combined with the linkerd-proxy's bad wait configuration is causing this.

So to sum up, what seems to be happening is:

linkerd-proxy tries to resolve the host.
node-local-dns finds the entry in the cache, but it's expired, does an immediate return with TTL Zero.
node-local-dns starts a request to resolve the entry upstream
linkerd gets the TTL Zero response and immediately tries to resolve again, repeat from the top (this whole request took ~2ms)
node-local-dns has some thundering herd of resolve requests and causes some contention, eventually after 1 second updates it's cache with the new value.
linkerd-proxy finally sees a TTL ~29s and chills out for the next 29 seconds.

The whole thundering-herd problem is made worse by every linkerd-proxy instance in the cluster synchronising on the TTL interval. Every pod tries to make the same request at exactly the same time.

I can't test today, but I will test on Monday my time to see if disabling the cache stops the issue.

willhughes-au · 2024-12-23T02:04:58Z

After some more testing today, I've managed to "resolve" the excessive DNS queries.

If I change the cache plugin to have keepttl which retains the TTL it originally received, rather than internally decrementing the TTL of the cached item, the behavior goes away.
I don't like setting this value though, as it can cause unexpected other behavior.

This confirms for me that linkerd-proxy's tight loop is ultimately to blame - it should have some form of protection against a zero TTL value, but does not.

Additionally, IMO the sleep for the TTL Interval is likely to cause excessive load by having all linkerd-proxy instances in the cluster synchronise their queries to happen at approximately the same time.

At the very minimum I think introducing random sleep jitter is necessary. This would stop all of the linkerd-proxy instances falling into lock-step and making DNS queries at the same instant. Given the TTL can be up to 30 seconds, introducing a random jitter of between at least 5 seconds and half the TTL value would make sense.

olix0r · 2025-01-09T17:51:15Z

@willhughes-au thanks for doing the research on this.

Looking at Envoy for inspiration, it looks like they ignore 0 TTL values and revert to a default. That's probably what we aught to do in our DNS implementation as well.

willhughes-au added the bug label Dec 19, 2024

olix0r added the area/proxy label Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive DNS Queries - linkerd2-proxy #13508

Excessive DNS Queries - linkerd2-proxy #13508

willhughes-au commented Dec 19, 2024

willhughes-au commented Dec 20, 2024

willhughes-au commented Dec 23, 2024 •

edited

Loading

olix0r commented Jan 9, 2025

Excessive DNS Queries - linkerd2-proxy #13508

Excessive DNS Queries - linkerd2-proxy #13508

Comments

willhughes-au commented Dec 19, 2024

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

willhughes-au commented Dec 20, 2024

willhughes-au commented Dec 23, 2024 • edited Loading

olix0r commented Jan 9, 2025

output of `linkerd check -o short`

willhughes-au commented Dec 23, 2024 •

edited

Loading