Massive Number of TCP Connections Remain Established #13426

lazysteel93 · 2024-12-04T16:18:09Z

What is the issue?

We are running Linkerd enterprise-2.15.5-0.

A Deployment's Linkerd-meshed pods opens TCP port 80 connections to a Service with Pods running Varnish behind it.
We are witnessing counts of established TCP connections on the outgoing pod's side over 2000. On occasion we've seen as many as 8000+ connections, which caused the linkerd container to crash. Under normal operations we would expect this to be 2 or 3 digits at the most. This is only happening in a single k8s cluster although all of our clusters run common code and are in the same provider's cloud.

Linkerd does not appear to be closing these HTTP connections properly. The connections are after all going to the Service IP.

The Java methods that we use on the outgoing pod are:

        HttpUtils http = new HttpUtils();
        http.setUrl(url);
        http.setMethod(HttpUtils.GET);
        http.execute();
        byte buffer[] = http.getBodyAsBinary();
        http.close();

^ note: the final line http.close() in fact calls releaseConnection().

We think this may be related to a similar open connection issue that we raised in the past: #9724

We are planning to try opaque-ports to confirm our theory that Linkerd is leaving the connections open.

How can it be reproduced?

We do not have a way to reproduce outside of this k8s cluster.

Logs, error output, etc

Current connections on the effected outgoing pods below. The IP address is that of the destination Service:

╰─◦  kubectl get pods -l app=<app label here> -oname | xargs -I% kubectl exec % -- /bin/sh -c "netstat -an | grep 10.39.250.148 | wc -l"
4722
4197
4795
4167
4670
5114
4722
4860
4410
4421
4733
4747
3870
3850
4205
4238
4275
4749
4223
4966
4589
4182
3740
4240
4977
3943
4481

output of `linkerd check -o short`

╰─◦ linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.15.5-0 but the latest enterprise version is 2.16.2-1
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.15.5-0 but the latest enterprise version is 2.16.2-1
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-d89467495-8bxtv (enterprise-2.15.5-0)
	* linkerd-destination-d89467495-n6th5 (enterprise-2.15.5-0)
	* linkerd-destination-d89467495-v48q4 (enterprise-2.15.5-0)
	* linkerd-identity-d55486b6c-k4nmc (enterprise-2.15.5-0)
	* linkerd-identity-d55486b6c-msh5l (enterprise-2.15.5-0)
	* linkerd-identity-d55486b6c-v49j5 (enterprise-2.15.5-0)
	* linkerd-proxy-injector-ff7d897fc-97bj9 (enterprise-2.15.5-0)
	* linkerd-proxy-injector-ff7d897fc-df5wl (enterprise-2.15.5-0)
	* linkerd-proxy-injector-ff7d897fc-nf8lh (enterprise-2.15.5-0)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-9979bc8f5-rdf7v (enterprise-2.15.5-0)
	* prometheus-fd6b67bcc-tmpfs (enterprise-2.15.5-0)
	* tap-7ddb4f85b4-m7mg7 (enterprise-2.15.5-0)
	* tap-7ddb4f85b4-nqzsl (enterprise-2.15.5-0)
	* tap-7ddb4f85b4-qtms7 (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-996j6 (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-d8rcz (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-fszsv (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-nns8h (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-xnxn2 (enterprise-2.15.5-0)
	* web-9c7b78c9f-l49d2 (enterprise-2.15.5-0)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

Kubernetes version: v1.29.10
Provider: GKE
Host OS: Ubuntu Jammy
Linkerd Version: enterprise-2.15.5-0

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

The text was updated successfully, but these errors were encountered:

olix0r · 2024-12-04T16:50:00Z

If you can provide a workload manifest that reproduces this, we'd be happy to look at it.

We do not have a way to reproduce outside of this k8s cluster.

If the same software does not reproduce the issue in other environments, then this may be a symptom of a misbehaving network stack. Linkerd 2.16 added inter-proxy keep-alive timeouts to help detect these states. I'm curious if you see the same behavior on a newer version of Linkerd.

lazysteel93 · 2024-12-04T19:13:09Z

If you can provide a workload manifest that reproduces this, we'd be happy to look at it.

Unfortunately it's part of a microservices architecture that cannot be duplicated easily.

We're available for any troubleshooting we could do that wouldn't be disruptive to this cluster, because it is a production environment. The service in question is critical to our business. We can provide logs or anything necessary if it helps.

olix0r · 2024-12-06T20:43:10Z

It might be helpful to dig more into the socket states of a workload. You can use something like the following to get a high-level snapshot:

:; kubectl debug -n "$ns" "$pod" --image=ghcr.io/linkerd/debug:edge-24.11.8 --attach -q -- ss -s                                               
Total: 11
TCP:   269 (estab 5, closed 261, orphaned 0, timewait 24)

Transport Total     IP        IPv6
RAW       0         0         0        
UDP       0         0         0        
TCP       8         8         0        
INET      8         8         0        
FRAG      0         0         0

Additionally, it would be useful to capture the proxy-metrics of a workload in this state:

:; linkerd diagnostics proxy-metrics -n "$ns" pod/"$pod"| grep ^tcp_open_connections

lazysteel93 added the bug label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive Number of TCP Connections Remain Established #13426

Massive Number of TCP Connections Remain Established #13426

lazysteel93 commented Dec 4, 2024 •

edited

Loading

olix0r commented Dec 4, 2024

lazysteel93 commented Dec 4, 2024

olix0r commented Dec 6, 2024

Massive Number of TCP Connections Remain Established #13426

Massive Number of TCP Connections Remain Established #13426

Comments

lazysteel93 commented Dec 4, 2024 • edited Loading

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

olix0r commented Dec 4, 2024

lazysteel93 commented Dec 4, 2024

olix0r commented Dec 6, 2024

lazysteel93 commented Dec 4, 2024 •

edited

Loading

output of `linkerd check -o short`