Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive Number of TCP Connections Remain Established #13426

Open
lazysteel93 opened this issue Dec 4, 2024 · 3 comments
Open

Massive Number of TCP Connections Remain Established #13426

lazysteel93 opened this issue Dec 4, 2024 · 3 comments
Labels

Comments

@lazysteel93
Copy link

lazysteel93 commented Dec 4, 2024

What is the issue?

We are running Linkerd enterprise-2.15.5-0.

A Deployment's Linkerd-meshed pods opens TCP port 80 connections to a Service with Pods running Varnish behind it.
We are witnessing counts of established TCP connections on the outgoing pod's side over 2000. On occasion we've seen as many as 8000+ connections, which caused the linkerd container to crash. Under normal operations we would expect this to be 2 or 3 digits at the most. This is only happening in a single k8s cluster although all of our clusters run common code and are in the same provider's cloud.

Linkerd does not appear to be closing these HTTP connections properly. The connections are after all going to the Service IP.

The Java methods that we use on the outgoing pod are:

        HttpUtils http = new HttpUtils();
        http.setUrl(url);
        http.setMethod(HttpUtils.GET);
        http.execute();
        byte buffer[] = http.getBodyAsBinary();
        http.close();

^ note: the final line http.close() in fact calls releaseConnection().

We think this may be related to a similar open connection issue that we raised in the past: #9724

We are planning to try opaque-ports to confirm our theory that Linkerd is leaving the connections open.

How can it be reproduced?

We do not have a way to reproduce outside of this k8s cluster.

Logs, error output, etc

Current connections on the effected outgoing pods below. The IP address is that of the destination Service:

╰─◦  kubectl get pods -l app=<app label here> -oname | xargs -I% kubectl exec % -- /bin/sh -c "netstat -an | grep 10.39.250.148 | wc -l"
4722
4197
4795
4167
4670
5114
4722
4860
4410
4421
4733
4747
3870
3850
4205
4238
4275
4749
4223
4966
4589
4182
3740
4240
4977
3943
4481

output of linkerd check -o short

╰─◦ linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
    is running version 2.15.5-0 but the latest enterprise version is 2.16.2-1
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.15.5-0 but the latest enterprise version is 2.16.2-1
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-d89467495-8bxtv (enterprise-2.15.5-0)
	* linkerd-destination-d89467495-n6th5 (enterprise-2.15.5-0)
	* linkerd-destination-d89467495-v48q4 (enterprise-2.15.5-0)
	* linkerd-identity-d55486b6c-k4nmc (enterprise-2.15.5-0)
	* linkerd-identity-d55486b6c-msh5l (enterprise-2.15.5-0)
	* linkerd-identity-d55486b6c-v49j5 (enterprise-2.15.5-0)
	* linkerd-proxy-injector-ff7d897fc-97bj9 (enterprise-2.15.5-0)
	* linkerd-proxy-injector-ff7d897fc-df5wl (enterprise-2.15.5-0)
	* linkerd-proxy-injector-ff7d897fc-nf8lh (enterprise-2.15.5-0)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-9979bc8f5-rdf7v (enterprise-2.15.5-0)
	* prometheus-fd6b67bcc-tmpfs (enterprise-2.15.5-0)
	* tap-7ddb4f85b4-m7mg7 (enterprise-2.15.5-0)
	* tap-7ddb4f85b4-nqzsl (enterprise-2.15.5-0)
	* tap-7ddb4f85b4-qtms7 (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-996j6 (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-d8rcz (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-fszsv (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-nns8h (enterprise-2.15.5-0)
	* tap-injector-b8fbc5b78-xnxn2 (enterprise-2.15.5-0)
	* web-9c7b78c9f-l49d2 (enterprise-2.15.5-0)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints

Status check results are √

Environment

Kubernetes version: v1.29.10
Provider: GKE
Host OS: Ubuntu Jammy
Linkerd Version: enterprise-2.15.5-0

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

@lazysteel93 lazysteel93 added the bug label Dec 4, 2024
@olix0r
Copy link
Member

olix0r commented Dec 4, 2024

If you can provide a workload manifest that reproduces this, we'd be happy to look at it.

We do not have a way to reproduce outside of this k8s cluster.

If the same software does not reproduce the issue in other environments, then this may be a symptom of a misbehaving network stack. Linkerd 2.16 added inter-proxy keep-alive timeouts to help detect these states. I'm curious if you see the same behavior on a newer version of Linkerd.

@lazysteel93
Copy link
Author

If you can provide a workload manifest that reproduces this, we'd be happy to look at it.

Unfortunately it's part of a microservices architecture that cannot be duplicated easily.

We're available for any troubleshooting we could do that wouldn't be disruptive to this cluster, because it is a production environment. The service in question is critical to our business. We can provide logs or anything necessary if it helps.

@olix0r
Copy link
Member

olix0r commented Dec 6, 2024

It might be helpful to dig more into the socket states of a workload. You can use something like the following to get a high-level snapshot:

:; kubectl debug -n "$ns" "$pod" --image=ghcr.io/linkerd/debug:edge-24.11.8 --attach -q -- ss -s                                               
Total: 11
TCP:   269 (estab 5, closed 261, orphaned 0, timewait 24)

Transport Total     IP        IPv6
RAW       0         0         0        
UDP       0         0         0        
TCP       8         8         0        
INET      8         8         0        
FRAG      0         0         0 

Additionally, it would be useful to capture the proxy-metrics of a workload in this state:

:; linkerd diagnostics proxy-metrics -n "$ns" pod/"$pod"| grep ^tcp_open_connections

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants