You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A Deployment's Linkerd-meshed pods opens TCP port 80 connections to a Service with Pods running Varnish behind it.
We are witnessing counts of established TCP connections on the outgoing pod's side over 2000. On occasion we've seen as many as 8000+ connections, which caused the linkerd container to crash. Under normal operations we would expect this to be 2 or 3 digits at the most. This is only happening in a single k8s cluster although all of our clusters run common code and are in the same provider's cloud.
Linkerd does not appear to be closing these HTTP connections properly. The connections are after all going to the Service IP.
The Java methods that we use on the outgoing pod are:
╰─◦ linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
is running version 2.15.5-0 but the latest enterprise version is 2.16.2-1
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.15.5-0 but the latest enterprise version is 2.16.2-1
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-d89467495-8bxtv (enterprise-2.15.5-0)
* linkerd-destination-d89467495-n6th5 (enterprise-2.15.5-0)
* linkerd-destination-d89467495-v48q4 (enterprise-2.15.5-0)
* linkerd-identity-d55486b6c-k4nmc (enterprise-2.15.5-0)
* linkerd-identity-d55486b6c-msh5l (enterprise-2.15.5-0)
* linkerd-identity-d55486b6c-v49j5 (enterprise-2.15.5-0)
* linkerd-proxy-injector-ff7d897fc-97bj9 (enterprise-2.15.5-0)
* linkerd-proxy-injector-ff7d897fc-df5wl (enterprise-2.15.5-0)
* linkerd-proxy-injector-ff7d897fc-nf8lh (enterprise-2.15.5-0)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-9979bc8f5-rdf7v (enterprise-2.15.5-0)
* prometheus-fd6b67bcc-tmpfs (enterprise-2.15.5-0)
* tap-7ddb4f85b4-m7mg7 (enterprise-2.15.5-0)
* tap-7ddb4f85b4-nqzsl (enterprise-2.15.5-0)
* tap-7ddb4f85b4-qtms7 (enterprise-2.15.5-0)
* tap-injector-b8fbc5b78-996j6 (enterprise-2.15.5-0)
* tap-injector-b8fbc5b78-d8rcz (enterprise-2.15.5-0)
* tap-injector-b8fbc5b78-fszsv (enterprise-2.15.5-0)
* tap-injector-b8fbc5b78-nns8h (enterprise-2.15.5-0)
* tap-injector-b8fbc5b78-xnxn2 (enterprise-2.15.5-0)
* web-9c7b78c9f-l49d2 (enterprise-2.15.5-0)
see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
Status check results are √
If you can provide a workload manifest that reproduces this, we'd be happy to look at it.
We do not have a way to reproduce outside of this k8s cluster.
If the same software does not reproduce the issue in other environments, then this may be a symptom of a misbehaving network stack. Linkerd 2.16 added inter-proxy keep-alive timeouts to help detect these states. I'm curious if you see the same behavior on a newer version of Linkerd.
If you can provide a workload manifest that reproduces this, we'd be happy to look at it.
Unfortunately it's part of a microservices architecture that cannot be duplicated easily.
We're available for any troubleshooting we could do that wouldn't be disruptive to this cluster, because it is a production environment. The service in question is critical to our business. We can provide logs or anything necessary if it helps.
What is the issue?
We are running Linkerd enterprise-2.15.5-0.
A Deployment's Linkerd-meshed pods opens TCP port 80 connections to a Service with Pods running Varnish behind it.
We are witnessing counts of established TCP connections on the outgoing pod's side over 2000. On occasion we've seen as many as 8000+ connections, which caused the linkerd container to crash. Under normal operations we would expect this to be 2 or 3 digits at the most. This is only happening in a single k8s cluster although all of our clusters run common code and are in the same provider's cloud.
Linkerd does not appear to be closing these HTTP connections properly. The connections are after all going to the Service IP.
The Java methods that we use on the outgoing pod are:
^ note: the final line
http.close()
in fact callsreleaseConnection()
.We think this may be related to a similar open connection issue that we raised in the past: #9724
We are planning to try
opaque-ports
to confirm our theory that Linkerd is leaving the connections open.How can it be reproduced?
We do not have a way to reproduce outside of this k8s cluster.
Logs, error output, etc
Current connections on the effected outgoing pods below. The IP address is that of the destination Service:
output of
linkerd check -o short
Environment
Kubernetes version: v1.29.10
Provider: GKE
Host OS: Ubuntu Jammy
Linkerd Version: enterprise-2.15.5-0
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: