-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network connectivity blocked for when connecting with an upstream that uses a server-first protocol #13364
Comments
Some additional testing that did NOT fix the problem:
The only thing that appears to fix the problem is removing Vault from the service mesh by disabling linkerd proxy injection. |
Note that there are no NetworkPolicy resources or Linkerd CRs installed in the cluster at all that should be impacting this network connection. |
Adding Oddly, adding |
Update: Appears to impact any pod trying to communicate with NATS as long as the pod is in the service mesh and |
Given the above, I have created a simplified scenario that demonstrates the problem:
Here are the logs from the sidecar when running
Hope that helps narrow the scope of what might potentially be going wrong. |
I do believe that the request is reaching the NATS server, but weirdly the only logs that get generated even in the highest possible log level are:
|
All of this said, I still think this is a bug in linkerd (vs in NATS) b/c I do not believe that adding the linkerd-proxy sidecar to a pod should prevent network connectivity via server-first protocols such as NATS (regardless of whether the upstream is in the mesh or not). |
Here are the linkerd-debug logs from the client pod that get generated when running
From this, it looks pretty clear that the NATS server is behaving according to its spec: when a client opens a TCP connection, immediately send an However, one thing that sticks out to me is that the linkerd proxy is sending a |
If you configure NATS to be marked as opaque, things should work.
How did you try to configure opaque ports? |
@olix0r Both I can confirm that I verified opaque ports were enabled via the logs. However, let me reiterate a key point: if the client has a linkerd sidecar but the NATS server does not, the problem persists. The only way I am able to resolve the problem was either (a) removing the linkerd sidecar from the client pods or (b) by setting Based on the above packet capture, it seems the linkerd sidecar on the client side is sending a |
There is some documentation on this (that has been updated relatively recently) which may help clarify things. If your server is not meshed and the client is meshed, it's critical that the Service that clients connect to is annotated with an opaque ports configuration. Let's confirm that behavior: metadata:
annotations:
config.linkerd.io/opaque-ports: "4222,6222,8222" Once that annotation is set, you can confirm the policy controller is providing the proper protocol configuration:
|
What is the issue?
UPDATE: This appears to affect any service trying to communicate with NATS over a linkerd mesh. As long
as the pod initiating the request to NATS is in linkerd mesh and
config.linkerd.io/skip-outbound-ports
is NOT set to4222
, no network connection can be established with an upstream NATS server. Ideally, we would like to mark 4222 opaque rather than skipping it; however, opaque ports do not fix this issue.I have deployed Hashicorp Vault with an external plugin integration.
The external plugin runs as a forked process inside the main vault container:
When the Vault pod has a linkerd proxy injected, the external plugin (
/plugins/vault-plugin-secrets-nats
) starts having connection timeout errors. The plugin works fine without linkerd.How can it be reproduced?
edge-24.11.2
with native sidecars enabled and using the linkerd CNILogs, error output, etc
Vault - Connectivity Failure Logs
Vault - linkerd-proxy sidecar logs
Upstream Target - Linkerd proxy logs
output of
linkerd check -o short
Environment
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
The text was updated successfully, but these errors were encountered: