Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Servers with a gRPC proxyProtocol and HTTPRoutes cannot be upgraded past edge-24.7.1 #13507

Open
andrew-gropyus opened this issue Dec 19, 2024 · 2 comments
Assignees

Comments

@andrew-gropyus
Copy link
Contributor

andrew-gropyus commented Dec 19, 2024

What is the issue?

We are running edge 24.5.3 and upgrading to 24.11.8.

We use a default deny policy.

We currently run a number of grpc servers (Servers that have proxyProtocol: "gRPC") combined with http routes and authorization policies for fine grained authorization.

In 24.5.3 the behavior of linkerd when building the server model internally was to collect the http routes for grpc server.

In #12785 (released in edge-24.7.1) this behavior was changed to use grpc routes for a grpc server:

-            ProxyProtocol::Grpc => Some(proto::proxy_protocol::Kind::Http2(
-                proto::proxy_protocol::Http2 {
-                    routes: http::to_route_list(&srv.http_routes, cluster_networks),
+            ProxyProtocol::Grpc => Some(proto::proxy_protocol::Kind::Grpc(
+                proto::proxy_protocol::Grpc {
+                    routes: grpc::to_route_list(&srv.grpc_routes, cluster_networks),
+                },
+            )),

When a grpc server does not have any grpc routes, it defaults to a default grpc route.

The same PR also added support for adding grpc routes as target refs for authorization policies. Before this PR, an authorization policy that had a target ref pointing to a grpc route would fail in the policy admission controller.

As I understand it, this means that the following combination of components can not be upgraded with out downtime:

  • default deny policy
  • servers with proxy protocol grpc
  • http routes
  • authorization policies

How can it be reproduced?

Before upgrading to 24.7.1 have a server, http route, and authorization policy that look like the following:

apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
  name: api-grpc
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: api
  port: grpc
  proxyProtocol: gRPC
---
apiVersion: policy.linkerd.io/v1beta3
kind: HTTPRoute
metadata:
  name: api-grpc-health
spec:
  parentRefs:
    - group: policy.linkerd.io
      kind: Server
      name: api-grpc
  rules:
    - matches:
        - path:
            value: "/grpc.health.v1.Health/Check"
---
apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  name: api-grpc-health-allow-anonymous
spec:
  targetRef:
    group: policy.linkerd.io
    kind: HTTPRoute
    name: api-grpc-health
  requiredAuthenticationRefs: []

Then upgrade to or past 24.7.1.

Note that the server has not detected the http route, and is instead relying on a default route, with a default policy to authorize requests.

Logs, error output, etc

No logs

output of linkerd check -o short

Only up to date warnings

Environment

  • Kubernetes Version: 1.30.5
  • Kubernetes Environment: AKS
  • Host OS: Ubuntu 22.04
  • Linkerd version: edge-24.5.3

Possible solution

  • When building the internal server model for grpc servers, allow falling back to http routes if grpc routes are not present.
  • Switch from proxy protocol gRPC to HTTP/2. This would work with the http routes but would likely impact the metrics that are being collected. I have not looked deeply into this.

Additional context

We are looking for an online migration path, we would rather not bring services down while we switched them from http routes to grpc routes.

Would you like to work on fixing this bug?

None

@andrew-gropyus
Copy link
Contributor Author

We are also experimenting with changing the proxy protocol for grpc to http/2. Although we are not fully sure of the implications.

From what we understand, load balancing http/2 should be the same as load balancing grpc. There are some differences in metrics. In particular the grpc_status_code label would not be published. The classification label would work the same as far as we understand, since grpc error status codes map to http 4xx or 5xx status codes.

@adleong
Copy link
Member

adleong commented Jan 9, 2025

Thanks for reporting this @andrew-gropyus. We're looking into this so that we can advise you on the best way to do this upgrade without downtime.

@adleong adleong self-assigned this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants