Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MGMT-19120: Use service net to connect to hosted API server #7090

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jhernand
Copy link
Contributor

@jhernand jhernand commented Dec 13, 2024

There are several situations where assisted service needs to connect to the API server of a spoke cluster. To do so it uses the kubeconfig generated during the installation, and that usually contains the external URL of the API server, and that means that the cluster where assisted service runs needs to be configured with a proxy that allows that. But for HyperShift clusters this can be avoided: assisted service can instead connect via the service network, using the kube-apiserver.my-cluster.svc host name, as the API server runs as a pod in the same cluster. Doing that reduces the number of round trips and the potential proxy configuration issues. In order to achieve that this patch changes the spoke client factory so that it checks if the cluster is a HyperShift cluster, and then it replaces the API server URL with https://kube-apiserver.my-cluster.svc:6443.

List all the issues related to this PR

https://issues.redhat.com/browse/MGMT-19120

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

@jhernand jhernand changed the title Use service network to talk to hosted api server MGMT-19120: Use service net to connect to hosted API server Dec 13, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 13, 2024

@jhernand: This pull request references MGMT-19120 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

In response to this:

There are several situations where assisted service needs to connect to the API server of a spoke cluster. To do so it uses the kubeconfig generated during the installation, and that usually contains the external URL of the API server, and that means that the cluster where assisted service runs needs to be configured with a proxy that allows that. But for HyperShift clusters this can be avoided: assisted service can instead connect via the service network, using the kube-apiserver.my-cluster.svc host name, as the API server runs as a pod in the same cluster. During that reduces the number of round trips and the potential proxy configuration issues. In order to achieve that this patch changes the spoke client factory so that it checks if the cluster is a HyperShift cluster, and then it replaces the API server URL
with https://kube-apiserver.my-cluster.svc:6443.

List all the issues related to this PR

https://issues.redhat.com/browse/MGMT-19120

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Dec 13, 2024
@jhernand jhernand marked this pull request as draft December 13, 2024 20:18
@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Dec 13, 2024
Copy link

openshift-ci bot commented Dec 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhernand

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 13, 2024
@jhernand jhernand force-pushed the use_service_network_to_talk_to_hosted_api_server branch from 63c3673 to 390c90d Compare December 16, 2024 16:06
@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 16, 2024
@gamli75
Copy link
Contributor

gamli75 commented Dec 16, 2024

@eranco74 can you review this PR?

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2024

@jhernand: This pull request references MGMT-19120 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

In response to this:

There are several situations where assisted service needs to connect to the API server of a spoke cluster. To do so it uses the kubeconfig generated during the installation, and that usually contains the external URL of the API server, and that means that the cluster where assisted service runs needs to be configured with a proxy that allows that. But for HyperShift clusters this can be avoided: assisted service can instead connect via the service network, using the kube-apiserver.my-cluster.svc host name, as the API server runs as a pod in the same cluster. Doing that reduces the number of round trips and the potential proxy configuration issues. In order to achieve that this patch changes the spoke client factory so that it checks if the cluster is a HyperShift cluster, and then it replaces the API server URL
with https://kube-apiserver.my-cluster.svc:6443.

List all the issues related to this PR

https://issues.redhat.com/browse/MGMT-19120

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Dec 16, 2024

@jhernand: This pull request references MGMT-19120 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.19.0" version, but no target version was set.

In response to this:

There are several situations where assisted service needs to connect to the API server of a spoke cluster. To do so it uses the kubeconfig generated during the installation, and that usually contains the external URL of the API server, and that means that the cluster where assisted service runs needs to be configured with a proxy that allows that. But for HyperShift clusters this can be avoided: assisted service can instead connect via the service network, using the kube-apiserver.my-cluster.svc host name, as the API server runs as a pod in the same cluster. Doing that reduces the number of round trips and the potential proxy configuration issues. In order to achieve that this patch changes the spoke client factory so that it checks if the cluster is a HyperShift cluster, and then it replaces the API server URL with https://kube-apiserver.my-cluster.svc:6443.

List all the issues related to this PR

https://issues.redhat.com/browse/MGMT-19120

  • New Feature
  • Enhancement
  • Bug fix
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Automation (CI, tools, etc)
  • Cloud
  • Operator Managed Deployments
  • None

How was this code tested?

  • assisted-test-infra environment
  • dev-scripts environment
  • Reviewer's test appreciated
  • Waiting for CI to do a full test run
  • Manual (Elaborate on how it was tested)
  • No tests needed

Checklist

  • Title and description added to both, commit and PR.
  • Relevant issues have been associated (see CONTRIBUTING guide)
  • This change does not require a documentation update (docstring, docs, README, etc)
  • Does this change include unit-tests (note that code changes require unit-tests)

Reviewers Checklist

  • Are the title and description (in both PR and commit) meaningful and clear?
  • Is there a bug required (and linked) for this change?
  • Should this PR be backported?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jhernand jhernand force-pushed the use_service_network_to_talk_to_hosted_api_server branch from 390c90d to 3e7178a Compare December 16, 2024 19:39
@jhernand jhernand mentioned this pull request Dec 17, 2024
20 tasks
@jhernand jhernand force-pushed the use_service_network_to_talk_to_hosted_api_server branch 2 times, most recently from 4cd5768 to 92b8bac Compare December 18, 2024 11:58
@openshift-ci openshift-ci bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 18, 2024
@jhernand jhernand marked this pull request as ready for review December 18, 2024 15:54
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 18, 2024
Copy link

codecov bot commented Dec 18, 2024

Codecov Report

Attention: Patch coverage is 82.35294% with 21 lines in your changes missing coverage. Please review.

Project coverage is 67.63%. Comparing base (8607a87) to head (4d0cd06).
Report is 19 commits behind head on master.

Files with missing lines Patch % Lines
internal/spoke_k8s_client/factory.go 82.24% 15 Missing and 4 partials ⚠️
...nal/controller/controllers/bmh_agent_controller.go 80.00% 1 Missing ⚠️
internal/spoke_k8s_client/spoke_k8s_client.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7090      +/-   ##
==========================================
+ Coverage   67.52%   67.63%   +0.10%     
==========================================
  Files         296      296              
  Lines       40088    40158      +70     
==========================================
+ Hits        27071    27160      +89     
+ Misses      10574    10548      -26     
- Partials     2443     2450       +7     
Files with missing lines Coverage Δ
...nternal/controller/controllers/agent_controller.go 76.43% <100.00%> (ø)
...oller/controllers/clusterdeployments_controller.go 72.79% <100.00%> (ø)
...rollers/hypershiftagentserviceconfig_controller.go 76.58% <100.00%> (ø)
...ernal/controller/controllers/spoke_client_cache.go 85.18% <100.00%> (ø)
...nal/controller/controllers/bmh_agent_controller.go 77.11% <80.00%> (ø)
internal/spoke_k8s_client/spoke_k8s_client.go 35.29% <0.00%> (+35.29%) ⬆️
internal/spoke_k8s_client/factory.go 80.00% <82.24%> (+35.55%) ⬆️

... and 2 files with indirect coverage changes

@jhernand
Copy link
Contributor Author

/retest-required

@jhernand jhernand force-pushed the use_service_network_to_talk_to_hosted_api_server branch from 92b8bac to 7b9e4dc Compare December 19, 2024 08:27
}

// SetHubClient sets the client that will be used to call the API of the hub cluster. This is mandatory.
func (b *SpokeK8sClientFactoryBuilder) SetHubClient(value ctrlclient.Client) *SpokeK8sClientFactoryBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@tsorya tsorya Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry my fault missed
https://github.com/openshift/assisted-service/pull/7090/files#diff-c444f711e9191b53952edb65bfd8c644419fc7695c62611dc0fb304b4fb197d6R625

Though it seems like this is a must parameter and we will get error in build if it was not set, so why not to provide it as param to New? Same actually for logger

Copy link
Contributor Author

@jhernand jhernand Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a way to make the code cleaner, avoiding long lists of parameters. We could pass the the logger, the client (and the transport wrapper, only used currently for tests) as parameters to the "New..." function, but over time that results in long lists of parameters like this.

api = NewManager(common.GetTestLog(), db, testing.GetDummyNotificationStream(ctrl), mockEventApi, nil, nil, nil, nil, &config, &leader.DummyElector{}, nil, nil, true, nil, nil, false)

It is already useful to avoid setting the transport wrapper parameter to nil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is required params and in this case you left them as optional so i don't understand actually why it is good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is good for several reasons:

  1. It is consistent: all the parameters (required or optional) are provided in the same way.

  2. It makes it clearer what each parameter means. Not in this case, but if you had two parameters that are strings it is not the same to see this:

whatever, err := NewWhatever("foo", "bar")

Than this:

whatever, err := NewWhatever(().
        SetUserName("foo").
        SetPassword("bar").
        Build()

In the first case you have to deep digger to find out what is the meaning of the parameters, and in the second it is explicit.

  1. It gives room for documenting each parameter separately: the documentation goes in the "Set..." method of the builder.

  2. It simplifies building the object in multiple steps, if needed, for example:

builder := NewWhatever()
builder.SetUserName("foo")
if shouldUsePassword {
        builder.SetPassword("bar")
}
whatever, err := builder.Build()
  1. It simplifies adding multiple values for the same parameter:
whatever, err := NewWahtever().
        SetUserName("foo").
        SetUserName("foo-alias").
        Build()
  1. It allows adding new optional parameters without having to change the call sites.

I don't want to bore you with my opinions about this. If you find this unacceptable I will change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk, maybe it just me, I just believe that if parameter is required it should be provided as part of function call another way if someone will write
whatever, err := NewWahtever().Build()
it will pass compilation but will fail on the run an i think better to find such error in compilation.
Though it is my personal opinion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, i like your proposition i just don't think it should be that way with required params

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point of view, and still think that the benefits outweigh the drawbacks. As that is not the key point of this pull request I am changing it to a plain list of parameters. We can have this discussion another time.

// object reference. So to find the cluster deployment we can get all the instances inside the namespace of the
// secret and then select the first one that references it.
clusterDeploymentList := &hivev1.ClusterDeploymentList{}
err = f.hubClient.List(ctx, clusterDeploymentList, ctrlclient.InNamespace(kubeconfigSecret.Namespace))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we list with filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean, can you elaborate? Note that the search criteria here is spec.clusterMetadata.adminKubeconfigSecretRef.Name == ..., I think searching by that field isn't supported by the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it can be only one cluster deployment per namespace actually. Don't we have owner ref in the secret for clusterdeployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we don't need to rely on that here.

return
}

func (f *spokeK8sClientFactory) CreateFromSecret(ctx context.Context, secret *corev1.Secret) (result SpokeK8sClient, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why better to return result this way?
(result SpokeK8sClient, err error) ?
I believe 99% of the code doesn't do it this way and i just wonder why it is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably a matter of taste. I like to have the names of the return parameters: helps understand what to expect. Not very important in this case as the meaning is very clear. I can change it if you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is taste issue :) just most of the code have another style so why to have different styles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I will change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if err != nil {
cf.log.WithError(err).Warnf("Getting kuberenetes config for cluster")
return nil, nil, err
func (f *spokeK8sClientFactory) kubeConfigFromSecret(secret *corev1.Secret) (result []byte, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it common function? We have at least 2 more place that do the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, at least we have one less place now: I removed similar logic from the spoke client cache in a previous patch. I will try to find where we are doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this in a different patch.

// Try to find the cluster deployment. If we can't, for whatever the reason, explain it in the log and assume
// it isn't a hosted cluster.
clusterDeployment, err := f.findClusterDeploymentForKubeconfigSecret(ctx, kubeconfigSecret)
if err != nil || clusterDeployment == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error and not having clusterDeployment seems to be different issues, maybe we should split the logging at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jhernand jhernand force-pushed the use_service_network_to_talk_to_hosted_api_server branch from 7b9e4dc to cb06b53 Compare December 19, 2024 12:55
log: cf.log,
// findClusterDeploymentForKubeconfigSecret finds the cluster deployment that corresponds to the given kubeconfig
// secret. It returns nil if there is no such cluster deployment.
func (f *spokeK8sClientFactory) findClusterDeploymentForKubeconfigSecret(ctx context.Context,
Copy link
Member

@carbonin carbonin Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ever in a situation where the caller of this factory doesn't already have a reference to the cluster deployment?

Since (based on the naming) we're talking about "spoke" clusters it seems likely that this could be simplified by either the caller supplying the cluster deployment or by this logic living outside this factory (then we would have an option like "useHubServiceNetwork" or something when creating the client).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here we don't know what is the cluster deployment:

spokeClient, err := hr.SpokeClients.Get(kubeconfigSecret)
.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks.

Side note though ... can we delete the HASC CRD and controller yet?
@gamli75 that effort isn't happening now, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with that effort. maybe @CrystalChun

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with it either, maybe @danielerez?

There are several situations where assisted service needs to connect to
the API server of a spoke cluster. To do so it uses the kubeconfig
generated during the installation, and that usually contains the
external URL of the API server, and that means that the cluster where
assisted service runs needs to be configured with a proxy that allows
that. But for HyperShift clusters this can be avoided: assisted service
can instead connect via the service network, using the
`kube-apiserver.my-cluster.svc` host name, as the API server runs as a
pod in the same cluster. Doing that reduces the number of round trips
and the potential proxy configuration issues. In order to achive that
this patch changes the spoke client factory so that it checks if the
cluster is a HyperShift cluster, and then it replaces the API server URL
with `https://kube-apiserver.my-cluster.svc:6443`.

Related: https://issues.redhat.com/browse/MGMT-19120
Signed-off-by: Juan Hernandez <[email protected]>
@jhernand jhernand force-pushed the use_service_network_to_talk_to_hosted_api_server branch from cb06b53 to 4d0cd06 Compare December 19, 2024 19:38
@jhernand
Copy link
Contributor Author

jhernand commented Jan 7, 2025

@carbonin @tsorya I think I addressed your concerns. Anything else that needs to be changed before merging?

@jhernand
Copy link
Contributor Author

jhernand commented Jan 7, 2025

/test edge-e2e-ai-operator-disconnected-capi

Copy link

openshift-ci bot commented Jan 7, 2025

@jhernand: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-ai-operator-disconnected-capi 4d0cd06 link false /test edge-e2e-ai-operator-disconnected-capi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gamli75
Copy link
Contributor

gamli75 commented Jan 7, 2025

/test edge-e2e-ai-operator-disconnected-capi

this job is broken again, see: https://issues.redhat.com/browse/MGMT-19358

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants