✨ Add MachineDrainRule "WaitCompleted" #11545

vincepri · 2024-12-05T16:06:49Z

What this PR does / why we need it:

This PR adds the ability for drain to wait for the completion of specific pods. This is useful in scenario where drain is either handled outside the context of kubectl drain after a Node is cordoned, or for long running batch Jobs that should be allowed to terminate on their own.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

api/v1beta1/machinedrainrules_types.go

internal/controllers/machine/drain/drain.go

internal/controllers/machine/drain/filters.go

sbueringer · 2024-12-30T17:39:17Z

internal/controllers/machine/drain/drain.go

@@ -281,6 +281,7 @@ func (d *Helper) EvictPods(ctx context.Context, podDeleteList *PodDeleteList) Ev
 	var podsToTriggerEvictionLater []PodDelete
 	var podsWithDeletionTimestamp []PodDelete


Please check the existing test coverage for the modified funcs and extend accordingly for the new case (we should have test coverage for everything, only new cases should be needed)

Let's also extend the NodeDrainTimeout e2e test to cover waitcompleted (node_drain.go)

(Unfortunately the whole drain implementation is a bit brittle (already before this PR) so e2e coverage would be really good. For example with the current PR even Pods with the label set to waitcompleted would have been drained because the machineDrainRulesFilter would have overwritten the result of the drainLabelFilter with behavior drain)

sbueringer · 2024-12-30T17:40:10Z

internal/controllers/machine/drain/drain.go

@@ -498,6 +507,10 @@ func (r EvictionResult) ConditionMessage(nodeDrainStartTime *metav1.Time) string
 		conditionMessage = fmt.Sprintf("%s\nAfter above Pods have been removed from the Node, the following Pods will be evicted: %s",
 			conditionMessage, PodListToString(r.PodsToTriggerEvictionLater, 3))
 	}
+	if len(r.PodsToWaitCompleted) > 0 {


@fabriziopandini Is this something that should be specifically handled on higher levels when the condition bubbles up? (like we handled PDB's etc.)

(could also be a follow-up PR)

api/v1beta1/machinedrainrules_types.go

internal/controllers/machine/drain/drain.go

sbueringer · 2024-12-31T11:59:01Z

internal/controllers/machine/drain/filters.go

+func MakePodDeleteStatusWaitCompleted() PodDeleteStatus {
+	return PodDeleteStatus{
+		DrainBehavior: clusterv1.MachineDrainRuleDrainBehaviorWaitCompleted,
+		Reason:        PodDeleteStatusTypeWaitCompleted,


Is it intentional that Pods with WaitCompleted basically don't have a drain order?

This means that Pods with "WaitCompleted" would never block eviction of other Pods (which might be a problem if the Pods with "WaitCompleted" depend on some other Pods to keep running)

(let's update the godoc comment of the order field and the webhook implementation according to the outcome of this discussion)

internal/controllers/machine/drain/filters.go

sbueringer · 2024-12-31T12:08:12Z

internal/controllers/machine/drain/drain.go

@@ -281,6 +281,7 @@ func (d *Helper) EvictPods(ctx context.Context, podDeleteList *PodDeleteList) Ev
 	var podsToTriggerEvictionLater []PodDelete
 	var podsWithDeletionTimestamp []PodDelete


Let's also extend the NodeDrainTimeout e2e test to cover waitcompleted (node_drain.go)

(Unfortunately the whole drain implementation is a bit brittle (already before this PR) so e2e coverage would be really good. For example with the current PR even Pods with the label set to waitcompleted would have been drained because the machineDrainRulesFilter would have overwritten the result of the drainLabelFilter with behavior drain)

k8s-ci-robot · 2025-01-07T19:02:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from vincepri. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Vince Prignano <[email protected]>

vincepri · 2025-01-09T15:18:18Z

/test ?

k8s-ci-robot · 2025-01-09T15:18:21Z

@vincepri: The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main

/test pull-cluster-api-e2e-blocking-main

/test pull-cluster-api-e2e-conformance-ci-latest-main

/test pull-cluster-api-e2e-conformance-main

/test pull-cluster-api-e2e-latestk8s-main

/test pull-cluster-api-e2e-main

/test pull-cluster-api-e2e-mink8s-main

/test pull-cluster-api-e2e-upgrade-1-32-1-33-main

/test pull-cluster-api-test-main

/test pull-cluster-api-test-mink8s-main

/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main

pull-cluster-api-build-main

pull-cluster-api-e2e-blocking-main

pull-cluster-api-test-main

pull-cluster-api-verify-main

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vincepri · 2025-01-09T15:18:33Z

/test pull-cluster-api-e2e-main

k8s-ci-robot · 2025-01-09T17:08:04Z

@vincepri: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-e2e-main	`e22bc4d`	link	true	`/test pull-cluster-api-e2e-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 5, 2024

k8s-ci-robot requested review from fabriziopandini and JoelSpeed December 5, 2024 16:06

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 5, 2024

vincepri force-pushed the pod-drain-waitcompleted branch 2 times, most recently from bab1c1b to 211839e Compare December 11, 2024 19:41

vincepri added the area/machine Issues or PRs related to machine lifecycle management label Dec 12, 2024

k8s-ci-robot removed the do-not-merge/needs-area PR is missing an area label label Dec 12, 2024

vincepri added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. do-not-merge/needs-area PR is missing an area label labels Dec 12, 2024

sbueringer reviewed Dec 30, 2024

View reviewed changes

sbueringer reviewed Dec 31, 2024

View reviewed changes

vincepri changed the title ~~WIP: ✨ Add MachineDrainRule "WaitCompleted"~~ ✨ Add MachineDrainRule "WaitCompleted" Jan 7, 2025

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 7, 2025

vincepri added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed do-not-merge/needs-area PR is missing an area label labels Jan 7, 2025

vincepri force-pushed the pod-drain-waitcompleted branch from 211839e to 533803e Compare January 7, 2025 19:02

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2025

vincepri removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 9, 2025

vincepri force-pushed the pod-drain-waitcompleted branch from 533803e to 8826d5b Compare January 9, 2025 14:56

✨ Add MachineDrainRule "WaitCompleted"

e22bc4d

Signed-off-by: Vince Prignano <[email protected]>

vincepri force-pushed the pod-drain-waitcompleted branch from 8826d5b to e22bc4d Compare January 9, 2025 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add MachineDrainRule "WaitCompleted" #11545

✨ Add MachineDrainRule "WaitCompleted" #11545

vincepri commented Dec 5, 2024 •

edited

Loading

sbueringer Dec 30, 2024

sbueringer Dec 31, 2024

sbueringer Dec 30, 2024

sbueringer Dec 31, 2024

sbueringer Dec 31, 2024

k8s-ci-robot commented Jan 7, 2025

vincepri commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

vincepri commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

		@@ -281,6 +281,7 @@ func (d Helper) EvictPods(ctx context.Context, podDeleteList PodDeleteList) Ev
		var podsToTriggerEvictionLater []PodDelete
		var podsWithDeletionTimestamp []PodDelete

✨ Add MachineDrainRule "WaitCompleted" #11545

Are you sure you want to change the base?

✨ Add MachineDrainRule "WaitCompleted" #11545

Conversation

vincepri commented Dec 5, 2024 • edited Loading

sbueringer Dec 30, 2024

Choose a reason for hiding this comment

sbueringer Dec 31, 2024

Choose a reason for hiding this comment

sbueringer Dec 30, 2024

Choose a reason for hiding this comment

sbueringer Dec 31, 2024

Choose a reason for hiding this comment

sbueringer Dec 31, 2024

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 7, 2025

vincepri commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

vincepri commented Jan 9, 2025

k8s-ci-robot commented Jan 9, 2025

vincepri commented Dec 5, 2024 •

edited

Loading