Graceful Node Deletion During Rolling Update #17230

peaaceChoi · 2025-01-22T09:44:16Z

/kind feature

1. Describe IN DETAIL the feature/behavior/change you would like to see.

[Graceful Node Deletion During Rolling Update]

It seems that the drainTerminateAndWait function called during a rolling update doesn’t fully clean up nodes. While there is some historical context here, currently, node deletion is only implemented for GCP and not for other environments. Adding node deletion would allow controllers to manage node resources more gracefully.

I encountered this issue when performing a rolling update in an environment with limited volume usage. When updating a StatefulSet with volumes attached, the volume detachment during the Kops rolling update can take a long time.

The problem arises because the node rolling update process follows a drain node -> terminate instance pattern. The kube-controller-manager, responsible for volume attach/detach, does not detect that the node has been removed. After a 6-minute forced detach timeout, the volume is forcefully detached, and users are left waiting the full 6 minutes.

If the process were modified to follow drain node -> delete node -> terminate instance, the kube-controller-manager would immediately detect the node deletion and allow for instant volume detachment.

This isn’t just a solution for the kube-controller-manager, but for any controller interacting with the Kubernetes API that requires graceful node termination.

If there is anything I am misunderstanding, please let me know.

2. Feel free to provide a design supporting your feature request.

Modify the code below to make draining possible from all providers.
https://github.com/cloud-pi/kops/blob/release-1.25-spc/pkg/instancegroups/instancegroups.go#L431-L440

rifelpet · 2025-01-23T02:50:22Z

Which cloud provider are you using?

Normally the cloud-controller-manager's node-lifecycle-controller watches for Nodes with a NodeReady status of Unknown and immediately queries the cloud provider for the VM's status to determine if it is being terminated:

https://github.com/kubernetes/cloud-provider/blob/95c3ff6e5a5085203b9fbcd14d50ca6348c11930/controllers/nodelifecycle/node_lifecycle_controller.go#L126-L129

For example, heres the AWS implementation:

https://github.com/kubernetes/cloud-provider-aws/blob/c9d75959d742f857f25be45a3d44dde8d59dc5cd/pkg/providers/v1/aws.go#L922-L923

There have been quite a few improvements to node lifecycle and volume attachment, in the migration to external CSI and CCM controllers. If you're still on k8s 1.25 as your link suggests, it would be wise to upgrade to a more recent k8s version and migrate to the external controllers.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 22, 2025

hakman added the kind/office-hours label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful Node Deletion During Rolling Update #17230

Graceful Node Deletion During Rolling Update #17230

peaaceChoi commented Jan 22, 2025

rifelpet commented Jan 23, 2025

Graceful Node Deletion During Rolling Update #17230

Graceful Node Deletion During Rolling Update #17230

Comments

peaaceChoi commented Jan 22, 2025

rifelpet commented Jan 23, 2025