You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Describe IN DETAIL the feature/behavior/change you would like to see.
[Graceful Node Deletion During Rolling Update]
It seems that the drainTerminateAndWait function called during a rolling update doesn’t fully clean up nodes. While there is some historical context here, currently, node deletion is only implemented for GCP and not for other environments. Adding node deletion would allow controllers to manage node resources more gracefully.
I encountered this issue when performing a rolling update in an environment with limited volume usage. When updating a StatefulSet with volumes attached, the volume detachment during the Kops rolling update can take a long time.
The problem arises because the node rolling update process follows a drain node -> terminate instance pattern. The kube-controller-manager, responsible for volume attach/detach, does not detect that the node has been removed. After a 6-minute forced detach timeout, the volume is forcefully detached, and users are left waiting the full 6 minutes.
If the process were modified to follow drain node -> delete node -> terminate instance, the kube-controller-manager would immediately detect the node deletion and allow for instant volume detachment.
This isn’t just a solution for the kube-controller-manager, but for any controller interacting with the Kubernetes API that requires graceful node termination.
If there is anything I am misunderstanding, please let me know.
2. Feel free to provide a design supporting your feature request.
Normally the cloud-controller-manager's node-lifecycle-controller watches for Nodes with a NodeReady status of Unknown and immediately queries the cloud provider for the VM's status to determine if it is being terminated:
There have been quite a few improvements to node lifecycle and volume attachment, in the migration to external CSI and CCM controllers. If you're still on k8s 1.25 as your link suggests, it would be wise to upgrade to a more recent k8s version and migrate to the external controllers.
/kind feature
1. Describe IN DETAIL the feature/behavior/change you would like to see.
[Graceful Node Deletion During Rolling Update]
It seems that the drainTerminateAndWait function called during a rolling update doesn’t fully clean up nodes. While there is some historical context here, currently, node deletion is only implemented for GCP and not for other environments. Adding node deletion would allow controllers to manage node resources more gracefully.
I encountered this issue when performing a rolling update in an environment with limited volume usage. When updating a StatefulSet with volumes attached, the volume detachment during the Kops rolling update can take a long time.
The problem arises because the node rolling update process follows a drain node -> terminate instance pattern. The kube-controller-manager, responsible for volume attach/detach, does not detect that the node has been removed. After a 6-minute forced detach timeout, the volume is forcefully detached, and users are left waiting the full 6 minutes.
If the process were modified to follow drain node -> delete node -> terminate instance, the kube-controller-manager would immediately detect the node deletion and allow for instant volume detachment.
This isn’t just a solution for the kube-controller-manager, but for any controller interacting with the Kubernetes API that requires graceful node termination.
If there is anything I am misunderstanding, please let me know.
2. Feel free to provide a design supporting your feature request.
Modify the code below to make draining possible from all providers.
https://github.com/cloud-pi/kops/blob/release-1.25-spc/pkg/instancegroups/instancegroups.go#L431-L440
The text was updated successfully, but these errors were encountered: