ThrottlingException and failed to get rate limit token issues #1504

pauldtill · 2024-11-19T19:10:35Z

/kind bug

What happened?
We have a very volatile integration EKS cluster which creates and deletes resources regularly, including PVC's which use an EFFS as the storage class.

We recently had an issue provisioning volumes (happening very slowly) and then found we had 1000's of what seem to be orphaned PV's in the cluster - with no matching EFS access point in the account. They were all "stuck" in a released state, with the finalizer preventing removal.

On checking the controller logs we found 1000's of these errors -

E1119 17:26:39.580943       1 driver.go:107] GRPC error: rpc error: code = Internal desc = Failed to Delete volume fs-06eb2378eed532625::fsap-0765ce577dfacc49a: Failed to delete access point: fsap-0765ce577dfacc49a, error: operation error EFS: DeleteAccessPoint, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested

E1119 17:32:42.641202       1 driver.go:107] GRPC error: rpc error: code = Unauthenticated desc = Access Denied. Please ensure you have the right AWS permissions: Access denied

Along with a handful of these errors -

E1119 17:32:13.526489       1 driver.go:107] GRPC error: rpc error: code = Internal desc = Failed to Delete volume fs-06eb2378eed532625::fsap-04ba7accf14fddb47: Failed to delete access point: fsap-04ba7accf14fddb47, error: operation error EFS: DeleteAccessPoint, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: e3decdeb-f8f5-41bc-9f77-7ebabbdc4860, api error ThrottlingException: Rate exceeded

Full logfile - efs-plugin.log

The Access Denied seems to be related to the fact that the controller is constantly trying to clean-up none existing access points, if you look at the CloudTrail event it has no request parameters. We know the controller has the correct permissions as it is creating/deleting file systems OK too (policy we are using is attached).

efs-pol.json

What seems to be happening is -

Driver fails to delete an access point due to ThrottlingException: Rate exceeded
The access point does actually get removed from the account.
The driver continues on (you can see in the logs) trying to remove it, but constantly failing.
The PV itself is left stuck in a "released" state and can only be removed by patching the finalizers.

What you expected to happen?
The driver to handle the rate limiting with some sort of back-off and ensure successful removal allowing the PV to then be deleted too.

How to reproduce it (as minimally and precisely as possible)?
This is going to be difficult, as I mentioned above this is a cluster where we have dozens of testing pipelines creating namespaces (which include EFS based PV's), performing automated tests, then deleting itself - we think the sheer volume of this activity is triggering the rate limiting etc.

Anything else we need to know?: No

Environment

Kubernetes version (use kubectl version): v1.30.4-eks-a737599
Driver version: 3.0.8

Please also attach debug logs to help us better diagnose

Difficult here due to manner in which the pods using the EFS are created / deleted. We have enabled logLevel 5 on the controller to see if this reveals anything and will upload when we have something.

Instructions to gather debug logs can be found here

The text was updated successfully, but these errors were encountered:

mskanth972 · 2024-11-19T21:40:46Z

Looks like you are hitting this issue which we fixed in latest version v2.1.0, can you please update to the latest and see if you are facing the issue still.

pauldtill · 2024-11-20T10:47:12Z

@mskanth972 - we are running this version of the image - amazon/aws-efs-csi-driver:v2.1.0

pauldtill · 2024-11-20T11:35:52Z

Some debug logs - efs-plugin-debug.log

emoreth · 2024-12-16T21:24:36Z

I can't confirm it doesn't leak undeletable PVs anymore as I can't forcefully make this happen, but from the fix, it looks like there is no cleaning up from old undeletable resources (left over finalizers), so this is what worked for me for cleaning up resources

P.S.A.: Double check all commands and outputs as this can destroy your data

for pv in $(kubectl get pv -A -o json | jq -r '.items[] | select(.spec.storageClassName == "efs-etcd" and .status.phase == "Released") | .metadata.name'); do
  kubectl patch pv "$pv" --type=json -p '[{"op": "remove", "path": "/metadata/finalizers"}]'
done

daro1337 · 2024-12-19T13:21:55Z

I confirm, I am also experiencing this error (latest version of efs-csi-controller) 2.1.2.

mskanth972 · 2024-12-19T15:25:07Z

Hi, we have raised a PR on this to switch to AdaptiveRetry mode which is already merged and we are planning to include this in the next release. PR has more info.
#1520

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThrottlingException and failed to get rate limit token issues #1504

ThrottlingException and failed to get rate limit token issues #1504

pauldtill commented Nov 19, 2024

mskanth972 commented Nov 19, 2024 •

edited

Loading

pauldtill commented Nov 20, 2024

pauldtill commented Nov 20, 2024

emoreth commented Dec 16, 2024 •

edited

Loading

daro1337 commented Dec 19, 2024

mskanth972 commented Dec 19, 2024 •

edited

Loading

ThrottlingException and failed to get rate limit token issues #1504

ThrottlingException and failed to get rate limit token issues #1504

Comments

pauldtill commented Nov 19, 2024

mskanth972 commented Nov 19, 2024 • edited Loading

pauldtill commented Nov 20, 2024

pauldtill commented Nov 20, 2024

emoreth commented Dec 16, 2024 • edited Loading

daro1337 commented Dec 19, 2024

mskanth972 commented Dec 19, 2024 • edited Loading

mskanth972 commented Nov 19, 2024 •

edited

Loading

emoreth commented Dec 16, 2024 •

edited

Loading

mskanth972 commented Dec 19, 2024 •

edited

Loading