Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThrottlingException and failed to get rate limit token issues #1504

Open
pauldtill opened this issue Nov 19, 2024 · 6 comments
Open

ThrottlingException and failed to get rate limit token issues #1504

pauldtill opened this issue Nov 19, 2024 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@pauldtill
Copy link

/kind bug

What happened?
We have a very volatile integration EKS cluster which creates and deletes resources regularly, including PVC's which use an EFFS as the storage class.

We recently had an issue provisioning volumes (happening very slowly) and then found we had 1000's of what seem to be orphaned PV's in the cluster - with no matching EFS access point in the account. They were all "stuck" in a released state, with the finalizer preventing removal.

On checking the controller logs we found 1000's of these errors -

E1119 17:26:39.580943       1 driver.go:107] GRPC error: rpc error: code = Internal desc = Failed to Delete volume fs-06eb2378eed532625::fsap-0765ce577dfacc49a: Failed to delete access point: fsap-0765ce577dfacc49a, error: operation error EFS: DeleteAccessPoint, failed to get rate limit token, retry quota exceeded, 4 available, 5 requested
E1119 17:32:42.641202       1 driver.go:107] GRPC error: rpc error: code = Unauthenticated desc = Access Denied. Please ensure you have the right AWS permissions: Access denied

Along with a handful of these errors -

E1119 17:32:13.526489       1 driver.go:107] GRPC error: rpc error: code = Internal desc = Failed to Delete volume fs-06eb2378eed532625::fsap-04ba7accf14fddb47: Failed to delete access point: fsap-04ba7accf14fddb47, error: operation error EFS: DeleteAccessPoint, exceeded maximum number of attempts, 3, https response error StatusCode: 400, RequestID: e3decdeb-f8f5-41bc-9f77-7ebabbdc4860, api error ThrottlingException: Rate exceeded

Full logfile - efs-plugin.log

The Access Denied seems to be related to the fact that the controller is constantly trying to clean-up none existing access points, if you look at the CloudTrail event it has no request parameters. We know the controller has the correct permissions as it is creating/deleting file systems OK too (policy we are using is attached).

efs-pol.json

What seems to be happening is -

  • Driver fails to delete an access point due to ThrottlingException: Rate exceeded
  • The access point does actually get removed from the account.
  • The driver continues on (you can see in the logs) trying to remove it, but constantly failing.
  • The PV itself is left stuck in a "released" state and can only be removed by patching the finalizers.

What you expected to happen?
The driver to handle the rate limiting with some sort of back-off and ensure successful removal allowing the PV to then be deleted too.

How to reproduce it (as minimally and precisely as possible)?
This is going to be difficult, as I mentioned above this is a cluster where we have dozens of testing pipelines creating namespaces (which include EFS based PV's), performing automated tests, then deleting itself - we think the sheer volume of this activity is triggering the rate limiting etc.

Anything else we need to know?: No

Environment

  • Kubernetes version (use kubectl version): v1.30.4-eks-a737599
  • Driver version: 3.0.8

Please also attach debug logs to help us better diagnose

Difficult here due to manner in which the pods using the EFS are created / deleted. We have enabled logLevel 5 on the controller to see if this reveals anything and will upload when we have something.

  • Instructions to gather debug logs can be found here
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 19, 2024
@mskanth972
Copy link
Contributor

mskanth972 commented Nov 19, 2024

Looks like you are hitting this issue which we fixed in latest version v2.1.0, can you please update to the latest and see if you are facing the issue still.

@pauldtill
Copy link
Author

@mskanth972 - we are running this version of the image - amazon/aws-efs-csi-driver:v2.1.0

@pauldtill
Copy link
Author

Some debug logs - efs-plugin-debug.log

@emoreth
Copy link

emoreth commented Dec 16, 2024

I can't confirm it doesn't leak undeletable PVs anymore as I can't forcefully make this happen, but from the fix, it looks like there is no cleaning up from old undeletable resources (left over finalizers), so this is what worked for me for cleaning up resources

P.S.A.: Double check all commands and outputs as this can destroy your data

for pv in $(kubectl get pv -A -o json | jq -r '.items[] | select(.spec.storageClassName == "efs-etcd" and .status.phase == "Released") | .metadata.name'); do
  kubectl patch pv "$pv" --type=json -p '[{"op": "remove", "path": "/metadata/finalizers"}]'
done

@daro1337
Copy link

I confirm, I am also experiencing this error (latest version of efs-csi-controller) 2.1.2.

@mskanth972
Copy link
Contributor

mskanth972 commented Dec 19, 2024

Hi, we have raised a PR on this to switch to AdaptiveRetry mode which is already merged and we are planning to include this in the next release. PR has more info.
#1520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants