You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have been using cortex for a few years and have some questions about improving the reliability of ingesters in high-load environments.
Details on our environment:
We run Cortex using the cortex-helm-chart and are using chart version 2.5.0. We are running Cortex version 1.18.1. We run our ingesters as a StatefulSet with autoscaling (with a slow interval) and each have 32Gi of memory. We generally are stable around 60-70 ingester replicas. Our system handles ~80-100 million time series with 3x duplication across ingesters, so ~240-300 million active series in ingesters. We have both auto-forget and flush on shutdown enabled. Our blocks-storage backend is S3.
Even when the system is healthy, we generally see an ingester take upwards of 5mins to just terminate. And then after a new pod is created, usually takes ~7mins to reach a ready state. Are these times expected giving our system load? We realize the flush on shutdown may lead to longer termination times, but are wondering if there are any config options/resource changes we should be looking at that could help these pods startup faster.
Ingesters quickly entering an OOMKill state
Generally, our ingesters seem stable around 18-23Gi of usage during normal operation (our HPA has a memory threshold of 70%). However, especially when a single ingester pod terminates (due to rolling updates, etc.), we will see ingester pods jump to using the full 32Gi of memory and OOMKill within 5mins. This can then lead to more ingester pods entering an OOMKill state and due to the long startup times mentioned above means it usually takes the system hours to fully recover. Again, can anyone recommend any config options we can tune to help with these situations?
According to the Cortex capacity planning docs, each ingester needs ~15Gi of memory per million series. We recognize that our setup is under-provisioning our ingesters, as each of our ingesters has upwards of 3 million active series but only has 32Gi of memory allocated. We are hesitant to increase their memory to match the capacity planning docs, as generally they are well within their resource allocations and for our scale that much more memory would be expensive. Can you all provide details on how the 15Gi memory/1 million series number was calculated and if that value was seen to be reached consistently or if there was extra buffer to handle large spikes like we are seeing?
Thanks!
We really appreciate any help or suggestions. Please let me know if there are any more details I can provide on our setup/environment as I realize these kinds of performance questions can be fairly specific. Thank you so much!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
We have been using cortex for a few years and have some questions about improving the reliability of ingesters in high-load environments.
Details on our environment:
We run Cortex using the cortex-helm-chart and are using chart version
2.5.0
. We are running Cortex version1.18.1
. We run our ingesters as a StatefulSet with autoscaling (with a slow interval) and each have32Gi
of memory. We generally are stable around 60-70 ingester replicas. Our system handles ~80-100 million time series with 3x duplication across ingesters, so ~240-300 million active series in ingesters. We have both auto-forget and flush on shutdown enabled. Our blocks-storage backend is S3.Our config (showing differences from defaults)
Issues observed
Slow ingester starting and terminating times
Even when the system is healthy, we generally see an ingester take upwards of 5mins to just terminate. And then after a new pod is created, usually takes ~7mins to reach a ready state. Are these times expected giving our system load? We realize the flush on shutdown may lead to longer termination times, but are wondering if there are any config options/resource changes we should be looking at that could help these pods startup faster.
Ingesters quickly entering an OOMKill state
Generally, our ingesters seem stable around 18-23Gi of usage during normal operation (our HPA has a memory threshold of 70%). However, especially when a single ingester pod terminates (due to rolling updates, etc.), we will see ingester pods jump to using the full 32Gi of memory and OOMKill within 5mins. This can then lead to more ingester pods entering an OOMKill state and due to the long startup times mentioned above means it usually takes the system hours to fully recover. Again, can anyone recommend any config options we can tune to help with these situations?
According to the Cortex capacity planning docs, each ingester needs ~15Gi of memory per million series. We recognize that our setup is under-provisioning our ingesters, as each of our ingesters has upwards of 3 million active series but only has 32Gi of memory allocated. We are hesitant to increase their memory to match the capacity planning docs, as generally they are well within their resource allocations and for our scale that much more memory would be expensive. Can you all provide details on how the 15Gi memory/1 million series number was calculated and if that value was seen to be reached consistently or if there was extra buffer to handle large spikes like we are seeing?
Thanks!
We really appreciate any help or suggestions. Please let me know if there are any more details I can provide on our setup/environment as I realize these kinds of performance questions can be fairly specific. Thank you so much!
Beta Was this translation helpful? Give feedback.
All reactions