Help configuring ingesters for high load deployments #6548

sam-mcbr · 2025-01-24T21:32:27Z

sam-mcbr
Jan 24, 2025

Hi,

We have been using cortex for a few years and have some questions about improving the reliability of ingesters in high-load environments.

Details on our environment:

We run Cortex using the cortex-helm-chart and are using chart version 2.5.0. We are running Cortex version 1.18.1. We run our ingesters as a StatefulSet with autoscaling (with a slow interval) and each have 32Gi of memory. We generally are stable around 60-70 ingester replicas. Our system handles ~80-100 million time series with 3x duplication across ingesters, so ~240-300 million active series in ingesters. We have both auto-forget and flush on shutdown enabled. Our blocks-storage backend is S3.

Our config (showing differences from defaults)

alertmanager:
  external_url: /api/prom/alertmanager
api:
  response_compression_enabled: true
blocks_storage:
  bucket_store:
    bucket_index:
      enabled: true
      idle_timeout: 2h0m0s
    chunks_cache:
      memcached:
        max_async_buffer_size: 10000000
        max_async_concurrency: 100
        max_get_multi_batch_size: 1000
        max_get_multi_concurrency: 750
        max_idle_connections: 750
        max_item_size: 33554432
        timeout: 300ms
    ignore_deletion_mark_delay: 1h0m0s
    index_cache:
      memcached:
        max_async_buffer_size: 10000000
        max_async_concurrency: 100
        max_get_multi_concurrency: 750
        max_idle_connections: 750
        max_item_size: 16777216
        timeout: 300ms
    index_header_lazy_loading_enabled: true
    metadata_cache:
      memcached:
        max_async_buffer_size: 10000000
        max_async_concurrency: 100
        max_get_multi_batch_size: 1000
        max_get_multi_concurrency: 750
        max_idle_connections: 750
        max_item_size: 16777216
        timeout: 300ms
    sync_dir: /data/tsdb-sync
    sync_interval: 5m0s
  filesystem:
    dir: /data
  s3:
    bucket_name: <our S3 bucket name>
    endpoint: s3.us-east-1.amazonaws.com
    region: us-east-1
    send_content_md5: false
  tsdb:
    dir: /data/tsdb
    flush_blocks_on_shutdown: true
    retention_period: 4h0m0s
    wal_compression_enabled: true
compactor:
  compaction_concurrency: 3
  compaction_interval: 15m0s
  deletion_delay: 4h0m0s
distributor:
  ha_tracker:
    enable_ha_tracker: true
    kvstore:
      consul:
        host: <our consul service host>
  instance_limits:
    max_inflight_push_requests: 5000
  ring:
    kvstore:
      store: memberlist
  shard_by_all_labels: true
frontend:
  grpc_client_config:
    grpc_compression: snappy-block
    max_send_msg_size: 104857600
  log_queries_longer_than: 30s
  max_body_size: 104857600
  query_stats_enabled: true
frontend_worker:
  grpc_client_config:
    grpc_compression: snappy-block
    max_send_msg_size: 104857600
  match_max_concurrent: true
ingester:
  lifecycler:
    join_after: 10s
    num_tokens: 512
    observe_period: 10s
    ring:
      kvstore:
        store: memberlist
    tokens_file_path: /data/tokens
  upload_compacted_blocks_enabled: false
ingester_client:
  grpc_client_config:
    max_recv_msg_size: 10485760
    max_send_msg_size: 10485760
limits:
  accept_ha_samples: true
  drop_labels:
  - prom_id
  ha_cluster_label: prom_id
  ha_replica_label: prometheus_replica
  ingestion_burst_size: 500000
  ingestion_rate: 100000
  max_fetched_chunks_per_query: 4000000
  max_label_names_per_series: 45
  max_metadata_per_user: 0
  max_query_length: 13w
  max_query_lookback: 183d
  max_series_per_metric: 0
  reject_old_samples: true
  reject_old_samples_max_age: 1w
memberlist:
  bind_port: 7950
  dead_node_reclaim_time: 5s
  gossip_interval: 150ms
  join_members:
  - <our memberlist service>
  pull_push_interval: 20s
  randomize_node_name: false
  retransmit_factor: 6
querier:
  active_query_tracker_dir: /data/active-query-tracker
  query_ingesters_within: 4h0m0s
  query_store_after: 3h55m0s
  store_gateway_client:
    grpc_compression: snappy
  timeout: 3m0s
query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      enable_fifocache: true
      memcached:
        expiration: 49h0m0s
      memcached_client:
        timeout: 1s
    compression: snappy
  split_queries_by_interval: 24h0m0s
query_scheduler:
  grpc_client_config:
    grpc_compression: snappy-block
ruler:
  alertmanager_client:
    basic_auth_password: <our alertmanager password>
    basic_auth_username: <our alertmanager username>
  alertmanager_url: <our alertmanager url>
  concurrent_evals_enabled: true
  enable_api: true
  enable_sharding: true
  max_concurrent_evals: 2
  query_stats_enabled: true
  ring:
    kvstore:
      store: memberlist
  ruler_client:
    grpc_compression: snappy-block
ruler_storage:
  backend: local
  local:
    directory: /local-rules
runtime_config:
  file: /etc/cortex-runtime-config/runtime_config.yaml
server:
  grpc_server_max_concurrent_streams: 10000
  grpc_server_max_recv_msg_size: 104857600
  grpc_server_max_send_msg_size: 104857600
  http_listen_port: 8080
  log_level: error
store_gateway:
  sharding_enabled: true
  sharding_ring:
    kvstore:
      store: memberlist
    tokens_file_path: /data/tokens
target: distributor

Issues observed

Slow ingester starting and terminating times

Even when the system is healthy, we generally see an ingester take upwards of 5mins to just terminate. And then after a new pod is created, usually takes ~7mins to reach a ready state. Are these times expected giving our system load? We realize the flush on shutdown may lead to longer termination times, but are wondering if there are any config options/resource changes we should be looking at that could help these pods startup faster.

Ingesters quickly entering an OOMKill state

Generally, our ingesters seem stable around 18-23Gi of usage during normal operation (our HPA has a memory threshold of 70%). However, especially when a single ingester pod terminates (due to rolling updates, etc.), we will see ingester pods jump to using the full 32Gi of memory and OOMKill within 5mins. This can then lead to more ingester pods entering an OOMKill state and due to the long startup times mentioned above means it usually takes the system hours to fully recover. Again, can anyone recommend any config options we can tune to help with these situations?

According to the Cortex capacity planning docs, each ingester needs ~15Gi of memory per million series. We recognize that our setup is under-provisioning our ingesters, as each of our ingesters has upwards of 3 million active series but only has 32Gi of memory allocated. We are hesitant to increase their memory to match the capacity planning docs, as generally they are well within their resource allocations and for our scale that much more memory would be expensive. Can you all provide details on how the 15Gi memory/1 million series number was calculated and if that value was seen to be reached consistently or if there was extra buffer to handle large spikes like we are seeing?

Thanks!

We really appreciate any help or suggestions. Please let me know if there are any more details I can provide on our setup/environment as I realize these kinds of performance questions can be fairly specific. Thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help configuring ingesters for high load deployments #6548

{{title}}

Replies: 0 comments

Select a reply

Help configuring ingesters for high load deployments #6548

sam-mcbr Jan 24, 2025

Details on our environment:

Issues observed

Slow ingester starting and terminating times

Ingesters quickly entering an OOMKill state

Thanks!

Replies: 0 comments

sam-mcbr
Jan 24, 2025