-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
duplicate timestamps in kube_job_status_failed metrics for retried Jobs #2565
Comments
This issue is currently awaiting triage. If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@shlomitubul My attempt. I am also tried apiVersion: batch/v1
kind: CronJob
metadata:
name: sleep-cronjob
namespace: default
spec:
schedule: '*/1 * * * *'
failedJobsHistoryLimit: 1
jobTemplate:
spec:
backoffLimit: 2
template:
spec:
containers:
- name: sleep-container
image: alpine:3.12
command:
- sh
- '-c'
- |
sleep 30
if [ ! -f /tmp/1.txt ]; then
touch /tmp/1.txt && exit 1
fi
exit 0
volumeMounts: [{ name: files-volume, mountPath: /tmp }]
volumes: [{ name: files-volume, emptyDir: {} }]
restartPolicy: OnFailure |
@zoglam it re-produced everytime this is the cronjob I used:
we used Prometheus in each cluster that send metrics to mimir, the number of pods in each cron/job is 1, we don't allow/need cuncrruny, so once we trigger job and the main process for example and then let next one pass then we get this error |
What happened:
When Jobs failed and then succeeded and
failedJobsHistoryLimit
are more then 0 thenkube_job_status_failed
will create duplicate samples sincekube_job_status_failed
has no uniq label like job_pod.What you expected to happen:
kube_job_* metrics or at least
kube_job_status_failed
metrics should have some unique label (can be job_pod or retry_index) so promethus doesn't reject the scrape.How to reproduce it (as minimally and precisely as possible):
create Job resource with
failedJobsHistoryLimit: 1
, andbackoffLimit: 2
, trigger the job and exit the pod, then trigger again and let it pass.Anything else we need to know?:
Environment:
kubectl version
): v1.30.5-gke.1014003Tasks
The text was updated successfully, but these errors were encountered: