Rules

ngate-certificate-expiration-monitoring

37.244s ago

893us

Rule State Error Last Evaluation Evaluation Time
alert: NgateCertificateExpiring expr: ng_infra_cert_expiry < 2.592e+06 labels: severity: warning annotations: description: Certificate is about to expire or expired summary: Certificate {{ $labels.name }} is about to expire or expired. Link https://grafana.basis.center/d/uU0S3zISz/ngate?orgId=1 ok 37.244s ago 343.2us
alert: NgateKeyExpiring expr: ng_infra_pk_expiry < 2.592e+06 labels: severity: warning annotations: description: Key is about to expire or expired summary: Key {{ $labels.name }} is about to expire or expired. Link https://grafana.basis.center/d/uU0S3zISz/ngate?orgId=1 ok 37.243s ago 223.6us
alert: NgateCertificateExpiring expr: ng_infra_cert_expiry < 1.2096e+06 labels: severity: critical annotations: description: Certificate expired in 14 days or less summary: Certificate {{ $labels.name }} is about to expire or expired. Link https://grafana.basis.center/d/uU0S3zISz/ngate?orgId=1 ok 37.243s ago 177.5us
alert: NgateKeyExpiring expr: ng_infra_pk_expiry < 1.2096e+06 labels: severity: critical annotations: description: Key expired in 14 days or less summary: Key {{ $labels.name }} is about to expire or expired. Link https://grafana.basis.center/d/uU0S3zISz/ngate?orgId=1 ok 37.243s ago 136.3us

rabbitmq-messages

44.433s ago

15.05ms

Rule State Error Last Evaluation Evaluation Time
alert: NumberOfConsumers expr: sum(rabbitmq_queue_consumers{job="rabbitmq"}) < 600 for: 10m labels: severity: critical annotations: description: Общее количество консьюмеров {{ $value }}. summary: Это может означать что работают не все поды rabbitmq. ok 44.433s ago 386.9us
alert: RabbitMQTooManyMessagesInQueue expr: rabbitmq_queue_messages > 100 for: 5m labels: severity: critical annotations: description: There are {{ $value }} messages in some queues. summary: This means there is a service overloaded or it's down. ok 44.432s ago 240.6us
alert: RabbitMQ-SoManyMessagesInQueue-1m expr: sum(increase(cqrs_commands_queue_time_seconds_sum{namespace="doc-production"}[1m])) / sum(increase(cqrs_commands_queue_time_seconds_count{namespace="doc-production"}[1m])) > 0.5 for: 1m labels: severity: warning annotations: description: There are a lot of messages in some queues within 1 min. summary: This means there is a service overloaded. Link https://grafana.basis.center/d/hxivlZ1Gk/health?viewPanel=88&from=now-30m ok 44.432s ago 7.465ms
alert: RabbitMQ-SoManyMessagesInQueue-10m expr: sum(increase(cqrs_commands_queue_time_seconds_sum{namespace="doc-production"}[5m])) / sum(increase(cqrs_commands_queue_time_seconds_count{namespace="doc-production"}[5m])) > 0.5 for: 5m labels: severity: critical annotations: description: There are a lot of messages in some queues within 10 min. summary: This means there is a service overloaded or down. Link https://grafana.basis.center/d/hxivlZ1Gk/health?viewPanel=88&from=now-30m ok 44.425s ago 6.952ms

alertmanager.rules

37.173s ago

454.8us

Rule State Error Last Evaluation Evaluation Time
alert: AlertmanagerConfigInconsistent expr: count by(namespace, service) (count_values by(namespace, service) ("config_hash", alertmanager_config_hash{job="prometheus-operator-kube-p-alertmanager",namespace="default"})) != 1 for: 5m labels: severity: critical annotations: message: | The configuration of the instances of the Alertmanager cluster `{{ $labels.namespace }}/{{ $labels.service }}` are out of sync. {{ range printf "alertmanager_config_hash{namespace=\"%s\",service=\"%s\"}" $labels.namespace $labels.service | query }} Configuration hash for pod {{ .Labels.pod }} is "{{ printf "%.f" .Value }}" {{ end }} ok 37.173s ago 257.3us
alert: AlertmanagerFailedReload expr: alertmanager_config_last_reload_successful{job="prometheus-operator-kube-p-alertmanager",namespace="default"} == 0 for: 10m labels: severity: warning annotations: message: Reloading Alertmanager's configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}. ok 37.172s ago 92.37us
alert: AlertmanagerMembersInconsistent expr: alertmanager_cluster_members{job="prometheus-operator-kube-p-alertmanager",namespace="default"} != on(service) group_left() count by(service) (alertmanager_cluster_members{job="prometheus-operator-kube-p-alertmanager",namespace="default"}) for: 5m labels: severity: critical annotations: message: Alertmanager has not found all other members of the cluster. ok 37.172s ago 98.03us

general.rules

23.476s ago

2.524ms

Rule State Error Last Evaluation Evaluation Time
alert: TargetDown expr: 100 * (count by(job, namespace, service) (up == 0) / count by(job, namespace, service) (up)) > 10 for: 10m labels: severity: warning annotations: message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace are down.' ok 23.476s ago 2.381ms
alert: Watchdog expr: vector(1) labels: severity: none annotations: message: | This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the "DeadMansSnitch" integration in PagerDuty. ok 23.474s ago 133.9us

k8s.rules

48.374s ago

66.57ms

Rule State Error Last Evaluation Evaluation Time
record: namespace:container_cpu_usage_seconds_total:sum_rate expr: sum by(namespace) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) ok 48.374s ago 2.82ms
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate expr: sum by(cluster, namespace, pod, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on(cluster, namespace, pod) group_left(node) topk by(cluster, namespace, pod) (1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})) ok 48.371s ago 5.577ms
record: node_namespace_pod_container:container_memory_working_set_bytes expr: container_memory_working_set_bytes{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 48.365s ago 11.37ms
record: node_namespace_pod_container:container_memory_rss expr: container_memory_rss{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 48.354s ago 11.32ms
record: node_namespace_pod_container:container_memory_cache expr: container_memory_cache{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 48.343s ago 11.04ms
record: node_namespace_pod_container:container_memory_swap expr: container_memory_swap{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 48.332s ago 10.54ms
record: namespace:container_memory_usage_bytes:sum expr: sum by(namespace) (container_memory_usage_bytes{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}) ok 48.321s ago 1.731ms
record: namespace:kube_pod_container_resource_requests_memory_bytes:sum expr: sum by(namespace) (sum by(namespace, pod) (max by(namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}) * on(namespace, pod) group_left() max by(namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))) ok 48.32s ago 3.191ms
record: namespace:kube_pod_container_resource_requests_cpu_cores:sum expr: sum by(namespace) (sum by(namespace, pod) (max by(namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}) * on(namespace, pod) group_left() max by(namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))) ok 48.316s ago 2.52ms
record: namespace_workload_pod:kube_pod_owner:relabel expr: max by(cluster, namespace, workload, pod) (label_replace(label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="ReplicaSet"}, "replicaset", "$1", "owner_name", "(.*)") * on(replicaset, namespace) group_left(owner_name) topk by(replicaset, namespace) (1, max by(replicaset, namespace, owner_name) (kube_replicaset_owner{job="kube-state-metrics"})), "workload", "$1", "owner_name", "(.*)")) labels: workload_type: deployment ok 48.314s ago 5.409ms
record: namespace_workload_pod:kube_pod_owner:relabel expr: max by(cluster, namespace, workload, pod) (label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="DaemonSet"}, "workload", "$1", "owner_name", "(.*)")) labels: workload_type: daemonset ok 48.309s ago 765.3us
record: namespace_workload_pod:kube_pod_owner:relabel expr: max by(cluster, namespace, workload, pod) (label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="StatefulSet"}, "workload", "$1", "owner_name", "(.*)")) labels: workload_type: statefulset ok 48.308s ago 255.7us

kube-apiserver-availability.rules

15.553s ago

7.569s

Rule State Error Last Evaluation Evaluation Time
record: apiserver_request:availability30d expr: 1 - ((sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))) + (sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d])) - ((sum(increase(apiserver_request_duration_seconds_bucket{le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30d])) or vector(0)) + sum(increase(apiserver_request_duration_seconds_bucket{le="0.5",scope="namespace",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{le="5",scope="cluster",verb=~"LIST|GET"}[30d])))) + sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))) / sum(code:apiserver_request_total:increase30d) labels: verb: all ok 15.553s ago 3.083s
record: apiserver_request:availability30d expr: 1 - (sum(increase(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[30d])) - ((sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30d])) or vector(0)) + sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[30d]))) + sum(code:apiserver_request_total:increase30d{code=~"5..",verb="read"} or vector(0))) / sum(code:apiserver_request_total:increase30d{verb="read"}) labels: verb: read ok 12.47s ago 1.596s
record: apiserver_request:availability30d expr: 1 - ((sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))) + sum(code:apiserver_request_total:increase30d{code=~"5..",verb="write"} or vector(0))) / sum(code:apiserver_request_total:increase30d{verb="write"}) labels: verb: write ok 10.874s ago 853.7ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="LIST"}[30d])) ok 10.02s ago 775.8ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="GET"}[30d])) ok 9.244s ago 298.4ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="POST"}[30d])) ok 8.946s ago 129.5ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="PUT"}[30d])) ok 8.817s ago 112.7ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="PATCH"}[30d])) ok 8.704s ago 135.4ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="DELETE"}[30d])) ok 8.569s ago 87.01ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="LIST"}[30d])) ok 8.482s ago 1.358ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="GET"}[30d])) ok 8.48s ago 1.122ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="POST"}[30d])) ok 8.479s ago 976us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="PUT"}[30d])) ok 8.478s ago 598.7us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="PATCH"}[30d])) ok 8.478s ago 538.5us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="DELETE"}[30d])) ok 8.477s ago 499.3us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="LIST"}[30d])) ok 8.477s ago 46.24ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="GET"}[30d])) ok 8.431s ago 90.18ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="POST"}[30d])) ok 8.34s ago 22.86ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="PUT"}[30d])) ok 8.318s ago 50.71ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="PATCH"}[30d])) ok 8.267s ago 30.93ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="DELETE"}[30d])) ok 8.236s ago 25ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="LIST"}[30d])) ok 8.211s ago 133.1ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="GET"}[30d])) ok 8.078s ago 46.56ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="POST"}[30d])) ok 8.032s ago 10.94ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="PUT"}[30d])) ok 8.021s ago 20.51ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="PATCH"}[30d])) ok 8s ago 13.32ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="DELETE"}[30d])) ok 7.987s ago 1.099ms
record: code:apiserver_request_total:increase30d expr: sum by(code) (code_verb:apiserver_request_total:increase30d{verb=~"LIST|GET"}) labels: verb: read ok 7.986s ago 171.1us
record: code:apiserver_request_total:increase30d expr: sum by(code) (code_verb:apiserver_request_total:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) labels: verb: write ok 7.986s ago 197.6us

kube-apiserver-slos

25.405s ago

1.151ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01) for: 2m labels: long: 1h severity: critical short: 5m annotations: description: The API server is burning too much error budget. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn summary: The API server is burning too much error budget. ok 25.405s ago 462.5us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate6h) > (6 * 0.01) and sum(apiserver_request:burnrate30m) > (6 * 0.01) for: 15m labels: long: 6h severity: critical short: 30m annotations: description: The API server is burning too much error budget. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn summary: The API server is burning too much error budget. ok 25.405s ago 288.5us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate1d) > (3 * 0.01) and sum(apiserver_request:burnrate2h) > (3 * 0.01) for: 1h labels: long: 1d severity: warning short: 2h annotations: description: The API server is burning too much error budget. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn summary: The API server is burning too much error budget. ok 25.404s ago 194.6us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate3d) > (1 * 0.01) and sum(apiserver_request:burnrate6h) > (1 * 0.01) for: 3h labels: long: 3d severity: warning short: 6h annotations: description: The API server is burning too much error budget. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn summary: The API server is burning too much error budget. ok 25.404s ago 191.7us

kube-apiserver.rules

25.605s ago

3.388s

Rule State Error Last Evaluation Evaluation Time
record: apiserver_request:burnrate1d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[1d])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[1d])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[1d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[1d])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[1d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[1d])) labels: verb: read ok 25.605s ago 499ms
record: apiserver_request:burnrate1h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[1h])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[1h])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[1h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[1h])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[1h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[1h])) labels: verb: read ok 25.106s ago 316.1ms
record: apiserver_request:burnrate2h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[2h])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[2h])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[2h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[2h])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[2h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[2h])) labels: verb: read ok 24.79s ago 115.7ms
record: apiserver_request:burnrate30m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[30m])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30m])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[30m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[30m])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[30m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[30m])) labels: verb: read ok 24.674s ago 22.7ms
record: apiserver_request:burnrate3d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[3d])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[3d])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[3d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[3d])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[3d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[3d])) labels: verb: read ok 24.652s ago 993.6ms
record: apiserver_request:burnrate5m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[5m])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[5m])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[5m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[5m])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[5m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[5m])) labels: verb: read ok 23.658s ago 15.37ms
record: apiserver_request:burnrate6h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[6h])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[6h])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[6h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[6h])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[6h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[6h])) labels: verb: read ok 23.643s ago 116.7ms
record: apiserver_request:burnrate1d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[1d]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d])) labels: verb: write ok 23.526s ago 168.8ms
record: apiserver_request:burnrate1h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[1h]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h])) labels: verb: write ok 23.358s ago 13.57ms
record: apiserver_request:burnrate2h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[2h]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h])) labels: verb: write ok 23.344s ago 26.41ms
record: apiserver_request:burnrate30m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[30m]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m])) labels: verb: write ok 23.318s ago 10.39ms
record: apiserver_request:burnrate3d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[3d]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d])) labels: verb: write ok 23.307s ago 471.9ms
record: apiserver_request:burnrate5m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write ok 22.836s ago 7.086ms
record: apiserver_request:burnrate6h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[6h]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h])) labels: verb: write ok 22.829s ago 54.95ms
record: code_resource:apiserver_request_total:rate5m expr: sum by(code, resource) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[5m])) labels: verb: read ok 22.774s ago 5.525ms
record: code_resource:apiserver_request_total:rate5m expr: sum by(code, resource) (rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write ok 22.768s ago 2.598ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(le, resource) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET"}[5m]))) > 0 labels: quantile: "0.99" verb: read ok 22.766s ago 116.4ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(le, resource) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) > 0 labels: quantile: "0.99" verb: write ok 22.649s ago 59.42ms
record: cluster:apiserver_request_duration_seconds:mean5m expr: sum without(instance, pod) (rate(apiserver_request_duration_seconds_sum{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) / sum without(instance, pod) (rate(apiserver_request_duration_seconds_count{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) ok 22.59s ago 7.998ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.99" ok 22.582s ago 120.1ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.9" ok 22.462s ago 123ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.5" ok 22.339s ago 120.2ms

kube-prometheus-general.rules

43.283s ago

2.606ms

Rule State Error Last Evaluation Evaluation Time
record: count:up1 expr: count without(instance, pod, node) (up == 1) ok 43.283s ago 1.765ms
record: count:up0 expr: count without(instance, pod, node) (up == 0) ok 43.282s ago 829.8us

kube-prometheus-node-recording.rules

10.692s ago

9.295ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_cpu:rate:sum expr: sum by(instance) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[3m])) ok 10.692s ago 1.529ms
record: instance:node_network_receive_bytes:rate:sum expr: sum by(instance) (rate(node_network_receive_bytes_total[3m])) ok 10.691s ago 1.405ms
record: instance:node_network_transmit_bytes:rate:sum expr: sum by(instance) (rate(node_network_transmit_bytes_total[3m])) ok 10.69s ago 1.252ms
record: instance:node_cpu:ratio expr: sum without(cpu, mode) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m])) / on(instance) group_left() count by(instance) (sum by(instance, cpu) (node_cpu_seconds_total)) ok 10.688s ago 2.442ms
record: cluster:node_cpu:sum_rate5m expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m])) ok 10.686s ago 920.1us
record: cluster:node_cpu:ratio expr: cluster:node_cpu_seconds_total:rate5m / count(sum by(instance, cpu) (node_cpu_seconds_total)) ok 10.685s ago 1.704ms

kube-state-metrics

2.377s ago

498us

Rule State Error Last Evaluation Evaluation Time
alert: KubeStateMetricsListErrors expr: (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01 for: 15m labels: severity: critical annotations: description: kube-state-metrics is experiencing errors at an elevated rate in list operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatemetricslisterrors summary: kube-state-metrics is experiencing errors in list operations. ok 2.377s ago 328.3us
alert: KubeStateMetricsWatchErrors expr: (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01 for: 15m labels: severity: critical annotations: description: kube-state-metrics is experiencing errors at an elevated rate in watch operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatemetricswatcherrors summary: kube-state-metrics is experiencing errors in watch operations. ok 2.377s ago 162.5us

kubelet.rules

11.561s ago

5.155ms

Rule State Error Last Evaluation Evaluation Time
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) labels: quantile: "0.99" ok 11.561s ago 1.993ms
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) labels: quantile: "0.9" ok 11.559s ago 1.393ms
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) labels: quantile: "0.5" ok 11.558s ago 1.755ms

kubernetes-apps

44.43s ago

17.38ms

Rule State Error Last Evaluation Evaluation Time
alert: KubePodCrashLooping expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace=~".*"}[5m]) * 60 * 5 > 0 for: 15m labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping summary: Pod is crash looping. ok 44.43s ago 1.816ms
alert: KubePodNotReady expr: sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~".*",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"}))) > 0 for: 15m labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready summary: Pod has been in a non-ready state for more than 15 minutes. ok 44.428s ago 3.942ms
alert: KubeDeploymentGenerationMismatch expr: kube_deployment_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_deployment_metadata_generation{job="kube-state-metrics",namespace=~".*"} for: 15m labels: severity: warning annotations: description: Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch summary: Deployment generation mismatch due to possible roll-back ok 44.424s ago 802.4us
alert: KubeDeploymentReplicasMismatch expr: (kube_deployment_spec_replicas{job="kube-state-metrics",namespace=~".*"} != kube_deployment_status_replicas_available{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[5m]) == 0) for: 15m labels: severity: warning annotations: description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch summary: Deployment has not matched the expected number of replicas. ok 44.423s ago 1.256ms
alert: KubeStatefulSetReplicasMismatch expr: (kube_statefulset_status_replicas_ready{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_status_replicas{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[5m]) == 0) for: 15m labels: severity: warning annotations: description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch summary: Deployment has not matched the expected number of replicas. ok 44.422s ago 281.5us
alert: KubeStatefulSetGenerationMismatch expr: kube_statefulset_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_metadata_generation{job="kube-state-metrics",namespace=~".*"} for: 15m labels: severity: warning annotations: description: StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch summary: StatefulSet generation mismatch due to possible roll-back ok 44.422s ago 161.4us
alert: KubeStatefulSetUpdateNotRolledOut expr: (max without(revision) (kube_statefulset_status_current_revision{job="kube-state-metrics",namespace=~".*"} unless kube_statefulset_status_update_revision{job="kube-state-metrics",namespace=~".*"}) * (kube_statefulset_replicas{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"})) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[5m]) == 0) for: 15m labels: severity: warning annotations: description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout summary: StatefulSet update has not been rolled out. ok 44.422s ago 393.5us
alert: KubeDaemonSetRolloutStuck expr: ((kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} != 0) or (kube_daemonset_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_available{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"})) and (changes(kube_daemonset_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"}[5m]) == 0) for: 15m labels: severity: warning annotations: description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not finished or progressed for at least 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck summary: DaemonSet rollout is stuck. ok 44.422s ago 937.6us
alert: KubeContainerWaiting expr: sum by(namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics",namespace=~".*"}) > 0 for: 1h labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontainerwaiting summary: Pod container waiting longer than 1 hour ok 44.421s ago 6.214ms
alert: KubeDaemonSetNotScheduled expr: kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} > 0 for: 10m labels: severity: warning annotations: description: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled summary: DaemonSet pods are not scheduled. ok 44.415s ago 194.2us
alert: KubeDaemonSetMisScheduled expr: kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} > 0 for: 15m labels: severity: warning annotations: description: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled summary: DaemonSet pods are misscheduled. ok 44.415s ago 80.17us
alert: KubeJobCompletion expr: kube_job_spec_completions{job="kube-state-metrics",namespace=~".*"} - kube_job_status_succeeded{job="kube-state-metrics",namespace=~".*"} > 0 for: 12h labels: severity: warning annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than 12 hours to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion summary: Job did not complete in time ok 44.415s ago 570.9us
alert: KubeJobFailed expr: kube_job_failed{job="kube-state-metrics",namespace=~".*"} > 0 for: 15m labels: severity: warning annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. Removing failed job after investigation should clear this alert. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed summary: Job failed to complete. ok 44.414s ago 294.3us
alert: KubeHpaReplicasMismatch expr: (kube_hpa_status_desired_replicas{job="kube-state-metrics",namespace=~".*"} != kube_hpa_status_current_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_hpa_status_current_replicas{job="kube-state-metrics",namespace=~".*"} > kube_hpa_spec_min_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_hpa_status_current_replicas{job="kube-state-metrics",namespace=~".*"} < kube_hpa_spec_max_replicas{job="kube-state-metrics",namespace=~".*"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 for: 15m labels: severity: warning annotations: description: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpareplicasmismatch summary: HPA has not matched descired number of replicas. ok 44.414s ago 314.2us
alert: KubeHpaMaxedOut expr: kube_hpa_status_current_replicas{job="kube-state-metrics",namespace=~".*"} == kube_hpa_spec_max_replicas{job="kube-state-metrics",namespace=~".*"} for: 15m labels: severity: warning annotations: description: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpamaxedout summary: HPA is running at max replicas ok 44.414s ago 102us

kubernetes-resources

33.163s ago

4.227ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeCPUOvercommit expr: sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores) - 1) / count(kube_node_status_allocatable_cpu_cores) for: 5m labels: severity: info annotations: description: Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit summary: Cluster has overcommitted CPU resource requests. ok 33.163s ago 729.5us
alert: KubeMemoryOvercommit expr: sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes) - 1) / count(kube_node_status_allocatable_memory_bytes) for: 5m labels: severity: warning annotations: description: Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit summary: Cluster has overcommitted memory resource requests. ok 33.162s ago 630.6us
alert: KubeCPUQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5 for: 5m labels: severity: warning annotations: description: Cluster has overcommitted CPU resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuquotaovercommit summary: Cluster has overcommitted CPU resource requests. ok 33.161s ago 366.9us
alert: KubeMemoryQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",resource="memory",type="hard"}) / sum(kube_node_status_allocatable_memory_bytes{job="kube-state-metrics"}) > 1.5 for: 5m labels: severity: warning annotations: description: Cluster has overcommitted memory resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryquotaovercommit summary: Cluster has overcommitted memory resource requests. ok 33.161s ago 305.4us
alert: KubeQuotaAlmostFull expr: kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 0.9 < 1 for: 15m labels: severity: info annotations: description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaalmostfull summary: Namespace quota is going to be full. ok 33.161s ago 243.9us
alert: KubeQuotaFullyUsed expr: kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) == 1 for: 15m labels: severity: info annotations: description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotafullyused summary: Namespace quota is fully used. ok 33.161s ago 212.2us
alert: KubeQuotaExceeded expr: kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 1 for: 15m labels: severity: warning annotations: description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded summary: Namespace quota has exceeded the limits. ok 33.161s ago 228.9us
alert: CPUThrottlingHigh expr: sum by(container, pod, namespace) (increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) / sum by(container, pod, namespace) (increase(container_cpu_cfs_periods_total[5m])) > (25 / 100) for: 15m labels: severity: info annotations: description: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh summary: Processes experience elevated CPU throttling. ok 33.161s ago 1.486ms

kubernetes-storage

37.604s ago

1.096ms

Rule State Error Last Evaluation Evaluation Time
alert: KubePersistentVolumeFillingUp expr: kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} < 0.03 for: 1m labels: severity: critical annotations: description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup summary: PersistentVolume is filling up. ok 37.604s ago 384.4us
alert: KubePersistentVolumeFillingUp expr: (kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.15 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}[6h], 4 * 24 * 3600) < 0 for: 1h labels: severity: warning annotations: description: Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup summary: PersistentVolume is filling up. ok 37.604s ago 519us
alert: KubePersistentVolumeErrors expr: kube_persistentvolume_status_phase{job="kube-state-metrics",phase=~"Failed|Pending"} > 0 for: 5m labels: severity: critical annotations: description: The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors summary: PersistentVolume is having issues with provisioning. ok 37.603s ago 183.2us

kubernetes-system-apiserver

8.882s ago

1.725ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeClientCertificateExpiration expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 labels: severity: warning annotations: description: A client certificate used to authenticate to the apiserver is expiring in less than 7.0 days. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration summary: Client certificate is about to expire. ok 8.882s ago 903.5us
alert: KubeClientCertificateExpiration expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400 labels: severity: critical annotations: description: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration summary: Client certificate is about to expire. ok 8.881s ago 542.7us
alert: AggregatedAPIErrors expr: sum by(name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2 labels: severity: warning annotations: description: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapierrors summary: An aggregated API has reported errors. ok 8.88s ago 153.2us
alert: KubeAPIDown expr: absent(up{job="apiserver"} == 1) for: 15m labels: severity: critical annotations: description: KubeAPI has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown summary: Target disappeared from Prometheus target discovery. ok 8.88s ago 111.7us

kubernetes-system-kubelet

17.516s ago

7.734ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeNodeNotReady expr: kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true"} == 0 for: 15m labels: severity: warning annotations: description: '{{ $labels.node }} has been unready for more than 15 minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready summary: Node is not ready. ok 17.516s ago 412.4us
alert: KubeNodeUnreachable expr: (kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"} unless ignoring(key, value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"}) == 1 for: 15m labels: severity: warning annotations: description: '{{ $labels.node }} is unreachable and some workloads may be rescheduled.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodeunreachable summary: Node is unreachable. ok 17.516s ago 386.9us
alert: KubeletTooManyPods expr: count by(node) ((kube_pod_status_phase{job="kube-state-metrics",phase="Running"} == 1) * on(instance, pod, namespace, cluster) group_left(node) topk by(instance, pod, namespace, cluster) (1, kube_pod_info{job="kube-state-metrics"})) / max by(node) (kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) > 0.95 for: 15m labels: severity: warning annotations: description: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods summary: Kubelet is running at capacity. ok 17.516s ago 4.122ms
alert: KubeNodeReadinessFlapping expr: sum by(node) (changes(kube_node_status_condition{condition="Ready",status="true"}[15m])) > 2 for: 15m labels: severity: warning annotations: description: The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodereadinessflapping summary: Node readiness status is flapping. ok 17.512s ago 339.4us
alert: KubeletPlegDurationHigh expr: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10 for: 5m labels: severity: warning annotations: description: The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletplegdurationhigh summary: Kubelet Pod Lifecycle Event Generator is taking too long to relist. ok 17.511s ago 157.7us
alert: KubeletPodStartUpLatencyHigh expr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",metrics_path="/metrics"}[5m]))) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"} > 60 for: 15m labels: severity: warning annotations: description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletpodstartuplatencyhigh summary: Kubelet Pod startup latency is too high. ok 17.511s ago 1.635ms
alert: KubeletClientCertificateExpiration expr: kubelet_certificate_manager_client_ttl_seconds < 604800 labels: severity: warning annotations: description: Client certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletclientcertificateexpiration summary: Kubelet client certificate is about to expire. ok 17.51s ago 56.14us
alert: KubeletClientCertificateExpiration expr: kubelet_certificate_manager_client_ttl_seconds < 86400 labels: severity: critical annotations: description: Client certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletclientcertificateexpiration summary: Kubelet client certificate is about to expire. ok 17.51s ago 36.42us
alert: KubeletServerCertificateExpiration expr: kubelet_certificate_manager_server_ttl_seconds < 604800 labels: severity: warning annotations: description: Server certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletservercertificateexpiration summary: Kubelet server certificate is about to expire. ok 17.51s ago 70.78us
alert: KubeletServerCertificateExpiration expr: kubelet_certificate_manager_server_ttl_seconds < 86400 labels: severity: critical annotations: description: Server certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletservercertificateexpiration summary: Kubelet server certificate is about to expire. ok 17.51s ago 95.96us
alert: KubeletClientCertificateRenewalErrors expr: increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0 for: 15m labels: severity: warning annotations: description: Kubelet on node {{ $labels.node }} has failed to renew its client certificate ({{ $value | humanize }} errors in the last 5 minutes). runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletclientcertificaterenewalerrors summary: Kubelet has failed to renew its client certificate. ok 17.51s ago 99.97us
alert: KubeletServerCertificateRenewalErrors expr: increase(kubelet_server_expiration_renew_errors[5m]) > 0 for: 15m labels: severity: warning annotations: description: Kubelet on node {{ $labels.node }} has failed to renew its server certificate ({{ $value | humanize }} errors in the last 5 minutes). runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletservercertificaterenewalerrors summary: Kubelet has failed to renew its server certificate. ok 17.51s ago 63.44us
alert: KubeletDown expr: absent(up{job="kubelet",metrics_path="/metrics"} == 1) for: 15m labels: severity: critical annotations: description: Kubelet has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown summary: Target disappeared from Prometheus target discovery. ok 17.51s ago 234.2us

kubernetes-system

4.077s ago

2.47ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeVersionMismatch expr: count(count by(gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"}, "gitVersion", "$1", "gitVersion", "(v[0-9]*.[0-9]*).*"))) > 1 for: 15m labels: severity: warning annotations: description: There are {{ $value }} different semantic versions of Kubernetes components running. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch summary: Different semantic versions of Kubernetes components running. ok 4.077s ago 914.4us
alert: KubeClientErrors expr: (sum by(instance, job) (rate(rest_client_requests_total{code=~"5.."}[5m])) / sum by(instance, job) (rate(rest_client_requests_total[5m]))) > 0.01 for: 15m labels: severity: warning annotations: description: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors summary: Kubernetes API server client is experiencing errors. ok 4.076s ago 1.541ms

node-exporter.rules

14.735s ago

9.537ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_num_cpu:sum expr: count without(cpu) (count without(mode) (node_cpu_seconds_total{job="node-exporter"})) ok 14.735s ago 1.894ms
record: instance:node_cpu_utilisation:rate1m expr: 1 - avg without(cpu, mode) (rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m])) ok 14.733s ago 507.5us
record: instance:node_load1_per_cpu:ratio expr: (node_load1{job="node-exporter"} / instance:node_num_cpu:sum{job="node-exporter"}) ok 14.733s ago 532.6us
record: instance:node_memory_utilisation:ratio expr: 1 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"}) ok 14.732s ago 584.7us
record: instance:node_vmstat_pgmajfault:rate1m expr: rate(node_vmstat_pgmajfault{job="node-exporter"}[1m]) ok 14.732s ago 276us
record: instance_device:node_disk_io_time_seconds:rate1m expr: rate(node_disk_io_time_seconds_total{device=~"mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+",job="node-exporter"}[1m]) ok 14.731s ago 555.3us
record: instance_device:node_disk_io_time_weighted_seconds:rate1m expr: rate(node_disk_io_time_weighted_seconds_total{device=~"mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+",job="node-exporter"}[1m]) ok 14.731s ago 400.6us
record: instance:node_network_receive_bytes_excluding_lo:rate1m expr: sum without(device) (rate(node_network_receive_bytes_total{device!="lo",job="node-exporter"}[1m])) ok 14.73s ago 1.217ms
record: instance:node_network_transmit_bytes_excluding_lo:rate1m expr: sum without(device) (rate(node_network_transmit_bytes_total{device!="lo",job="node-exporter"}[1m])) ok 14.729s ago 1.257ms
record: instance:node_network_receive_drop_excluding_lo:rate1m expr: sum without(device) (rate(node_network_receive_drop_total{device!="lo",job="node-exporter"}[1m])) ok 14.728s ago 1.16ms
record: instance:node_network_transmit_drop_excluding_lo:rate1m expr: sum without(device) (rate(node_network_transmit_drop_total{device!="lo",job="node-exporter"}[1m])) ok 14.727s ago 1.131ms

node-exporter

2.569s ago

40.64ms

Rule State Error Last Evaluation Evaluation Time
alert: NodeFilesystemSpaceFillingUp expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemspacefillingup summary: Filesystem is predicted to run out of space within the next 24 hours. ok 2.569s ago 7.542ms
alert: NodeFilesystemSpaceFillingUp expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 15 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up fast. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemspacefillingup summary: Filesystem is predicted to run out of space within the next 4 hours. ok 2.562s ago 7.135ms
alert: NodeFilesystemAlmostOutOfSpace expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 5 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutofspace summary: Filesystem has less than 5% space left. ok 2.555s ago 1.612ms
alert: NodeFilesystemAlmostOutOfSpace expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 3 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutofspace summary: Filesystem has less than 3% space left. ok 2.553s ago 1.394ms
alert: NodeFilesystemFilesFillingUp expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemfilesfillingup summary: Filesystem is predicted to run out of inodes within the next 24 hours. ok 2.552s ago 7.184ms
alert: NodeFilesystemFilesFillingUp expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 20 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up fast. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemfilesfillingup summary: Filesystem is predicted to run out of inodes within the next 4 hours. ok 2.545s ago 6.689ms
alert: NodeFilesystemAlmostOutOfFiles expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 5 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutoffiles summary: Filesystem has less than 5% inodes left. ok 2.538s ago 1.327ms
alert: NodeFilesystemAlmostOutOfFiles expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 3 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutoffiles summary: Filesystem has less than 3% inodes left. ok 2.537s ago 1.337ms
alert: NodeNetworkReceiveErrs expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01 for: 1h labels: severity: warning annotations: description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworkreceiveerrs summary: Network interface is reporting many receive errors. ok 2.536s ago 2.639ms
alert: NodeNetworkTransmitErrs expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01 for: 1h labels: severity: warning annotations: description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworktransmiterrs summary: Network interface is reporting many transmit errors. ok 2.533s ago 2.577ms
alert: NodeHighNumberConntrackEntriesUsed expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75 labels: severity: warning annotations: description: '{{ $value | humanizePercentage }} of conntrack entries are used.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodehighnumberconntrackentriesused summary: Number of conntrack are getting close to the limit. ok 2.531s ago 281.7us
alert: NodeTextFileCollectorScrapeError expr: node_textfile_scrape_error{job="node-exporter"} == 1 labels: severity: warning annotations: description: Node Exporter text file collector failed to scrape. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodetextfilecollectorscrapeerror summary: Node Exporter text file collector failed to scrape. ok 2.53s ago 160.3us
alert: NodeClockSkewDetected expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0) for: 10m labels: severity: warning annotations: message: Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeclockskewdetected summary: Clock skew detected. ok 2.53s ago 379.1us
alert: NodeClockNotSynchronising expr: min_over_time(node_timex_sync_status[5m]) == 0 and node_timex_maxerror_seconds >= 16 for: 10m labels: severity: warning annotations: message: Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeclocknotsynchronising summary: Clock not synchronising. ok 2.53s ago 222.5us
alert: NodeRAIDDegraded expr: node_md_disks_required - ignoring(state) (node_md_disks{state="active"}) > 0 for: 15m labels: severity: critical annotations: description: RAID array '{{ $labels.device }}' on {{ $labels.instance }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-noderaiddegraded summary: RAID Array is degraded ok 2.53s ago 89.44us
alert: NodeRAIDDiskFailure expr: node_md_disks{state="fail"} > 0 labels: severity: warning annotations: description: At least one device in RAID array on {{ $labels.instance }} failed. Array '{{ $labels.device }}' needs attention and possibly a disk swap. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-noderaiddiskfailure summary: Failed device in RAID array ok 2.53s ago 45.13us

node-network

7.647s ago

1.558ms

Rule State Error Last Evaluation Evaluation Time
alert: NodeNetworkInterfaceFlapping expr: changes(node_network_up{device!~"veth.+",job="node-exporter"}[2m]) > 2 for: 2m labels: severity: warning annotations: message: Network interface "{{ $labels.device }}" changing it's up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}" ok 7.647s ago 1.547ms

node.rules

9.345s ago

9.416ms

Rule State Error Last Evaluation Evaluation Time
record: :kube_pod_info_node_count: expr: sum(min by(cluster, node) (kube_pod_info{node!=""})) ok 9.345s ago 1.502ms
record: node_namespace_pod:kube_pod_info: expr: topk by(namespace, pod) (1, max by(node, namespace, pod) (label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)"))) ok 9.344s ago 2.764ms
record: node:node_num_cpu:sum expr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)) ok 9.341s ago 4.402ms
record: :node_memory_MemAvailable_bytes:sum expr: sum by(cluster) (node_memory_MemAvailable_bytes{job="node-exporter"} or (node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Slab_bytes{job="node-exporter"})) ok 9.337s ago 732.3us

prometheus-operator

23.39s ago

1.843ms

Rule State Error Last Evaluation Evaluation Time
alert: PrometheusOperatorListErrors expr: (sum by(controller, namespace) (rate(prometheus_operator_list_operations_failed_total{job="prometheus-operator-kube-p-operator",namespace="default"}[10m])) / sum by(controller, namespace) (rate(prometheus_operator_list_operations_total{job="prometheus-operator-kube-p-operator",namespace="default"}[10m]))) > 0.4 for: 15m labels: severity: warning annotations: description: Errors while performing List operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorlisterrors summary: Errors while performing list operations in controller. ok 23.39s ago 416us
alert: PrometheusOperatorWatchErrors expr: (sum by(controller, namespace) (rate(prometheus_operator_watch_operations_failed_total{job="prometheus-operator-kube-p-operator",namespace="default"}[10m])) / sum by(controller, namespace) (rate(prometheus_operator_watch_operations_total{job="prometheus-operator-kube-p-operator",namespace="default"}[10m]))) > 0.4 for: 15m labels: severity: warning annotations: description: Errors while performing watch operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorwatcherrors summary: Errors while performing watch operations in controller. ok 23.39s ago 413us
alert: PrometheusOperatorSyncFailed expr: min_over_time(prometheus_operator_syncs{job="prometheus-operator-kube-p-operator",namespace="default",status="failed"}[5m]) > 0 for: 10m labels: severity: warning annotations: description: Controller {{ $labels.controller }} in {{ $labels.namespace }} namespace fails to reconcile {{ $value }} objects. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorsyncfailed summary: Last controller reconciliation failed ok 23.39s ago 237.6us
alert: PrometheusOperatorReconcileErrors expr: (sum by(controller, namespace) (rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator-kube-p-operator",namespace="default"}[5m]))) / (sum by(controller, namespace) (rate(prometheus_operator_reconcile_operations_total{job="prometheus-operator-kube-p-operator",namespace="default"}[5m]))) > 0.1 for: 10m labels: severity: warning annotations: description: '{{ $value | humanizePercentage }} of reconciling operations failed for {{ $labels.controller }} controller in {{ $labels.namespace }} namespace.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorreconcileerrors summary: Errors while reconciling controller. ok 23.39s ago 320us
alert: PrometheusOperatorNodeLookupErrors expr: rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator-kube-p-operator",namespace="default"}[5m]) > 0.1 for: 10m labels: severity: warning annotations: description: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatornodelookuperrors summary: Errors while reconciling Prometheus. ok 23.389s ago 121.4us
alert: PrometheusOperatorNotReady expr: min by(namespace, controller) (max_over_time(prometheus_operator_ready{job="prometheus-operator-kube-p-operator",namespace="default"}[5m]) == 0) for: 5m labels: severity: warning annotations: description: Prometheus operator in {{ $labels.namespace }} namespace isn't ready to reconcile {{ $labels.controller }} resources. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatornotready summary: Prometheus operator not ready ok 23.389s ago 164.4us
alert: PrometheusOperatorRejectedResources expr: min_over_time(prometheus_operator_managed_resources{job="prometheus-operator-kube-p-operator",namespace="default",state="rejected"}[5m]) > 0 for: 5m labels: severity: warning annotations: description: Prometheus operator in {{ $labels.namespace }} namespace rejected {{ printf "%0.0f" $value }} {{ $labels.controller }}/{{ $labels.resource }} resources. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorrejectedresources summary: Resources rejected by Prometheus operator ok 23.389s ago 155.8us

prometheus

2.224s ago

2.915ms

Rule State Error Last Evaluation Evaluation Time
alert: PrometheusBadConfig expr: max_over_time(prometheus_config_last_reload_successful{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) == 0 for: 10m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to reload its configuration. summary: Failed Prometheus configuration reload. ok 2.224s ago 311.5us
alert: PrometheusNotificationQueueRunningFull expr: (predict_linear(prometheus_notifications_queue_length{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m], 60 * 30) > min_over_time(prometheus_notifications_queue_capacity{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) for: 15m labels: severity: warning annotations: description: Alert notification queue of Prometheus {{$labels.namespace}}/{{$labels.pod}} is running full. summary: Prometheus alert notification queue predicted to run full in less than 30m. ok 2.224s ago 258.5us
alert: PrometheusErrorSendingAlertsToSomeAlertmanagers expr: (rate(prometheus_notifications_errors_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) * 100 > 1 for: 15m labels: severity: warning annotations: description: '{{ printf "%.1f" $value }}% errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to Alertmanager {{$labels.alertmanager}}.' summary: Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager. ok 2.224s ago 168.5us
alert: PrometheusErrorSendingAlertsToAnyAlertmanager expr: min without(alertmanager) (rate(prometheus_notifications_errors_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) * 100 > 3 for: 15m labels: severity: critical annotations: description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.' summary: Prometheus encounters more than 3% errors sending alerts to any Alertmanager. ok 2.223s ago 155.9us
alert: PrometheusNotConnectedToAlertmanagers expr: max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) < 1 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not connected to any Alertmanagers. summary: Prometheus is not connected to any Alertmanagers. ok 2.223s ago 64.4us
alert: PrometheusTSDBReloadsFailing expr: increase(prometheus_tsdb_reloads_failures_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[3h]) > 0 for: 4h labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} reload failures over the last 3h. summary: Prometheus has issues reloading blocks from disk. ok 2.223s ago 257.8us
alert: PrometheusTSDBCompactionsFailing expr: increase(prometheus_tsdb_compactions_failed_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[3h]) > 0 for: 4h labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} compaction failures over the last 3h. summary: Prometheus has issues compacting blocks. ok 2.223s ago 195.1us
alert: PrometheusNotIngestingSamples expr: rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) <= 0 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples. summary: Prometheus is not ingesting samples. ok 2.223s ago 120.8us
alert: PrometheusDuplicateTimestamps expr: rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with different values but duplicated timestamp. summary: Prometheus is dropping samples with duplicate timestamps. ok 2.223s ago 107.3us
alert: PrometheusOutOfOrderTimestamps expr: rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with timestamps arriving out of order. summary: Prometheus drops samples with out-of-order timestamps. ok 2.223s ago 113.7us
alert: PrometheusRemoteStorageFailures expr: (rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) / (rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) + rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]))) * 100 > 1 for: 15m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} failed to send {{ printf "%.1f" $value }}% of the samples to {{ $labels.remote_name}}:{{ $labels.url }} summary: Prometheus fails to send samples to remote storage. ok 2.223s ago 272.2us
alert: PrometheusRemoteWriteBehind expr: (max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) - on(job, instance) group_right() max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) > 120 for: 15m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write is {{ printf "%.1f" $value }}s behind for {{ $labels.remote_name}}:{{ $labels.url }}. summary: Prometheus remote write is behind. ok 2.223s ago 182.1us
alert: PrometheusRemoteWriteDesiredShards expr: (max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > max_over_time(prometheus_remote_storage_shards_max{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) for: 15m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write desired shards calculation wants to run {{ $value }} shards for queue {{ $labels.remote_name}}:{{ $labels.url }}, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%s",job="prometheus-operator-kube-p-prometheus",namespace="default"}` $labels.instance | query | first | value }}. summary: Prometheus remote write desired shards calculation wants to run more than configured max shards. ok 2.223s ago 120.9us
alert: PrometheusRuleFailures expr: increase(prometheus_rule_evaluation_failures_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 15m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m. summary: Prometheus is failing rule evaluations. ok 2.223s ago 241.2us
alert: PrometheusMissingRuleEvaluations expr: increase(prometheus_rule_group_iterations_missed_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 15m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m. summary: Prometheus is missing rule evaluations due to slow rule group evaluation. ok 2.223s ago 241.7us
alert: PrometheusTargetLimitHit expr: increase(prometheus_target_scrape_pool_exceeded_target_limit_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 15m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has dropped {{ printf "%.0f" $value }} targets because the number of targets exceeded the configured target_limit. summary: Prometheus has dropped targets because some scrape configs have exceeded the targets limit. ok 2.222s ago 82.42us

kubernetes-storage

34.226s ago

595.9us

Rule State Error Last Evaluation Evaluation Time
alert: KubePersistentVolumeFillingUp expr: kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} < 0.15 for: 1m labels: severity: critical annotations: description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup summary: PersistentVolume is filling up. ok 34.226s ago 582.4us

file-storage

38.354s ago

1.913ms

Rule State Error Last Evaluation Evaluation Time
alert: file-storage-speed expr: (((sum(increase(file_storage_download_seconds_sum{exception="none",namespace="doc-production"}[5m])) / sum(increase(file_storage_download_size{namespace="doc-production"}[5m])) * 1e+06) > bool 4) + ((sum(increase(file_storage_upload_seconds_sum{exception="none",namespace="doc-production"}[5m])) / sum(increase(file_storage_upload_size{namespace="doc-production"}[5m])) * 1e+06) > bool 4)) > bool 0 for: 5m labels: alertname: analytics-telegram severity: warning annotations: message: "\U0001F4E6 File-storage - Средний показатель sec/MB для скачивания либо загрузки превышает норму!" ok 38.354s ago 1.2ms
alert: file-storage-error expr: (sum(rate(file_storage_upload_seconds_count{exception!="none",namespace="doc-production"}[5m]) or vector(0)) + sum(rate(file_storage_download_seconds_count{exception!="none",namespace="doc-production"}[5m]) or vector(0))) * 60 > bool 1 for: 5m labels: alertname: analytics-telegram severity: warning annotations: message: ⚠️ File-storage - количество ошибок выше нормы! ok 38.353s ago 699.4us