Rules

ngate-certificate-expiration-monitoring

36.661s ago

1.028ms

Rule State Error Last Evaluation Evaluation Time
alert: NgateCertificateExpiring expr: ng_infra_cert_expiry < 2.592e+06 labels: severity: warning annotations: description: Certificate is about to expire or expired summary: Certificate {{ $labels.name }} is about to expire or expired. Link https://grafana.basis.center/d/uU0S3zISz/ngate?orgId=1 ok 36.662s ago 356.2us
alert: NgateKeyExpiring expr: ng_infra_pk_expiry < 2.592e+06 labels: severity: warning annotations: description: Key is about to expire or expired summary: Key {{ $labels.name }} is about to expire or expired. Link https://grafana.basis.center/d/uU0S3zISz/ngate?orgId=1 ok 36.661s ago 273.6us
alert: NgateCertificateExpiring expr: ng_infra_cert_expiry < 1.2096e+06 labels: severity: critical annotations: description: Certificate expired in 14 days or less summary: Certificate {{ $labels.name }} is about to expire or expired. Link https://grafana.basis.center/d/uU0S3zISz/ngate?orgId=1 ok 36.661s ago 231.8us
alert: NgateKeyExpiring expr: ng_infra_pk_expiry < 1.2096e+06 labels: severity: critical annotations: description: Key expired in 14 days or less summary: Key {{ $labels.name }} is about to expire or expired. Link https://grafana.basis.center/d/uU0S3zISz/ngate?orgId=1 ok 36.661s ago 152us

rabbitmq-messages

43.85s ago

22.51ms

Rule State Error Last Evaluation Evaluation Time
alert: NumberOfConsumers expr: sum(rabbitmq_queue_consumers{job="rabbitmq"}) < 600 for: 10m labels: severity: critical annotations: description: Общее количество консьюмеров {{ $value }}. summary: Это может означать что работают не все поды rabbitmq. ok 43.851s ago 701.6us
alert: RabbitMQTooManyMessagesInQueue expr: rabbitmq_queue_messages > 100 for: 5m labels: severity: critical annotations: description: There are {{ $value }} messages in some queues. summary: This means there is a service overloaded or it's down. ok 43.85s ago 371us
alert: RabbitMQ-SoManyMessagesInQueue-1m expr: sum(increase(cqrs_commands_queue_time_seconds_sum{namespace="doc-production"}[1m])) / sum(increase(cqrs_commands_queue_time_seconds_count{namespace="doc-production"}[1m])) > 0.5 for: 1m labels: severity: warning annotations: description: There are a lot of messages in some queues within 1 min. summary: This means there is a service overloaded. Link https://grafana.basis.center/d/hxivlZ1Gk/health?viewPanel=88&from=now-30m ok 43.85s ago 9.507ms
alert: RabbitMQ-SoManyMessagesInQueue-10m expr: sum(increase(cqrs_commands_queue_time_seconds_sum{namespace="doc-production"}[5m])) / sum(increase(cqrs_commands_queue_time_seconds_count{namespace="doc-production"}[5m])) > 0.5 for: 5m labels: severity: critical annotations: description: There are a lot of messages in some queues within 10 min. summary: This means there is a service overloaded or down. Link https://grafana.basis.center/d/hxivlZ1Gk/health?viewPanel=88&from=now-30m ok 43.841s ago 11.92ms

alertmanager.rules

36.591s ago

741.1us

Rule State Error Last Evaluation Evaluation Time
alert: AlertmanagerConfigInconsistent expr: count by(namespace, service) (count_values by(namespace, service) ("config_hash", alertmanager_config_hash{job="prometheus-operator-kube-p-alertmanager",namespace="default"})) != 1 for: 5m labels: severity: critical annotations: message: | The configuration of the instances of the Alertmanager cluster `{{ $labels.namespace }}/{{ $labels.service }}` are out of sync. {{ range printf "alertmanager_config_hash{namespace=\"%s\",service=\"%s\"}" $labels.namespace $labels.service | query }} Configuration hash for pod {{ .Labels.pod }} is "{{ printf "%.f" .Value }}" {{ end }} ok 36.592s ago 335.1us
alert: AlertmanagerFailedReload expr: alertmanager_config_last_reload_successful{job="prometheus-operator-kube-p-alertmanager",namespace="default"} == 0 for: 10m labels: severity: warning annotations: message: Reloading Alertmanager's configuration has failed for {{ $labels.namespace }}/{{ $labels.pod}}. ok 36.592s ago 172.4us
alert: AlertmanagerMembersInconsistent expr: alertmanager_cluster_members{job="prometheus-operator-kube-p-alertmanager",namespace="default"} != on(service) group_left() count by(service) (alertmanager_cluster_members{job="prometheus-operator-kube-p-alertmanager",namespace="default"}) for: 5m labels: severity: critical annotations: message: Alertmanager has not found all other members of the cluster. ok 36.593s ago 218.3us

general.rules

22.897s ago

3.415ms

Rule State Error Last Evaluation Evaluation Time
alert: TargetDown expr: 100 * (count by(job, namespace, service) (up == 0) / count by(job, namespace, service) (up)) > 10 for: 10m labels: severity: warning annotations: message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service }} targets in {{ $labels.namespace }} namespace are down.' ok 22.899s ago 3.217ms
alert: Watchdog expr: vector(1) labels: severity: none annotations: message: | This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the "DeadMansSnitch" integration in PagerDuty. ok 22.895s ago 188.3us

k8s.rules

47.796s ago

69.98ms

Rule State Error Last Evaluation Evaluation Time
record: namespace:container_cpu_usage_seconds_total:sum_rate expr: sum by(namespace) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) ok 47.796s ago 3.942ms
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate expr: sum by(cluster, namespace, pod, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[5m])) * on(cluster, namespace, pod) group_left(node) topk by(cluster, namespace, pod) (1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})) ok 47.792s ago 6.628ms
record: node_namespace_pod_container:container_memory_working_set_bytes expr: container_memory_working_set_bytes{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 47.786s ago 11.37ms
record: node_namespace_pod_container:container_memory_rss expr: container_memory_rss{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 47.775s ago 10.91ms
record: node_namespace_pod_container:container_memory_cache expr: container_memory_cache{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 47.764s ago 10.46ms
record: node_namespace_pod_container:container_memory_swap expr: container_memory_swap{image!="",job="kubelet",metrics_path="/metrics/cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info{node!=""})) ok 47.754s ago 10.38ms
record: namespace:container_memory_usage_bytes:sum expr: sum by(namespace) (container_memory_usage_bytes{container!="POD",image!="",job="kubelet",metrics_path="/metrics/cadvisor"}) ok 47.743s ago 2.489ms
record: namespace:kube_pod_container_resource_requests_memory_bytes:sum expr: sum by(namespace) (sum by(namespace, pod) (max by(namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}) * on(namespace, pod) group_left() max by(namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))) ok 47.741s ago 3.863ms
record: namespace:kube_pod_container_resource_requests_cpu_cores:sum expr: sum by(namespace) (sum by(namespace, pod) (max by(namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}) * on(namespace, pod) group_left() max by(namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))) ok 47.738s ago 3.052ms
record: namespace_workload_pod:kube_pod_owner:relabel expr: max by(cluster, namespace, workload, pod) (label_replace(label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="ReplicaSet"}, "replicaset", "$1", "owner_name", "(.*)") * on(replicaset, namespace) group_left(owner_name) topk by(replicaset, namespace) (1, max by(replicaset, namespace, owner_name) (kube_replicaset_owner{job="kube-state-metrics"})), "workload", "$1", "owner_name", "(.*)")) labels: workload_type: deployment ok 47.735s ago 5.551ms
record: namespace_workload_pod:kube_pod_owner:relabel expr: max by(cluster, namespace, workload, pod) (label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="DaemonSet"}, "workload", "$1", "owner_name", "(.*)")) labels: workload_type: daemonset ok 47.729s ago 790.4us
record: namespace_workload_pod:kube_pod_owner:relabel expr: max by(cluster, namespace, workload, pod) (label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="StatefulSet"}, "workload", "$1", "owner_name", "(.*)")) labels: workload_type: statefulset ok 47.729s ago 511.3us

kube-apiserver-availability.rules

1m14.977s ago

6.375s

Rule State Error Last Evaluation Evaluation Time
record: apiserver_request:availability30d expr: 1 - ((sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))) + (sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d])) - ((sum(increase(apiserver_request_duration_seconds_bucket{le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30d])) or vector(0)) + sum(increase(apiserver_request_duration_seconds_bucket{le="0.5",scope="namespace",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{le="5",scope="cluster",verb=~"LIST|GET"}[30d])))) + sum(code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))) / sum(code:apiserver_request_total:increase30d) labels: verb: all ok 1m14.989s ago 2.062s
record: apiserver_request:availability30d expr: 1 - (sum(increase(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[30d])) - ((sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30d])) or vector(0)) + sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[30d]))) + sum(code:apiserver_request_total:increase30d{code=~"5..",verb="read"} or vector(0))) / sum(code:apiserver_request_total:increase30d{verb="read"}) labels: verb: read ok 1m12.927s ago 1.469s
record: apiserver_request:availability30d expr: 1 - ((sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))) + sum(code:apiserver_request_total:increase30d{code=~"5..",verb="write"} or vector(0))) / sum(code:apiserver_request_total:increase30d{verb="write"}) labels: verb: write ok 1m11.458s ago 1.188s
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="LIST"}[30d])) ok 1m10.27s ago 624.4ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="GET"}[30d])) ok 1m9.646s ago 257.3ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="POST"}[30d])) ok 1m9.389s ago 105.3ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="PUT"}[30d])) ok 1m9.283s ago 94.13ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="PATCH"}[30d])) ok 1m9.189s ago 99.12ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"2..",job="apiserver",verb="DELETE"}[30d])) ok 1m9.09s ago 67.3ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="LIST"}[30d])) ok 1m9.023s ago 1.14ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="GET"}[30d])) ok 1m9.022s ago 963.8us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="POST"}[30d])) ok 1m9.021s ago 877.1us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="PUT"}[30d])) ok 1m9.02s ago 773.9us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="PATCH"}[30d])) ok 1m9.019s ago 542.5us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"3..",job="apiserver",verb="DELETE"}[30d])) ok 1m9.019s ago 518.7us
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="LIST"}[30d])) ok 1m9.019s ago 45.1ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="GET"}[30d])) ok 1m8.973s ago 66.9ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="POST"}[30d])) ok 1m8.907s ago 21.21ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="PUT"}[30d])) ok 1m8.885s ago 31.72ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="PATCH"}[30d])) ok 1m8.854s ago 17.06ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"4..",job="apiserver",verb="DELETE"}[30d])) ok 1m8.837s ago 20.7ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="LIST"}[30d])) ok 1m8.816s ago 127.5ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="GET"}[30d])) ok 1m8.689s ago 43.37ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="POST"}[30d])) ok 1m8.645s ago 8.873ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="PUT"}[30d])) ok 1m8.637s ago 9.65ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="PATCH"}[30d])) ok 1m8.627s ago 9.752ms
record: code_verb:apiserver_request_total:increase30d expr: sum by(code, verb) (increase(apiserver_request_total{code=~"5..",job="apiserver",verb="DELETE"}[30d])) ok 1m8.617s ago 786.7us
record: code:apiserver_request_total:increase30d expr: sum by(code) (code_verb:apiserver_request_total:increase30d{verb=~"LIST|GET"}) labels: verb: read ok 1m8.617s ago 165.9us
record: code:apiserver_request_total:increase30d expr: sum by(code) (code_verb:apiserver_request_total:increase30d{verb=~"POST|PUT|PATCH|DELETE"}) labels: verb: write ok 1m8.617s ago 276.3us

kube-apiserver-slos

24.842s ago

1.541ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01) for: 2m labels: long: 1h severity: critical short: 5m annotations: description: The API server is burning too much error budget. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn summary: The API server is burning too much error budget. ok 24.843s ago 608.6us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate6h) > (6 * 0.01) and sum(apiserver_request:burnrate30m) > (6 * 0.01) for: 15m labels: long: 6h severity: critical short: 30m annotations: description: The API server is burning too much error budget. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn summary: The API server is burning too much error budget. ok 24.843s ago 430.2us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate1d) > (3 * 0.01) and sum(apiserver_request:burnrate2h) > (3 * 0.01) for: 1h labels: long: 1d severity: warning short: 2h annotations: description: The API server is burning too much error budget. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn summary: The API server is burning too much error budget. ok 24.843s ago 279.3us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate3d) > (1 * 0.01) and sum(apiserver_request:burnrate6h) > (1 * 0.01) for: 3h labels: long: 3d severity: warning short: 6h annotations: description: The API server is burning too much error budget. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn summary: The API server is burning too much error budget. ok 24.844s ago 205.4us

kube-apiserver.rules

25.044s ago

3.005s

Rule State Error Last Evaluation Evaluation Time
record: apiserver_request:burnrate1d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[1d])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[1d])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[1d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[1d])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[1d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[1d])) labels: verb: read ok 25.078s ago 301ms
record: apiserver_request:burnrate1h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[1h])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[1h])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[1h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[1h])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[1h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[1h])) labels: verb: read ok 24.777s ago 22.45ms
record: apiserver_request:burnrate2h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[2h])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[2h])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[2h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[2h])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[2h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[2h])) labels: verb: read ok 24.769s ago 34.95ms
record: apiserver_request:burnrate30m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[30m])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[30m])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[30m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[30m])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[30m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[30m])) labels: verb: read ok 24.734s ago 20.07ms
record: apiserver_request:burnrate3d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[3d])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[3d])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[3d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[3d])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[3d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[3d])) labels: verb: read ok 24.715s ago 824.8ms
record: apiserver_request:burnrate5m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[5m])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[5m])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[5m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[5m])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[5m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[5m])) labels: verb: read ok 23.89s ago 16.93ms
record: apiserver_request:burnrate6h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"LIST|GET"}[6h])) - ((sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.1",scope=~"resource|",verb=~"LIST|GET"}[6h])) or vector(0)) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[6h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[6h])))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"LIST|GET"}[6h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[6h])) labels: verb: read ok 23.873s ago 91.51ms
record: apiserver_request:burnrate1d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[1d]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d])) labels: verb: write ok 23.782s ago 133.1ms
record: apiserver_request:burnrate1h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[1h]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h])) labels: verb: write ok 23.649s ago 10.24ms
record: apiserver_request:burnrate2h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[2h]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h])) labels: verb: write ok 23.639s ago 12.94ms
record: apiserver_request:burnrate30m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[30m]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m])) labels: verb: write ok 23.626s ago 7.508ms
record: apiserver_request:burnrate3d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[3d]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d])) labels: verb: write ok 23.618s ago 359.1ms
record: apiserver_request:burnrate5m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write ok 23.259s ago 7.682ms
record: apiserver_request:burnrate6h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[6h]))) + sum(rate(apiserver_request_total{code=~"5..",job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h]))) / sum(rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h])) labels: verb: write ok 23.252s ago 41.39ms
record: code_resource:apiserver_request_total:rate5m expr: sum by(code, resource) (rate(apiserver_request_total{job="apiserver",verb=~"LIST|GET"}[5m])) labels: verb: read ok 23.212s ago 6.805ms
record: code_resource:apiserver_request_total:rate5m expr: sum by(code, resource) (rate(apiserver_request_total{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write ok 23.205s ago 3.05ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(le, resource) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"LIST|GET"}[5m]))) > 0 labels: quantile: "0.99" verb: read ok 23.202s ago 141.9ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(le, resource) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) > 0 labels: quantile: "0.99" verb: write ok 23.061s ago 66.56ms
record: cluster:apiserver_request_duration_seconds:mean5m expr: sum without(instance, pod) (rate(apiserver_request_duration_seconds_sum{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) / sum without(instance, pod) (rate(apiserver_request_duration_seconds_count{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) ok 23.002s ago 8.325ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.99" ok 22.994s ago 129.5ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.9" ok 22.865s ago 130ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.5" ok 22.735s ago 635.1ms

kube-prometheus-general.rules

42.782s ago

3.991ms

Rule State Error Last Evaluation Evaluation Time
record: count:up1 expr: count without(instance, pod, node) (up == 1) ok 42.782s ago 2.416ms
record: count:up0 expr: count without(instance, pod, node) (up == 0) ok 42.779s ago 1.562ms

kube-prometheus-node-recording.rules

10.191s ago

11.28ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_cpu:rate:sum expr: sum by(instance) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[3m])) ok 10.191s ago 1.335ms
record: instance:node_network_receive_bytes:rate:sum expr: sum by(instance) (rate(node_network_receive_bytes_total[3m])) ok 10.19s ago 1.067ms
record: instance:node_network_transmit_bytes:rate:sum expr: sum by(instance) (rate(node_network_transmit_bytes_total[3m])) ok 10.189s ago 1.051ms
record: instance:node_cpu:ratio expr: sum without(cpu, mode) (rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m])) / on(instance) group_left() count by(instance) (sum by(instance, cpu) (node_cpu_seconds_total)) ok 10.188s ago 4.112ms
record: cluster:node_cpu:sum_rate5m expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m])) ok 10.184s ago 1.517ms
record: cluster:node_cpu:ratio expr: cluster:node_cpu_seconds_total:rate5m / count(sum by(instance, cpu) (node_cpu_seconds_total)) ok 10.182s ago 2.184ms

kube-state-metrics

1.875s ago

494.9us

Rule State Error Last Evaluation Evaluation Time
alert: KubeStateMetricsListErrors expr: (sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m]))) > 0.01 for: 15m labels: severity: critical annotations: description: kube-state-metrics is experiencing errors at an elevated rate in list operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatemetricslisterrors summary: kube-state-metrics is experiencing errors in list operations. ok 1.875s ago 333.6us
alert: KubeStateMetricsWatchErrors expr: (sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) / sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m]))) > 0.01 for: 15m labels: severity: critical annotations: description: kube-state-metrics is experiencing errors at an elevated rate in watch operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatemetricswatcherrors summary: kube-state-metrics is experiencing errors in watch operations. ok 1.875s ago 151.8us

kubelet.rules

11.06s ago

6.731ms

Rule State Error Last Evaluation Evaluation Time
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) labels: quantile: "0.99" ok 11.06s ago 2.167ms
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) labels: quantile: "0.9" ok 11.058s ago 2.388ms
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"}) labels: quantile: "0.5" ok 11.055s ago 2.163ms

kubernetes-apps

43.928s ago

25.14ms

Rule State Error Last Evaluation Evaluation Time
alert: KubePodCrashLooping expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace=~".*"}[5m]) * 60 * 5 > 0 for: 15m labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping summary: Pod is crash looping. ok 43.929s ago 2.554ms
alert: KubePodNotReady expr: sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace=~".*",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) topk by(namespace, pod) (1, max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"}))) > 0 for: 15m labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready summary: Pod has been in a non-ready state for more than 15 minutes. ok 43.926s ago 5.38ms
alert: KubeDeploymentGenerationMismatch expr: kube_deployment_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_deployment_metadata_generation{job="kube-state-metrics",namespace=~".*"} for: 15m labels: severity: warning annotations: description: Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch summary: Deployment generation mismatch due to possible roll-back ok 43.921s ago 1.146ms
alert: KubeDeploymentReplicasMismatch expr: (kube_deployment_spec_replicas{job="kube-state-metrics",namespace=~".*"} != kube_deployment_status_replicas_available{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[5m]) == 0) for: 15m labels: severity: warning annotations: description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch summary: Deployment has not matched the expected number of replicas. ok 43.92s ago 1.885ms
alert: KubeStatefulSetReplicasMismatch expr: (kube_statefulset_status_replicas_ready{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_status_replicas{job="kube-state-metrics",namespace=~".*"}) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[5m]) == 0) for: 15m labels: severity: warning annotations: description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch summary: Deployment has not matched the expected number of replicas. ok 43.918s ago 557.5us
alert: KubeStatefulSetGenerationMismatch expr: kube_statefulset_status_observed_generation{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_metadata_generation{job="kube-state-metrics",namespace=~".*"} for: 15m labels: severity: warning annotations: description: StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch summary: StatefulSet generation mismatch due to possible roll-back ok 43.918s ago 265.5us
alert: KubeStatefulSetUpdateNotRolledOut expr: (max without(revision) (kube_statefulset_status_current_revision{job="kube-state-metrics",namespace=~".*"} unless kube_statefulset_status_update_revision{job="kube-state-metrics",namespace=~".*"}) * (kube_statefulset_replicas{job="kube-state-metrics",namespace=~".*"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"})) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics",namespace=~".*"}[5m]) == 0) for: 15m labels: severity: warning annotations: description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout summary: StatefulSet update has not been rolled out. ok 43.918s ago 695.1us
alert: KubeDaemonSetRolloutStuck expr: ((kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} != 0) or (kube_daemonset_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"}) or (kube_daemonset_status_number_available{job="kube-state-metrics",namespace=~".*"} != kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"})) and (changes(kube_daemonset_updated_number_scheduled{job="kube-state-metrics",namespace=~".*"}[5m]) == 0) for: 15m labels: severity: warning annotations: description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has not finished or progressed for at least 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck summary: DaemonSet rollout is stuck. ok 43.917s ago 1.15ms
alert: KubeContainerWaiting expr: sum by(namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics",namespace=~".*"}) > 0 for: 1h labels: severity: warning annotations: description: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontainerwaiting summary: Pod container waiting longer than 1 hour ok 43.916s ago 8.94ms
alert: KubeDaemonSetNotScheduled expr: kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics",namespace=~".*"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics",namespace=~".*"} > 0 for: 10m labels: severity: warning annotations: description: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled summary: DaemonSet pods are not scheduled. ok 43.907s ago 381.1us
alert: KubeDaemonSetMisScheduled expr: kube_daemonset_status_number_misscheduled{job="kube-state-metrics",namespace=~".*"} > 0 for: 15m labels: severity: warning annotations: description: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled summary: DaemonSet pods are misscheduled. ok 43.907s ago 192.4us
alert: KubeJobCompletion expr: kube_job_spec_completions{job="kube-state-metrics",namespace=~".*"} - kube_job_status_succeeded{job="kube-state-metrics",namespace=~".*"} > 0 for: 12h labels: severity: warning annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than 12 hours to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion summary: Job did not complete in time ok 43.907s ago 837.9us
alert: KubeJobFailed expr: kube_job_failed{job="kube-state-metrics",namespace=~".*"} > 0 for: 15m labels: severity: warning annotations: description: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. Removing failed job after investigation should clear this alert. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed summary: Job failed to complete. ok 43.906s ago 537.8us
alert: KubeHpaReplicasMismatch expr: (kube_hpa_status_desired_replicas{job="kube-state-metrics",namespace=~".*"} != kube_hpa_status_current_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_hpa_status_current_replicas{job="kube-state-metrics",namespace=~".*"} > kube_hpa_spec_min_replicas{job="kube-state-metrics",namespace=~".*"}) and (kube_hpa_status_current_replicas{job="kube-state-metrics",namespace=~".*"} < kube_hpa_spec_max_replicas{job="kube-state-metrics",namespace=~".*"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 for: 15m labels: severity: warning annotations: description: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpareplicasmismatch summary: HPA has not matched descired number of replicas. ok 43.906s ago 425.6us
alert: KubeHpaMaxedOut expr: kube_hpa_status_current_replicas{job="kube-state-metrics",namespace=~".*"} == kube_hpa_spec_max_replicas{job="kube-state-metrics",namespace=~".*"} for: 15m labels: severity: warning annotations: description: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpamaxedout summary: HPA is running at max replicas ok 43.906s ago 161.3us

kubernetes-resources

32.662s ago

4.165ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeCPUOvercommit expr: sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores) - 1) / count(kube_node_status_allocatable_cpu_cores) for: 5m labels: severity: info annotations: description: Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit summary: Cluster has overcommitted CPU resource requests. ok 32.662s ago 713.5us
alert: KubeMemoryOvercommit expr: sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes) - 1) / count(kube_node_status_allocatable_memory_bytes) for: 5m labels: severity: warning annotations: description: Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit summary: Cluster has overcommitted memory resource requests. ok 32.661s ago 462.5us
alert: KubeCPUQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5 for: 5m labels: severity: warning annotations: description: Cluster has overcommitted CPU resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuquotaovercommit summary: Cluster has overcommitted CPU resource requests. ok 32.661s ago 240.6us
alert: KubeMemoryQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",resource="memory",type="hard"}) / sum(kube_node_status_allocatable_memory_bytes{job="kube-state-metrics"}) > 1.5 for: 5m labels: severity: warning annotations: description: Cluster has overcommitted memory resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryquotaovercommit summary: Cluster has overcommitted memory resource requests. ok 32.661s ago 233.7us
alert: KubeQuotaAlmostFull expr: kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 0.9 < 1 for: 15m labels: severity: info annotations: description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaalmostfull summary: Namespace quota is going to be full. ok 32.661s ago 121.6us
alert: KubeQuotaFullyUsed expr: kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) == 1 for: 15m labels: severity: info annotations: description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotafullyused summary: Namespace quota is fully used. ok 32.661s ago 91.22us
alert: KubeQuotaExceeded expr: kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 1 for: 15m labels: severity: warning annotations: description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded summary: Namespace quota has exceeded the limits. ok 32.661s ago 122.8us
alert: CPUThrottlingHigh expr: sum by(container, pod, namespace) (increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) / sum by(container, pod, namespace) (increase(container_cpu_cfs_periods_total[5m])) > (25 / 100) for: 15m labels: severity: info annotations: description: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh summary: Processes experience elevated CPU throttling. ok 32.661s ago 2.165ms

kubernetes-storage

37.104s ago

1.1ms

Rule State Error Last Evaluation Evaluation Time
alert: KubePersistentVolumeFillingUp expr: kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} < 0.03 for: 1m labels: severity: critical annotations: description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup summary: PersistentVolume is filling up. ok 37.104s ago 346.5us
alert: KubePersistentVolumeFillingUp expr: (kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}) < 0.15 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"}[6h], 4 * 24 * 3600) < 0 for: 1h labels: severity: warning annotations: description: Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup summary: PersistentVolume is filling up. ok 37.104s ago 533.8us
alert: KubePersistentVolumeErrors expr: kube_persistentvolume_status_phase{job="kube-state-metrics",phase=~"Failed|Pending"} > 0 for: 5m labels: severity: critical annotations: description: The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors summary: PersistentVolume is having issues with provisioning. ok 37.103s ago 211us

kubernetes-system-apiserver

8.382s ago

2.3ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeClientCertificateExpiration expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 604800 labels: severity: warning annotations: description: A client certificate used to authenticate to the apiserver is expiring in less than 7.0 days. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration summary: Client certificate is about to expire. ok 8.382s ago 1.226ms
alert: KubeClientCertificateExpiration expr: apiserver_client_certificate_expiration_seconds_count{job="apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="apiserver"}[5m]))) < 86400 labels: severity: critical annotations: description: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration summary: Client certificate is about to expire. ok 8.381s ago 719.1us
alert: AggregatedAPIErrors expr: sum by(name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2 labels: severity: warning annotations: description: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapierrors summary: An aggregated API has reported errors. ok 8.38s ago 199.7us
alert: KubeAPIDown expr: absent(up{job="apiserver"} == 1) for: 15m labels: severity: critical annotations: description: KubeAPI has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown summary: Target disappeared from Prometheus target discovery. ok 8.38s ago 143.9us

kubernetes-system-kubelet

17.017s ago

8.248ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeNodeNotReady expr: kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true"} == 0 for: 15m labels: severity: warning annotations: description: '{{ $labels.node }} has been unready for more than 15 minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready summary: Node is not ready. ok 17.017s ago 448.2us
alert: KubeNodeUnreachable expr: (kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"} unless ignoring(key, value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"}) == 1 for: 15m labels: severity: warning annotations: description: '{{ $labels.node }} is unreachable and some workloads may be rescheduled.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodeunreachable summary: Node is unreachable. ok 17.016s ago 265.2us
alert: KubeletTooManyPods expr: count by(node) ((kube_pod_status_phase{job="kube-state-metrics",phase="Running"} == 1) * on(instance, pod, namespace, cluster) group_left(node) topk by(instance, pod, namespace, cluster) (1, kube_pod_info{job="kube-state-metrics"})) / max by(node) (kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) > 0.95 for: 15m labels: severity: warning annotations: description: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods summary: Kubelet is running at capacity. ok 17.016s ago 4.377ms
alert: KubeNodeReadinessFlapping expr: sum by(node) (changes(kube_node_status_condition{condition="Ready",status="true"}[15m])) > 2 for: 15m labels: severity: warning annotations: description: The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodereadinessflapping summary: Node readiness status is flapping. ok 17.012s ago 226.8us
alert: KubeletPlegDurationHigh expr: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10 for: 5m labels: severity: warning annotations: description: The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletplegdurationhigh summary: Kubelet Pod Lifecycle Event Generator is taking too long to relist. ok 17.012s ago 139.9us
alert: KubeletPodStartUpLatencyHigh expr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",metrics_path="/metrics"}[5m]))) * on(instance) group_left(node) kubelet_node_name{job="kubelet",metrics_path="/metrics"} > 60 for: 15m labels: severity: warning annotations: description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletpodstartuplatencyhigh summary: Kubelet Pod startup latency is too high. ok 17.012s ago 2.157ms
alert: KubeletClientCertificateExpiration expr: kubelet_certificate_manager_client_ttl_seconds < 604800 labels: severity: warning annotations: description: Client certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletclientcertificateexpiration summary: Kubelet client certificate is about to expire. ok 17.01s ago 54.05us
alert: KubeletClientCertificateExpiration expr: kubelet_certificate_manager_client_ttl_seconds < 86400 labels: severity: critical annotations: description: Client certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletclientcertificateexpiration summary: Kubelet client certificate is about to expire. ok 17.01s ago 36.22us
alert: KubeletServerCertificateExpiration expr: kubelet_certificate_manager_server_ttl_seconds < 604800 labels: severity: warning annotations: description: Server certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletservercertificateexpiration summary: Kubelet server certificate is about to expire. ok 17.01s ago 33.25us
alert: KubeletServerCertificateExpiration expr: kubelet_certificate_manager_server_ttl_seconds < 86400 labels: severity: critical annotations: description: Server certificate for Kubelet on node {{ $labels.node }} expires in {{ $value | humanizeDuration }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletservercertificateexpiration summary: Kubelet server certificate is about to expire. ok 17.01s ago 54.17us
alert: KubeletClientCertificateRenewalErrors expr: increase(kubelet_certificate_manager_client_expiration_renew_errors[5m]) > 0 for: 15m labels: severity: warning annotations: description: Kubelet on node {{ $labels.node }} has failed to renew its client certificate ({{ $value | humanize }} errors in the last 5 minutes). runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletclientcertificaterenewalerrors summary: Kubelet has failed to renew its client certificate. ok 17.01s ago 96.55us
alert: KubeletServerCertificateRenewalErrors expr: increase(kubelet_server_expiration_renew_errors[5m]) > 0 for: 15m labels: severity: warning annotations: description: Kubelet on node {{ $labels.node }} has failed to renew its server certificate ({{ $value | humanize }} errors in the last 5 minutes). runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletservercertificaterenewalerrors summary: Kubelet has failed to renew its server certificate. ok 17.011s ago 78.59us
alert: KubeletDown expr: absent(up{job="kubelet",metrics_path="/metrics"} == 1) for: 15m labels: severity: critical annotations: description: Kubelet has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown summary: Target disappeared from Prometheus target discovery. ok 17.011s ago 263.2us

kubernetes-system

3.579s ago

3.35ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeVersionMismatch expr: count(count by(gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"}, "gitVersion", "$1", "gitVersion", "(v[0-9]*.[0-9]*).*"))) > 1 for: 15m labels: severity: warning annotations: description: There are {{ $value }} different semantic versions of Kubernetes components running. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch summary: Different semantic versions of Kubernetes components running. ok 3.579s ago 840.6us
alert: KubeClientErrors expr: (sum by(instance, job) (rate(rest_client_requests_total{code=~"5.."}[5m])) / sum by(instance, job) (rate(rest_client_requests_total[5m]))) > 0.01 for: 15m labels: severity: warning annotations: description: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors summary: Kubernetes API server client is experiencing errors. ok 3.579s ago 2.5ms

node-exporter.rules

14.249s ago

10.53ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_num_cpu:sum expr: count without(cpu) (count without(mode) (node_cpu_seconds_total{job="node-exporter"})) ok 14.25s ago 2.953ms
record: instance:node_cpu_utilisation:rate1m expr: 1 - avg without(cpu, mode) (rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m])) ok 14.247s ago 474.4us
record: instance:node_load1_per_cpu:ratio expr: (node_load1{job="node-exporter"} / instance:node_num_cpu:sum{job="node-exporter"}) ok 14.246s ago 419.8us
record: instance:node_memory_utilisation:ratio expr: 1 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"}) ok 14.246s ago 484.6us
record: instance:node_vmstat_pgmajfault:rate1m expr: rate(node_vmstat_pgmajfault{job="node-exporter"}[1m]) ok 14.246s ago 232us
record: instance_device:node_disk_io_time_seconds:rate1m expr: rate(node_disk_io_time_seconds_total{device=~"mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+",job="node-exporter"}[1m]) ok 14.246s ago 490.3us
record: instance_device:node_disk_io_time_weighted_seconds:rate1m expr: rate(node_disk_io_time_weighted_seconds_total{device=~"mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|dasd.+",job="node-exporter"}[1m]) ok 14.246s ago 466.9us
record: instance:node_network_receive_bytes_excluding_lo:rate1m expr: sum without(device) (rate(node_network_receive_bytes_total{device!="lo",job="node-exporter"}[1m])) ok 14.245s ago 1.152ms
record: instance:node_network_transmit_bytes_excluding_lo:rate1m expr: sum without(device) (rate(node_network_transmit_bytes_total{device!="lo",job="node-exporter"}[1m])) ok 14.244s ago 1.062ms
record: instance:node_network_receive_drop_excluding_lo:rate1m expr: sum without(device) (rate(node_network_receive_drop_total{device!="lo",job="node-exporter"}[1m])) ok 14.243s ago 1.445ms
record: instance:node_network_transmit_drop_excluding_lo:rate1m expr: sum without(device) (rate(node_network_transmit_drop_total{device!="lo",job="node-exporter"}[1m])) ok 14.242s ago 1.323ms

node-exporter

2.078s ago

45.45ms

Rule State Error Last Evaluation Evaluation Time
alert: NodeFilesystemSpaceFillingUp expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemspacefillingup summary: Filesystem is predicted to run out of space within the next 24 hours. ok 2.078s ago 8.417ms
alert: NodeFilesystemSpaceFillingUp expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 15 and predict_linear(node_filesystem_avail_bytes{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up fast. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemspacefillingup summary: Filesystem is predicted to run out of space within the next 4 hours. ok 2.07s ago 7.967ms
alert: NodeFilesystemAlmostOutOfSpace expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 5 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutofspace summary: Filesystem has less than 5% space left. ok 2.062s ago 1.873ms
alert: NodeFilesystemAlmostOutOfSpace expr: (node_filesystem_avail_bytes{fstype!="",job="node-exporter"} / node_filesystem_size_bytes{fstype!="",job="node-exporter"} * 100 < 3 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutofspace summary: Filesystem has less than 3% space left. ok 2.06s ago 1.789ms
alert: NodeFilesystemFilesFillingUp expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 40 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 24 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemfilesfillingup summary: Filesystem is predicted to run out of inodes within the next 24 hours. ok 2.059s ago 7.415ms
alert: NodeFilesystemFilesFillingUp expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 20 and predict_linear(node_filesystem_files_free{fstype!="",job="node-exporter"}[6h], 4 * 60 * 60) < 0 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left and is filling up fast. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemfilesfillingup summary: Filesystem is predicted to run out of inodes within the next 4 hours. ok 2.052s ago 7.854ms
alert: NodeFilesystemAlmostOutOfFiles expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 5 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: warning annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutoffiles summary: Filesystem has less than 5% inodes left. ok 2.044s ago 1.798ms
alert: NodeFilesystemAlmostOutOfFiles expr: (node_filesystem_files_free{fstype!="",job="node-exporter"} / node_filesystem_files{fstype!="",job="node-exporter"} * 100 < 3 and node_filesystem_readonly{fstype!="",job="node-exporter"} == 0) for: 1h labels: severity: critical annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemalmostoutoffiles summary: Filesystem has less than 3% inodes left. ok 2.042s ago 1.783ms
alert: NodeNetworkReceiveErrs expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01 for: 1h labels: severity: warning annotations: description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} receive errors in the last two minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworkreceiveerrs summary: Network interface is reporting many receive errors. ok 2.04s ago 2.604ms
alert: NodeNetworkTransmitErrs expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01 for: 1h labels: severity: warning annotations: description: '{{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf "%.0f" $value }} transmit errors in the last two minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodenetworktransmiterrs summary: Network interface is reporting many transmit errors. ok 2.038s ago 2.514ms
alert: NodeHighNumberConntrackEntriesUsed expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit) > 0.75 labels: severity: warning annotations: description: '{{ $value | humanizePercentage }} of conntrack entries are used.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodehighnumberconntrackentriesused summary: Number of conntrack are getting close to the limit. ok 2.036s ago 290us
alert: NodeTextFileCollectorScrapeError expr: node_textfile_scrape_error{job="node-exporter"} == 1 labels: severity: warning annotations: description: Node Exporter text file collector failed to scrape. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodetextfilecollectorscrapeerror summary: Node Exporter text file collector failed to scrape. ok 2.035s ago 146.8us
alert: NodeClockSkewDetected expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0) for: 10m labels: severity: warning annotations: message: Clock on {{ $labels.instance }} is out of sync by more than 300s. Ensure NTP is configured correctly on this host. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeclockskewdetected summary: Clock skew detected. ok 2.036s ago 518.5us
alert: NodeClockNotSynchronising expr: min_over_time(node_timex_sync_status[5m]) == 0 and node_timex_maxerror_seconds >= 16 for: 10m labels: severity: warning annotations: message: Clock on {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodeclocknotsynchronising summary: Clock not synchronising. ok 2.036s ago 280.3us
alert: NodeRAIDDegraded expr: node_md_disks_required - ignoring(state) (node_md_disks{state="active"}) > 0 for: 15m labels: severity: critical annotations: description: RAID array '{{ $labels.device }}' on {{ $labels.instance }} is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-noderaiddegraded summary: RAID Array is degraded ok 2.036s ago 104.7us
alert: NodeRAIDDiskFailure expr: node_md_disks{state="fail"} > 0 labels: severity: warning annotations: description: At least one device in RAID array on {{ $labels.instance }} failed. Array '{{ $labels.device }}' needs attention and possibly a disk swap. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-noderaiddiskfailure summary: Failed device in RAID array ok 2.088s ago 73.74us

node-network

7.21s ago

1.348ms

Rule State Error Last Evaluation Evaluation Time
alert: NodeNetworkInterfaceFlapping expr: changes(node_network_up{device!~"veth.+",job="node-exporter"}[2m]) > 2 for: 2m labels: severity: warning annotations: message: Network interface "{{ $labels.device }}" changing it's up status often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}" ok 7.21s ago 1.338ms

node.rules

8.908s ago

10.97ms

Rule State Error Last Evaluation Evaluation Time
record: :kube_pod_info_node_count: expr: sum(min by(cluster, node) (kube_pod_info{node!=""})) ok 8.909s ago 1.959ms
record: node_namespace_pod:kube_pod_info: expr: topk by(namespace, pod) (1, max by(node, namespace, pod) (label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)"))) ok 8.907s ago 2.948ms
record: node:node_num_cpu:sum expr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)) ok 8.904s ago 4.828ms
record: :node_memory_MemAvailable_bytes:sum expr: sum by(cluster) (node_memory_MemAvailable_bytes{job="node-exporter"} or (node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Slab_bytes{job="node-exporter"})) ok 8.899s ago 1.219ms

prometheus-operator

22.954s ago

1.71ms

Rule State Error Last Evaluation Evaluation Time
alert: PrometheusOperatorListErrors expr: (sum by(controller, namespace) (rate(prometheus_operator_list_operations_failed_total{job="prometheus-operator-kube-p-operator",namespace="default"}[10m])) / sum by(controller, namespace) (rate(prometheus_operator_list_operations_total{job="prometheus-operator-kube-p-operator",namespace="default"}[10m]))) > 0.4 for: 15m labels: severity: warning annotations: description: Errors while performing List operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorlisterrors summary: Errors while performing list operations in controller. ok 22.954s ago 511.8us
alert: PrometheusOperatorWatchErrors expr: (sum by(controller, namespace) (rate(prometheus_operator_watch_operations_failed_total{job="prometheus-operator-kube-p-operator",namespace="default"}[10m])) / sum by(controller, namespace) (rate(prometheus_operator_watch_operations_total{job="prometheus-operator-kube-p-operator",namespace="default"}[10m]))) > 0.4 for: 15m labels: severity: warning annotations: description: Errors while performing watch operations in controller {{$labels.controller}} in {{$labels.namespace}} namespace. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorwatcherrors summary: Errors while performing watch operations in controller. ok 22.954s ago 261.2us
alert: PrometheusOperatorSyncFailed expr: min_over_time(prometheus_operator_syncs{job="prometheus-operator-kube-p-operator",namespace="default",status="failed"}[5m]) > 0 for: 10m labels: severity: warning annotations: description: Controller {{ $labels.controller }} in {{ $labels.namespace }} namespace fails to reconcile {{ $value }} objects. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorsyncfailed summary: Last controller reconciliation failed ok 22.954s ago 135us
alert: PrometheusOperatorReconcileErrors expr: (sum by(controller, namespace) (rate(prometheus_operator_reconcile_errors_total{job="prometheus-operator-kube-p-operator",namespace="default"}[5m]))) / (sum by(controller, namespace) (rate(prometheus_operator_reconcile_operations_total{job="prometheus-operator-kube-p-operator",namespace="default"}[5m]))) > 0.1 for: 10m labels: severity: warning annotations: description: '{{ $value | humanizePercentage }} of reconciling operations failed for {{ $labels.controller }} controller in {{ $labels.namespace }} namespace.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorreconcileerrors summary: Errors while reconciling controller. ok 22.954s ago 248us
alert: PrometheusOperatorNodeLookupErrors expr: rate(prometheus_operator_node_address_lookup_errors_total{job="prometheus-operator-kube-p-operator",namespace="default"}[5m]) > 0.1 for: 10m labels: severity: warning annotations: description: Errors while reconciling Prometheus in {{ $labels.namespace }} Namespace. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatornodelookuperrors summary: Errors while reconciling Prometheus. ok 22.954s ago 77.18us
alert: PrometheusOperatorNotReady expr: min by(namespace, controller) (max_over_time(prometheus_operator_ready{job="prometheus-operator-kube-p-operator",namespace="default"}[5m]) == 0) for: 5m labels: severity: warning annotations: description: Prometheus operator in {{ $labels.namespace }} namespace isn't ready to reconcile {{ $labels.controller }} resources. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatornotready summary: Prometheus operator not ready ok 22.954s ago 242.2us
alert: PrometheusOperatorRejectedResources expr: min_over_time(prometheus_operator_managed_resources{job="prometheus-operator-kube-p-operator",namespace="default",state="rejected"}[5m]) > 0 for: 5m labels: severity: warning annotations: description: Prometheus operator in {{ $labels.namespace }} namespace rejected {{ printf "%0.0f" $value }} {{ $labels.controller }}/{{ $labels.resource }} resources. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-prometheusoperatorrejectedresources summary: Resources rejected by Prometheus operator ok 22.954s ago 218.7us

prometheus

1.788s ago

2.989ms

Rule State Error Last Evaluation Evaluation Time
alert: PrometheusBadConfig expr: max_over_time(prometheus_config_last_reload_successful{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) == 0 for: 10m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to reload its configuration. summary: Failed Prometheus configuration reload. ok 1.789s ago 275.3us
alert: PrometheusNotificationQueueRunningFull expr: (predict_linear(prometheus_notifications_queue_length{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m], 60 * 30) > min_over_time(prometheus_notifications_queue_capacity{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) for: 15m labels: severity: warning annotations: description: Alert notification queue of Prometheus {{$labels.namespace}}/{{$labels.pod}} is running full. summary: Prometheus alert notification queue predicted to run full in less than 30m. ok 1.789s ago 246.8us
alert: PrometheusErrorSendingAlertsToSomeAlertmanagers expr: (rate(prometheus_notifications_errors_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) * 100 > 1 for: 15m labels: severity: warning annotations: description: '{{ printf "%.1f" $value }}% errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to Alertmanager {{$labels.alertmanager}}.' summary: Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager. ok 1.788s ago 206.9us
alert: PrometheusErrorSendingAlertsToAnyAlertmanager expr: min without(alertmanager) (rate(prometheus_notifications_errors_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) / rate(prometheus_notifications_sent_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) * 100 > 3 for: 15m labels: severity: critical annotations: description: '{{ printf "%.1f" $value }}% minimum errors while sending alerts from Prometheus {{$labels.namespace}}/{{$labels.pod}} to any Alertmanager.' summary: Prometheus encounters more than 3% errors sending alerts to any Alertmanager. ok 1.797s ago 197us
alert: PrometheusNotConnectedToAlertmanagers expr: max_over_time(prometheus_notifications_alertmanagers_discovered{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) < 1 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not connected to any Alertmanagers. summary: Prometheus is not connected to any Alertmanagers. ok 1.797s ago 98.05us
alert: PrometheusTSDBReloadsFailing expr: increase(prometheus_tsdb_reloads_failures_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[3h]) > 0 for: 4h labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} reload failures over the last 3h. summary: Prometheus has issues reloading blocks from disk. ok 1.797s ago 265.3us
alert: PrometheusTSDBCompactionsFailing expr: increase(prometheus_tsdb_compactions_failed_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[3h]) > 0 for: 4h labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has detected {{$value | humanize}} compaction failures over the last 3h. summary: Prometheus has issues compacting blocks. ok 1.797s ago 161.7us
alert: PrometheusNotIngestingSamples expr: rate(prometheus_tsdb_head_samples_appended_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) <= 0 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is not ingesting samples. summary: Prometheus is not ingesting samples. ok 1.797s ago 119.4us
alert: PrometheusDuplicateTimestamps expr: rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with different values but duplicated timestamp. summary: Prometheus is dropping samples with duplicate timestamps. ok 1.797s ago 92.06us
alert: PrometheusOutOfOrderTimestamps expr: rate(prometheus_target_scrapes_sample_out_of_order_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 10m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} is dropping {{ printf "%.4g" $value }} samples/s with timestamps arriving out of order. summary: Prometheus drops samples with out-of-order timestamps. ok 1.797s ago 108.3us
alert: PrometheusRemoteStorageFailures expr: (rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) / (rate(prometheus_remote_storage_failed_samples_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) + rate(prometheus_remote_storage_succeeded_samples_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]))) * 100 > 1 for: 15m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} failed to send {{ printf "%.1f" $value }}% of the samples to {{ $labels.remote_name}}:{{ $labels.url }} summary: Prometheus fails to send samples to remote storage. ok 1.797s ago 221.1us
alert: PrometheusRemoteWriteBehind expr: (max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) - on(job, instance) group_right() max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) > 120 for: 15m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write is {{ printf "%.1f" $value }}s behind for {{ $labels.remote_name}}:{{ $labels.url }}. summary: Prometheus remote write is behind. ok 1.797s ago 153.6us
alert: PrometheusRemoteWriteDesiredShards expr: (max_over_time(prometheus_remote_storage_shards_desired{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > max_over_time(prometheus_remote_storage_shards_max{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m])) for: 15m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote write desired shards calculation wants to run {{ $value }} shards for queue {{ $labels.remote_name}}:{{ $labels.url }}, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%s",job="prometheus-operator-kube-p-prometheus",namespace="default"}` $labels.instance | query | first | value }}. summary: Prometheus remote write desired shards calculation wants to run more than configured max shards. ok 1.797s ago 132.9us
alert: PrometheusRuleFailures expr: increase(prometheus_rule_evaluation_failures_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 15m labels: severity: critical annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has failed to evaluate {{ printf "%.0f" $value }} rules in the last 5m. summary: Prometheus is failing rule evaluations. ok 1.797s ago 294.3us
alert: PrometheusMissingRuleEvaluations expr: increase(prometheus_rule_group_iterations_missed_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 15m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has missed {{ printf "%.0f" $value }} rule group evaluations in the last 5m. summary: Prometheus is missing rule evaluations due to slow rule group evaluation. ok 1.797s ago 304.9us
alert: PrometheusTargetLimitHit expr: increase(prometheus_target_scrape_pool_exceeded_target_limit_total{job="prometheus-operator-kube-p-prometheus",namespace="default"}[5m]) > 0 for: 15m labels: severity: warning annotations: description: Prometheus {{$labels.namespace}}/{{$labels.pod}} has dropped {{ printf "%.0f" $value }} targets because the number of targets exceeded the configured target_limit. summary: Prometheus has dropped targets because some scrape configs have exceeded the targets limit. ok 1.797s ago 90.61us

kubernetes-storage

33.801s ago

410.2us

Rule State Error Last Evaluation Evaluation Time
alert: KubePersistentVolumeFillingUp expr: kubelet_volume_stats_available_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} / kubelet_volume_stats_capacity_bytes{job="kubelet",metrics_path="/metrics",namespace=~".*"} < 0.15 for: 1m labels: severity: critical annotations: description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup summary: PersistentVolume is filling up. ok 33.801s ago 399.4us

file-storage

37.929s ago

2.399ms

Rule State Error Last Evaluation Evaluation Time
alert: file-storage-speed expr: (((sum(increase(file_storage_download_seconds_sum{exception="none",namespace="doc-production"}[5m])) / sum(increase(file_storage_download_size{namespace="doc-production"}[5m])) * 1e+06) > bool 4) + ((sum(increase(file_storage_upload_seconds_sum{exception="none",namespace="doc-production"}[5m])) / sum(increase(file_storage_upload_size{namespace="doc-production"}[5m])) * 1e+06) > bool 4)) > bool 0 for: 5m labels: alertname: analytics-telegram severity: warning annotations: message: "\U0001F4E6 File-storage - Средний показатель sec/MB для скачивания либо загрузки превышает норму!" ok 37.929s ago 1.658ms
alert: file-storage-error expr: (sum(rate(file_storage_upload_seconds_count{exception!="none",namespace="doc-production"}[5m]) or vector(0)) + sum(rate(file_storage_download_seconds_count{exception!="none",namespace="doc-production"}[5m]) or vector(0))) * 60 > bool 1 for: 5m labels: alertname: analytics-telegram severity: warning annotations: message: ⚠️ File-storage - количество ошибок выше нормы! ok 37.928s ago 728.3us