Metrics collection, alerting, and multi-stack federation. Operates as either Master (central receiver) or Remote (forwarding writer). Auto-generates PrometheusRule alert manifests per linked component with cluster label injection for remote instances.
The prometheus_master attribute controls how this Prometheus instance operates. When you have multiple stacks, one Prometheus acts as the Master and receives metrics from all other stacks' Remote instances.
prometheus_master: true), the rest are Remote.prometheus-prometheus (cross-stack, is_external).cluster="{system_name}" label to all scraped metrics via scrapeClasses — this tags every metric with its stack origin.remote_host + link's api_key + TLS skip verify).remote_alerts URL) instead of its own.cluster="{system_name}" injected so alerts filter by stack.
| Behavior | Master (true) |
Remote (false) |
|---|---|---|
| Remote Write Receiver | Enabled — accepts incoming metrics | Disabled — sends metrics outbound |
| TSDB Out-of-Order | outOfOrderTimeWindow: 1h (handles clock skew) |
Not configured |
| scrapeClasses | None — no cluster labeling on local metrics | cluster="{system_name}" added to all scraped metrics |
| remoteWrite | Not configured | Sends to Master's remote_host with API key + TLS |
| ruleSelector | {} — matches ALL PrometheusRules |
Scoped to own release label |
| Alertmanager Target | Local Alertmanager | Master's Alertmanager (remote_alerts) |
| Alert PromQL Injection | None — expressions unchanged | cluster="{system_name}" injected into all alert PromQL (post-gen) |
| Alerts Directory | Not included in kustomization | Included if any alert link exists |
prometheus-prometheus. The dev/staging instances add cluster="dev" / cluster="staging" to all metrics and forward everything to sre's Prometheus. All alerts route to sre's Alertmanager.
| Attribute | Example | Description |
|---|---|---|
namespace REQ |
prometheus |
Kubernetes namespace for all Prometheus resources |
prometheus_master REQ |
true / false |
Master vs Remote mode — controls remoteWrite, scrapeClasses, ruleSelector, alert injection (see above) |
retention_time |
15d |
TSDB data retention period |
retention_size |
40GB |
TSDB data retention size limit |
storage_size |
50Gi |
PersistentVolumeClaim size for TSDB storage |
push_gateway |
true / false |
Deploys PushGateway + ServiceMonitor (pushgateway.yaml, :9091) |
resource_profile |
medium |
Selects resource patch: small, medium, large, x-large |
kube_state_metrics_enabled |
true / false |
Enables kube-state-metrics deployment and ServiceMonitor |
chart_version_prometheus |
11.0.2 |
kube-prometheus-stack Helm chart version |
remote_host REQ |
https://prom.sre.io/api/v1/write |
Master only — Required for external federation. URL where Remote instances send metrics via remoteWrite. If empty, no remoteWrite is configured on Remote instances even if linked. |
remote_alerts REQ |
https://alerts.sre.io |
Master only — Required for external federation. Alertmanager URL where Remote instances send alerts. If empty, Remote uses its own local Alertmanager. |
smtp_host |
smtp.example.com:587 |
SMTP host for Alertmanager email notifications (stored in alert.env) |
from |
alerts@example.com |
SMTP sender address for alert emails |
smtp_user |
alert-user |
SMTP username for authentication |
smtp_password |
secret |
SMTP password for authentication |
slack_api |
https://hooks.slack.com/... |
Slack webhook URL for alert notifications |
configmap |
key=value |
Custom entries appended to alert.env configmap |
| Link Type | Direction | What It Automates |
|---|---|---|
| prometheus-prometheus | Outbound (is_external) |
Federation link — Remote sends remoteWrite + alerts to Master across clusters. Auto-configures: remoteWrite URL, API key auth, TLS, queue config, additionalAlertManagerConfigs. Link attribute: api_key — authentication for remote write
|
| prometheus-apisix | Outbound | Generates alerts/apisix-alerts.yaml + enables ServiceMonitor on APISIX side |
| prometheus-cert_manager | Outbound | Generates alerts/cert-mger-alert.yaml — Certificate expiry and renewal alerts |
| prometheus-elasticsearch | Outbound | Generates alerts/elasticsearch-alerts.yaml — Cluster health and indexing alerts |
| prometheus-keycloak_operator | Outbound | Generates alerts/keycloak-alerts.yaml — Keycloak availability alerts |
| prometheus-loki | Outbound | Generates alerts/loki-alerts.yaml — Ingestion rate and error alerts |
| prometheus-mongodb | Outbound | Generates alerts/mongo-alerts.yaml + mongo-operator.yaml — Replication, connection, operator alerts |
| prometheus-opa | Outbound | Generates alerts/opa.yaml — OPA Gatekeeper policy violation alerts |
| prometheus-rabbitmq | Outbound | Generates alerts/rabbitmq-alers.yaml — Queue depth and memory alerts |
| prometheus-qdrant | Outbound | Generates alerts/qdrant-alerts.yaml — Collection and memory alerts |
| prometheus-nextcloud | Outbound | Generates alerts/nextcloud-alerts.yaml — Nextcloud availability alerts |
prometheus-{component} outbound link generates a PrometheusRule alert file on the Prometheus side and enables a ServiceMonitor on the target component's side. Alert files are removed in post-gen if the link doesn't exist. The entire alerts/ directory is excluded from kustomization when prometheus_master: true or when no alert links exist.
| File | Condition | Contains |
|---|---|---|
k8s/deploy/base/namespace.yaml |
Always | Namespace |
k8s/deploy/base/kustomization.yaml |
Always | Resources, secretGenerator (alert.env), resource_profile patch, conditional alerts/ directory |
k8s/deploy/base/pushgateway.yaml |
push_gateway: true | PushGateway Deployment + Service + ServiceMonitor (:9091) |
k8s/deploy/base/patch/resource-{profile}.yaml |
Per resource_profile | CPU/memory limits for Prometheus, Alertmanager, Operator (small/medium/large/xlarge) |
k8s/deploy/base/alerts/kustomization.yaml |
NOT master AND any alert link | Kustomize listing of active alert files (conditional per link) |
k8s/deploy/base/alerts/apisix-alerts.yaml |
prometheus-apisix linked | APISIX PrometheusRule alerts |
k8s/deploy/base/alerts/cert-mger-alert.yaml |
prometheus-cert_manager linked | Certificate expiry and renewal alerts |
k8s/deploy/base/alerts/elasticsearch-alerts.yaml |
prometheus-elasticsearch linked | Cluster health and indexing alerts |
k8s/deploy/base/alerts/keycloak-alerts.yaml |
prometheus-keycloak_operator linked | Keycloak availability alerts |
k8s/deploy/base/alerts/loki-alerts.yaml |
prometheus-loki linked | Ingestion rate and error alerts |
k8s/deploy/base/alerts/mongo-alerts.yaml |
prometheus-mongodb linked | MongoDB replication and connection alerts |
k8s/deploy/base/alerts/mongo-operator.yaml |
prometheus-mongodb linked | MongoDB operator alerts |
k8s/deploy/base/alerts/opa.yaml |
prometheus-opa linked | OPA Gatekeeper policy violation alerts |
k8s/deploy/base/alerts/rabbitmq-alers.yaml |
prometheus-rabbitmq linked | Queue depth and memory alerts |
k8s/deploy/base/alerts/qdrant-alerts.yaml |
prometheus-qdrant linked | Collection and memory alerts |
k8s/deploy/base/alerts/nextcloud-alerts.yaml |
prometheus-nextcloud linked | Nextcloud availability alerts |
k8s/deploy/base/secret/alert.env |
Always | SMTP + Slack credentials for Alertmanager notifications + custom configmap entries |
k8s/deploy/base/secret/minio.env |
Always SOPS | Remote write credentials (SOPS encrypted) |
helm/helm-values.yaml |
Always | kube-prometheus-stack Helm values: prometheusSpec (scrapeClasses, remoteWrite, ruleSelector, storage), alertmanager, operator, nodeExporter, kubeStateMetrics, additionalAlertManagerConfigs |
helm/generate-yaml.sh |
Always | Helm template render script → outputs prometheus.yaml |