Prometheus

Metrics collection, alerting, and multi-stack federation. Operates as either Master (central receiver) or Remote (forwarding writer). Auto-generates PrometheusRule alert manifests per linked component with cluster label injection for remote instances.

Architecture

                Prometheus Server - Metrics collection + TSDB storage (:9090)

                Alertmanager - Alert routing + notification dispatch (:9093)

                Operator - CRD controller for Prometheus, Alertmanager, ServiceMonitor, PrometheusRule

                Node Exporter - Host-level metrics (CPU, memory, disk, network) (:9100)

                Kube State Metrics - Kubernetes object state metrics (optional)

                PushGateway - Push-based metric ingestion for batch jobs (optional, :9091)

Multi-Stack: Master vs Remote

The prometheus_master attribute controls how this Prometheus instance operates. When you have multiple stacks, one Prometheus acts as the Master and receives metrics from all other stacks' Remote instances.

How multi-stack federation works:
1. Each stack has its own Prometheus. One is set as Master (prometheus_master: true), the rest are Remote.
2. Remote instances link to the Master via prometheus-prometheus (cross-stack, is_external).
3. The Remote automatically adds cluster="{system_name}" label to all scraped metrics via scrapeClasses — this tags every metric with its stack origin.
4. The Remote sends all metrics to the Master via remoteWrite (using Master's remote_host + link's api_key + TLS skip verify).
5. The Remote sends alerts to the Master's Alertmanager (using Master's remote_alerts URL) instead of its own.
6. In post-gen, all alert PromQL expressions on Remote instances get cluster="{system_name}" injected so alerts filter by stack.

Behavior	Master (`true`)	Remote (`false`)
Remote Write Receiver	Enabled — accepts incoming metrics	Disabled — sends metrics outbound
TSDB Out-of-Order	`outOfOrderTimeWindow: 1h` (handles clock skew)	Not configured
scrapeClasses	None — no cluster labeling on local metrics	`cluster="{system_name}"` added to all scraped metrics
remoteWrite	Not configured	Sends to Master's `remote_host` with API key + TLS
ruleSelector	`{}` — matches ALL PrometheusRules	Scoped to own release label
Alertmanager Target	Local Alertmanager	Master's Alertmanager (`remote_alerts`)
Alert PromQL Injection	None — expressions unchanged	`cluster="{system_name}"` injected into all alert PromQL (post-gen)
Alerts Directory	Not included in kustomization	Included if any alert link exists

Example: Stack "sre" has a Master Prometheus. Stack "dev" and "staging" each have Remote Prometheus instances linked to "sre-prometheus" via prometheus-prometheus. The dev/staging instances add cluster="dev" / cluster="staging" to all metrics and forward everything to sre's Prometheus. All alerts route to sre's Alertmanager.

Attributes

Attribute	Example	Description
`namespace` REQ	`prometheus`	Kubernetes namespace for all Prometheus resources
`prometheus_master` REQ	`true / false`	Master vs Remote mode — controls remoteWrite, scrapeClasses, ruleSelector, alert injection (see above)
`retention_time`	`15d`	TSDB data retention period
`retention_size`	`40GB`	TSDB data retention size limit
`storage_size`	`50Gi`	PersistentVolumeClaim size for TSDB storage
`push_gateway`	`true / false`	Deploys PushGateway + ServiceMonitor (pushgateway.yaml, :9091)
`resource_profile`	`medium`	Selects resource patch: `small`, `medium`, `large`, `x-large`
`kube_state_metrics_enabled`	`true / false`	Enables kube-state-metrics deployment and ServiceMonitor
`chart_version_prometheus`	`11.0.2`	kube-prometheus-stack Helm chart version
`remote_host` REQ	`https://prom.sre.io/api/v1/write`	Master only — Required for external federation. URL where Remote instances send metrics via remoteWrite. If empty, no remoteWrite is configured on Remote instances even if linked.
`remote_alerts` REQ	`https://alerts.sre.io`	Master only — Required for external federation. Alertmanager URL where Remote instances send alerts. If empty, Remote uses its own local Alertmanager.
`smtp_host`	`smtp.example.com:587`	SMTP host for Alertmanager email notifications (stored in alert.env)
`from`	`alerts@example.com`	SMTP sender address for alert emails
`smtp_user`	`alert-user`	SMTP username for authentication
`smtp_password`	`secret`	SMTP password for authentication
`slack_api`	`https://hooks.slack.com/...`	Slack webhook URL for alert notifications
`configmap`	`key=value`	Custom entries appended to alert.env configmap

Links

Link Type	Direction	What It Automates
prometheus-prometheus	Outbound (is_external)	Federation link — Remote sends remoteWrite + alerts to Master across clusters. Auto-configures: remoteWrite URL, API key auth, TLS, queue config, additionalAlertManagerConfigs. Link attribute: `api_key` — authentication for remote write
prometheus-apisix	Outbound	Generates `alerts/apisix-alerts.yaml` + enables ServiceMonitor on APISIX side
prometheus-cert_manager	Outbound	Generates `alerts/cert-mger-alert.yaml` — Certificate expiry and renewal alerts
prometheus-elasticsearch	Outbound	Generates `alerts/elasticsearch-alerts.yaml` — Cluster health and indexing alerts
prometheus-keycloak_operator	Outbound	Generates `alerts/keycloak-alerts.yaml` — Keycloak availability alerts
prometheus-loki	Outbound	Generates `alerts/loki-alerts.yaml` — Ingestion rate and error alerts
prometheus-mongodb	Outbound	Generates `alerts/mongo-alerts.yaml` + `mongo-operator.yaml` — Replication, connection, operator alerts
prometheus-opa	Outbound	Generates `alerts/opa.yaml` — OPA Gatekeeper policy violation alerts
prometheus-rabbitmq	Outbound	Generates `alerts/rabbitmq-alers.yaml` — Queue depth and memory alerts
prometheus-qdrant	Outbound	Generates `alerts/qdrant-alerts.yaml` — Collection and memory alerts
prometheus-nextcloud	Outbound	Generates `alerts/nextcloud-alerts.yaml` — Nextcloud availability alerts

Note: Each prometheus-{component} outbound link generates a PrometheusRule alert file on the Prometheus side and enables a ServiceMonitor on the target component's side. Alert files are removed in post-gen if the link doesn't exist. The entire alerts/ directory is excluded from kustomization when prometheus_master: true or when no alert links exist.

Generated Files

File	Condition	Contains
`k8s/deploy/base/namespace.yaml`	Always	Namespace
`k8s/deploy/base/kustomization.yaml`	Always	Resources, secretGenerator (alert.env), resource_profile patch, conditional alerts/ directory
`k8s/deploy/base/pushgateway.yaml`	push_gateway: true	PushGateway Deployment + Service + ServiceMonitor (:9091)
`k8s/deploy/base/patch/resource-{profile}.yaml`	Per resource_profile	CPU/memory limits for Prometheus, Alertmanager, Operator (small/medium/large/xlarge)
`k8s/deploy/base/alerts/kustomization.yaml`	NOT master AND any alert link	Kustomize listing of active alert files (conditional per link)
`k8s/deploy/base/alerts/apisix-alerts.yaml`	prometheus-apisix linked	APISIX PrometheusRule alerts
`k8s/deploy/base/alerts/cert-mger-alert.yaml`	prometheus-cert_manager linked	Certificate expiry and renewal alerts
`k8s/deploy/base/alerts/elasticsearch-alerts.yaml`	prometheus-elasticsearch linked	Cluster health and indexing alerts
`k8s/deploy/base/alerts/keycloak-alerts.yaml`	prometheus-keycloak_operator linked	Keycloak availability alerts
`k8s/deploy/base/alerts/loki-alerts.yaml`	prometheus-loki linked	Ingestion rate and error alerts
`k8s/deploy/base/alerts/mongo-alerts.yaml`	prometheus-mongodb linked	MongoDB replication and connection alerts
`k8s/deploy/base/alerts/mongo-operator.yaml`	prometheus-mongodb linked	MongoDB operator alerts
`k8s/deploy/base/alerts/opa.yaml`	prometheus-opa linked	OPA Gatekeeper policy violation alerts
`k8s/deploy/base/alerts/rabbitmq-alers.yaml`	prometheus-rabbitmq linked	Queue depth and memory alerts
`k8s/deploy/base/alerts/qdrant-alerts.yaml`	prometheus-qdrant linked	Collection and memory alerts
`k8s/deploy/base/alerts/nextcloud-alerts.yaml`	prometheus-nextcloud linked	Nextcloud availability alerts
`k8s/deploy/base/secret/alert.env`	Always	SMTP + Slack credentials for Alertmanager notifications + custom configmap entries
`k8s/deploy/base/secret/minio.env`	Always SOPS	Remote write credentials (SOPS encrypted)
`helm/helm-values.yaml`	Always	kube-prometheus-stack Helm values: prometheusSpec (scrapeClasses, remoteWrite, ruleSelector, storage), alertmanager, operator, nodeExporter, kubeStateMetrics, additionalAlertManagerConfigs
`helm/generate-yaml.sh`	Always	Helm template render script → outputs prometheus.yaml