AI Governance & MCP
The Dark Open-Source Factory — where AI governance is the architecture
In manufacturing, a "dark factory" runs lights-out — fully automated, zero human intervention. Stacktic brings this model to open-source infrastructure. Every stack generates a governed MCP server where the topology you design determines exactly what AI can see, query, and modify. The result: autonomous, metadata-driven operations with zero raw access — an open-source dark factory for topology, relationships, and operations automation.
Why AI governance is the next infrastructure challenge
AI agents are becoming the primary operators of infrastructure. They diagnose incidents, query databases, read logs, and trigger deployments. But today, giving AI access means giving it kubectl — raw, unstructured, unlimited access to your entire cluster.
The industry response is to bolt on governance after the fact — policy engines, permission layers, audit wrappers. This creates another system to maintain, another thing that can drift, another gap between intention and reality.
Stacktic eliminates this problem entirely. The topology you draw IS the governance. When you design your stack — components, links, sub-components — you're simultaneously defining what AI can access, what metadata it receives, what operations it can perform, and what credentials it holds. Nothing to bolt on. Nothing to drift.
AI + kubectl = full cluster access. No structural boundaries. Every enterprise is one hallucinated command away from a production incident.
AI receives structured metadata, typed tools, and scoped credentials — auto-generated from the stack topology. The metadata feed IS the control plane.
You don't govern AI by restricting it. You govern AI by feeding it the right metadata. Stacktic is the interface between AI and production — every tool is validated, every configuration production-hardened, every secret scoped, every operation error-proof. The topology metadata gives AI complete understanding with zero raw access. Lights-out operations.
Stacktic: The Production Interface for AI
AI doesn't talk to your infrastructure directly. Stacktic is the interface. Every tool AI calls was generated from a validated template. Every configuration is production-hardened. Every credential is scoped and resolved at runtime. Every operation is bounded by typed parameters — no arbitrary commands, no room for hallucinated kubectl.
Every tool generated from tested templates with typed parameters
Configurations are production-grade — Helm values, K8s manifests, SOPS secrets
No kubectl, no arbitrary commands — only scoped tools from drawn links
Typed params, credential scoping, write gating — AI cannot make destructive mistakes
AI + kubectl vs Stacktic MCP
| Capability | AI + kubectl (raw access) | Stacktic MCP |
|---|---|---|
| What AI sees | Raw YAML, thousands of resources, no context | Structured topology — components, links, types, groups |
| How AI queries | kubectl get pods -A — wall of text to parse | query_topology(source="type:kafka") — typed JSON |
| Cross-component awareness | AI must guess relationships from labels | Links are first-class — AI knows what connects to what |
| Credential access | AI needs kubectl get secret — sees raw passwords | MCP server holds credentials internally, AI never sees them |
| Write operations | kubectl apply/delete — anything goes | Per-service write gating (MCP_*_WRITE_ACCESS=false) |
| Blast radius | Entire cluster — can delete any namespace | Only services with drawn links. No link = no access |
| Multi-stack | Full kubeconfig access to every cluster | Each stack's MCP exposes only its own topology |
| Observability | AI runs kubectl logs, curl prometheus manually | Typed tools: loki_query(), prom_query(), grafana_list_dashboards() |
| Incident response | AI improvises — different approach every time | Prompt templates: structured playbook, repeatable steps |
| Adding a new service | Manually teach AI how to access it | Draw a link in Stacktic → MCP tools auto-generated |
| Audit trail | Shell history (if saved) | Every MCP call is JSON-RPC — method, params, timestamp |
| Security review | Review every kubectl command AI might run | Review tool definitions once — that's the attack surface |
How It Works
┌─────────────────────────────────────────────────────────────────────┐
│ AI Client (Claude, Cursor, etc.) │
│ Sends: tools/call, prompts/get │
└──────────────────────────────┬──────────────────────────────────────┘
│ MCP Protocol (JSON-RPC)
│ Authorization: Bearer <api_key>
▼
┌─────────────────────────────────────────────────────────────────────┐
│ APISIX Gateway │
│ key-auth, CORS, TLS termination │
└─────────────────────── ───────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ FastMCP Server (per stack) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Topology │ │ Prometheus │ │ Loki │ │
│ │ Tools (7) │ │ Tools (7) │ │ Tools (5) │ Tools are │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ auto- │
│ │ CNPG │ │ Grafana │ │ ClickHouse │ generated │
│ │ Tools (5) │ │ Tools (5) │ │ Tools (5) │ from │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ topology │
│ │ Kafka │ │ ArgoCD │ │ Valkey │ links │
│ │ Tools │ │ Tools (6) │ │ Tools (8) │ │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ │
│ │ Prompts (6) │ │ Resources(3) │ │ S3 Tools (5) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Credentials from cloud.env (SOPS encrypted) │
│ Write-access gating per service │
└──────────────────────────────┬──────────────────────────────────────┘
│ HTTP/gRPC/TCP (internal cluster network)
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Stack Services │
│ PostgreSQL │ Kafka │ Prometheus │ Grafana │ Loki │ ... │
└─────────────────────────────────────────────────────────────────────┘
The MCP server is a ConfigMap-mounted Python application. Tool files are generated from templates — when you draw a link from FastMCP to Grafana, the grafana_tools.py file is included. When you remove the link, it's deleted. No manual configuration.
Stack Agent API — The Intelligence Layer
The Stack Agent is a lightweight pod deployed per stack that holds the entire topology as structured metadata. It exposes a single API endpoint — POST /metadata/q — that lets you query your stack like a database. The MCP topology tools are a thin wrapper around this API.
What AI gets from the API
- Every component — name, type, namespace, group
- Every sub-component — databases, topics, queues, buckets, models
- Every link — who connects to whom, with direction and type
- Every attribute — ports, credentials, endpoints, config
- Cross-stack boundaries —
is_externalflag - Run shell commands with auto-resolved variables
Why this matters for governance
- AI doesn't need
kubectl— it queries structured JSON - Variables like
{namespace},{password}resolve automatically - AI never sees raw credentials — the Agent substitutes them
- Commands run inside the Agent pod — not from AI's context
- Every query is scoped to the stack's topology
- Dry-run mode lets AI preview without executing
Query Model — Source / Target / Where
Every MCP topology call maps to a POST /metadata/q request with a source/target/where structure. This is the "SQL" of your stack.
Source — WHAT to query
"all" — every component
"type:kafka" — all Kafka instances
"namespace:db" — all in namespace
"group:backend" — all in group
"kafka" — specific by name
Target — WHAT to return
"component" — the component itself
"sub_components" — DBs, topics, queues
"links_to" — outbound connections
"links_from" — inbound connections
"attributes" — config values
Where — HOW to filter
"field": "value" — exact match
"field__contains": "str" — substring
"field__exists": true — field present
"field__in": ["a","b"] — in list
"field__gt": N — greater than
Variable Substitution — Zero Hardcoding
When AI sends a command like kubectl get pods -n {namespace}, the Stack Agent resolves {namespace} to the actual namespace from the topology. The AI never hardcodes service names, ports, or credentials.
| Variable | Resolved from | Example value |
|---|---|---|
{namespace} | Component metadata | kafka, db, monitoring |
{name} | Component name | kafka, cnpg, prometheus |
{port} | Component attributes | 5432, 9092, 9090 |
{password} | Component attributes | (auto-resolved, AI never sees it) |
{database} | Sub-component attributes | app_db, analytics |
{username} | Sub-component attributes | app_user, admin |
{host} | Computed from type pattern | cluster-db-rw.db.svc.cluster.local |
{link_name} | Link metadata | fastapi-kafka_topic |
{linked_namespace} | Linked component | messaging |
{linked_is_external} | Cross-stack flag | true / false |
Key governance point: AI writes generic commands with {variables}. The Stack Agent resolves them. The same command works across any stack, any environment — zero hardcoded names, ports, or credentials.
Command Pipeline — Test + Diagnose
┌──────────────────────────────────────────────────────────────────────────┐
│ AI calls: run_test(source="type:cnpg", target="sub_components", │
│ command="PGPASSWORD={password} psql -h {host} -U {username} │
│ -d {database} -c 'SELECT 1'") │
└──────────────────────────────┬───────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ Stack Agent resolves variables per sub-component: │
│ {password} → "s3cur3pw", {host} → "cluster-db-rw.db.svc...", │
│ {username} → "app_user", {database} → "app_db" │
└──────────────────────────────┬───────────────────────────────────────────┘
│
┌──────────┴──────────┐
│ command executes │
│ (exit code check) │
└──────────┬──────────┘
│
┌───────────┴───────────┐
│ │
exit 0 (pass) exit ≠ 0 (fail)
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ wait │ │ run on_failure │
│ verify_delay │ │ (diagnostics) │
│ then on_success │ └─────────────────┘
└─────────────────┘
MCP Tools that Wrap the API
| MCP Tool | What it does | Maps to |
|---|---|---|
query_topology() | Full query with all parameters — source, target, where, command, pipeline | POST /metadata/q with full body |
list_components() | List all components with types, groups, namespaces | POST /metadata/q → source="all", target="component" |
get_component_links() | Get outbound or inbound links for a component | POST /metadata/q → target="links_to" or "links_from" |
run_test() | Run a validation command with full pipeline | POST /metadata/q with command + on_success/on_failure |
get_stack_structure() | Get component tree with sub-component counts | GET /metadata/structure |
get_available_fields() | Discover what {variables} are available for a component | GET /metadata/fields/{'{component}'} |
MCP Resources — Persistent Context
| Resource URI | What AI gets | Use case |
|---|---|---|
stack://guide/api | Full API reference — source options, target options, where operators, host patterns, CLI tools, test examples | AI reads once, knows how to query everything |
stack://live/summary | Live snapshot — domain, component counts by type, namespaces, groups | Quick stack overview before deep queries |
stack://live/structure | Live topology tree — every component, sub-component count, link list | Full map for navigation and planning |
How it works together: AI reads the stack://guide/api resource once to learn the query language. Then calls get_stack_structure() to understand the topology. Then uses query_topology() and run_test() to investigate specific components. The MCP prompts (diagnose_component, incident_response, etc.) encode this workflow as reusable playbooks.
Example: AI Queries the Stack
tools/call → list_components()
→ [{name: "kafka", type: "kafka", namespace: "kafka", group: "messaging"},
{name: "cnpg", type: "cnpg", namespace: "db", group: "databases"},
{name: "fastapi", type: "fastapi", namespace: "app", group: "backend"},
{name: "apisix", type: "apisix", namespace: "ingress", group: "gateway"}, ...]tools/call → query_topology(source="type:cnpg", target="sub_components",
select=["database", "username", "consumers"])
→ [{database: "app_db", username: "app_user", consumers: ["fastapi"]},
{database: "keycloak_db", username: "keycloak", consumers: ["keycloak"]}]tools/call → run_test(
source="type:cnpg", target="sub_components",
command='PGPASSWORD={password} psql -h {host} -p {port} -U {username} -d {database} -c "SELECT 1"',
on_failure='kubectl get pods -n {namespace} -l cnpg.io/cluster --no-headers')
→ {status: "passed", results: [{name: "app_db", status: "passed", output: "1"},
{name: "keycloak_db", status: "passed", output: "1"}]}tools/call → query_topology(source="apisix", target="links_to",
command="curl -sk https://{subdomain}.{domain}",
dry_run=true)
→ [{name: "fastapi", command: "curl -sk https://api.stack-3.source-lab.io"},
{name: "grafana", command: "curl -sk https://grafana.stack-3.source-lab.io"},
{name: "langflow", command: "curl -sk https://langflow.stack-3.source-lab.io"}]AI never hardcoded a single name, port, or password. It queried the topology, got structured data, and used {variables} for commands. The Stack Agent resolved everything from the metadata. Same queries work on any stack.
Governance by Architecture — 4 Layers
Layer 1: Generation (Day 0)
What CAN exist — 70 curated templates. AI can't invent components or create arbitrary connections. Link types are predefined.
Templates → Hooks → Deterministic output
Layer 2: Topology (Runtime)
What AI KNOWS — Stack Agent provides structured metadata. AI sees components through a typed API, not raw kubectl output.
Stack Agent → Metadata API → Structured JSON
Layer 3: Operations (Day 2)
What AI CAN DO — MCP tools with write-access gating. Prompt templates guide AI through structured workflows. Everything auditable.
MCP Tools → Write Gating → Audit Trail
Layer 4: Multi-Stack Isolation
What SCOPE AI has — Each stack generates its own MCP with its own credentials and topology. Cross-stack access is explicitly controlled.
Per-stack MCP → Isolated credentials → is_external boundaries
360° Stack Metadata for AI
Stacktic provides AI with complete, structured metadata about every layer of the stack. Instead of parsing YAML or guessing service names, AI gets typed, queryable data from a single source of truth.
| Metadata | What AI learns | Source |
|---|---|---|
| Components | Every service — name, type, namespace, group | list_components() |
| Topology | What connects to what — all links with directions | get_component_links() |
| Sub-components | Databases, topics, queues, buckets, models | query_topology(target="sub_components") |
| Attributes | Ports, credentials, config per component | query_topology(target="attributes") |
| Cross-stack | Which services are external, their endpoints | is_external flag in metadata |
| Prometheus metrics | Active targets, alert rules, metric values | prom_query(), prom_alerts() |
| Grafana dashboards | All dashboards, datasources, folders | grafana_list_dashboards() |
| Loki logs | Log labels, streams, LogQL query results | loki_labels(), loki_query_range() |
| Database schemas | Tables, columns, row counts | ch_describe_table(), cnpg tools |
| ArgoCD state | App sync status, health, resource tree | argocd_list_applications() |
Key point: This metadata is not manually configured. It's auto-generated from the topology you draw in Stacktic. Add a component, its metadata becomes queryable. Draw a link, the connection appears in the API.
MCP Tools — Auto-Generated from Links
When you draw a link from FastMCP to a service, the corresponding tool file is generated and mounted into the MCP server. When you remove the link, the file is deleted.
| Link drawn | Tools generated | Operations |
|---|---|---|
fastmcp → stack_agent | Topology (7 tools) + Prompts (6) | query_topology, run_test, list_components, get_stack_structure, get_available_fields, get_component_links |
fastmcp → cnpg | CNPG (5 tools) | cnpg_list_databases, cnpg_query, cnpg_table_stats, cnpg_describe_table, cnpg_list_tables |
fastmcp → prometheus | Prometheus (7 tools) | prom_query, prom_query_range, prom_alerts, prom_targets, prom_rules, prom_label_values, prom_series |
fastmcp → grafana | Grafana (5-6 tools) | grafana_list_dashboards, grafana_get_dashboard, grafana_list_datasources, grafana_get_alerts, grafana_list_folders, grafana_create_annotation* |
fastmcp → loki | Loki (5 tools) | loki_query, loki_query_range, loki_labels, loki_label_values, loki_series |
fastmcp → clickhouse | ClickHouse (5 tools) | ch_list_databases, ch_list_tables, ch_describe_table, ch_query, ch_table_stats |
fastmcp → rabbitmq | RabbitMQ (varies) | Queue management, message operations |
fastmcp → valkey | Valkey (6-8 tools) | valkey_info, valkey_dbsize, valkey_scan_keys, valkey_get, valkey_type_and_ttl, valkey_hgetall, valkey_set, valkey_delete |
fastmcp → argo_cd | ArgoCD (5-6 tools) | argocd_list_applications, argocd_get_application, argocd_get_app_resources, argocd_list_projects, argocd_get_app_events, argocd_sync_application* |
fastmcp → seaweedfs | S3 (5 tools) | s3_list_buckets, s3_list_objects, s3_get_object, s3_read_text, s3_bucket_stats |
fastmcp → fastapi | FastAPI (varies) | HTTP proxy tools for backend API |
MCP_*_WRITE_ACCESS=trueWrite-Access Gating
Every service connection has an independent write-access flag. Read tools are always available. Write tools only appear when explicitly enabled.
Read-only (default)
MCP_CLICKHOUSE_WRITE_ACCESS=False
MCP_S3_WRITE_ACCESS=false
MCP_GRAFANA_WRITE_ACCESS=false
MCP_ARGOCD_WRITE_ACCESS=falseAI can query, inspect, and monitor — but cannot modify data, create annotations, or trigger deployments.
Write-enabled (explicit)
MCP_RABBITMQ_WRITE_ACCESS=True
MCP_VALKEY_WRITE_ACCESS=True
MCP_CLICKHOUSE_WRITE_ACCESS=TrueAI gains write tools: publish messages, set cache keys, insert rows. Each service is independently controlled.
MCP Prompts — AI Playbooks
MCP prompts are pre-built instruction templates that teach AI agents how to operate the stack. Instead of improvising, AI follows structured, repeatable playbooks.
diagnose_component
Query topology → check outbound links → check inbound links → discover test variables → run health tests → summarize blast radius and status.
explain_stack
Get structure → list all components → map connections → identify architecture patterns → present layered view with data flows.
incident_response
Identify suspect components from symptom → check each one → trace dependency chain → check ArgoCD for recent deploys → present root cause + blast radius.
trace_data_flow
Walk the link graph hop-by-hop from source to destination → test each connection → report protocol, health, and bottlenecks per hop.
run_validation
Systematic test suite: pods → services → component-specific checks → link connectivity → present PASS/FAIL table with health score.
check_logs
Get namespace from topology → query Loki logs for error patterns → correlate with Prometheus metrics → present timeline and root cause.
CNCF Four Pillars of Platform Control
The CNCF published a framework for governing AI in infrastructure (January 2026). Stacktic's MCP architecture maps to all four pillars.
| CNCF Pillar | Definition | Stacktic Implementation |
|---|---|---|
| Golden Paths | Pre-built, approved workflows | MCP Prompts — diagnose_component, incident_response, run_validation |
| Guardrails | Hard limits on what AI can do | Write-access gating per service, tool boundaries, typed parameters |
| Safety Nets | Catch mistakes before damage | dry_run=True on tests, read-only defaults, no raw kubectl |
| Manual Review | Human approves risky actions | Write tools only registered when explicitly enabled by platform team |
Governance Detail — What Controls What
| Governance Mechanism | What it controls | How it works | Example |
|---|---|---|---|
| Template Catalog | What CAN exist | 70 curated templates — AI can't invent components | No random services outside the catalog |
| Link Types | What CAN connect | Defined link types per template | fastapi-kafka_bridge exists, arbitrary links don't |
| Hooks | What configuration is POSSIBLE | Deterministic code generation from links | Prometheus scrape config computed from links, not AI-generated |
| Post-gen cleanup | What files EXIST | Unused tool files deleted at generation time | No Loki link → loki_tools.py deleted → AI has no Loki tools |
| Credential scoping | What AI can AUTHENTICATE to | cloud.env only contains credentials for linked services | No ClickHouse link → no ClickHouse credentials in MCP |
| Write-access gating | What AI can MODIFY | Per-service boolean in environment variables | MCP_RABBITMQ_WRITE_ACCESS=True but MCP_CLICKHOUSE_WRITE_ACCESS=False |
| Tool boundaries | What OPERATIONS are available | Typed tools with defined parameters | AI calls ch_query(sql) not kubectl exec clickhouse |
| MCP Prompts | HOW AI operates | Structured playbooks for multi-step workflows | incident_response follows: suspects → check → trace → fix |
| Stack isolation | What SCOPE AI has | Each stack has its own MCP with its own credentials | SRE MCP can't touch production databases |
| APISIX gateway | WHO can access | API key authentication at ingress | Only authenticated AI clients reach the MCP |
| Stack Agent metadata | What AI UNDERSTANDS | Structured data model, not raw k8s resources | AI sees {type: "kafka", links_to: [...]} not 500 lines of YAML |
Example: AI Diagnoses a Kafka Issue
An AI client calls prompts/get("diagnose_component", {component_name: "kafka"}) via MCP. Here's what happens — all through governed tools, no raw cluster access.
tools/call → query_topology(source="kafka", target="component")
→ {name: "kafka", type: "kafka", namespace: "kafka", group: "messaging"}tools/call → query_topology(source="kafka", target="links_to")
→ 0 outbound links
tools/call → query_topology(source="kafka", target="links_from")
→ [{link: "prometheus-kafka", component: "prometheus"}]tools/call → run_test(
source="kafka",
command='pods=$(kubectl get pods -n {namespace} --no-headers | grep -c Running); test "$pods" -gt 0'
)
→ {status: "passed", output: "9 running"}"Kafka is healthy — 9 pods running in namespace kafka. Prometheus monitors it. No outbound dependencies. Blast radius: only Prometheus scraping would be affected if Kafka goes down."
Every step went through MCP. The AI never touched kubectl directly. It used typed tools, got structured responses, and followed the prompt template. Every call is logged as JSON-RPC.
Roadmap
Auto-generated MCP Tools
- 40+ tools across 11 service types
- 6 MCP prompt templates
- 3 MCP resources (API guide, live summary, structure)
- Write-access gating per service
- APISIX gateway authentication
- SOPS-encrypted credentials
Stack DNA Resource
- One-shot full stack context for AI
- Components + links + sub-components + schemas + data flows
- AI reads once, understands the entire system
- Includes OpenAPI specs from linked FastAPI services
Drift Detection
- Compare live state vs Stacktic-generated desired state
- Detect: missing pods, broken links, expired certs, config drift
- Continuous auditing — AI as governance agent
Autonomous Remediation
- Alert → diagnose → fix → verify (closed loop)
- Governed by write-access flags — AI can only fix what's allowed
- Cross-stack AI coordination with per-stack boundaries
MCP Protocol — Technical Details
The MCP server uses the Model Context Protocol (now part of the Linux Foundation's Agentic AI Foundation). It speaks JSON-RPC 2.0 over streamable HTTP.
| Property | Value |
|---|---|
| Protocol | MCP 2025-03-26 (JSON-RPC 2.0) |
| Transport | Streamable HTTP (POST /mcp) |
| Authentication | Bearer token (Authorization: Bearer <key>) |
| Session management | Mcp-Session-Id header |
| Capabilities | Tools, Prompts, Resources |
| Runtime | Python 3.12 + FastMCP SDK |
| Deployment | Kubernetes pod with ConfigMap-mounted tools |
| Credentials | SOPS-encrypted cloud.env in Kubernetes Secret |
Compatible AI clients: Claude Desktop, Cursor, VS Code Copilot, any MCP-compatible client.
MCP Endpoints
POST /mcp # All MCP operations (initialize, tools/call, prompts/get, etc.)
GET /health # Health check
GET /mcp # SSE endpoint for server-initiated messages
Example: Initialize + Call Tool
# 1. Initialize session
curl -X POST https://fastmcp.your-domain.com/mcp \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize",
"params":{"protocolVersion":"2025-03-26",
"clientInfo":{"name":"my-ai","version":"1.0"}}}'
# 2. Call a tool
curl -X POST https://fastmcp.your-domain.com/mcp \
-H "Mcp-Session-Id: SESSION_ID" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"jsonrpc":"2.0","id":2,"method":"tools/call",
"params":{"name":"list_components","arguments":{}}}'