Skip to main content

AI Governance & MCP

The Dark Open-Source Factory — where AI governance is the architecture

In manufacturing, a "dark factory" runs lights-out — fully automated, zero human intervention. Stacktic brings this model to open-source infrastructure. Every stack generates a governed MCP server where the topology you design determines exactly what AI can see, query, and modify. The result: autonomous, metadata-driven operations with zero raw access — an open-source dark factory for topology, relationships, and operations automation.

Why AI governance is the next infrastructure challenge

AI agents are becoming the primary operators of infrastructure. They diagnose incidents, query databases, read logs, and trigger deployments. But today, giving AI access means giving it kubectl — raw, unstructured, unlimited access to your entire cluster.

The industry response is to bolt on governance after the fact — policy engines, permission layers, audit wrappers. This creates another system to maintain, another thing that can drift, another gap between intention and reality.

Stacktic eliminates this problem entirely. The topology you draw IS the governance. When you design your stack — components, links, sub-components — you're simultaneously defining what AI can access, what metadata it receives, what operations it can perform, and what credentials it holds. Nothing to bolt on. Nothing to drift.

THE INDUSTRY PROBLEM

AI + kubectl = full cluster access. No structural boundaries. Every enterprise is one hallucinated command away from a production incident.

THE STACKTIC ANSWER

AI receives structured metadata, typed tools, and scoped credentials — auto-generated from the stack topology. The metadata feed IS the control plane.

THE DARK FACTORY MODEL

You don't govern AI by restricting it. You govern AI by feeding it the right metadata. Stacktic is the interface between AI and production — every tool is validated, every configuration production-hardened, every secret scoped, every operation error-proof. The topology metadata gives AI complete understanding with zero raw access. Lights-out operations.


Stacktic: The Production Interface for AI

AI doesn't talk to your infrastructure directly. Stacktic is the interface. Every tool AI calls was generated from a validated template. Every configuration is production-hardened. Every credential is scoped and resolved at runtime. Every operation is bounded by typed parameters — no arbitrary commands, no room for hallucinated kubectl.

Validated

Every tool generated from tested templates with typed parameters

Production

Configurations are production-grade — Helm values, K8s manifests, SOPS secrets

Restricted

No kubectl, no arbitrary commands — only scoped tools from drawn links

Error-Proof

Typed params, credential scoping, write gating — AI cannot make destructive mistakes


AI + kubectl vs Stacktic MCP

CapabilityAI + kubectl (raw access)Stacktic MCP
What AI seesRaw YAML, thousands of resources, no contextStructured topology — components, links, types, groups
How AI querieskubectl get pods -A — wall of text to parsequery_topology(source="type:kafka") — typed JSON
Cross-component awarenessAI must guess relationships from labelsLinks are first-class — AI knows what connects to what
Credential accessAI needs kubectl get secret — sees raw passwordsMCP server holds credentials internally, AI never sees them
Write operationskubectl apply/delete — anything goesPer-service write gating (MCP_*_WRITE_ACCESS=false)
Blast radiusEntire cluster — can delete any namespaceOnly services with drawn links. No link = no access
Multi-stackFull kubeconfig access to every clusterEach stack's MCP exposes only its own topology
ObservabilityAI runs kubectl logs, curl prometheus manuallyTyped tools: loki_query(), prom_query(), grafana_list_dashboards()
Incident responseAI improvises — different approach every timePrompt templates: structured playbook, repeatable steps
Adding a new serviceManually teach AI how to access itDraw a link in Stacktic → MCP tools auto-generated
Audit trailShell history (if saved)Every MCP call is JSON-RPC — method, params, timestamp
Security reviewReview every kubectl command AI might runReview tool definitions once — that's the attack surface

How It Works

┌─────────────────────────────────────────────────────────────────────┐
│ AI Client (Claude, Cursor, etc.) │
│ Sends: tools/call, prompts/get │
└──────────────────────────────┬──────────────────────────────────────┘
│ MCP Protocol (JSON-RPC)
│ Authorization: Bearer <api_key>

┌─────────────────────────────────────────────────────────────────────┐
│ APISIX Gateway │
│ key-auth, CORS, TLS termination │
└──────────────────────────────┬──────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────────┐
│ FastMCP Server (per stack) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Topology │ │ Prometheus │ │ Loki │ │
│ │ Tools (7) │ │ Tools (7) │ │ Tools (5) │ Tools are │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ auto- │
│ │ CNPG │ │ Grafana │ │ ClickHouse │ generated │
│ │ Tools (5) │ │ Tools (5) │ │ Tools (5) │ from │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ topology │
│ │ Kafka │ │ ArgoCD │ │ Valkey │ links │
│ │ Tools │ │ Tools (6) │ │ Tools (8) │ │
│ ├──────────────┤ ├──────────────┤ ├──────────────┤ │
│ │ Prompts (6) │ │ Resources(3) │ │ S3 Tools (5) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Credentials from cloud.env (SOPS encrypted) │
│ Write-access gating per service │
└──────────────────────────────┬──────────────────────────────────────┘
│ HTTP/gRPC/TCP (internal cluster network)

┌─────────────────────────────────────────────────────────────────────┐
│ Stack Services │
│ PostgreSQL │ Kafka │ Prometheus │ Grafana │ Loki │ ... │
└─────────────────────────────────────────────────────────────────────┘

The MCP server is a ConfigMap-mounted Python application. Tool files are generated from templates — when you draw a link from FastMCP to Grafana, the grafana_tools.py file is included. When you remove the link, it's deleted. No manual configuration.


Stack Agent API — The Intelligence Layer

The Stack Agent is a lightweight pod deployed per stack that holds the entire topology as structured metadata. It exposes a single API endpoint — POST /metadata/q — that lets you query your stack like a database. The MCP topology tools are a thin wrapper around this API.

What AI gets from the API

  • Every component — name, type, namespace, group
  • Every sub-component — databases, topics, queues, buckets, models
  • Every link — who connects to whom, with direction and type
  • Every attribute — ports, credentials, endpoints, config
  • Cross-stack boundaries — is_external flag
  • Run shell commands with auto-resolved variables

Why this matters for governance

  • AI doesn't need kubectl — it queries structured JSON
  • Variables like {namespace}, {password} resolve automatically
  • AI never sees raw credentials — the Agent substitutes them
  • Commands run inside the Agent pod — not from AI's context
  • Every query is scoped to the stack's topology
  • Dry-run mode lets AI preview without executing

Query Model — Source / Target / Where

Every MCP topology call maps to a POST /metadata/q request with a source/target/where structure. This is the "SQL" of your stack.

Source — WHAT to query

"all" — every component
"type:kafka" — all Kafka instances
"namespace:db" — all in namespace
"group:backend" — all in group
"kafka" — specific by name

Target — WHAT to return

"component" — the component itself
"sub_components" — DBs, topics, queues
"links_to" — outbound connections
"links_from" — inbound connections
"attributes" — config values

Where — HOW to filter

"field": "value" — exact match
"field__contains": "str" — substring
"field__exists": true — field present
"field__in": ["a","b"] — in list
"field__gt": N — greater than

Variable Substitution — Zero Hardcoding

When AI sends a command like kubectl get pods -n {namespace}, the Stack Agent resolves {namespace} to the actual namespace from the topology. The AI never hardcodes service names, ports, or credentials.

VariableResolved fromExample value
{namespace}Component metadatakafka, db, monitoring
{name}Component namekafka, cnpg, prometheus
{port}Component attributes5432, 9092, 9090
{password}Component attributes(auto-resolved, AI never sees it)
{database}Sub-component attributesapp_db, analytics
{username}Sub-component attributesapp_user, admin
{host}Computed from type patterncluster-db-rw.db.svc.cluster.local
{link_name}Link metadatafastapi-kafka_topic
{linked_namespace}Linked componentmessaging
{linked_is_external}Cross-stack flagtrue / false

Key governance point: AI writes generic commands with {variables}. The Stack Agent resolves them. The same command works across any stack, any environment — zero hardcoded names, ports, or credentials.

Command Pipeline — Test + Diagnose

┌──────────────────────────────────────────────────────────────────────────┐
│ AI calls: run_test(source="type:cnpg", target="sub_components", │
│ command="PGPASSWORD={password} psql -h {host} -U {username} │
│ -d {database} -c 'SELECT 1'") │
└──────────────────────────────┬───────────────────────────────────────────┘


┌──────────────────────────────────────────────────────────────────────────┐
│ Stack Agent resolves variables per sub-component: │
│ {password} → "s3cur3pw", {host} → "cluster-db-rw.db.svc...", │
│ {username} → "app_user", {database} → "app_db" │
└──────────────────────────────┬───────────────────────────────────────────┘

┌──────────┴──────────┐
│ command executes │
│ (exit code check) │
└──────────┬──────────┘

┌───────────┴───────────┐
│ │
exit 0 (pass) exit ≠ 0 (fail)
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ wait │ │ run on_failure │
│ verify_delay │ │ (diagnostics) │
│ then on_success │ └─────────────────┘
└─────────────────┘

MCP Tools that Wrap the API

MCP ToolWhat it doesMaps to
query_topology()Full query with all parameters — source, target, where, command, pipelinePOST /metadata/q with full body
list_components()List all components with types, groups, namespacesPOST /metadata/q → source="all", target="component"
get_component_links()Get outbound or inbound links for a componentPOST /metadata/q → target="links_to" or "links_from"
run_test()Run a validation command with full pipelinePOST /metadata/q with command + on_success/on_failure
get_stack_structure()Get component tree with sub-component countsGET /metadata/structure
get_available_fields()Discover what {variables} are available for a componentGET /metadata/fields/{'{component}'}

MCP Resources — Persistent Context

Resource URIWhat AI getsUse case
stack://guide/apiFull API reference — source options, target options, where operators, host patterns, CLI tools, test examplesAI reads once, knows how to query everything
stack://live/summaryLive snapshot — domain, component counts by type, namespaces, groupsQuick stack overview before deep queries
stack://live/structureLive topology tree — every component, sub-component count, link listFull map for navigation and planning

How it works together: AI reads the stack://guide/api resource once to learn the query language. Then calls get_stack_structure() to understand the topology. Then uses query_topology() and run_test() to investigate specific components. The MCP prompts (diagnose_component, incident_response, etc.) encode this workflow as reusable playbooks.

Example: AI Queries the Stack

1. "What's in this stack?"
tools/call → list_components()
→ [{name: "kafka", type: "kafka", namespace: "kafka", group: "messaging"},
 {name: "cnpg", type: "cnpg", namespace: "db", group: "databases"},
 {name: "fastapi", type: "fastapi", namespace: "app", group: "backend"},
 {name: "apisix", type: "apisix", namespace: "ingress", group: "gateway"}, ...]
2. "What databases exist?"
tools/call → query_topology(source="type:cnpg", target="sub_components",
                            select=["database", "username", "consumers"])
→ [{database: "app_db", username: "app_user", consumers: ["fastapi"]},
 {database: "keycloak_db", username: "keycloak", consumers: ["keycloak"]}]
3. "Can the backend reach its database?"
tools/call → run_test(
source="type:cnpg", target="sub_components",
command='PGPASSWORD={password} psql -h {host} -p {port} -U {username} -d {database} -c "SELECT 1"',
on_failure='kubectl get pods -n {namespace} -l cnpg.io/cluster --no-headers')
→ {status: "passed", results: [{name: "app_db", status: "passed", output: "1"},
                              {name: "keycloak_db", status: "passed", output: "1"}]}
4. "Preview before executing" (dry run)
tools/call → query_topology(source="apisix", target="links_to",
                            command="curl -sk https://{subdomain}.{domain}",
                            dry_run=true)
→ [{name: "fastapi", command: "curl -sk https://api.stack-3.source-lab.io"},
 {name: "grafana", command: "curl -sk https://grafana.stack-3.source-lab.io"},
 {name: "langflow", command: "curl -sk https://langflow.stack-3.source-lab.io"}]

AI never hardcoded a single name, port, or password. It queried the topology, got structured data, and used {variables} for commands. The Stack Agent resolved everything from the metadata. Same queries work on any stack.


Governance by Architecture — 4 Layers

Layer 1: Generation (Day 0)

What CAN exist — 70 curated templates. AI can't invent components or create arbitrary connections. Link types are predefined.

Templates → Hooks → Deterministic output

Layer 2: Topology (Runtime)

What AI KNOWS — Stack Agent provides structured metadata. AI sees components through a typed API, not raw kubectl output.

Stack Agent → Metadata API → Structured JSON

Layer 3: Operations (Day 2)

What AI CAN DO — MCP tools with write-access gating. Prompt templates guide AI through structured workflows. Everything auditable.

MCP Tools → Write Gating → Audit Trail

Layer 4: Multi-Stack Isolation

What SCOPE AI has — Each stack generates its own MCP with its own credentials and topology. Cross-stack access is explicitly controlled.

Per-stack MCP → Isolated credentials → is_external boundaries


360° Stack Metadata for AI

Stacktic provides AI with complete, structured metadata about every layer of the stack. Instead of parsing YAML or guessing service names, AI gets typed, queryable data from a single source of truth.

MetadataWhat AI learnsSource
ComponentsEvery service — name, type, namespace, grouplist_components()
TopologyWhat connects to what — all links with directionsget_component_links()
Sub-componentsDatabases, topics, queues, buckets, modelsquery_topology(target="sub_components")
AttributesPorts, credentials, config per componentquery_topology(target="attributes")
Cross-stackWhich services are external, their endpointsis_external flag in metadata
Prometheus metricsActive targets, alert rules, metric valuesprom_query(), prom_alerts()
Grafana dashboardsAll dashboards, datasources, foldersgrafana_list_dashboards()
Loki logsLog labels, streams, LogQL query resultsloki_labels(), loki_query_range()
Database schemasTables, columns, row countsch_describe_table(), cnpg tools
ArgoCD stateApp sync status, health, resource treeargocd_list_applications()

Key point: This metadata is not manually configured. It's auto-generated from the topology you draw in Stacktic. Add a component, its metadata becomes queryable. Draw a link, the connection appears in the API.


When you draw a link from FastMCP to a service, the corresponding tool file is generated and mounted into the MCP server. When you remove the link, the file is deleted.

Link drawnTools generatedOperations
fastmcp → stack_agentTopology (7 tools) + Prompts (6)query_topology, run_test, list_components, get_stack_structure, get_available_fields, get_component_links
fastmcp → cnpgCNPG (5 tools)cnpg_list_databases, cnpg_query, cnpg_table_stats, cnpg_describe_table, cnpg_list_tables
fastmcp → prometheusPrometheus (7 tools)prom_query, prom_query_range, prom_alerts, prom_targets, prom_rules, prom_label_values, prom_series
fastmcp → grafanaGrafana (5-6 tools)grafana_list_dashboards, grafana_get_dashboard, grafana_list_datasources, grafana_get_alerts, grafana_list_folders, grafana_create_annotation*
fastmcp → lokiLoki (5 tools)loki_query, loki_query_range, loki_labels, loki_label_values, loki_series
fastmcp → clickhouseClickHouse (5 tools)ch_list_databases, ch_list_tables, ch_describe_table, ch_query, ch_table_stats
fastmcp → rabbitmqRabbitMQ (varies)Queue management, message operations
fastmcp → valkeyValkey (6-8 tools)valkey_info, valkey_dbsize, valkey_scan_keys, valkey_get, valkey_type_and_ttl, valkey_hgetall, valkey_set, valkey_delete
fastmcp → argo_cdArgoCD (5-6 tools)argocd_list_applications, argocd_get_application, argocd_get_app_resources, argocd_list_projects, argocd_get_app_events, argocd_sync_application*
fastmcp → seaweedfsS3 (5 tools)s3_list_buckets, s3_list_objects, s3_get_object, s3_read_text, s3_bucket_stats
fastmcp → fastapiFastAPI (varies)HTTP proxy tools for backend API
* = write-gated — only registered when MCP_*_WRITE_ACCESS=true

Write-Access Gating

Every service connection has an independent write-access flag. Read tools are always available. Write tools only appear when explicitly enabled.

Read-only (default)

MCP_CLICKHOUSE_WRITE_ACCESS=False
MCP_S3_WRITE_ACCESS=false
MCP_GRAFANA_WRITE_ACCESS=false
MCP_ARGOCD_WRITE_ACCESS=false

AI can query, inspect, and monitor — but cannot modify data, create annotations, or trigger deployments.

Write-enabled (explicit)

MCP_RABBITMQ_WRITE_ACCESS=True
MCP_VALKEY_WRITE_ACCESS=True
MCP_CLICKHOUSE_WRITE_ACCESS=True

AI gains write tools: publish messages, set cache keys, insert rows. Each service is independently controlled.


MCP Prompts — AI Playbooks

MCP prompts are pre-built instruction templates that teach AI agents how to operate the stack. Instead of improvising, AI follows structured, repeatable playbooks.

diagnose_component

Query topology → check outbound links → check inbound links → discover test variables → run health tests → summarize blast radius and status.

explain_stack

Get structure → list all components → map connections → identify architecture patterns → present layered view with data flows.

incident_response

Identify suspect components from symptom → check each one → trace dependency chain → check ArgoCD for recent deploys → present root cause + blast radius.

trace_data_flow

Walk the link graph hop-by-hop from source to destination → test each connection → report protocol, health, and bottlenecks per hop.

run_validation

Systematic test suite: pods → services → component-specific checks → link connectivity → present PASS/FAIL table with health score.

check_logs

Get namespace from topology → query Loki logs for error patterns → correlate with Prometheus metrics → present timeline and root cause.


CNCF Four Pillars of Platform Control

The CNCF published a framework for governing AI in infrastructure (January 2026). Stacktic's MCP architecture maps to all four pillars.

CNCF PillarDefinitionStacktic Implementation
Golden PathsPre-built, approved workflowsMCP Prompts — diagnose_component, incident_response, run_validation
GuardrailsHard limits on what AI can doWrite-access gating per service, tool boundaries, typed parameters
Safety NetsCatch mistakes before damagedry_run=True on tests, read-only defaults, no raw kubectl
Manual ReviewHuman approves risky actionsWrite tools only registered when explicitly enabled by platform team

Governance Detail — What Controls What

Governance MechanismWhat it controlsHow it worksExample
Template CatalogWhat CAN exist70 curated templates — AI can't invent componentsNo random services outside the catalog
Link TypesWhat CAN connectDefined link types per templatefastapi-kafka_bridge exists, arbitrary links don't
HooksWhat configuration is POSSIBLEDeterministic code generation from linksPrometheus scrape config computed from links, not AI-generated
Post-gen cleanupWhat files EXISTUnused tool files deleted at generation timeNo Loki link → loki_tools.py deleted → AI has no Loki tools
Credential scopingWhat AI can AUTHENTICATE tocloud.env only contains credentials for linked servicesNo ClickHouse link → no ClickHouse credentials in MCP
Write-access gatingWhat AI can MODIFYPer-service boolean in environment variablesMCP_RABBITMQ_WRITE_ACCESS=True but MCP_CLICKHOUSE_WRITE_ACCESS=False
Tool boundariesWhat OPERATIONS are availableTyped tools with defined parametersAI calls ch_query(sql) not kubectl exec clickhouse
MCP PromptsHOW AI operatesStructured playbooks for multi-step workflowsincident_response follows: suspects → check → trace → fix
Stack isolationWhat SCOPE AI hasEach stack has its own MCP with its own credentialsSRE MCP can't touch production databases
APISIX gatewayWHO can accessAPI key authentication at ingressOnly authenticated AI clients reach the MCP
Stack Agent metadataWhat AI UNDERSTANDSStructured data model, not raw k8s resourcesAI sees {type: "kafka", links_to: [...]} not 500 lines of YAML

Example: AI Diagnoses a Kafka Issue

An AI client calls prompts/get("diagnose_component", {component_name: "kafka"}) via MCP. Here's what happens — all through governed tools, no raw cluster access.

1
Get component info
tools/call → query_topology(source="kafka", target="component")
→ {name: "kafka", type: "kafka", namespace: "kafka", group: "messaging"}
2
Check dependencies
tools/call → query_topology(source="kafka", target="links_to")
→ 0 outbound links

tools/call → query_topology(source="kafka", target="links_from")
→ [{link: "prometheus-kafka", component: "prometheus"}]
3
Run health test
tools/call → run_test(
source="kafka",
command='pods=$(kubectl get pods -n {namespace} --no-headers | grep -c Running); test "$pods" -gt 0'
)
→ {status: "passed", output: "9 running"}
4
AI summarizes

"Kafka is healthy — 9 pods running in namespace kafka. Prometheus monitors it. No outbound dependencies. Blast radius: only Prometheus scraping would be affected if Kafka goes down."

Every step went through MCP. The AI never touched kubectl directly. It used typed tools, got structured responses, and followed the prompt template. Every call is logged as JSON-RPC.


Roadmap

AVAILABLE TODAY

Auto-generated MCP Tools

  • 40+ tools across 11 service types
  • 6 MCP prompt templates
  • 3 MCP resources (API guide, live summary, structure)
  • Write-access gating per service
  • APISIX gateway authentication
  • SOPS-encrypted credentials
NEXT

Stack DNA Resource

  • One-shot full stack context for AI
  • Components + links + sub-components + schemas + data flows
  • AI reads once, understands the entire system
  • Includes OpenAPI specs from linked FastAPI services
PLANNED

Drift Detection

  • Compare live state vs Stacktic-generated desired state
  • Detect: missing pods, broken links, expired certs, config drift
  • Continuous auditing — AI as governance agent
PLANNED

Autonomous Remediation

  • Alert → diagnose → fix → verify (closed loop)
  • Governed by write-access flags — AI can only fix what's allowed
  • Cross-stack AI coordination with per-stack boundaries

MCP Protocol — Technical Details

The MCP server uses the Model Context Protocol (now part of the Linux Foundation's Agentic AI Foundation). It speaks JSON-RPC 2.0 over streamable HTTP.

PropertyValue
ProtocolMCP 2025-03-26 (JSON-RPC 2.0)
TransportStreamable HTTP (POST /mcp)
AuthenticationBearer token (Authorization: Bearer <key>)
Session managementMcp-Session-Id header
CapabilitiesTools, Prompts, Resources
RuntimePython 3.12 + FastMCP SDK
DeploymentKubernetes pod with ConfigMap-mounted tools
CredentialsSOPS-encrypted cloud.env in Kubernetes Secret

Compatible AI clients: Claude Desktop, Cursor, VS Code Copilot, any MCP-compatible client.

MCP Endpoints

POST /mcp              # All MCP operations (initialize, tools/call, prompts/get, etc.)
GET /health # Health check
GET /mcp # SSE endpoint for server-initiated messages

Example: Initialize + Call Tool

# 1. Initialize session
curl -X POST https://fastmcp.your-domain.com/mcp \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize",
"params":{"protocolVersion":"2025-03-26",
"clientInfo":{"name":"my-ai","version":"1.0"}}}'

# 2. Call a tool
curl -X POST https://fastmcp.your-domain.com/mcp \
-H "Mcp-Session-Id: SESSION_ID" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"jsonrpc":"2.0","id":2,"method":"tools/call",
"params":{"name":"list_components","arguments":{}}}'