AI Ops Automation

Stacktic automates the deployment and management of production AI infrastructure on Kubernetes. Design your AI stack visually, connect components, and Stacktic generates everything: LLM serving, vector databases, RAG applications, API gateways, and auto-scaling.

Demo

Architecture Overview

                                    ┌─────────────────────────────────────────────────────────────┐
                                    │                      APISIX Gateway                         │
                                    │         Routes, AI Plugins, Authentication, SSL             │
                                    └─────────────────────────────────────────────────────────────┘
                                                              │
                        ┌─────────────────────────────────────┼─────────────────────────────────────┐
                        │                                     │                                     │
                        ▼                                     ▼                                     ▼
              ┌─────────────────┐                   ┌─────────────────┐                   ┌─────────────────┐
              │   Open-WebUI    │                   │    Langflow     │                   │  External AI    │
              │  (Chat UI)      │                   │  (RAG Builder)  │                   │  (OpenAI, etc)  │
              └────────┬────────┘                   └────────┬────────┘                   └─────────────────┘
                       │                                     │
                       │         ┌───────────────────────────┤
                       │         │                           │
                       ▼         ▼                           ▼
              ┌─────────────────────────┐          ┌─────────────────┐
              │         vLLM            │          │     Qdrant      │
              │  ┌───────────────────┐  │          │  (Vector DB)    │
              │  │      Router       │  │          │                 │
              │  └─────────┬─────────┘  │          │  collections    │
              │            │            │          └─────────────────┘
              │  ┌─────────┴─────────┐  │
              │  ▼                   ▼  │
              │ model             model │          ┌─────────────────┐
              │ (embedding)      (chat) │◄─────────│      KEDA       │
              └─────────────────────────┘          │  (Auto-Scale)   │
                         ▲                         └────────┬────────┘
                         │                                  │
              ┌──────────┴──────────┐                       │
              │     Prometheus      │◄──────────────────────┘
              │     (Metrics)       │
              └─────────────────────┘

Components

vLLM - LLM Serving

High-performance LLM inference with GPU acceleration. Deploy multiple models with automatic load balancing.

Component Properties:

Property	Description
`namespace`	Kubernetes namespace for deployment
`huggingface`	HuggingFace token for gated models
`nodepool`	GPU nodepool name
`gpu_request`	Number of GPUs (0 for CPU mode)
`route_mem_request`	Router memory (recommend 1Gi+)

Sub-Component: Model

Add models as sub-components. Each model runs as a separate pod:

Property	Description
`model_path`	HuggingFace model (e.g., `Qwen/Qwen2-1.5B-Instruct`)
`model_type`	`chat` for conversation, `embedding` for vectors
`max_model_len`	Maximum context length
`gpu_memory_utilization`	GPU memory fraction (0.8-0.9 recommended)

Example Structure:

vLLM
├── gte-qwen2-1-5b-instruct (embedding model)
└── qwen2-1-5b-instruct (chat model)

What Gets Generated:

VLLMRuntime CR for each model
VLLMRouter CR for load balancing
Service endpoints for model access

Qdrant - Vector Database

Vector similarity search for RAG applications. Store and query embeddings.

Component Properties:

Property	Description
`namespace`	Kubernetes namespace
`cpu_request` / `mem_request`	Resource allocation

Sub-Component: Collection

Create collections for organizing vector data:

Property	Description
`collection_name`	Name of the collection
`vector_size`	Dimension size (must match embedding model)
`distance`	`Cosine`, `Euclidean`, or `Dot`

Langflow - RAG Application Builder

Visual tool for building LLM workflows with RAG capabilities.

Component Properties:

Property	Description
`namespace`	Kubernetes namespace
`superuser` / `superuser_password`	Admin credentials
`api_key`	API key for external access
`cache_type`	`async` or `redis`

Open-WebUI - Chat Interface

ChatGPT-like interface for interacting with your LLMs.

Component Properties:

Property	Description
`namespace`	Kubernetes namespace
`service_port`	Service port (default: 8080)
`pipelines_enabled`	Enable pipelines feature

KEDA - Auto-Scaling

Scale your LLM models automatically based on demand.

Component Properties:

Property	Description
`namespace`	Kubernetes namespace (default: keda)

APISIX - API Gateway

Ingress routing with AI-specific plugins for content filtering, rate limiting, and authentication.

Component Properties:

Property	Description
`namespace`	Kubernetes namespace
`dashboard`	Enable APISIX dashboard
`external_traffic`	Preserve client IP (required for IP whitelisting)

Sub-Component: AI Gateway

Advanced AI routing with content filtering and rate limiting:

Property	Description
`api_key`	API key for authentication
`cors`	Enable CORS
`rate_limiting`	Enable request rate limiting
`whitelist` / `ip_whitelist`	IP whitelisting
`apisix_timeout`	Request timeout configuration

Relationships & Automation

When you link components in Stacktic, the platform automatically generates all necessary configurations.

Langflow → vLLM

What happens: Langflow is configured to use vLLM as its LLM backend.

Generated: Environment variable OPENAI_API_BASE pointing to vLLM router endpoint.

Result: Langflow can use any model deployed in vLLM through the standard OpenAI API.

Langflow → Qdrant

What happens: Langflow is configured with Qdrant for vector storage.

Generated: QDRANT_URL environment variable with Qdrant service endpoint.

Result: Langflow flows can store and retrieve embeddings from Qdrant.

Langflow → Collection (Qdrant sub-component)

What happens: Langflow is linked to a specific Qdrant collection.

Generated: QDRANT_COLLECTION and QDRANT_VECTOR_SIZE configuration.

Result: Pre-configured collection for RAG workflows.

Open-WebUI → Model (vLLM sub-component)

What happens: Open-WebUI is connected to specific models.

Generated: Model configurations in Open-WebUI, separated by type (chat vs embedding).

Result: Users see linked models in the model selector.

KEDA → Model (vLLM sub-component)

What happens: Auto-scaling is enabled for a specific model.

Link Properties:

Property	Description
`scale_metric`	Prometheus metric (e.g., `vllm:num_requests_waiting`)
`scale_threshold`	Threshold to trigger scaling
`keda_minreplicacount`	Minimum replicas
`keda_maxreplicacount`	Maximum replicas
`polling_interval`	How often to check metrics
`cooldown_period`	Wait time before scaling down

Generated: KEDA ScaledObject that monitors Prometheus metrics and scales the model.

Result: Model pods scale up when queue depth exceeds threshold, scale down when idle.

Example Configuration:

KEDA → Model: qwen2-1-5b-instruct
├── scale_metric: vllm:num_requests_waiting
├── scale_threshold: 5
├── min replicas: 1
├── max replicas: 3
└── polling_interval: 20s

AI Gateway → Model (vLLM sub-component)

What happens: Per-model API routing with AI-specific plugins.

Link Properties:

Property	Description
`subdomain`	Route subdomain (e.g., `qwen-chat` → `qwen-chat.yourdomain.com`)
`path`	URL path (default: `/v1/*`)
`deny_patterns`	Blocked regex patterns (one per line)
`allow_patterns`	Required regex patterns (one per line)
`prompt_decorator`	System prompt to inject
`ai_rate_limit`	Token-based rate limiting

Generated:

ApisixRoute with AI plugins
ApisixUpstream pointing to model service
ApisixTls for SSL
ApisixConsumer for API key auth

AI Plugins:

Plugin	Function
`ai-prompt-guard`	Block or require specific content patterns
`ai-prompt-decorator`	Inject system prompts into every request
`ai-rate-limiting`	Limit by tokens (not just requests)
`key-auth`	API key authentication

Example Configuration:

AI Gateway → Model: qwen2-chat
├── subdomain: qwen-chat
├── path: /v1/*
├── deny_patterns: prohibited_word
├── allow_patterns: allowed_word
├── prompt_decorator:
│   role: "system"
│   content: "You are a helpful assistant"
└── ai_rate_limit:
    limit: 10000
    time_window: 3600
    limit_strategy: total_tokens

Result: Model accessible at https://qwen-chat.yourdomain.com/v1/* with content filtering, prompt injection, and token-based rate limiting.

AI Gateway → External AI

What happens: Route to external AI providers (OpenAI, Azure, Anthropic) through your gateway.

External AI Component Properties:

Property	Description
`ai_provider`	`openai`, `azure_openai`, or `anthropic`
`ai_endpoint`	Provider API endpoint
`ai_provider_api_key`	Provider API key

Generated: Same AI plugins as internal models, but routing to external providers.

Result: Unified API gateway for both internal vLLM models and external AI providers.

Prometheus → vLLM / Qdrant / KEDA

What happens: Metrics collection is enabled.

Generated: ServiceMonitor CRs for Prometheus to scrape metrics.

Result: Metrics available in Grafana dashboards, KEDA can use metrics for scaling.

APISIX → Langflow / Open-WebUI

What happens: Ingress routes are created.

Generated: ApisixRoute with SSL termination.

Result: Services accessible at langflow.yourdomain.com, open-webui.yourdomain.com.

Relationship Summary

Link	Source	Target	What Gets Configured
`langflow-vllm`	Langflow	vLLM	OpenAI API endpoint
`langflow-qdrant`	Langflow	Qdrant	Vector database URL
`langflow-collection`	Langflow	Collection	Collection name, vector size
`langflow-db`	Langflow	CNPG Database	PostgreSQL connection
`open-webui-vllm`	Open-WebUI	vLLM	OpenAI API endpoint
`open-webui-model`	Open-WebUI	Model	Model selector configuration
`open-webui-qdrant`	Open-WebUI	Qdrant	Vector search
`keda-model`	KEDA	Model	ScaledObject for auto-scaling
`ai_gateway-model`	AI Gateway	Model	AI route with plugins
`ai_gateway-external_ai`	AI Gateway	External AI	External provider route
`prometheus-vllm`	Prometheus	vLLM	ServiceMonitor
`prometheus-qdrant`	Prometheus	Qdrant	ServiceMonitor
`apisix-langflow`	APISIX	Langflow	Ingress route
`apisix-open-webui`	APISIX	Open-WebUI	Ingress route

Quick Start Guides

Basic RAG Stack

Add vLLM with embedding and chat models
Add Qdrant for vector storage
Add Langflow for RAG orchestration
Link: Langflow → vLLM
Link: Langflow → Qdrant
Add APISIX and link to Langflow

Result: Langflow accessible at langflow.yourdomain.com with LLM and vector search ready.

Auto-Scaling LLM API

Add vLLM with models
Add Prometheus (metrics source)
Add KEDA (auto-scaler)
Link: Prometheus → vLLM
Link: KEDA → Model (for each model to scale)
Add APISIX with AI Gateway sub-component
Link: AI Gateway → Model (for each model)

Result: Models scale 1-3 replicas based on queue depth, accessible via API gateway with authentication.

Testing Your Stack

Test vLLM Models

# Port-forward to router
kubectl port-forward -n vllm svc/vllm-router-service 8000:80

# List models
curl http://localhost:8000/v1/models

# Chat completion
curl -X POST http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Qwen/Qwen2-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Test AI Gateway

curl -X POST 'https://qwen-chat.yourdomain.com/v1/chat/completions' \
  -H 'Content-Type: application/json' \
  -H 'apikey: YOUR_API_KEY' \
  -d '{
    "model": "Qwen/Qwen2-1.5B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Check KEDA Scaling

kubectl get scaledobject -n vllm
kubectl get hpa -n vllm
kubectl get pods -n vllm -w

Troubleshooting

Issue	Solution
Model won't start	Check GPU nodepool exists and has capacity
Router crashes (OOM)	Increase `route_mem_request` to 1Gi+
"Model does not exist"	Use full model name from `/v1/models` endpoint
AI Gateway 403	Verify API key header: `apikey: YOUR_KEY`
Content blocked	Message must contain word from `allow_patterns`
KEDA not scaling	Verify Prometheus → vLLM link exists
`max_model_len` error	Lower `max_model_len` in model properties

Best Practices

Separate Models by Type - Use embedding models for vectors, chat models for conversation
Set Appropriate Timeouts - AI requests can take 30s+, configure apisix_timeout
Use Token Rate Limiting - Prefer ai-rate-limiting over request count for LLM APIs
Monitor GPU Memory - Set gpu_memory_utilization to 0.8-0.9 to avoid OOM
Scale Based on Queue - Use vllm:num_requests_waiting metric for scaling decisions
Protect with Allow Patterns - Use allow_patterns to restrict inputs to expected vocabulary

AI Ops Automation

Demo​

Architecture Overview​

Components​

vLLM - LLM Serving​

Qdrant - Vector Database​

Langflow - RAG Application Builder​

Open-WebUI - Chat Interface​

KEDA - Auto-Scaling​

APISIX - API Gateway​

Relationships & Automation​

Langflow → vLLM​

Langflow → Qdrant​

Langflow → Collection (Qdrant sub-component)​

Open-WebUI → Model (vLLM sub-component)​

KEDA → Model (vLLM sub-component)​

AI Gateway → Model (vLLM sub-component)​

AI Gateway → External AI​

Prometheus → vLLM / Qdrant / KEDA​

APISIX → Langflow / Open-WebUI​

Relationship Summary​

Quick Start Guides​

Basic RAG Stack​

Auto-Scaling LLM API​

Testing Your Stack​

Test vLLM Models​

Test AI Gateway​

Check KEDA Scaling​

Troubleshooting​

Best Practices​

Demo

Architecture Overview

Components

vLLM - LLM Serving

Qdrant - Vector Database

Langflow - RAG Application Builder

Open-WebUI - Chat Interface

KEDA - Auto-Scaling

APISIX - API Gateway

Relationships & Automation

Langflow → vLLM

Langflow → Qdrant

Langflow → Collection (Qdrant sub-component)

Open-WebUI → Model (vLLM sub-component)

KEDA → Model (vLLM sub-component)

AI Gateway → Model (vLLM sub-component)

AI Gateway → External AI

Prometheus → vLLM / Qdrant / KEDA

APISIX → Langflow / Open-WebUI

Relationship Summary

Quick Start Guides

Basic RAG Stack

Auto-Scaling LLM API

Testing Your Stack

Test vLLM Models

Test AI Gateway

Check KEDA Scaling

Troubleshooting

Best Practices