vLLM Component

High-throughput LLM inference server with PagedAttention. Supports multiple models via sub-components with automatic GPU scheduling and OpenAI-compatible API.

Architecture

                vLLM Server - Inference engine with PagedAttention

                Model - Sub-component per LLM model

                GPU Scheduling - NVIDIA GPU allocation

                OpenAI API - Compatible /v1/completions endpoint

                HuggingFace - Model download + caching

Quick Reference

Attribute	Example	Default	Effect
`namespace` REQ	`vllm`	-	Kubernetes namespace
`hf_token`	`hf_xxx...`	-	HuggingFace access token
`gpu_memory_utilization`	`0.9`	0.9	GPU memory allocation ratio

Link Variables

Variable	Link Type	Purpose
`__prometheus`	prometheus-vllm	Inference metrics scraping
`__ingress`	apisix-vllm	Gateway routing for API
`__model`	(sub-component)	LLM model deployments

Model Sub-Component

Attribute	Example	Purpose
model_name	`meta-llama/Llama-2-7b-chat-hf`	HuggingFace model ID
gpu_count	`1`	GPUs per model instance
replicas	`1`	Model deployment replicas
max_model_len	`4096`	Maximum sequence length
quantization	`awq`	awq, gptq, or none
dtype	`half`	auto, half, float16, bfloat16

Popular Model Configurations

Model	Size	GPU Memory	Quantization
Llama-2-7b	7B	~14GB (fp16)	AWQ: ~4GB
Mistral-7B	7B	~14GB (fp16)	AWQ: ~4GB
Llama-2-13b	13B	~26GB (fp16)	AWQ: ~8GB
Llama-2-70b	70B	~140GB (fp16)	AWQ: ~40GB (2+ GPUs)

Generated Files

File	Condition	Contains
deployment-{model}.yaml	Per model sub-component	vLLM server deployment
service-{model}.yaml	Per model	K8s Service
servicemonitor.yaml	__prometheus	Prometheus metrics
secret/hf-token.env	hf_token set	HuggingFace token

Ports

Port	Purpose	Protocol
8000	OpenAI-compatible API	HTTP

OpenAI-Compatible API

                POST /v1/completions - Text completion

                POST /v1/chat/completions - Chat completion

                GET /v1/models - List available models

                GET /health - Health check

                GET /metrics - Prometheus metrics

Technical Info

                Port: 8000

                GPU: NVIDIA (CUDA required)

                Features: PagedAttention, Continuous Batching, Tensor Parallelism