vLLM Model Sub-Component
What This Creates
VLLMRuntime CR - Kubernetes custom resource
Deployment - Pod running the model
Service - Internal endpoint for router
Model Types
task: generate (default)
Chat/Instruct models for conversation
Examples: Qwen/Qwen2-1.5B-Instruct, mistralai/Mistral-7B-Instruct-v0.3
task: embed
Embedding models for vector search
Examples: BAAI/bge-small-en-v1.5, Alibaba-NLP/gte-Qwen2-1.5B-instruct
Required Attributes
model_path - HuggingFace model path
Example: Qwen/Qwen2-1.5B-Instruct
Optional Attributes
task - generate (chat) or embed (embedding)
max_model_len - Max context length
dtype - auto, float16, bfloat16
gpu_request - GPU count (0 = CPU)
cpu_request - CPU cores
mem_request - Memory
pvc_size - Storage for model cache
enable_lora - Enable LoRA adapters
extra_args - Additional vLLM CLI args
Attribute Inheritance
Priority: model attribute > parent attribute > default
Example:
Parent: gpu_request: 1
Model: gpu_request: 0
Result: Model runs on CPU (0 overrides 1)
max_model_len Configuration
Must not exceed model's max_position_embeddings
Qwen/Qwen2-1.5B-Instruct: 32768
BAAI/bge-small-en-v1.5: 512
Alibaba-NLP/gte-Qwen2-1.5B-instruct: 32768
If omitted, vLLM uses model's default
GPU vs CPU Mode
gpu_request: 0 - CPU mode (no GPU nodepool)
gpu_request: 1+ - GPU mode (uses parent nodepool)
CPU mode useful for small embedding models