vLLM Model Sub-Component

What This Creates

                VLLMRuntime CR - Kubernetes custom resource

                Deployment - Pod running the model

                Service - Internal endpoint for router

Model Types

                task: generate (default)

                Chat/Instruct models for conversation

                Examples: Qwen/Qwen2-1.5B-Instruct, mistralai/Mistral-7B-Instruct-v0.3

                task: embed

                Embedding models for vector search

                Examples: BAAI/bge-small-en-v1.5, Alibaba-NLP/gte-Qwen2-1.5B-instruct

Required Attributes

                model_path - HuggingFace model path

                Example: Qwen/Qwen2-1.5B-Instruct

Optional Attributes

                task - generate (chat) or embed (embedding)

                max_model_len - Max context length

                dtype - auto, float16, bfloat16

                gpu_request - GPU count (0 = CPU)

                cpu_request - CPU cores

                mem_request - Memory

                pvc_size - Storage for model cache

                enable_lora - Enable LoRA adapters

                extra_args - Additional vLLM CLI args

Attribute Inheritance

                Priority: model attribute > parent attribute > default

                Example:

                Parent: gpu_request: 1

                Model: gpu_request: 0

                Result: Model runs on CPU (0 overrides 1)

max_model_len Configuration

                Must not exceed model's max_position_embeddings

                Qwen/Qwen2-1.5B-Instruct: 32768

                BAAI/bge-small-en-v1.5: 512

                Alibaba-NLP/gte-Qwen2-1.5B-instruct: 32768

                If omitted, vLLM uses model's default

GPU vs CPU Mode

                gpu_request: 0 - CPU mode (no GPU nodepool)

                gpu_request: 1+ - GPU mode (uses parent nodepool)

                CPU mode useful for small embedding models