vLLM Component

High-throughput LLM inference server with PagedAttention. Supports multiple models via sub-components with automatic GPU scheduling and OpenAI-compatible API.

Architecture

vLLM Server - Inference engine with PagedAttention
Model - Sub-component per LLM model
GPU Scheduling - NVIDIA GPU allocation
OpenAI API - Compatible /v1/completions endpoint
HuggingFace - Model download + caching

Quick Reference

Attribute Example Default Effect
namespace REQ vllm - Kubernetes namespace
hf_token hf_xxx... - HuggingFace access token
gpu_memory_utilization 0.9 0.9 GPU memory allocation ratio

Link Variables

Variable Link Type Purpose
__prometheus prometheus-vllm Inference metrics scraping
__ingress apisix-vllm Gateway routing for API
__model (sub-component) LLM model deployments

Model Sub-Component

Attribute Example Purpose
model_name meta-llama/Llama-2-7b-chat-hf HuggingFace model ID
gpu_count 1 GPUs per model instance
replicas 1 Model deployment replicas
max_model_len 4096 Maximum sequence length
quantization awq awq, gptq, or none
dtype half auto, half, float16, bfloat16

Popular Model Configurations

Model Size GPU Memory Quantization
Llama-2-7b 7B ~14GB (fp16) AWQ: ~4GB
Mistral-7B 7B ~14GB (fp16) AWQ: ~4GB
Llama-2-13b 13B ~26GB (fp16) AWQ: ~8GB
Llama-2-70b 70B ~140GB (fp16) AWQ: ~40GB (2+ GPUs)

Generated Files

File Condition Contains
deployment-{model}.yaml Per model sub-component vLLM server deployment
service-{model}.yaml Per model K8s Service
servicemonitor.yaml __prometheus Prometheus metrics
secret/hf-token.env hf_token set HuggingFace token

Ports

Port Purpose Protocol
8000 OpenAI-compatible API HTTP

OpenAI-Compatible API

POST /v1/completions - Text completion
POST /v1/chat/completions - Chat completion
GET /v1/models - List available models
GET /health - Health check
GET /metrics - Prometheus metrics

Technical Info

Port: 8000
GPU: NVIDIA (CUDA required)
Features: PagedAttention, Continuous Batching, Tensor Parallelism