vLLM Component
High-throughput LLM inference server with PagedAttention. Supports multiple models via sub-components with automatic GPU scheduling and OpenAI-compatible API.
Architecture
vLLM Server - Inference engine with PagedAttention
Model - Sub-component per LLM model
GPU Scheduling - NVIDIA GPU allocation
OpenAI API - Compatible /v1/completions endpoint
HuggingFace - Model download + caching
Quick Reference
| Attribute |
Example |
Default |
Effect |
namespace REQ |
vllm |
- |
Kubernetes namespace |
hf_token |
hf_xxx... |
- |
HuggingFace access token |
gpu_memory_utilization |
0.9 |
0.9 |
GPU memory allocation ratio |
Link Variables
| Variable |
Link Type |
Purpose |
__prometheus |
prometheus-vllm |
Inference metrics scraping |
__ingress |
apisix-vllm |
Gateway routing for API |
__model |
(sub-component) |
LLM model deployments |
Model Sub-Component
| Attribute |
Example |
Purpose |
| model_name |
meta-llama/Llama-2-7b-chat-hf |
HuggingFace model ID |
| gpu_count |
1 |
GPUs per model instance |
| replicas |
1 |
Model deployment replicas |
| max_model_len |
4096 |
Maximum sequence length |
| quantization |
awq |
awq, gptq, or none |
| dtype |
half |
auto, half, float16, bfloat16 |
Popular Model Configurations
| Model |
Size |
GPU Memory |
Quantization |
| Llama-2-7b |
7B |
~14GB (fp16) |
AWQ: ~4GB |
| Mistral-7B |
7B |
~14GB (fp16) |
AWQ: ~4GB |
| Llama-2-13b |
13B |
~26GB (fp16) |
AWQ: ~8GB |
| Llama-2-70b |
70B |
~140GB (fp16) |
AWQ: ~40GB (2+ GPUs) |
Generated Files
| File |
Condition |
Contains |
| deployment-{model}.yaml |
Per model sub-component |
vLLM server deployment |
| service-{model}.yaml |
Per model |
K8s Service |
| servicemonitor.yaml |
__prometheus |
Prometheus metrics |
| secret/hf-token.env |
hf_token set |
HuggingFace token |
Ports
| Port |
Purpose |
Protocol |
| 8000 |
OpenAI-compatible API |
HTTP |
OpenAI-Compatible API
POST /v1/completions - Text completion
POST /v1/chat/completions - Chat completion
GET /v1/models - List available models
GET /health - Health check
GET /metrics - Prometheus metrics
Technical Info
Port: 8000
GPU: NVIDIA (CUDA required)
Features: PagedAttention, Continuous Batching, Tensor Parallelism