From: Luis Chamberlain <mcgrof@kernel.org>
To: Chuck Lever <cel@kernel.org>, Daniel Gomez <da.gomez@kruces.com>,
kdevops@lists.linux.dev
Cc: Devasena Inupakutika <devasena.i@samsung.com>,
DongjooSeo <dongjoo.seo1@samsung.com>,
Joel Fernandes <Joelagnelf@nvidia.com>,
Luis Chamberlain <mcgrof@kernel.org>
Subject: [PATCH v2 4/4] defconfigs: Add composable fragments for Lambda Labs vLLM deployment
Date: Sat, 4 Oct 2025 09:38:14 -0700 [thread overview]
Message-ID: <20251004163816.3303237-5-mcgrof@kernel.org> (raw)
In-Reply-To: <20251004163816.3303237-1-mcgrof@kernel.org>
This introduces a fragment-based approach to defconfig composition, allowing
users to combine infrastructure provisioning with workflow configurations.
Two new config fragments are added to defconfigs/configs/:
- lambdalabs-gpu-1x-a10.config: Terraform configuration for Lambda Labs A10
GPU instance provisioning with automatic region inference and SSH key
generation.
- vllm-production-stack-gpu.config: vLLM production stack configuration with
GPU-accelerated inference, Kubernetes deployment via minikube, monitoring,
autoscaling, and benchmarking capabilities.
These fragments are combined into a new defconfig lambdalabs-vllm-gpu-1x-a10
which enables end-to-end deployment: provision a Lambda Labs A10 GPU instance
($0.75/hr) and deploy the vLLM production stack for LLM inference workloads.
The fragment approach allows users to compose configurations by combining
infrastructure providers (Lambda Labs, AWS, Azure, bare metal) with different
workflows (vLLM, fstests, blktests) without maintaining separate defconfigs
for every combination.
Example usage:
make defconfig-lambdalabs-vllm-gpu-1x-a10
make bringup # Provisions Lambda Labs A10 GPU instance
make vllm # Deploys vLLM production stack
make vllm-benchmark # Run performance benchmarks
Generated-by: Claude AI
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
.../configs/lambdalabs-gpu-1x-a10.config | 8 ++
.../configs/vllm-production-stack-gpu.config | 61 +++++++++++
defconfigs/lambdalabs-vllm-gpu-1x-a10 | 103 ++++++++++++++++++
3 files changed, 172 insertions(+)
create mode 100644 defconfigs/configs/lambdalabs-gpu-1x-a10.config
create mode 100644 defconfigs/configs/vllm-production-stack-gpu.config
create mode 100644 defconfigs/lambdalabs-vllm-gpu-1x-a10
diff --git a/defconfigs/configs/lambdalabs-gpu-1x-a10.config b/defconfigs/configs/lambdalabs-gpu-1x-a10.config
new file mode 100644
index 00000000..c85dae4e
--- /dev/null
+++ b/defconfigs/configs/lambdalabs-gpu-1x-a10.config
@@ -0,0 +1,8 @@
+# Lambda Labs GPU 1x A10 instance configuration
+CONFIG_TERRAFORM=y
+CONFIG_TERRAFORM_LAMBDALABS=y
+CONFIG_TERRAFORM_LAMBDALABS_REGION_SMART_INFER=y
+CONFIG_TERRAFORM_LAMBDALABS_INSTANCE_TYPE_GPU_1X_A10=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
diff --git a/defconfigs/configs/vllm-production-stack-gpu.config b/defconfigs/configs/vllm-production-stack-gpu.config
new file mode 100644
index 00000000..75b11a9f
--- /dev/null
+++ b/defconfigs/configs/vllm-production-stack-gpu.config
@@ -0,0 +1,61 @@
+# vLLM Production Stack with GPU support
+CONFIG_WORKFLOWS=y
+CONFIG_WORKFLOWS_TESTS=y
+CONFIG_WORKFLOWS_LINUX_TESTS=y
+CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
+CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y
+
+# vLLM Production Stack with Kubernetes
+CONFIG_VLLM_PRODUCTION_STACK=y
+CONFIG_VLLM_K8S_MINIKUBE=y
+CONFIG_VLLM_VERSION_STABLE=y
+CONFIG_VLLM_ENGINE_IMAGE_TAG="v0.10.2"
+CONFIG_VLLM_HELM_RELEASE_NAME="vllm-prod"
+CONFIG_VLLM_HELM_NAMESPACE="vllm-system"
+
+# Production Stack components
+CONFIG_VLLM_PROD_STACK_REPO="https://vllm-project.github.io/production-stack"
+CONFIG_VLLM_PROD_STACK_CHART_VERSION="latest"
+CONFIG_VLLM_PROD_STACK_ROUTER_IMAGE="ghcr.io/vllm-project/production-stack/router"
+CONFIG_VLLM_PROD_STACK_ROUTER_TAG="latest"
+CONFIG_VLLM_PROD_STACK_ENABLE_MONITORING=y
+CONFIG_VLLM_PROD_STACK_ENABLE_AUTOSCALING=y
+CONFIG_VLLM_PROD_STACK_MIN_REPLICAS=2
+CONFIG_VLLM_PROD_STACK_MAX_REPLICAS=5
+CONFIG_VLLM_PROD_STACK_TARGET_GPU_UTILIZATION=80
+
+# Model configuration
+CONFIG_VLLM_MODEL_URL="facebook/opt-125m"
+CONFIG_VLLM_MODEL_NAME="opt-125m"
+
+# GPU configuration - EXPLICITLY DISABLED CPU INFERENCE
+# CONFIG_VLLM_USE_CPU_INFERENCE is not set
+CONFIG_VLLM_REQUEST_GPU=1
+CONFIG_VLLM_GPU_TYPE=""
+CONFIG_VLLM_GPU_MEMORY_UTILIZATION="0.5"
+CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1
+
+# Engine configuration for GPU
+CONFIG_VLLM_REPLICA_COUNT=1
+CONFIG_VLLM_REQUEST_CPU=8
+CONFIG_VLLM_REQUEST_MEMORY="16Gi"
+CONFIG_VLLM_MAX_MODEL_LEN=1024
+CONFIG_VLLM_DTYPE="auto"
+
+# Router and observability
+CONFIG_VLLM_ROUTER_ENABLED=y
+CONFIG_VLLM_ROUTER_ROUND_ROBIN=y
+CONFIG_VLLM_OBSERVABILITY_ENABLED=y
+CONFIG_VLLM_GRAFANA_PORT=3000
+CONFIG_VLLM_PROMETHEUS_PORT=9090
+
+# API configuration
+CONFIG_VLLM_API_PORT=8000
+CONFIG_VLLM_API_KEY=""
+CONFIG_VLLM_HF_TOKEN=""
+
+# Benchmarking
+CONFIG_VLLM_BENCHMARK_ENABLED=y
+CONFIG_VLLM_BENCHMARK_DURATION=60
+CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=10
+CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark"
diff --git a/defconfigs/lambdalabs-vllm-gpu-1x-a10 b/defconfigs/lambdalabs-vllm-gpu-1x-a10
new file mode 100644
index 00000000..926be1bd
--- /dev/null
+++ b/defconfigs/lambdalabs-vllm-gpu-1x-a10
@@ -0,0 +1,103 @@
+#
+# Lambda Labs vLLM Production Stack - 1x A10 GPU ($0.75/hr)
+#
+# This combines:
+# - defconfigs/configs/lambdalabs-gpu-1x-a10.config (Terraform provisioning)
+# - defconfigs/configs/vllm-production-stack-gpu.config (vLLM deployment)
+#
+# Provisions a Lambda Labs GPU instance with NVIDIA A10 (24GB) and deploys
+# the vLLM production stack for LLM inference workloads.
+#
+# ============================================================================
+# NVIDIA GPU COMPATIBILITY (CUDA):
+# ============================================================================
+#
+# vLLM v0.10.x uses FlashInfer CUDA kernels that require NVIDIA GPUs with
+# compute capability >= 8.0. Older NVIDIA GPUs will fail with:
+# "RuntimeError: TopPSamplingFromProbs failed with error code
+# too many resources requested for launch"
+#
+# NVIDIA A10 Compatibility:
+# - Compute Capability: 8.6 ✓ COMPATIBLE
+# - Memory: 24GB GDDR6
+# - Cost: $0.75/hour on Lambda Labs
+# - Perfect for: Production LLM inference, fine-tuning
+#
+# ============================================================================
+# Usage:
+# make defconfig-lambdalabs-vllm-gpu-1x-a10
+# make bringup # Provisions A10 GPU instance
+# make vllm # Deploys vLLM production stack
+# make vllm-benchmark # Run performance benchmarks
+# ============================================================================
+#
+# Lambda Labs GPU 1x A10 instance configuration
+CONFIG_TERRAFORM=y
+CONFIG_TERRAFORM_LAMBDALABS=y
+CONFIG_TERRAFORM_LAMBDALABS_REGION_SMART_INFER=y
+CONFIG_TERRAFORM_LAMBDALABS_INSTANCE_TYPE_GPU_1X_A10=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
+
+# vLLM Production Stack with GPU support
+CONFIG_WORKFLOWS=y
+CONFIG_WORKFLOWS_TESTS=y
+CONFIG_WORKFLOWS_LINUX_TESTS=y
+CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
+CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y
+
+# vLLM Production Stack with Kubernetes
+CONFIG_VLLM_PRODUCTION_STACK=y
+CONFIG_VLLM_K8S_MINIKUBE=y
+CONFIG_VLLM_VERSION_STABLE=y
+CONFIG_VLLM_ENGINE_IMAGE_TAG="v0.10.2"
+CONFIG_VLLM_HELM_RELEASE_NAME="vllm-prod"
+CONFIG_VLLM_HELM_NAMESPACE="vllm-system"
+
+# Production Stack components
+CONFIG_VLLM_PROD_STACK_REPO="https://vllm-project.github.io/production-stack"
+CONFIG_VLLM_PROD_STACK_CHART_VERSION="latest"
+CONFIG_VLLM_PROD_STACK_ROUTER_IMAGE="ghcr.io/vllm-project/production-stack/router"
+CONFIG_VLLM_PROD_STACK_ROUTER_TAG="latest"
+CONFIG_VLLM_PROD_STACK_ENABLE_MONITORING=y
+CONFIG_VLLM_PROD_STACK_ENABLE_AUTOSCALING=y
+CONFIG_VLLM_PROD_STACK_MIN_REPLICAS=2
+CONFIG_VLLM_PROD_STACK_MAX_REPLICAS=5
+CONFIG_VLLM_PROD_STACK_TARGET_GPU_UTILIZATION=80
+
+# Model configuration
+CONFIG_VLLM_MODEL_URL="facebook/opt-125m"
+CONFIG_VLLM_MODEL_NAME="opt-125m"
+
+# GPU configuration - EXPLICITLY DISABLED CPU INFERENCE
+# CONFIG_VLLM_USE_CPU_INFERENCE is not set
+CONFIG_VLLM_REQUEST_GPU=1
+CONFIG_VLLM_GPU_TYPE=""
+CONFIG_VLLM_GPU_MEMORY_UTILIZATION="0.5"
+CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1
+
+# Engine configuration for GPU
+CONFIG_VLLM_REPLICA_COUNT=1
+CONFIG_VLLM_REQUEST_CPU=8
+CONFIG_VLLM_REQUEST_MEMORY="16Gi"
+CONFIG_VLLM_MAX_MODEL_LEN=1024
+CONFIG_VLLM_DTYPE="auto"
+
+# Router and observability
+CONFIG_VLLM_ROUTER_ENABLED=y
+CONFIG_VLLM_ROUTER_ROUND_ROBIN=y
+CONFIG_VLLM_OBSERVABILITY_ENABLED=y
+CONFIG_VLLM_GRAFANA_PORT=3000
+CONFIG_VLLM_PROMETHEUS_PORT=9090
+
+# API configuration
+CONFIG_VLLM_API_PORT=8000
+CONFIG_VLLM_API_KEY=""
+CONFIG_VLLM_HF_TOKEN=""
+
+# Benchmarking
+CONFIG_VLLM_BENCHMARK_ENABLED=y
+CONFIG_VLLM_BENCHMARK_DURATION=60
+CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=10
+CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark"
--
2.51.0
next prev parent reply other threads:[~2025-10-04 16:38 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-04 16:38 [PATCH v2 0/4] vLLM and the vLLM production stack Luis Chamberlain
2025-10-04 16:38 ` [PATCH v2 1/4] workflows: Add vLLM workflow for LLM inference and production deployment Luis Chamberlain
2025-10-04 16:38 ` [PATCH v2 2/4] vllm: Add DECLARE_HOSTS support for bare metal and existing infrastructure Luis Chamberlain
2025-10-04 16:38 ` [PATCH v2 3/4] vllm: Add GPU-enabled defconfig with compatibility documentation Luis Chamberlain
2025-10-04 16:38 ` Luis Chamberlain [this message]
2025-10-04 16:39 ` [PATCH v2 0/4] vLLM and the vLLM production stack Luis Chamberlain
2025-10-04 16:55 ` Chuck Lever
2025-10-04 17:03 ` Luis Chamberlain
2025-10-04 17:14 ` Chuck Lever
2025-10-08 17:46 ` Chuck Lever
2025-10-10 0:55 ` Luis Chamberlain
2025-10-10 12:38 ` Chuck Lever
2025-10-10 16:20 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251004163816.3303237-5-mcgrof@kernel.org \
--to=mcgrof@kernel.org \
--cc=Joelagnelf@nvidia.com \
--cc=cel@kernel.org \
--cc=da.gomez@kruces.com \
--cc=devasena.i@samsung.com \
--cc=dongjoo.seo1@samsung.com \
--cc=kdevops@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox