public inbox for kdevops@lists.linux.dev
 help / color / mirror / Atom feed
From: Luis Chamberlain <mcgrof@kernel.org>
To: Chuck Lever <cel@kernel.org>, Daniel Gomez <da.gomez@kruces.com>,
	kdevops@lists.linux.dev
Cc: Devasena Inupakutika <devasena.i@samsung.com>,
	DongjooSeo <dongjoo.seo1@samsung.com>,
	Joel Fernandes <Joelagnelf@nvidia.com>,
	Luis Chamberlain <mcgrof@kernel.org>
Subject: [PATCH v2 4/4] defconfigs: Add composable fragments for Lambda Labs vLLM deployment
Date: Sat,  4 Oct 2025 09:38:14 -0700	[thread overview]
Message-ID: <20251004163816.3303237-5-mcgrof@kernel.org> (raw)
In-Reply-To: <20251004163816.3303237-1-mcgrof@kernel.org>

This introduces a fragment-based approach to defconfig composition, allowing
users to combine infrastructure provisioning with workflow configurations.

Two new config fragments are added to defconfigs/configs/:

- lambdalabs-gpu-1x-a10.config: Terraform configuration for Lambda Labs A10
  GPU instance provisioning with automatic region inference and SSH key
  generation.

- vllm-production-stack-gpu.config: vLLM production stack configuration with
  GPU-accelerated inference, Kubernetes deployment via minikube, monitoring,
  autoscaling, and benchmarking capabilities.

These fragments are combined into a new defconfig lambdalabs-vllm-gpu-1x-a10
which enables end-to-end deployment: provision a Lambda Labs A10 GPU instance
($0.75/hr) and deploy the vLLM production stack for LLM inference workloads.

The fragment approach allows users to compose configurations by combining
infrastructure providers (Lambda Labs, AWS, Azure, bare metal) with different
workflows (vLLM, fstests, blktests) without maintaining separate defconfigs
for every combination.

Example usage:
  make defconfig-lambdalabs-vllm-gpu-1x-a10
  make bringup        # Provisions Lambda Labs A10 GPU instance
  make vllm           # Deploys vLLM production stack
  make vllm-benchmark # Run performance benchmarks

Generated-by: Claude AI
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 .../configs/lambdalabs-gpu-1x-a10.config      |   8 ++
 .../configs/vllm-production-stack-gpu.config  |  61 +++++++++++
 defconfigs/lambdalabs-vllm-gpu-1x-a10         | 103 ++++++++++++++++++
 3 files changed, 172 insertions(+)
 create mode 100644 defconfigs/configs/lambdalabs-gpu-1x-a10.config
 create mode 100644 defconfigs/configs/vllm-production-stack-gpu.config
 create mode 100644 defconfigs/lambdalabs-vllm-gpu-1x-a10

diff --git a/defconfigs/configs/lambdalabs-gpu-1x-a10.config b/defconfigs/configs/lambdalabs-gpu-1x-a10.config
new file mode 100644
index 00000000..c85dae4e
--- /dev/null
+++ b/defconfigs/configs/lambdalabs-gpu-1x-a10.config
@@ -0,0 +1,8 @@
+# Lambda Labs GPU 1x A10 instance configuration
+CONFIG_TERRAFORM=y
+CONFIG_TERRAFORM_LAMBDALABS=y
+CONFIG_TERRAFORM_LAMBDALABS_REGION_SMART_INFER=y
+CONFIG_TERRAFORM_LAMBDALABS_INSTANCE_TYPE_GPU_1X_A10=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
diff --git a/defconfigs/configs/vllm-production-stack-gpu.config b/defconfigs/configs/vllm-production-stack-gpu.config
new file mode 100644
index 00000000..75b11a9f
--- /dev/null
+++ b/defconfigs/configs/vllm-production-stack-gpu.config
@@ -0,0 +1,61 @@
+# vLLM Production Stack with GPU support
+CONFIG_WORKFLOWS=y
+CONFIG_WORKFLOWS_TESTS=y
+CONFIG_WORKFLOWS_LINUX_TESTS=y
+CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
+CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y
+
+# vLLM Production Stack with Kubernetes
+CONFIG_VLLM_PRODUCTION_STACK=y
+CONFIG_VLLM_K8S_MINIKUBE=y
+CONFIG_VLLM_VERSION_STABLE=y
+CONFIG_VLLM_ENGINE_IMAGE_TAG="v0.10.2"
+CONFIG_VLLM_HELM_RELEASE_NAME="vllm-prod"
+CONFIG_VLLM_HELM_NAMESPACE="vllm-system"
+
+# Production Stack components
+CONFIG_VLLM_PROD_STACK_REPO="https://vllm-project.github.io/production-stack"
+CONFIG_VLLM_PROD_STACK_CHART_VERSION="latest"
+CONFIG_VLLM_PROD_STACK_ROUTER_IMAGE="ghcr.io/vllm-project/production-stack/router"
+CONFIG_VLLM_PROD_STACK_ROUTER_TAG="latest"
+CONFIG_VLLM_PROD_STACK_ENABLE_MONITORING=y
+CONFIG_VLLM_PROD_STACK_ENABLE_AUTOSCALING=y
+CONFIG_VLLM_PROD_STACK_MIN_REPLICAS=2
+CONFIG_VLLM_PROD_STACK_MAX_REPLICAS=5
+CONFIG_VLLM_PROD_STACK_TARGET_GPU_UTILIZATION=80
+
+# Model configuration
+CONFIG_VLLM_MODEL_URL="facebook/opt-125m"
+CONFIG_VLLM_MODEL_NAME="opt-125m"
+
+# GPU configuration - EXPLICITLY DISABLED CPU INFERENCE
+# CONFIG_VLLM_USE_CPU_INFERENCE is not set
+CONFIG_VLLM_REQUEST_GPU=1
+CONFIG_VLLM_GPU_TYPE=""
+CONFIG_VLLM_GPU_MEMORY_UTILIZATION="0.5"
+CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1
+
+# Engine configuration for GPU
+CONFIG_VLLM_REPLICA_COUNT=1
+CONFIG_VLLM_REQUEST_CPU=8
+CONFIG_VLLM_REQUEST_MEMORY="16Gi"
+CONFIG_VLLM_MAX_MODEL_LEN=1024
+CONFIG_VLLM_DTYPE="auto"
+
+# Router and observability
+CONFIG_VLLM_ROUTER_ENABLED=y
+CONFIG_VLLM_ROUTER_ROUND_ROBIN=y
+CONFIG_VLLM_OBSERVABILITY_ENABLED=y
+CONFIG_VLLM_GRAFANA_PORT=3000
+CONFIG_VLLM_PROMETHEUS_PORT=9090
+
+# API configuration
+CONFIG_VLLM_API_PORT=8000
+CONFIG_VLLM_API_KEY=""
+CONFIG_VLLM_HF_TOKEN=""
+
+# Benchmarking
+CONFIG_VLLM_BENCHMARK_ENABLED=y
+CONFIG_VLLM_BENCHMARK_DURATION=60
+CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=10
+CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark"
diff --git a/defconfigs/lambdalabs-vllm-gpu-1x-a10 b/defconfigs/lambdalabs-vllm-gpu-1x-a10
new file mode 100644
index 00000000..926be1bd
--- /dev/null
+++ b/defconfigs/lambdalabs-vllm-gpu-1x-a10
@@ -0,0 +1,103 @@
+#
+# Lambda Labs vLLM Production Stack - 1x A10 GPU ($0.75/hr)
+#
+# This combines:
+#   - defconfigs/configs/lambdalabs-gpu-1x-a10.config (Terraform provisioning)
+#   - defconfigs/configs/vllm-production-stack-gpu.config (vLLM deployment)
+#
+# Provisions a Lambda Labs GPU instance with NVIDIA A10 (24GB) and deploys
+# the vLLM production stack for LLM inference workloads.
+#
+# ============================================================================
+# NVIDIA GPU COMPATIBILITY (CUDA):
+# ============================================================================
+#
+# vLLM v0.10.x uses FlashInfer CUDA kernels that require NVIDIA GPUs with
+# compute capability >= 8.0. Older NVIDIA GPUs will fail with:
+#   "RuntimeError: TopPSamplingFromProbs failed with error code
+#    too many resources requested for launch"
+#
+# NVIDIA A10 Compatibility:
+#   - Compute Capability: 8.6 ✓ COMPATIBLE
+#   - Memory: 24GB GDDR6
+#   - Cost: $0.75/hour on Lambda Labs
+#   - Perfect for: Production LLM inference, fine-tuning
+#
+# ============================================================================
+# Usage:
+#   make defconfig-lambdalabs-vllm-gpu-1x-a10
+#   make bringup        # Provisions A10 GPU instance
+#   make vllm           # Deploys vLLM production stack
+#   make vllm-benchmark # Run performance benchmarks
+# ============================================================================
+#
+# Lambda Labs GPU 1x A10 instance configuration
+CONFIG_TERRAFORM=y
+CONFIG_TERRAFORM_LAMBDALABS=y
+CONFIG_TERRAFORM_LAMBDALABS_REGION_SMART_INFER=y
+CONFIG_TERRAFORM_LAMBDALABS_INSTANCE_TYPE_GPU_1X_A10=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y
+CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y
+
+# vLLM Production Stack with GPU support
+CONFIG_WORKFLOWS=y
+CONFIG_WORKFLOWS_TESTS=y
+CONFIG_WORKFLOWS_LINUX_TESTS=y
+CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
+CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y
+
+# vLLM Production Stack with Kubernetes
+CONFIG_VLLM_PRODUCTION_STACK=y
+CONFIG_VLLM_K8S_MINIKUBE=y
+CONFIG_VLLM_VERSION_STABLE=y
+CONFIG_VLLM_ENGINE_IMAGE_TAG="v0.10.2"
+CONFIG_VLLM_HELM_RELEASE_NAME="vllm-prod"
+CONFIG_VLLM_HELM_NAMESPACE="vllm-system"
+
+# Production Stack components
+CONFIG_VLLM_PROD_STACK_REPO="https://vllm-project.github.io/production-stack"
+CONFIG_VLLM_PROD_STACK_CHART_VERSION="latest"
+CONFIG_VLLM_PROD_STACK_ROUTER_IMAGE="ghcr.io/vllm-project/production-stack/router"
+CONFIG_VLLM_PROD_STACK_ROUTER_TAG="latest"
+CONFIG_VLLM_PROD_STACK_ENABLE_MONITORING=y
+CONFIG_VLLM_PROD_STACK_ENABLE_AUTOSCALING=y
+CONFIG_VLLM_PROD_STACK_MIN_REPLICAS=2
+CONFIG_VLLM_PROD_STACK_MAX_REPLICAS=5
+CONFIG_VLLM_PROD_STACK_TARGET_GPU_UTILIZATION=80
+
+# Model configuration
+CONFIG_VLLM_MODEL_URL="facebook/opt-125m"
+CONFIG_VLLM_MODEL_NAME="opt-125m"
+
+# GPU configuration - EXPLICITLY DISABLED CPU INFERENCE
+# CONFIG_VLLM_USE_CPU_INFERENCE is not set
+CONFIG_VLLM_REQUEST_GPU=1
+CONFIG_VLLM_GPU_TYPE=""
+CONFIG_VLLM_GPU_MEMORY_UTILIZATION="0.5"
+CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1
+
+# Engine configuration for GPU
+CONFIG_VLLM_REPLICA_COUNT=1
+CONFIG_VLLM_REQUEST_CPU=8
+CONFIG_VLLM_REQUEST_MEMORY="16Gi"
+CONFIG_VLLM_MAX_MODEL_LEN=1024
+CONFIG_VLLM_DTYPE="auto"
+
+# Router and observability
+CONFIG_VLLM_ROUTER_ENABLED=y
+CONFIG_VLLM_ROUTER_ROUND_ROBIN=y
+CONFIG_VLLM_OBSERVABILITY_ENABLED=y
+CONFIG_VLLM_GRAFANA_PORT=3000
+CONFIG_VLLM_PROMETHEUS_PORT=9090
+
+# API configuration
+CONFIG_VLLM_API_PORT=8000
+CONFIG_VLLM_API_KEY=""
+CONFIG_VLLM_HF_TOKEN=""
+
+# Benchmarking
+CONFIG_VLLM_BENCHMARK_ENABLED=y
+CONFIG_VLLM_BENCHMARK_DURATION=60
+CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=10
+CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark"
-- 
2.51.0


  parent reply	other threads:[~2025-10-04 16:38 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-04 16:38 [PATCH v2 0/4] vLLM and the vLLM production stack Luis Chamberlain
2025-10-04 16:38 ` [PATCH v2 1/4] workflows: Add vLLM workflow for LLM inference and production deployment Luis Chamberlain
2025-10-04 16:38 ` [PATCH v2 2/4] vllm: Add DECLARE_HOSTS support for bare metal and existing infrastructure Luis Chamberlain
2025-10-04 16:38 ` [PATCH v2 3/4] vllm: Add GPU-enabled defconfig with compatibility documentation Luis Chamberlain
2025-10-04 16:38 ` Luis Chamberlain [this message]
2025-10-04 16:39 ` [PATCH v2 0/4] vLLM and the vLLM production stack Luis Chamberlain
2025-10-04 16:55 ` Chuck Lever
2025-10-04 17:03   ` Luis Chamberlain
2025-10-04 17:14     ` Chuck Lever
2025-10-08 17:46       ` Chuck Lever
2025-10-10  0:55         ` Luis Chamberlain
2025-10-10 12:38           ` Chuck Lever
2025-10-10 16:20             ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251004163816.3303237-5-mcgrof@kernel.org \
    --to=mcgrof@kernel.org \
    --cc=Joelagnelf@nvidia.com \
    --cc=cel@kernel.org \
    --cc=da.gomez@kruces.com \
    --cc=devasena.i@samsung.com \
    --cc=dongjoo.seo1@samsung.com \
    --cc=kdevops@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox