From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 306761A0B15 for ; Sat, 4 Oct 2025 16:38:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.137.202.133 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759595902; cv=none; b=EvI0141UIDc6B+X3E1fusLqyH6j7/KTZxIhFvB7G05vX8S7iSiMwbwcxYkf6Lyd2nlYWzMYf3mAmn6RD+ME1fD5M1z3DLPDjto3mmZ4le1t6vhMgdxTUvTVVRlFQwa9VghBjOMtG6U6RlUkklFVAHEOPOAZZLchwwbKERX811fQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759595902; c=relaxed/simple; bh=H1CKrJ7JdO1KMocuopaec8RDYoUxhDLDNajT6jrIYwQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=K5g+bOCJ6VLwO79NVSa4hSs3FWaj+8MllcOTB0Qk59pyRfiRlN54q+rkBZ/sOzf+lL1BXkYV4rj/Pco8IlgiuF1e3Pop0itaO+DXBVLWTbVX5sXSo7NSRdvnwd33z1oCB49F8et9WJyTz+ETMUL3/P6kv7RWmfE9wvjSqLV7p3E= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=KlUiKxYA; arc=none smtp.client-ip=198.137.202.133 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="KlUiKxYA" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=scesHNJ3om2u+vkA1b0HD42po/NnfCScehbH+MpYqqo=; b=KlUiKxYAQ7sCu6iKMWE/uDvU29 2LqxPec4cvl+aJbZpyN4ISuJD+PGGZGuc9942RNqQsZXI86YmvO+ojt8se+vNa3nRaBKNIwXk1AQm Wu0u+QO9oXAox00WMt/w/7iKPXoM+jiVT6H3D9EDCUIrpCCb44uvDxeYQGId/Owk4GypqTuW3RKmj +vQOEG6IheLM6qvjJcpC231JBrVpSvjVshaburE97PKXg3TiwwEjbV7qGY/UQ3uilGzAQ970Jinx4 XUa5f0YnkuBm4fSEVLm8GjhqJzhGwl2wXau3ZD/tabnM2WaLF+KHg7x6pARrAbTcpk5cVKHnvGB6n LnwOuALw==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.98.2 #2 (Red Hat Linux)) id 1v55H0-0000000DrKG-04Nh; Sat, 04 Oct 2025 16:38:18 +0000 From: Luis Chamberlain To: Chuck Lever , Daniel Gomez , kdevops@lists.linux.dev Cc: Devasena Inupakutika , DongjooSeo , Joel Fernandes , Luis Chamberlain Subject: [PATCH v2 4/4] defconfigs: Add composable fragments for Lambda Labs vLLM deployment Date: Sat, 4 Oct 2025 09:38:14 -0700 Message-ID: <20251004163816.3303237-5-mcgrof@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251004163816.3303237-1-mcgrof@kernel.org> References: <20251004163816.3303237-1-mcgrof@kernel.org> Precedence: bulk X-Mailing-List: kdevops@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: Luis Chamberlain This introduces a fragment-based approach to defconfig composition, allowing users to combine infrastructure provisioning with workflow configurations. Two new config fragments are added to defconfigs/configs/: - lambdalabs-gpu-1x-a10.config: Terraform configuration for Lambda Labs A10 GPU instance provisioning with automatic region inference and SSH key generation. - vllm-production-stack-gpu.config: vLLM production stack configuration with GPU-accelerated inference, Kubernetes deployment via minikube, monitoring, autoscaling, and benchmarking capabilities. These fragments are combined into a new defconfig lambdalabs-vllm-gpu-1x-a10 which enables end-to-end deployment: provision a Lambda Labs A10 GPU instance ($0.75/hr) and deploy the vLLM production stack for LLM inference workloads. The fragment approach allows users to compose configurations by combining infrastructure providers (Lambda Labs, AWS, Azure, bare metal) with different workflows (vLLM, fstests, blktests) without maintaining separate defconfigs for every combination. Example usage: make defconfig-lambdalabs-vllm-gpu-1x-a10 make bringup # Provisions Lambda Labs A10 GPU instance make vllm # Deploys vLLM production stack make vllm-benchmark # Run performance benchmarks Generated-by: Claude AI Signed-off-by: Luis Chamberlain --- .../configs/lambdalabs-gpu-1x-a10.config | 8 ++ .../configs/vllm-production-stack-gpu.config | 61 +++++++++++ defconfigs/lambdalabs-vllm-gpu-1x-a10 | 103 ++++++++++++++++++ 3 files changed, 172 insertions(+) create mode 100644 defconfigs/configs/lambdalabs-gpu-1x-a10.config create mode 100644 defconfigs/configs/vllm-production-stack-gpu.config create mode 100644 defconfigs/lambdalabs-vllm-gpu-1x-a10 diff --git a/defconfigs/configs/lambdalabs-gpu-1x-a10.config b/defconfigs/configs/lambdalabs-gpu-1x-a10.config new file mode 100644 index 00000000..c85dae4e --- /dev/null +++ b/defconfigs/configs/lambdalabs-gpu-1x-a10.config @@ -0,0 +1,8 @@ +# Lambda Labs GPU 1x A10 instance configuration +CONFIG_TERRAFORM=y +CONFIG_TERRAFORM_LAMBDALABS=y +CONFIG_TERRAFORM_LAMBDALABS_REGION_SMART_INFER=y +CONFIG_TERRAFORM_LAMBDALABS_INSTANCE_TYPE_GPU_1X_A10=y +CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y +CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y +CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y diff --git a/defconfigs/configs/vllm-production-stack-gpu.config b/defconfigs/configs/vllm-production-stack-gpu.config new file mode 100644 index 00000000..75b11a9f --- /dev/null +++ b/defconfigs/configs/vllm-production-stack-gpu.config @@ -0,0 +1,61 @@ +# vLLM Production Stack with GPU support +CONFIG_WORKFLOWS=y +CONFIG_WORKFLOWS_TESTS=y +CONFIG_WORKFLOWS_LINUX_TESTS=y +CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y +CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y + +# vLLM Production Stack with Kubernetes +CONFIG_VLLM_PRODUCTION_STACK=y +CONFIG_VLLM_K8S_MINIKUBE=y +CONFIG_VLLM_VERSION_STABLE=y +CONFIG_VLLM_ENGINE_IMAGE_TAG="v0.10.2" +CONFIG_VLLM_HELM_RELEASE_NAME="vllm-prod" +CONFIG_VLLM_HELM_NAMESPACE="vllm-system" + +# Production Stack components +CONFIG_VLLM_PROD_STACK_REPO="https://vllm-project.github.io/production-stack" +CONFIG_VLLM_PROD_STACK_CHART_VERSION="latest" +CONFIG_VLLM_PROD_STACK_ROUTER_IMAGE="ghcr.io/vllm-project/production-stack/router" +CONFIG_VLLM_PROD_STACK_ROUTER_TAG="latest" +CONFIG_VLLM_PROD_STACK_ENABLE_MONITORING=y +CONFIG_VLLM_PROD_STACK_ENABLE_AUTOSCALING=y +CONFIG_VLLM_PROD_STACK_MIN_REPLICAS=2 +CONFIG_VLLM_PROD_STACK_MAX_REPLICAS=5 +CONFIG_VLLM_PROD_STACK_TARGET_GPU_UTILIZATION=80 + +# Model configuration +CONFIG_VLLM_MODEL_URL="facebook/opt-125m" +CONFIG_VLLM_MODEL_NAME="opt-125m" + +# GPU configuration - EXPLICITLY DISABLED CPU INFERENCE +# CONFIG_VLLM_USE_CPU_INFERENCE is not set +CONFIG_VLLM_REQUEST_GPU=1 +CONFIG_VLLM_GPU_TYPE="" +CONFIG_VLLM_GPU_MEMORY_UTILIZATION="0.5" +CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1 + +# Engine configuration for GPU +CONFIG_VLLM_REPLICA_COUNT=1 +CONFIG_VLLM_REQUEST_CPU=8 +CONFIG_VLLM_REQUEST_MEMORY="16Gi" +CONFIG_VLLM_MAX_MODEL_LEN=1024 +CONFIG_VLLM_DTYPE="auto" + +# Router and observability +CONFIG_VLLM_ROUTER_ENABLED=y +CONFIG_VLLM_ROUTER_ROUND_ROBIN=y +CONFIG_VLLM_OBSERVABILITY_ENABLED=y +CONFIG_VLLM_GRAFANA_PORT=3000 +CONFIG_VLLM_PROMETHEUS_PORT=9090 + +# API configuration +CONFIG_VLLM_API_PORT=8000 +CONFIG_VLLM_API_KEY="" +CONFIG_VLLM_HF_TOKEN="" + +# Benchmarking +CONFIG_VLLM_BENCHMARK_ENABLED=y +CONFIG_VLLM_BENCHMARK_DURATION=60 +CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=10 +CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark" diff --git a/defconfigs/lambdalabs-vllm-gpu-1x-a10 b/defconfigs/lambdalabs-vllm-gpu-1x-a10 new file mode 100644 index 00000000..926be1bd --- /dev/null +++ b/defconfigs/lambdalabs-vllm-gpu-1x-a10 @@ -0,0 +1,103 @@ +# +# Lambda Labs vLLM Production Stack - 1x A10 GPU ($0.75/hr) +# +# This combines: +# - defconfigs/configs/lambdalabs-gpu-1x-a10.config (Terraform provisioning) +# - defconfigs/configs/vllm-production-stack-gpu.config (vLLM deployment) +# +# Provisions a Lambda Labs GPU instance with NVIDIA A10 (24GB) and deploys +# the vLLM production stack for LLM inference workloads. +# +# ============================================================================ +# NVIDIA GPU COMPATIBILITY (CUDA): +# ============================================================================ +# +# vLLM v0.10.x uses FlashInfer CUDA kernels that require NVIDIA GPUs with +# compute capability >= 8.0. Older NVIDIA GPUs will fail with: +# "RuntimeError: TopPSamplingFromProbs failed with error code +# too many resources requested for launch" +# +# NVIDIA A10 Compatibility: +# - Compute Capability: 8.6 ✓ COMPATIBLE +# - Memory: 24GB GDDR6 +# - Cost: $0.75/hour on Lambda Labs +# - Perfect for: Production LLM inference, fine-tuning +# +# ============================================================================ +# Usage: +# make defconfig-lambdalabs-vllm-gpu-1x-a10 +# make bringup # Provisions A10 GPU instance +# make vllm # Deploys vLLM production stack +# make vllm-benchmark # Run performance benchmarks +# ============================================================================ +# +# Lambda Labs GPU 1x A10 instance configuration +CONFIG_TERRAFORM=y +CONFIG_TERRAFORM_LAMBDALABS=y +CONFIG_TERRAFORM_LAMBDALABS_REGION_SMART_INFER=y +CONFIG_TERRAFORM_LAMBDALABS_INSTANCE_TYPE_GPU_1X_A10=y +CONFIG_TERRAFORM_SSH_CONFIG_GENKEY=y +CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_OVERWRITE=y +CONFIG_TERRAFORM_SSH_CONFIG_GENKEY_EMPTY_PASSPHRASE=y + +# vLLM Production Stack with GPU support +CONFIG_WORKFLOWS=y +CONFIG_WORKFLOWS_TESTS=y +CONFIG_WORKFLOWS_LINUX_TESTS=y +CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y +CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y + +# vLLM Production Stack with Kubernetes +CONFIG_VLLM_PRODUCTION_STACK=y +CONFIG_VLLM_K8S_MINIKUBE=y +CONFIG_VLLM_VERSION_STABLE=y +CONFIG_VLLM_ENGINE_IMAGE_TAG="v0.10.2" +CONFIG_VLLM_HELM_RELEASE_NAME="vllm-prod" +CONFIG_VLLM_HELM_NAMESPACE="vllm-system" + +# Production Stack components +CONFIG_VLLM_PROD_STACK_REPO="https://vllm-project.github.io/production-stack" +CONFIG_VLLM_PROD_STACK_CHART_VERSION="latest" +CONFIG_VLLM_PROD_STACK_ROUTER_IMAGE="ghcr.io/vllm-project/production-stack/router" +CONFIG_VLLM_PROD_STACK_ROUTER_TAG="latest" +CONFIG_VLLM_PROD_STACK_ENABLE_MONITORING=y +CONFIG_VLLM_PROD_STACK_ENABLE_AUTOSCALING=y +CONFIG_VLLM_PROD_STACK_MIN_REPLICAS=2 +CONFIG_VLLM_PROD_STACK_MAX_REPLICAS=5 +CONFIG_VLLM_PROD_STACK_TARGET_GPU_UTILIZATION=80 + +# Model configuration +CONFIG_VLLM_MODEL_URL="facebook/opt-125m" +CONFIG_VLLM_MODEL_NAME="opt-125m" + +# GPU configuration - EXPLICITLY DISABLED CPU INFERENCE +# CONFIG_VLLM_USE_CPU_INFERENCE is not set +CONFIG_VLLM_REQUEST_GPU=1 +CONFIG_VLLM_GPU_TYPE="" +CONFIG_VLLM_GPU_MEMORY_UTILIZATION="0.5" +CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1 + +# Engine configuration for GPU +CONFIG_VLLM_REPLICA_COUNT=1 +CONFIG_VLLM_REQUEST_CPU=8 +CONFIG_VLLM_REQUEST_MEMORY="16Gi" +CONFIG_VLLM_MAX_MODEL_LEN=1024 +CONFIG_VLLM_DTYPE="auto" + +# Router and observability +CONFIG_VLLM_ROUTER_ENABLED=y +CONFIG_VLLM_ROUTER_ROUND_ROBIN=y +CONFIG_VLLM_OBSERVABILITY_ENABLED=y +CONFIG_VLLM_GRAFANA_PORT=3000 +CONFIG_VLLM_PROMETHEUS_PORT=9090 + +# API configuration +CONFIG_VLLM_API_PORT=8000 +CONFIG_VLLM_API_KEY="" +CONFIG_VLLM_HF_TOKEN="" + +# Benchmarking +CONFIG_VLLM_BENCHMARK_ENABLED=y +CONFIG_VLLM_BENCHMARK_DURATION=60 +CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=10 +CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark" -- 2.51.0