From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4312A2EC0A4 for ; Sat, 6 Dec 2025 16:58:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.137.202.133 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765040315; cv=none; b=TrGmVxTeEG9By40kIvqRJ3tVKduoh2lGdU9cYLWAw/JgobIMw8pZO3YXPl99lyPt0cNWyVzkdCIMvY1fhIZuFmA35znPoD1qsMuiVsMvMwyecqkFek0alYRTTwdQywD07xVr65HHQdWUz3mP/sevpnCUBE+q+o9Z3UX+x9hf7Rg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1765040315; c=relaxed/simple; bh=xcvD9wS5i7zDRrtlLyf6cv3IAeWgxZesK1oludC4JEk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=EL+6/xqJRkJKZ1KCg1F+elXjqNl3wEULvT0DiGniAvht8xpmNNNHFX50KPvriAI3XKYa16odd/xH4lWfRUe7VDC53SwgVJVkis9y+ssYulQGlFiaq+YXqgD9KD06rktbPF+lhzfbSkqGsb9lvL1bvhXBc0LiAINXD52N9i4pRTs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=uGvxXax8; arc=none smtp.client-ip=198.137.202.133 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="uGvxXax8" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=p/JHeVo47WXEkXmhSdtsetVVItFYKkEIV9YUF18j/Bw=; b=uGvxXax8Kgbk+ajo+zFqcs+rH+ pZjhjog2XTsTX89W7DUviZZtuU2LINii9ccuu52lbnjKRBd7tBUMPFyz/ZjrBPylvc1hftolabD6/ wNI+K3R3J6R9sRNQRULbfNb46Bav1k7quRGoh3qJxRtaRJNlAF3bLQ69z2qhJKTYfyQD4+d8YOaDz iRv2/sQC5PeXmv/5m+C2gnNXB2vTiN7WPtdRCDypvF4zXrzjISqu+jgLea8EIgTXoG+lesL68Ps8a oPWVBXOToFZzdf3vTwiVCorQh6FPZ/NJRwJ9dqd743r0an+iYnNwjLelY5+LPxaVJt/oQZwI2JQty 95Abdkug==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.98.2 #2 (Red Hat Linux)) id 1vRvc8-0000000B4tH-2bWW; Sat, 06 Dec 2025 16:58:32 +0000 From: Luis Chamberlain To: Chuck Lever , Daniel Gomez , kdevops@lists.linux.dev Cc: Luis Chamberlain Subject: [PATCH 7/8] terraform: Document tier-based GPU selection for Lambda Labs Date: Sat, 6 Dec 2025 08:56:21 -0800 Message-ID: <20251206165624.2640158-8-mcgrof@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251206165624.2640158-1-mcgrof@kernel.org> References: <20251206165624.2640158-1-mcgrof@kernel.org> Precedence: bulk X-Mailing-List: kdevops@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: Luis Chamberlain Add comprehensive documentation for the tier-based GPU selection feature to the Lambda Labs README. This includes documentation for the capacity checking and tier selection scripts, the available tier groups for both single GPU and multi-GPU configurations, and quick start examples. The documentation covers how tier-based selection works with automatic fallback from higher to lower GPU tiers when capacity is unavailable. It also updates the defconfigs table and scripts reference to include the new tier-based options. Generated-by: Claude AI Signed-off-by: Luis Chamberlain --- terraform/lambdalabs/README.md | 103 +++++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) diff --git a/terraform/lambdalabs/README.md b/terraform/lambdalabs/README.md index 4ec1ac44..71da2490 100644 --- a/terraform/lambdalabs/README.md +++ b/terraform/lambdalabs/README.md @@ -8,6 +8,7 @@ This directory contains the Terraform configuration for deploying kdevops infras - [Prerequisites](#prerequisites) - [Quick Start](#quick-start) - [Dynamic Configuration](#dynamic-configuration) +- [Tier-Based GPU Selection](#tier-based-gpu-selection) - [SSH Key Security](#ssh-key-security) - [Configuration Options](#configuration-options) - [Provider Limitations](#provider-limitations) @@ -111,6 +112,101 @@ scripts/lambda-cli --output json pricing list For more details on the dynamic configuration system, see [Dynamic Cloud Kconfig Documentation](../../docs/dynamic-cloud-kconfig.md). +## Tier-Based GPU Selection + +Lambda Labs supports tier-based GPU selection with automatic fallback. Instead of specifying +a single instance type, you can specify a maximum tier and kdevops will automatically select +the highest available GPU within that tier. + +### How It Works + +1. **Specify Maximum Tier**: Choose a tier group like `H100_OR_LESS` +2. **Capacity Check**: The system queries Lambda Labs API for available instances +3. **Tier Fallback**: Tries each tier from highest to lowest until one is available +4. **Auto-Provision**: Deploys to the first region with available capacity + +### Single GPU Tier Groups + +| Tier Group | Fallback Order | Use Case | +|------------|----------------|----------| +| `GH200_OR_LESS` | GH200 → H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | Maximum performance | +| `H100_OR_LESS` | H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | High performance | +| `A100_OR_LESS` | A100-SXM → A100 → A6000 → RTX6000 → A10 | Cost-effective | +| `A6000_OR_LESS` | A6000 → RTX6000 → A10 | Budget-friendly | + +### Multi-GPU (8x) Tier Groups + +| Tier Group | Fallback Order | Use Case | +|------------|----------------|----------| +| `8X_B200_OR_LESS` | 8x B200 → 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | Maximum multi-GPU | +| `8X_H100_OR_LESS` | 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | High-end multi-GPU | +| `8X_A100_OR_LESS` | 8x A100-80 → 8x A100 → 8x V100 | Cost-effective multi-GPU | + +### Quick Start with Tier Selection + +```bash +# Single GPU - best available up to H100 +make defconfig-lambdalabs-h100-or-less +make bringup + +# Single GPU - best available up to GH200 +make defconfig-lambdalabs-gh200-or-less +make bringup + +# 8x GPU - best available up to H100 +make defconfig-lambdalabs-8x-h100-or-less +make bringup +``` + +### Checking Capacity + +Before deploying, you can check current GPU availability: + +```bash +# Check all available GPU instances +python3 scripts/lambdalabs_check_capacity.py + +# Check specific instance type +python3 scripts/lambdalabs_check_capacity.py --instance-type gpu_1x_h100_sxm5 + +# JSON output for scripting +python3 scripts/lambdalabs_check_capacity.py --json +``` + +### Tier Selection Script + +The tier selection script finds the best available GPU: + +```bash +# Find best single GPU up to H100 +python3 scripts/lambdalabs_select_tier.py h100-or-less --verbose + +# Find best 8x GPU up to H100 +python3 scripts/lambdalabs_select_tier.py 8x-h100-or-less --verbose + +# List all available tier groups +python3 scripts/lambdalabs_select_tier.py --list-tiers +``` + +Example output: +``` +Checking tier group: h100-or-less +Tiers to check (highest to lowest): h100-sxm, h100-pcie, a100-sxm, a100, a6000, rtx6000, a10 + +Checking tier 'h100-sxm': gpu_1x_h100_sxm5 + Checking gpu_1x_h100_sxm5... ✓ AVAILABLE in us-west-1 + +Selected: gpu_1x_h100_sxm5 in us-west-1 (tier: h100-sxm) +gpu_1x_h100_sxm5 us-west-1 +``` + +### Benefits of Tier-Based Selection + +- **Higher Success Rate**: Automatically falls back to available GPUs +- **No Manual Intervention**: System handles capacity changes +- **Best Performance**: Always gets the highest tier available +- **Simple Configuration**: One defconfig covers multiple GPU types + ## SSH Key Security ### Automatic Unique Keys (Default - Recommended) @@ -168,6 +264,11 @@ The default configuration automatically: |--------|-------------|----------| | `defconfig-lambdalabs` | Smart instance + unique SSH keys | Production (recommended) | | `defconfig-lambdalabs-shared-key` | Smart instance + shared SSH key | Legacy/testing | +| `defconfig-lambdalabs-gh200-or-less` | Best single GPU up to GH200 | Maximum performance | +| `defconfig-lambdalabs-h100-or-less` | Best single GPU up to H100 | High performance | +| `defconfig-lambdalabs-a100-or-less` | Best single GPU up to A100 | Cost-effective | +| `defconfig-lambdalabs-8x-b200-or-less` | Best 8-GPU up to B200 | Maximum multi-GPU | +| `defconfig-lambdalabs-8x-h100-or-less` | Best 8-GPU up to H100 | High-end multi-GPU | ### Manual Configuration @@ -274,6 +375,8 @@ The Lambda Labs Terraform provider (elct9620/lambdalabs v0.3.0) has significant |--------|---------| | `lambdalabs_api.py` | Main API integration, generates Kconfig | | `lambdalabs_smart_inference.py` | Smart instance/region selection | +| `lambdalabs_check_capacity.py` | Check GPU availability across regions | +| `lambdalabs_select_tier.py` | Tier-based GPU selection with fallback | | `lambdalabs_ssh_keys.py` | SSH key management | | `lambdalabs_list_instances.py` | List running instances | | `lambdalabs_credentials.py` | Manage API credentials | -- 2.51.0