From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4312A2EC0A4
	for <kdevops@lists.linux.dev>; Sat,  6 Dec 2025 16:58:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.137.202.133
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1765040315; cv=none; b=TrGmVxTeEG9By40kIvqRJ3tVKduoh2lGdU9cYLWAw/JgobIMw8pZO3YXPl99lyPt0cNWyVzkdCIMvY1fhIZuFmA35znPoD1qsMuiVsMvMwyecqkFek0alYRTTwdQywD07xVr65HHQdWUz3mP/sevpnCUBE+q+o9Z3UX+x9hf7Rg=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1765040315; c=relaxed/simple;
	bh=xcvD9wS5i7zDRrtlLyf6cv3IAeWgxZesK1oludC4JEk=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=EL+6/xqJRkJKZ1KCg1F+elXjqNl3wEULvT0DiGniAvht8xpmNNNHFX50KPvriAI3XKYa16odd/xH4lWfRUe7VDC53SwgVJVkis9y+ssYulQGlFiaq+YXqgD9KD06rktbPF+lhzfbSkqGsb9lvL1bvhXBc0LiAINXD52N9i4pRTs=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=uGvxXax8; arc=none smtp.client-ip=198.137.202.133
Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org
Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="uGvxXax8"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=bombadil.20210309; h=Sender:Content-Transfer-Encoding:
	Content-Type:MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc:
	To:From:Reply-To:Content-ID:Content-Description;
	bh=p/JHeVo47WXEkXmhSdtsetVVItFYKkEIV9YUF18j/Bw=; b=uGvxXax8Kgbk+ajo+zFqcs+rH+
	pZjhjog2XTsTX89W7DUviZZtuU2LINii9ccuu52lbnjKRBd7tBUMPFyz/ZjrBPylvc1hftolabD6/
	wNI+K3R3J6R9sRNQRULbfNb46Bav1k7quRGoh3qJxRtaRJNlAF3bLQ69z2qhJKTYfyQD4+d8YOaDz
	iRv2/sQC5PeXmv/5m+C2gnNXB2vTiN7WPtdRCDypvF4zXrzjISqu+jgLea8EIgTXoG+lesL68Ps8a
	oPWVBXOToFZzdf3vTwiVCorQh6FPZ/NJRwJ9dqd743r0an+iYnNwjLelY5+LPxaVJt/oQZwI2JQty
	95Abdkug==;
Received: from mcgrof by bombadil.infradead.org with local (Exim 4.98.2 #2 (Red Hat Linux))
	id 1vRvc8-0000000B4tH-2bWW;
	Sat, 06 Dec 2025 16:58:32 +0000
From: Luis Chamberlain <mcgrof@kernel.org>
To: Chuck Lever <cel@kernel.org>,
	Daniel Gomez <da.gomez@kruces.com>,
	kdevops@lists.linux.dev
Cc: Luis Chamberlain <mcgrof@kernel.org>
Subject: [PATCH 7/8] terraform: Document tier-based GPU selection for Lambda Labs
Date: Sat,  6 Dec 2025 08:56:21 -0800
Message-ID: <20251206165624.2640158-8-mcgrof@kernel.org>
X-Mailer: git-send-email 2.51.0
In-Reply-To: <20251206165624.2640158-1-mcgrof@kernel.org>
References: <20251206165624.2640158-1-mcgrof@kernel.org>
Precedence: bulk
X-Mailing-List: kdevops@lists.linux.dev
List-Id: <kdevops.lists.linux.dev>
List-Subscribe: <mailto:kdevops+subscribe@lists.linux.dev>
List-Unsubscribe: <mailto:kdevops+unsubscribe@lists.linux.dev>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: Luis Chamberlain <mcgrof@infradead.org>

Add comprehensive documentation for the tier-based GPU selection feature
to the Lambda Labs README. This includes documentation for the capacity
checking and tier selection scripts, the available tier groups for both
single GPU and multi-GPU configurations, and quick start examples.

The documentation covers how tier-based selection works with automatic
fallback from higher to lower GPU tiers when capacity is unavailable.
It also updates the defconfigs table and scripts reference to include
the new tier-based options.

Generated-by: Claude AI
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 terraform/lambdalabs/README.md | 103 +++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/terraform/lambdalabs/README.md b/terraform/lambdalabs/README.md
index 4ec1ac44..71da2490 100644
--- a/terraform/lambdalabs/README.md
+++ b/terraform/lambdalabs/README.md
@@ -8,6 +8,7 @@ This directory contains the Terraform configuration for deploying kdevops infras
 - [Prerequisites](#prerequisites)
 - [Quick Start](#quick-start)
 - [Dynamic Configuration](#dynamic-configuration)
+- [Tier-Based GPU Selection](#tier-based-gpu-selection)
 - [SSH Key Security](#ssh-key-security)
 - [Configuration Options](#configuration-options)
 - [Provider Limitations](#provider-limitations)
@@ -111,6 +112,101 @@ scripts/lambda-cli --output json pricing list
 
 For more details on the dynamic configuration system, see [Dynamic Cloud Kconfig Documentation](../../docs/dynamic-cloud-kconfig.md).
 
+## Tier-Based GPU Selection
+
+Lambda Labs supports tier-based GPU selection with automatic fallback. Instead of specifying
+a single instance type, you can specify a maximum tier and kdevops will automatically select
+the highest available GPU within that tier.
+
+### How It Works
+
+1. **Specify Maximum Tier**: Choose a tier group like `H100_OR_LESS`
+2. **Capacity Check**: The system queries Lambda Labs API for available instances
+3. **Tier Fallback**: Tries each tier from highest to lowest until one is available
+4. **Auto-Provision**: Deploys to the first region with available capacity
+
+### Single GPU Tier Groups
+
+| Tier Group | Fallback Order | Use Case |
+|------------|----------------|----------|
+| `GH200_OR_LESS` | GH200 → H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | Maximum performance |
+| `H100_OR_LESS` | H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | High performance |
+| `A100_OR_LESS` | A100-SXM → A100 → A6000 → RTX6000 → A10 | Cost-effective |
+| `A6000_OR_LESS` | A6000 → RTX6000 → A10 | Budget-friendly |
+
+### Multi-GPU (8x) Tier Groups
+
+| Tier Group | Fallback Order | Use Case |
+|------------|----------------|----------|
+| `8X_B200_OR_LESS` | 8x B200 → 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | Maximum multi-GPU |
+| `8X_H100_OR_LESS` | 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | High-end multi-GPU |
+| `8X_A100_OR_LESS` | 8x A100-80 → 8x A100 → 8x V100 | Cost-effective multi-GPU |
+
+### Quick Start with Tier Selection
+
+```bash
+# Single GPU - best available up to H100
+make defconfig-lambdalabs-h100-or-less
+make bringup
+
+# Single GPU - best available up to GH200
+make defconfig-lambdalabs-gh200-or-less
+make bringup
+
+# 8x GPU - best available up to H100
+make defconfig-lambdalabs-8x-h100-or-less
+make bringup
+```
+
+### Checking Capacity
+
+Before deploying, you can check current GPU availability:
+
+```bash
+# Check all available GPU instances
+python3 scripts/lambdalabs_check_capacity.py
+
+# Check specific instance type
+python3 scripts/lambdalabs_check_capacity.py --instance-type gpu_1x_h100_sxm5
+
+# JSON output for scripting
+python3 scripts/lambdalabs_check_capacity.py --json
+```
+
+### Tier Selection Script
+
+The tier selection script finds the best available GPU:
+
+```bash
+# Find best single GPU up to H100
+python3 scripts/lambdalabs_select_tier.py h100-or-less --verbose
+
+# Find best 8x GPU up to H100
+python3 scripts/lambdalabs_select_tier.py 8x-h100-or-less --verbose
+
+# List all available tier groups
+python3 scripts/lambdalabs_select_tier.py --list-tiers
+```
+
+Example output:
+```
+Checking tier group: h100-or-less
+Tiers to check (highest to lowest): h100-sxm, h100-pcie, a100-sxm, a100, a6000, rtx6000, a10
+
+Checking tier 'h100-sxm': gpu_1x_h100_sxm5
+  Checking gpu_1x_h100_sxm5... ✓ AVAILABLE in us-west-1
+
+Selected: gpu_1x_h100_sxm5 in us-west-1 (tier: h100-sxm)
+gpu_1x_h100_sxm5 us-west-1
+```
+
+### Benefits of Tier-Based Selection
+
+- **Higher Success Rate**: Automatically falls back to available GPUs
+- **No Manual Intervention**: System handles capacity changes
+- **Best Performance**: Always gets the highest tier available
+- **Simple Configuration**: One defconfig covers multiple GPU types
+
 ## SSH Key Security
 
 ### Automatic Unique Keys (Default - Recommended)
@@ -168,6 +264,11 @@ The default configuration automatically:
 |--------|-------------|----------|
 | `defconfig-lambdalabs` | Smart instance + unique SSH keys | Production (recommended) |
 | `defconfig-lambdalabs-shared-key` | Smart instance + shared SSH key | Legacy/testing |
+| `defconfig-lambdalabs-gh200-or-less` | Best single GPU up to GH200 | Maximum performance |
+| `defconfig-lambdalabs-h100-or-less` | Best single GPU up to H100 | High performance |
+| `defconfig-lambdalabs-a100-or-less` | Best single GPU up to A100 | Cost-effective |
+| `defconfig-lambdalabs-8x-b200-or-less` | Best 8-GPU up to B200 | Maximum multi-GPU |
+| `defconfig-lambdalabs-8x-h100-or-less` | Best 8-GPU up to H100 | High-end multi-GPU |
 
 ### Manual Configuration
 
@@ -274,6 +375,8 @@ The Lambda Labs Terraform provider (elct9620/lambdalabs v0.3.0) has significant
 |--------|---------|
 | `lambdalabs_api.py` | Main API integration, generates Kconfig |
 | `lambdalabs_smart_inference.py` | Smart instance/region selection |
+| `lambdalabs_check_capacity.py` | Check GPU availability across regions |
+| `lambdalabs_select_tier.py` | Tier-based GPU selection with fallback |
 | `lambdalabs_ssh_keys.py` | SSH key management |
 | `lambdalabs_list_instances.py` | List running instances |
 | `lambdalabs_credentials.py` | Manage API credentials |
-- 
2.51.0