[PATCH 7/8] terraform: Document tier-based GPU selection for Lambda Labs

public inbox for kdevops@lists.linux.dev
 help / color / mirror / Atom feed

From: Luis Chamberlain <mcgrof@kernel.org>
To: Chuck Lever <cel@kernel.org>, Daniel Gomez <da.gomez@kruces.com>,
	kdevops@lists.linux.dev
Cc: Luis Chamberlain <mcgrof@kernel.org>
Subject: [PATCH 7/8] terraform: Document tier-based GPU selection for Lambda Labs
Date: Sat,  6 Dec 2025 08:56:21 -0800	[thread overview]
Message-ID: <20251206165624.2640158-8-mcgrof@kernel.org> (raw)
In-Reply-To: <20251206165624.2640158-1-mcgrof@kernel.org>

Add comprehensive documentation for the tier-based GPU selection feature
to the Lambda Labs README. This includes documentation for the capacity
checking and tier selection scripts, the available tier groups for both
single GPU and multi-GPU configurations, and quick start examples.

The documentation covers how tier-based selection works with automatic
fallback from higher to lower GPU tiers when capacity is unavailable.
It also updates the defconfigs table and scripts reference to include
the new tier-based options.

Generated-by: Claude AI
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 terraform/lambdalabs/README.md | 103 +++++++++++++++++++++++++++++++++
 1 file changed, 103 insertions(+)

diff --git a/terraform/lambdalabs/README.md b/terraform/lambdalabs/README.md
index 4ec1ac44..71da2490 100644
--- a/terraform/lambdalabs/README.md
+++ b/terraform/lambdalabs/README.md
@@ -8,6 +8,7 @@ This directory contains the Terraform configuration for deploying kdevops infras
 - [Prerequisites](#prerequisites)
 - [Quick Start](#quick-start)
 - [Dynamic Configuration](#dynamic-configuration)
+- [Tier-Based GPU Selection](#tier-based-gpu-selection)
 - [SSH Key Security](#ssh-key-security)
 - [Configuration Options](#configuration-options)
 - [Provider Limitations](#provider-limitations)
@@ -111,6 +112,101 @@ scripts/lambda-cli --output json pricing list
 
 For more details on the dynamic configuration system, see [Dynamic Cloud Kconfig Documentation](../../docs/dynamic-cloud-kconfig.md).
 
+## Tier-Based GPU Selection
+
+Lambda Labs supports tier-based GPU selection with automatic fallback. Instead of specifying
+a single instance type, you can specify a maximum tier and kdevops will automatically select
+the highest available GPU within that tier.
+
+### How It Works
+
+1. **Specify Maximum Tier**: Choose a tier group like `H100_OR_LESS`
+2. **Capacity Check**: The system queries Lambda Labs API for available instances
+3. **Tier Fallback**: Tries each tier from highest to lowest until one is available
+4. **Auto-Provision**: Deploys to the first region with available capacity
+
+### Single GPU Tier Groups
+
+| Tier Group | Fallback Order | Use Case |
+|------------|----------------|----------|
+| `GH200_OR_LESS` | GH200 → H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | Maximum performance |
+| `H100_OR_LESS` | H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | High performance |
+| `A100_OR_LESS` | A100-SXM → A100 → A6000 → RTX6000 → A10 | Cost-effective |
+| `A6000_OR_LESS` | A6000 → RTX6000 → A10 | Budget-friendly |
+
+### Multi-GPU (8x) Tier Groups
+
+| Tier Group | Fallback Order | Use Case |
+|------------|----------------|----------|
+| `8X_B200_OR_LESS` | 8x B200 → 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | Maximum multi-GPU |
+| `8X_H100_OR_LESS` | 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | High-end multi-GPU |
+| `8X_A100_OR_LESS` | 8x A100-80 → 8x A100 → 8x V100 | Cost-effective multi-GPU |
+
+### Quick Start with Tier Selection
+
+```bash
+# Single GPU - best available up to H100
+make defconfig-lambdalabs-h100-or-less
+make bringup
+
+# Single GPU - best available up to GH200
+make defconfig-lambdalabs-gh200-or-less
+make bringup
+
+# 8x GPU - best available up to H100
+make defconfig-lambdalabs-8x-h100-or-less
+make bringup
+```
+
+### Checking Capacity
+
+Before deploying, you can check current GPU availability:
+
+```bash
+# Check all available GPU instances
+python3 scripts/lambdalabs_check_capacity.py
+
+# Check specific instance type
+python3 scripts/lambdalabs_check_capacity.py --instance-type gpu_1x_h100_sxm5
+
+# JSON output for scripting
+python3 scripts/lambdalabs_check_capacity.py --json
+```
+
+### Tier Selection Script
+
+The tier selection script finds the best available GPU:
+
+```bash
+# Find best single GPU up to H100
+python3 scripts/lambdalabs_select_tier.py h100-or-less --verbose
+
+# Find best 8x GPU up to H100
+python3 scripts/lambdalabs_select_tier.py 8x-h100-or-less --verbose
+
+# List all available tier groups
+python3 scripts/lambdalabs_select_tier.py --list-tiers
+```
+
+Example output:
+```
+Checking tier group: h100-or-less
+Tiers to check (highest to lowest): h100-sxm, h100-pcie, a100-sxm, a100, a6000, rtx6000, a10
+
+Checking tier 'h100-sxm': gpu_1x_h100_sxm5
+  Checking gpu_1x_h100_sxm5... ✓ AVAILABLE in us-west-1
+
+Selected: gpu_1x_h100_sxm5 in us-west-1 (tier: h100-sxm)
+gpu_1x_h100_sxm5 us-west-1
+```
+
+### Benefits of Tier-Based Selection
+
+- **Higher Success Rate**: Automatically falls back to available GPUs
+- **No Manual Intervention**: System handles capacity changes
+- **Best Performance**: Always gets the highest tier available
+- **Simple Configuration**: One defconfig covers multiple GPU types
+
 ## SSH Key Security
 
 ### Automatic Unique Keys (Default - Recommended)
@@ -168,6 +264,11 @@ The default configuration automatically:
 |--------|-------------|----------|
 | `defconfig-lambdalabs` | Smart instance + unique SSH keys | Production (recommended) |
 | `defconfig-lambdalabs-shared-key` | Smart instance + shared SSH key | Legacy/testing |
+| `defconfig-lambdalabs-gh200-or-less` | Best single GPU up to GH200 | Maximum performance |
+| `defconfig-lambdalabs-h100-or-less` | Best single GPU up to H100 | High performance |
+| `defconfig-lambdalabs-a100-or-less` | Best single GPU up to A100 | Cost-effective |
+| `defconfig-lambdalabs-8x-b200-or-less` | Best 8-GPU up to B200 | Maximum multi-GPU |
+| `defconfig-lambdalabs-8x-h100-or-less` | Best 8-GPU up to H100 | High-end multi-GPU |
 
 ### Manual Configuration
 
@@ -274,6 +375,8 @@ The Lambda Labs Terraform provider (elct9620/lambdalabs v0.3.0) has significant
 |--------|---------|
 | `lambdalabs_api.py` | Main API integration, generates Kconfig |
 | `lambdalabs_smart_inference.py` | Smart instance/region selection |
+| `lambdalabs_check_capacity.py` | Check GPU availability across regions |
+| `lambdalabs_select_tier.py` | Tier-based GPU selection with fallback |
 | `lambdalabs_ssh_keys.py` | SSH key management |
 | `lambdalabs_list_instances.py` | List running instances |
 | `lambdalabs_credentials.py` | Manage API credentials |
-- 
2.51.0

next prev parent reply	other threads:[~2025-12-06 16:58 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-06 16:56 [PATCH 0/8] neoclouds: add new datacrunch / verda support Luis Chamberlain
2025-12-06 16:56 ` [PATCH 1/8] terraform: Use directory checksum in SSH key filenames Luis Chamberlain
2025-12-06 22:28   ` Chuck Lever
2025-12-12 19:14     ` Chuck Lever
2025-12-15 15:41       ` Chuck Lever
2025-12-06 16:56 ` [PATCH 2/8] devconfig: Add tmux.conf copying to target systems Luis Chamberlain
2025-12-06 16:56 ` [PATCH 3/8] terraform: Enable fact gathering for localhost Luis Chamberlain
2025-12-07 16:23   ` Chuck Lever
2025-12-06 16:56 ` [PATCH 4/8] terraform: Add DataCrunch GPU cloud provider integration Luis Chamberlain
2025-12-16 16:12   ` Chuck Lever
2025-12-06 16:56 ` [PATCH 5/8] kconfig: Add support for merging defconfig fragments Luis Chamberlain
2025-12-07 16:25   ` Chuck Lever
2025-12-07 20:37   ` Daniel Gomez
2025-12-06 16:56 ` [PATCH 6/8] terraform: Add tier-based GPU selection for Lambda Labs Luis Chamberlain
2025-12-16 18:05   ` Chuck Lever
2025-12-06 16:56 ` Luis Chamberlain [this message]
2025-12-16 19:30   ` [PATCH 7/8] terraform: Document " Chuck Lever
2025-12-06 16:56 ` [PATCH 8/8] docs: Organize cloud providers with Neoclouds section Luis Chamberlain

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:4ec1ac4 dfblob:71da249 )
 OR (
bs:"[PATCH 7/8] terraform: Document tier-based GPU selection for Lambda Labs" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251206165624.2640158-8-mcgrof@kernel.org \
    --to=mcgrof@kernel.org \
    --cc=cel@kernel.org \
    --cc=da.gomez@kruces.com \
    --cc=kdevops@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox