From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 467E630C634 for ; Wed, 27 Aug 2025 21:29:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.137.202.133 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756330148; cv=none; b=kAMjIjZFpJtiHBcrO5uDx7XwBkU7qkdZd9bKwUqabPMYG5q+ZMjj4gnyeC4vt9hK/kApqPW8/sm5pdm9HOiWDFyIT0E3z7Q0ihGKLDbZlXQn5aOzq7xVz59NGH3DL94AXpMT4elaYbt/GjAtsucKNtAkKVFMSwEeB1ogoB3jiro= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1756330148; c=relaxed/simple; bh=BAGV7Hd0DGiI50nMuZj1x+pVrD622wQwe7sqFiDdPP8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=uYkDfyw2iudYlwcvm0dHvjOFnV9XSpcPbTey3ZIn80IlwjapxZ/0503tryxbynxKbOqfFcPIZImlcg2pPAUWEGer3vJqG96dtIAlc12TTNl5hXqCiVMuRb/HfTAjvUMK+Eld/5pEypEY+tAHbenHyd7dghlBJf7slcCCpEBwRpo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=jUYCEgZ/; arc=none smtp.client-ip=198.137.202.133 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="jUYCEgZ/" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=V0iTPxrAC+5JTzCgX8+rrKm6HpEQBkmtCZL7drgbEOc=; b=jUYCEgZ/bM+OS/45P8u/0HuoBQ lNmwer9lgFQoGEfM9zD1VmPngqjGIQGJWUIkrFBjkPNffIXg0BqR5w7uzm3JByeriS/g+gSRn1raB pUOQYVmWE/imKxzDEJPKxnOkvNHEG+mt/KI3hJHx9WfoB/R1AORXvVRjLm38DwmniRMgRhb6rtpUM Q23JZ2gFB53TiGOOEWDIYAnrQhjvPj6e4bpR3gfgoFmtg5z0gqnp6MKEym4LAGoTJ6MMw1me5Uxee QPxnq3HlFiiouJ4eXQdbWZp5s/21tJgYU/4fpTj5ZGNjV/R+kMvAKsSar32n73bNBoovVF0FJxb7E 8SGzA9KQ==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.98.2 #2 (Red Hat Linux)) id 1urNhY-0000000GsK1-2UyE; Wed, 27 Aug 2025 21:29:04 +0000 From: Luis Chamberlain To: Chuck Lever , Daniel Gomez , kdevops@lists.linux.dev Cc: Luis Chamberlain Subject: [PATCH v2 09/10] scripts: add Lambda Labs testing and debugging utilities Date: Wed, 27 Aug 2025 14:29:00 -0700 Message-ID: <20250827212902.4021990-10-mcgrof@kernel.org> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250827212902.4021990-1-mcgrof@kernel.org> References: <20250827212902.4021990-1-mcgrof@kernel.org> Precedence: bulk X-Mailing-List: kdevops@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: Luis Chamberlain Add utility scripts for testing, debugging, and managing Lambda Labs cloud resources. These tools help developers validate configurations, debug issues, and manage instances. Testing utilities: - Capacity checking before provisioning - SSH connectivity testing - API endpoint validation - Credential verification Management utilities: - Instance listing and status checking - Smart instance selection based on cost/availability - Region inference for optimal placement - Cloud provider comparison tool Debugging utilities: - API response exploration - Debug script for troubleshooting API issues - Instance update helper Also documents the Lambda Labs implementation in PROMPTS.md for future reference and improvement. Generated-by: Claude AI Signed-off-by: Luis Chamberlain --- PROMPTS.md | 56 ++++++++ scripts/check_lambdalabs_capacity.py | 172 ++++++++++++++++++++++ scripts/cloud_list_all.sh | 151 ++++++++++++++++++++ scripts/debug_lambdalabs_api.sh | 87 ++++++++++++ scripts/explore_lambda_api.py | 48 +++++++ scripts/lambdalabs_infer_cheapest.py | 107 ++++++++++++++ scripts/lambdalabs_infer_region.py | 36 +++++ scripts/lambdalabs_list_instances.py | 167 ++++++++++++++++++++++ scripts/lambdalabs_smart_inference.py | 196 ++++++++++++++++++++++++++ scripts/terraform_list_instances.sh | 79 +++++++++++ scripts/test_lambda_ssh.py | 111 +++++++++++++++ scripts/update_lambdalabs_instance.sh | 29 ++++ 12 files changed, 1239 insertions(+) create mode 100755 scripts/check_lambdalabs_capacity.py create mode 100755 scripts/cloud_list_all.sh create mode 100755 scripts/debug_lambdalabs_api.sh create mode 100644 scripts/explore_lambda_api.py create mode 100755 scripts/lambdalabs_infer_cheapest.py create mode 100755 scripts/lambdalabs_infer_region.py create mode 100755 scripts/lambdalabs_list_instances.py create mode 100755 scripts/lambdalabs_smart_inference.py create mode 100755 scripts/terraform_list_instances.sh create mode 100644 scripts/test_lambda_ssh.py create mode 100755 scripts/update_lambdalabs_instance.sh diff --git a/PROMPTS.md b/PROMPTS.md index 1b60cbe..c87a3b3 100644 --- a/PROMPTS.md +++ b/PROMPTS.md @@ -280,3 +280,59 @@ The implementation successfully added: to do is to use a separate fact if you want a true dynamic variable. This is why we switched to an active ref prefix for the baseline and dev group ref tags. + +## Cloud provider integrations + +### Adding Lambda Labs cloud provider support with dynamic Kconfig + +**Prompt:** +The Lambda Labs company helps you use GPUs online, kind of like AWS, or OCI. Add +support for the terraform support for Lambda Labs. The best provider docs are at +https://registry.terraform.io/providers/elct9620/lambdalabs/latest/docs . Then +To create the kconfig values you will implement support to use the lambda cloud +API to let us query for what type of instances they have available and so forth. +Therefore the Kconfig stuff for Lambda labs will all be dynamic. So we'll want +to expand this as part of what make dynconfig does. However note that dynconfig +does *all* dynamically generated kconfig. We want to add support for make +cloud-config as a new target which is dynamic which is a subset of make +dynconfig ; OK! good luck + +**AI:** Claude Code (Opus 4.1) +**Commit:** [To be determined] +**Result:** Complete Lambda Labs integration with dynamic Kconfig generation. +**Grading:** 75% + +**Notes:** + +The implementation successfully added: + +1. **Terraform Provider Integration**: Created complete Terraform configuration + for Lambda Labs including instance management, persistent storage, and SSH + configuration management following existing cloud provider patterns. + +2. **Dynamic Kconfig Generation**: Implemented Python script to query Lambda Labs + API for available instance types, regions, and OS images. Generated dynamic + Kconfig files with fallback defaults when API is unavailable. + +3. **Build System Integration**: Added `make cloud-config` as a new target for + cloud-specific dynamic configuration, properly integrated with `make dynconfig`. + Created modular Makefile structure for cloud provider dynamic configuration. + +4. **Kconfig Structure**: Properly integrated Lambda Labs into the provider + selection system with modular Kconfig files for location, compute, storage, + and identity management. + +Biggest issues: + +1. **SSH Management**: For this it failed to realize the provider + didn't suport asking for a custom username, so we had to find out the + hard way. + +2. **Environment variables**: For some reason it wanted to define the + credential API as an environment variable. This proved painful as some + environment variables do not carry over for some ansible tasks. The + best solution was to follow the strategy similar to what AWS supports + with ~/.lambdalabs/credentials. This a more secure alternative. + +Minor issues: +- Some whitespace formatting was automatically fixed by the linter diff --git a/scripts/check_lambdalabs_capacity.py b/scripts/check_lambdalabs_capacity.py new file mode 100755 index 0000000..5b16156 --- /dev/null +++ b/scripts/check_lambdalabs_capacity.py @@ -0,0 +1,172 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: copyleft-next-0.3.1 + +""" +Check Lambda Labs capacity for a given instance type and region. +Provides clear error messages when capacity is not available. +""" + +import json +import os +import sys +import urllib.request +import urllib.error +from typing import Dict, List, Optional + +# Import our credentials module +sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) +from lambdalabs_credentials import get_api_key as get_api_key_from_credentials + +LAMBDALABS_API_BASE = "https://cloud.lambdalabs.com/api/v1" + + +def get_api_key() -> Optional[str]: + """Get Lambda Labs API key from credentials file or environment variable.""" + return get_api_key_from_credentials() + + +def check_capacity(instance_type: str, region: str) -> Dict: + """ + Check if capacity is available for the given instance type and region. + + Returns: + Dictionary with: + - available: bool - whether capacity is available + - message: str - human-readable message + - alternatives: list - alternative regions with capacity + """ + api_key = get_api_key() + if not api_key: + return { + "available": False, + "message": "ERROR: Lambda Labs API key not configured.\n" + "Please configure your API key using:\n" + " python3 scripts/lambdalabs_credentials.py set 'your-api-key'", + "alternatives": [], + } + + headers = {"Authorization": f"Bearer {api_key}", "User-Agent": "kdevops/1.0"} + url = f"{LAMBDALABS_API_BASE}/instance-types" + + try: + req = urllib.request.Request(url, headers=headers) + with urllib.request.urlopen(req) as response: + data = json.loads(response.read().decode()) + + if "data" not in data: + return { + "available": False, + "message": "ERROR: Invalid API response format", + "alternatives": [], + } + + # Check if instance type exists + if instance_type not in data["data"]: + available_types = list(data["data"].keys())[:10] + return { + "available": False, + "message": f"ERROR: Instance type '{instance_type}' does not exist.\n" + f"Available instance types include: {', '.join(available_types)}", + "alternatives": [], + } + + gpu_info = data["data"][instance_type] + + # Check if instance type is generally available + # Note: is_available can be None, True, or False + is_available = gpu_info.get("instance_type", {}).get("is_available") + if is_available is False: # Only fail if explicitly False, not None + return { + "available": False, + "message": f"ERROR: Instance type '{instance_type}' is not currently available from Lambda Labs", + "alternatives": [], + } + + # Get regions with capacity + regions_with_capacity = gpu_info.get("regions_with_capacity_available", []) + region_names = [r["name"] for r in regions_with_capacity] + + # Check if requested region has capacity + if region in region_names: + return { + "available": True, + "message": f"✓ Capacity is available for {instance_type} in {region}", + "alternatives": region_names, + } + else: + # No capacity in requested region + if regions_with_capacity: + alt_regions = [f"{r['name']}" for r in regions_with_capacity] + return { + "available": False, + "message": f"ERROR: No capacity available for '{instance_type}' in region '{region}'.\n" + f"\nRegions with available capacity:\n" + + "\n".join([f" • {r}" for r in alt_regions]) + + f"\n\nTo fix this issue, either:\n" + f"1. Wait for capacity to become available in {region}\n" + f"2. Change your region in menuconfig to one of the available regions\n" + f"3. Choose a different instance type", + "alternatives": region_names, + } + else: + return { + "available": False, + "message": f"ERROR: No capacity available for '{instance_type}' in ANY region.\n" + f"This instance type is currently sold out across all Lambda Labs regions.\n" + f"Please try:\n" + f" • A different instance type\n" + f" • Checking back later when capacity becomes available", + "alternatives": [], + } + + except urllib.error.HTTPError as e: + if e.code == 403: + return { + "available": False, + "message": "ERROR: Lambda Labs API returned 403 Forbidden.\n" + "This usually means your API key is invalid, expired, or lacks permissions.\n" + "\n" + "To fix this:\n" + "1. Log into https://cloud.lambdalabs.com\n" + "2. Go to API Keys section\n" + "3. Create a new API key with full permissions\n" + "4. Update your credentials:\n" + ' python3 scripts/lambdalabs_credentials.py set "your-new-api-key"\n' + "\n" + "Current API key source: ~/.lambdalabs/credentials", + "alternatives": [], + } + else: + return { + "available": False, + "message": f"ERROR: API request failed with HTTP {e.code}: {e.reason}", + "alternatives": [], + } + except Exception as e: + return { + "available": False, + "message": f"ERROR: Failed to check capacity: {str(e)}", + "alternatives": [], + } + + +def main(): + """Main function for command-line usage.""" + if len(sys.argv) != 3: + print("Usage: check_lambdalabs_capacity.py ") + print("Example: check_lambdalabs_capacity.py gpu_1x_a10 us-tx-1") + sys.exit(1) + + instance_type = sys.argv[1] + region = sys.argv[2] + + result = check_capacity(instance_type, region) + + print(result["message"]) + + # Exit with appropriate code + sys.exit(0 if result["available"] else 1) + + +if __name__ == "__main__": + main() diff --git a/scripts/cloud_list_all.sh b/scripts/cloud_list_all.sh new file mode 100755 index 0000000..405c3d9 --- /dev/null +++ b/scripts/cloud_list_all.sh @@ -0,0 +1,151 @@ +#!/bin/bash +# List all cloud instances across supported providers +# Currently supports: Lambda Labs + +set -e + +PROVIDER="" + +# Detect which cloud provider is configured +if [ -f .config ]; then + if grep -q "CONFIG_TERRAFORM_LAMBDALABS=y" .config 2>/dev/null; then + PROVIDER="lambdalabs" + elif grep -q "CONFIG_TERRAFORM_AWS=y" .config 2>/dev/null; then + PROVIDER="aws" + elif grep -q "CONFIG_TERRAFORM_GCE=y" .config 2>/dev/null; then + PROVIDER="gce" + elif grep -q "CONFIG_TERRAFORM_AZURE=y" .config 2>/dev/null; then + PROVIDER="azure" + elif grep -q "CONFIG_TERRAFORM_OCI=y" .config 2>/dev/null; then + PROVIDER="oci" + fi +fi + +if [ -z "$PROVIDER" ]; then + echo "No cloud provider configured or .config file not found" + exit 1 +fi + +echo "Cloud Provider: $PROVIDER" +echo + +case "$PROVIDER" in + lambdalabs) + # Get API key from credentials file + API_KEY=$(python3 $(dirname "$0")/lambdalabs_credentials.py get 2>/dev/null) + if [ -z "$API_KEY" ]; then + echo "Error: Lambda Labs API key not found" + echo "Please configure it with: python3 scripts/lambdalabs_credentials.py set 'your-api-key'" + exit 1 + fi + + # Try to list instances using curl + echo "Fetching Lambda Labs instances..." + response=$(curl -s -H "Authorization: Bearer $API_KEY" \ + https://cloud.lambdalabs.com/api/v1/instances 2>&1) + + # Check if we got an error + if echo "$response" | grep -q '"error"'; then + echo "Error accessing Lambda Labs API:" + echo "$response" | python3 -c " +import sys, json +try: + data = json.load(sys.stdin) + if 'error' in data: + err = data['error'] + print(f\" {err.get('message', 'Unknown error')}\") + if 'suggestion' in err: + print(f\" Suggestion: {err['suggestion']}\") +except: + print(' Unable to parse error response') +" + exit 1 + fi + + # Parse and display instances + echo "$response" | python3 -c ' +import sys, json +from datetime import datetime + +def format_uptime(created_at): + try: + created = datetime.fromisoformat(created_at.replace("Z", "+00:00")) + now = datetime.now(created.tzinfo) + delta = now - created + + days = delta.days + hours, remainder = divmod(delta.seconds, 3600) + minutes, _ = divmod(remainder, 60) + + if days > 0: + return f"{days}d {hours}h {minutes}m" + elif hours > 0: + return f"{hours}h {minutes}m" + else: + return f"{minutes}m" + except: + return "unknown" + +data = json.load(sys.stdin) +instances = data.get("data", []) + +if not instances: + print("No Lambda Labs instances currently running") +else: + print("Lambda Labs Instances:") + print("=" * 80) + headers = f"{'Name':<20} {'Type':<20} {'IP':<15} {'Region':<15} {'Status':<10}" + print(headers) + print("-" * 80) + + total_cost = 0 + for inst in instances: + name = inst.get("name", "unnamed") + inst_type = inst.get("instance_type", {}).get("name", "unknown") + ip = inst.get("ip", "pending") + region = inst.get("region", {}).get("name", "unknown") + status = inst.get("status", "unknown") + + # Highlight kdevops instances + if "cgpu" in name or "kdevops" in name.lower(): + name = f"→ {name}" + + row = f"{name:<20} {inst_type:<20} {ip:<15} {region:<15} {status:<10}" + print(row) + + price_cents = inst.get("instance_type", {}).get("price_cents_per_hour", 0) + total_cost += price_cents / 100 + + print("-" * 80) + print(f"Total instances: {len(instances)}") + if total_cost > 0: + print(f"Total hourly cost: ${total_cost:.2f}/hr") + print(f"Daily cost estimate: ${total_cost * 24:.2f}/day") +' + ;; + + aws) + echo "AWS cloud listing not yet implemented" + echo "You can use: aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,PublicIpAddress,State.Name,Tags[?Key==\`Name\`]|[0].Value]' --output table" + ;; + + gce) + echo "Google Cloud listing not yet implemented" + echo "You can use: gcloud compute instances list" + ;; + + azure) + echo "Azure cloud listing not yet implemented" + echo "You can use: az vm list --output table" + ;; + + oci) + echo "Oracle Cloud listing not yet implemented" + echo "You can use: oci compute instance list --compartment-id " + ;; + + *) + echo "Cloud provider '$PROVIDER' not supported for listing" + exit 1 + ;; +esac diff --git a/scripts/debug_lambdalabs_api.sh b/scripts/debug_lambdalabs_api.sh new file mode 100755 index 0000000..d9b5b15 --- /dev/null +++ b/scripts/debug_lambdalabs_api.sh @@ -0,0 +1,87 @@ +#!/bin/bash + +echo "Lambda Labs API Diagnostic Script" +echo "=================================" +echo + +# Get API key from credentials file +API_KEY=$(python3 $(dirname "$0")/lambdalabs_credentials.py get 2>/dev/null) +if [ -z "$API_KEY" ]; then + echo "❌ Lambda Labs API key not found" + echo " Please configure it with: python3 scripts/lambdalabs_credentials.py set 'your-api-key'" + exit 1 +else + echo "✓ Lambda Labs API key loaded from credentials" + echo " Key starts with: ${API_KEY:0:10}..." + echo " Key length: ${#API_KEY} characters" +fi + +echo +echo "Testing API Access:" +echo "-------------------" + +# Test with curl to get more detailed error information +echo "1. Testing instance types endpoint..." +response=$(curl -s -w "\n%{http_code}" -H "Authorization: Bearer $API_KEY" \ + https://cloud.lambdalabs.com/api/v1/instance-types 2>&1) +http_code=$(echo "$response" | tail -n 1) +body=$(echo "$response" | head -n -1) + +if [ "$http_code" = "200" ]; then + echo " ✓ API access successful" + echo " Instance types available: $(echo "$body" | grep -o '"name"' | wc -l)" +elif [ "$http_code" = "403" ]; then + echo " ❌ Access forbidden (HTTP 403)" + echo " Error: $body" + echo + echo " Possible causes:" + echo " - Invalid or expired API key" + echo " - API key doesn't have necessary permissions" + echo " - IP address or region restrictions" + echo " - Rate limiting" + echo + echo " Please verify:" + echo " 1. Your API key is correct and active" + echo " 2. You're not behind a VPN that might be blocked" + echo " 3. Your Lambda Labs account is in good standing" +elif [ "$http_code" = "401" ]; then + echo " ❌ Unauthorized (HTTP 401)" + echo " Your API key appears to be invalid or malformed" +else + echo " ❌ Unexpected response (HTTP $http_code)" + echo " Response: $body" +fi + +echo +echo "2. Testing SSH keys endpoint..." +response=$(curl -s -w "\n%{http_code}" -H "Authorization: Bearer $API_KEY" \ + https://cloud.lambdalabs.com/api/v1/ssh-keys 2>&1) +http_code=$(echo "$response" | tail -n 1) +body=$(echo "$response" | head -n -1) + +if [ "$http_code" = "200" ]; then + echo " ✓ Can access SSH keys" + # Try to find the kdevops key + if echo "$body" | grep -q "kdevops-lambdalabs"; then + echo " ✓ Found 'kdevops-lambdalabs' SSH key" + else + echo " ⚠ 'kdevops-lambdalabs' SSH key not found" + echo " Available keys:" + echo "$body" | grep -o '"name":"[^"]*"' | sed 's/"name":"/ - /g' | sed 's/"//g' + fi +else + echo " ❌ Cannot access SSH keys (HTTP $http_code)" +fi + +echo +echo "Troubleshooting Steps:" +echo "----------------------" +echo "1. Verify your API key at: https://cloud.lambdalabs.com/api-keys" +echo "2. Create a new API key if needed" +echo "3. Ensure you're not using a VPN that might be blocked" +echo "4. Try accessing the API from a different network/location" +echo "5. Contact Lambda Labs support if the issue persists" +echo +echo "For manual testing, try:" +echo "API_KEY=\$(python3 scripts/lambdalabs_credentials.py get)" +echo "curl -H \"Authorization: Bearer \$API_KEY\" https://cloud.lambdalabs.com/api/v1/instance-types" diff --git a/scripts/explore_lambda_api.py b/scripts/explore_lambda_api.py new file mode 100644 index 0000000..8c07547 --- /dev/null +++ b/scripts/explore_lambda_api.py @@ -0,0 +1,48 @@ +#!/usr/bin/env python3 +"""Explore Lambda Labs API to understand SSH key management.""" + +import json +import sys +import os + +# Add scripts directory to path +sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) +from lambdalabs_credentials import get_api_key + +# Try to get docs +print("Lambda Labs API SSH Key Management") +print("=" * 50) +print() +print("Based on API exploration, here's what we know:") +print() +print("1. SSH Keys Endpoint: /ssh-keys") +print(" - GET /ssh-keys returns a list of key NAMES only") +print(" - The API returns: {'data': ['key-name-1', 'key-name-2', ...]}") +print() +print("2. Deleting Keys:") +print(" - DELETE /ssh-keys/{key_id} expects a key ID, not a name") +print(" - The error 'Invalid SSH key ID' suggests IDs are different from names") +print(" - The IDs might be UUIDs or other internal identifiers") +print() +print("3. Adding Keys:") +print( + " - POST /ssh-keys likely works with {name: 'key-name', public_key: 'ssh-rsa ...'}" +) +print() +print("4. The problem:") +print(" - GET /ssh-keys only returns names") +print(" - DELETE /ssh-keys/{id} requires IDs") +print(" - There's no apparent way to get the ID from the name") +print() +print("Possible solutions:") +print("1. There might be a GET /ssh-keys?detailed=true or similar") +print("2. The key names might BE the IDs (but delete fails)") +print("3. There might be a separate endpoint to get key details") +print("4. The API might be incomplete/broken for key deletion") +print() +print("To properly use kdevops with Lambda Labs, we should use") +print("the key name 'kdevops-lambdalabs' as configured in Kconfig.") +print() +print("Since we can list keys but not delete them via API,") +print("users must manage keys through the web console:") +print("https://cloud.lambdalabs.com/ssh-keys") diff --git a/scripts/lambdalabs_infer_cheapest.py b/scripts/lambdalabs_infer_cheapest.py new file mode 100755 index 0000000..52c62c3 --- /dev/null +++ b/scripts/lambdalabs_infer_cheapest.py @@ -0,0 +1,107 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: copyleft-next-0.3.1 + +""" +Find the cheapest available Lambda Labs instance type. +""" + +import json +import sys +import urllib.request +import urllib.error +from typing import Optional, List, Dict, Tuple + +# Import our credentials module +sys.path.insert(0, sys.path[0]) +from lambdalabs_credentials import get_api_key + +LAMBDALABS_API_BASE = "https://cloud.lambdalabs.com/api/v1" + +# Known pricing for Lambda Labs instances (per hour) +INSTANCE_PRICING = { + "gpu_1x_rtx6000": 0.50, + "gpu_1x_a10": 0.75, + "gpu_1x_a6000": 0.80, + "gpu_1x_a100": 1.29, + "gpu_1x_a100_sxm4": 1.29, + "gpu_1x_a100_pcie": 1.29, + "gpu_1x_gh200": 1.49, + "gpu_1x_h100_pcie": 2.49, + "gpu_1x_h100_sxm5": 3.29, + "gpu_2x_a100": 2.58, + "gpu_2x_a100_pcie": 2.58, + "gpu_2x_a6000": 1.60, + "gpu_2x_h100_sxm5": 6.38, + "gpu_4x_a100": 5.16, + "gpu_4x_a100_pcie": 5.16, + "gpu_4x_a6000": 3.20, + "gpu_4x_h100_sxm5": 12.36, + "gpu_8x_v100": 4.40, + "gpu_8x_a100": 10.32, + "gpu_8x_a100_40gb": 10.32, + "gpu_8x_a100_80gb": 14.32, + "gpu_8x_a100_80gb_sxm4": 14.32, + "gpu_8x_h100_sxm5": 23.92, + "gpu_8x_b200_sxm6": 39.92, +} + + +def get_cheapest_available_instance() -> Optional[str]: + """ + Find the cheapest instance type with available capacity. + + Returns: + Instance type name of cheapest available option + """ + api_key = get_api_key() + if not api_key: + # Return a reasonable default if no API key + return "gpu_1x_a10" + + headers = {"Authorization": f"Bearer {api_key}", "User-Agent": "kdevops/1.0"} + url = f"{LAMBDALABS_API_BASE}/instance-types" + + try: + req = urllib.request.Request(url, headers=headers) + with urllib.request.urlopen(req) as response: + data = json.loads(response.read().decode()) + + if "data" not in data: + return "gpu_1x_a10" + + # Find all instance types with available capacity + available_instances = [] + + for instance_type, info in data["data"].items(): + regions_with_capacity = info.get("regions_with_capacity_available", []) + if regions_with_capacity: + # This instance has capacity somewhere + price = INSTANCE_PRICING.get(instance_type, 999.99) + available_instances.append((instance_type, price)) + + if not available_instances: + # No capacity anywhere, return cheapest known instance + return "gpu_1x_a10" + + # Sort by price (lowest first) + available_instances.sort(key=lambda x: x[1]) + + # Return the cheapest available instance type + return available_instances[0][0] + + except Exception as e: + # On any error, return default + return "gpu_1x_a10" + + +def main(): + """Main function for command-line usage.""" + instance = get_cheapest_available_instance() + if instance: + print(instance) + else: + print("gpu_1x_a10") + + +if __name__ == "__main__": + main() diff --git a/scripts/lambdalabs_infer_region.py b/scripts/lambdalabs_infer_region.py new file mode 100755 index 0000000..d8fea17 --- /dev/null +++ b/scripts/lambdalabs_infer_region.py @@ -0,0 +1,36 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: copyleft-next-0.3.1 + +""" +Smart region inference for Lambda Labs. +Uses the smart inference algorithm to find the best region for a given instance type. +""" + +import sys +import os + +# Import the smart inference module +sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) +from lambdalabs_smart_inference import get_best_instance_and_region + + +def main(): + """Main function for command-line usage.""" + if len(sys.argv) != 2: + print("us-east-1") # Default + sys.exit(0) + + # The instance type is passed but we'll get the best region from smart inference + # This maintains backward compatibility while using the smart algorithm + instance_type_requested = sys.argv[1] + + # Get the best instance and region combo + best_instance, best_region = get_best_instance_and_region() + + # For now, just return the best region + # In the future, we could check if the requested instance is available in the best region + print(best_region) + + +if __name__ == "__main__": + main() diff --git a/scripts/lambdalabs_list_instances.py b/scripts/lambdalabs_list_instances.py new file mode 100755 index 0000000..c61b701 --- /dev/null +++ b/scripts/lambdalabs_list_instances.py @@ -0,0 +1,167 @@ +#!/usr/bin/env python3 +""" +List all Lambda Labs instances for the current account. +Part of kdevops cloud management utilities. +""" + +import os +import sys +import json +import urllib.request +import urllib.error +from datetime import datetime +import lambdalabs_credentials + + +def format_uptime(created_at): + """Convert timestamp to human-readable uptime.""" + try: + created = datetime.fromisoformat(created_at.replace("Z", "+00:00")) + now = datetime.now(created.tzinfo) + delta = now - created + + days = delta.days + hours, remainder = divmod(delta.seconds, 3600) + minutes, _ = divmod(remainder, 60) + + if days > 0: + return f"{days}d {hours}h {minutes}m" + elif hours > 0: + return f"{hours}h {minutes}m" + else: + return f"{minutes}m" + except: + return "unknown" + + +def list_instances(): + """List all Lambda Labs instances.""" + # Get API key from credentials + api_key = lambdalabs_credentials.get_api_key() + if not api_key: + print( + "Error: Lambda Labs API key not found in credentials file", file=sys.stderr + ) + print( + "Please configure it with: python3 scripts/lambdalabs_credentials.py set 'your-api-key'", + file=sys.stderr, + ) + return 1 + + url = "https://cloud.lambdalabs.com/api/v1/instances" + headers = {"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"} + + try: + req = urllib.request.Request(url, headers=headers) + with urllib.request.urlopen(req) as response: + data = json.loads(response.read().decode()) + + if "data" not in data: + print("No instances found or unexpected API response") + return 0 + + instances = data["data"] + + if not instances: + print("No Lambda Labs instances currently running") + return 0 + + # Print header + print("\nLambda Labs Instances:") + print("=" * 80) + print( + f"{'Name':<20} {'Type':<20} {'IP':<15} {'Region':<15} {'Uptime':<10} {'Status'}" + ) + print("-" * 80) + + # Print each instance + for instance in instances: + name = instance.get("name", "unnamed") + instance_type = instance.get("instance_type", {}).get("name", "unknown") + ip = instance.get("ip", "pending") + region = instance.get("region", {}).get("name", "unknown") + status = instance.get("status", "unknown") + created_at = instance.get("created", "") + uptime = format_uptime(created_at) + + # Highlight kdevops instances + if "cgpu" in name or "kdevops" in name.lower(): + name = f"→ {name}" + + print( + f"{name:<20} {instance_type:<20} {ip:<15} {region:<15} {uptime:<10} {status}" + ) + + print("-" * 80) + print(f"Total instances: {len(instances)}") + + # Calculate total cost + total_cost = 0 + for instance in instances: + price_cents = instance.get("instance_type", {}).get( + "price_cents_per_hour", 0 + ) + total_cost += price_cents / 100 + + if total_cost > 0: + print(f"Total hourly cost: ${total_cost:.2f}/hr") + print(f"Daily cost estimate: ${total_cost * 24:.2f}/day") + + print() + + return 0 + + except urllib.error.HTTPError as e: + error_body = e.read().decode() + print(f"Error: HTTP {e.code} - {e.reason}", file=sys.stderr) + if error_body: + try: + error_data = json.loads(error_body) + if "error" in error_data: + err = error_data["error"] + print(f" {err.get('message', 'Unknown error')}", file=sys.stderr) + if "suggestion" in err: + print(f" Suggestion: {err['suggestion']}", file=sys.stderr) + except: + print(f" Response: {error_body}", file=sys.stderr) + return 1 + except Exception as e: + print(f"Error: {e}", file=sys.stderr) + return 1 + + +def main(): + """Main entry point.""" + # Support JSON output flag + if len(sys.argv) > 1 and sys.argv[1] == "--json": + # For future: output raw JSON + # Get API key from credentials + api_key = lambdalabs_credentials.get_api_key() + if not api_key: + print( + json.dumps( + {"error": "Lambda Labs API key not found in credentials file"} + ) + ) + return 1 + + url = "https://cloud.lambdalabs.com/api/v1/instances" + headers = { + "Authorization": f"Bearer {api_key}", + "Content-Type": "application/json", + } + + try: + req = urllib.request.Request(url, headers=headers) + with urllib.request.urlopen(req) as response: + print(response.read().decode()) + return 0 + except Exception as e: + print(json.dumps({"error": str(e)})) + return 1 + else: + return list_instances() + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/lambdalabs_smart_inference.py b/scripts/lambdalabs_smart_inference.py new file mode 100755 index 0000000..fa59d76 --- /dev/null +++ b/scripts/lambdalabs_smart_inference.py @@ -0,0 +1,196 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: copyleft-next-0.3.1 + +""" +Smart inference for Lambda Labs - finds cheapest instance preferring closer regions. +Algorithm: +1. Determine user's location from public IP +2. Find all available instance/region combinations +3. Group by price tier (instances with same price) +4. For each price tier, select the closest region +5. Return the cheapest tier's best region/instance combo +""" + +import json +import sys +import urllib.request +import urllib.error +from typing import Optional, List, Dict, Tuple +import math + +# Import our credentials module +sys.path.insert(0, sys.path[0]) +from lambdalabs_credentials import get_api_key + +LAMBDALABS_API_BASE = "https://cloud.lambdalabs.com/api/v1" + +# Known pricing for Lambda Labs instances (per hour) +INSTANCE_PRICING = { + "gpu_1x_rtx6000": 0.50, + "gpu_1x_a10": 0.75, + "gpu_1x_a6000": 0.80, + "gpu_1x_a100": 1.29, + "gpu_1x_a100_sxm4": 1.29, + "gpu_1x_a100_pcie": 1.29, + "gpu_1x_gh200": 1.49, + "gpu_1x_h100_pcie": 2.49, + "gpu_1x_h100_sxm5": 3.29, + "gpu_2x_a100": 2.58, + "gpu_2x_a100_pcie": 2.58, + "gpu_2x_a6000": 1.60, + "gpu_2x_h100_sxm5": 6.38, + "gpu_4x_a100": 5.16, + "gpu_4x_a100_pcie": 5.16, + "gpu_4x_a6000": 3.20, + "gpu_4x_h100_sxm5": 12.36, + "gpu_8x_v100": 4.40, + "gpu_8x_a100": 10.32, + "gpu_8x_a100_40gb": 10.32, + "gpu_8x_a100_80gb": 14.32, + "gpu_8x_a100_80gb_sxm4": 14.32, + "gpu_8x_h100_sxm5": 23.92, + "gpu_8x_b200_sxm6": 39.92, +} + +# Approximate region locations (latitude, longitude) +REGION_LOCATIONS = { + "us-east-1": (39.0458, -77.6413), # Virginia + "us-west-1": (37.3541, -121.9552), # California (San Jose) + "us-west-2": (45.5152, -122.6784), # Oregon + "us-west-3": (33.4484, -112.0740), # Arizona + "us-tx-1": (30.2672, -97.7431), # Texas (Austin) + "us-midwest-1": (41.8781, -87.6298), # Illinois (Chicago) + "us-south-1": (33.7490, -84.3880), # Georgia (Atlanta) + "us-south-2": (29.7604, -95.3698), # Texas (Houston) + "us-south-3": (25.7617, -80.1918), # Florida (Miami) + "europe-central-1": (50.1109, 8.6821), # Frankfurt + "asia-northeast-1": (35.6762, 139.6503), # Tokyo + "asia-south-1": (19.0760, 72.8777), # Mumbai + "me-west-1": (25.2048, 55.2708), # Dubai + "australia-east-1": (-33.8688, 151.2093), # Sydney +} + + +def get_user_location() -> Tuple[float, float]: + """ + Get user's approximate location from public IP. + Returns (latitude, longitude) tuple. + """ + try: + # Try to get location from IP + with urllib.request.urlopen("http://ip-api.com/json/", timeout=2) as response: + data = json.loads(response.read().decode()) + if data.get("status") == "success": + return (data.get("lat", 39.0458), data.get("lon", -77.6413)) + except: + pass + + # Default to US East Coast if can't determine + return (39.0458, -77.6413) + + +def calculate_distance(lat1: float, lon1: float, lat2: float, lon2: float) -> float: + """ + Calculate approximate distance between two points using Haversine formula. + Returns distance in kilometers. + """ + R = 6371 # Earth's radius in kilometers + + lat1_rad = math.radians(lat1) + lat2_rad = math.radians(lat2) + delta_lat = math.radians(lat2 - lat1) + delta_lon = math.radians(lon2 - lon1) + + a = ( + math.sin(delta_lat / 2) ** 2 + + math.cos(lat1_rad) * math.cos(lat2_rad) * math.sin(delta_lon / 2) ** 2 + ) + c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) + + return R * c + + +def get_best_instance_and_region() -> Tuple[str, str]: + """ + Find the cheapest available instance, preferring closer regions when same price. + + Returns: + (instance_type, region) tuple + """ + api_key = get_api_key() + if not api_key: + # Return defaults if no API key + return ("gpu_1x_a10", "us-west-1") + + # Get user's location + user_lat, user_lon = get_user_location() + + headers = {"Authorization": f"Bearer {api_key}", "User-Agent": "kdevops/1.0"} + url = f"{LAMBDALABS_API_BASE}/instance-types" + + try: + req = urllib.request.Request(url, headers=headers) + with urllib.request.urlopen(req) as response: + data = json.loads(response.read().decode()) + + if "data" not in data: + return ("gpu_1x_a10", "us-west-1") + + # Build a map of price -> list of (instance, region, distance) tuples + price_tiers = {} + + for instance_type, info in data["data"].items(): + regions_with_capacity = info.get("regions_with_capacity_available", []) + if regions_with_capacity: + price = INSTANCE_PRICING.get(instance_type, 999.99) + + for region_info in regions_with_capacity: + region = region_info.get("name") + if region and region in REGION_LOCATIONS: + region_lat, region_lon = REGION_LOCATIONS[region] + distance = calculate_distance( + user_lat, user_lon, region_lat, region_lon + ) + + if price not in price_tiers: + price_tiers[price] = [] + price_tiers[price].append((instance_type, region, distance)) + + if not price_tiers: + # No capacity anywhere + return ("gpu_1x_a10", "us-west-1") + + # Sort price tiers by price + sorted_prices = sorted(price_tiers.keys()) + + # For the cheapest price tier, find the closest region + cheapest_price = sorted_prices[0] + options = price_tiers[cheapest_price] + + # Sort by distance to find closest + options.sort(key=lambda x: x[2]) + best_instance, best_region, best_distance = options[0] + + return (best_instance, best_region) + + except Exception as e: + # On any error, return defaults (west for SF user) + return ("gpu_1x_a10", "us-west-1") + + +def main(): + """Main function for command-line usage.""" + mode = sys.argv[1] if len(sys.argv) > 1 else "both" + + instance, region = get_best_instance_and_region() + + if mode == "instance": + print(instance) + elif mode == "region": + print(region) + else: # both + print(f"{instance},{region}") + + +if __name__ == "__main__": + main() diff --git a/scripts/terraform_list_instances.sh b/scripts/terraform_list_instances.sh new file mode 100755 index 0000000..0a98363 --- /dev/null +++ b/scripts/terraform_list_instances.sh @@ -0,0 +1,79 @@ +#!/bin/bash +# List instances from terraform state +# This works even when API access is limited + +set -e + +# Function to detect terraform directory based on cloud provider +get_terraform_dir() { + if [ -f .config ]; then + if grep -q "CONFIG_TERRAFORM_LAMBDALABS=y" .config 2>/dev/null; then + echo "terraform/lambdalabs" + elif grep -q "CONFIG_TERRAFORM_AWS=y" .config 2>/dev/null; then + echo "terraform/aws" + elif grep -q "CONFIG_TERRAFORM_GCE=y" .config 2>/dev/null; then + echo "terraform/gce" + elif grep -q "CONFIG_TERRAFORM_AZURE=y" .config 2>/dev/null; then + echo "terraform/azure" + elif grep -q "CONFIG_TERRAFORM_OCI=y" .config 2>/dev/null; then + echo "terraform/oci" + elif grep -q "CONFIG_TERRAFORM_OPENSTACK=y" .config 2>/dev/null; then + echo "terraform/openstack" + else + echo "" + fi + else + echo "" + fi +} + +# Get terraform directory +TERRAFORM_DIR=$(get_terraform_dir) + +if [ -z "$TERRAFORM_DIR" ]; then + echo "No terraform provider configured" + exit 1 +fi + +if [ ! -d "$TERRAFORM_DIR" ]; then + echo "Terraform directory $TERRAFORM_DIR does not exist" + exit 1 +fi + +cd "$TERRAFORM_DIR" + +# Check if terraform is initialized +if [ ! -d ".terraform" ]; then + echo "Terraform not initialized. Run 'make' first." + exit 1 +fi + +# Check if we have state +if [ ! -f "terraform.tfstate" ]; then + echo "No terraform state file found. No instances deployed." + exit 0 +fi + +echo "Terraform Managed Instances:" +echo "============================" +echo + +# Try to get instances from state +terraform state list 2>/dev/null | grep -E "instance|vm" | while read resource; do + echo "Resource: $resource" + terraform state show "$resource" 2>/dev/null | grep -E "^\s*(name|ip|ip_address|public_ip|instance_type|region|status|hostname)" | sed 's/^/ /' + echo +done + +# If no instances found +if ! terraform state list 2>/dev/null | grep -qE "instance|vm"; then + echo "No instances found in terraform state" + echo + echo "To deploy instances, run: make bringup" +fi + +# Show outputs if available +echo +echo "Terraform Outputs:" +echo "-----------------" +terraform output 2>/dev/null || echo "No outputs defined" diff --git a/scripts/test_lambda_ssh.py b/scripts/test_lambda_ssh.py new file mode 100644 index 0000000..5034697 --- /dev/null +++ b/scripts/test_lambda_ssh.py @@ -0,0 +1,111 @@ +#!/usr/bin/env python3 + +import os +import json +import urllib.request +import urllib.error +import lambdalabs_credentials + +# Get API key from credentials file +# Get API key from credentials +api_key = lambdalabs_credentials.get_api_key() +if not api_key: + print("No Lambda Labs API key found in credentials file") + print( + "Please configure it with: python3 scripts/lambdalabs_credentials.py set 'your-api-key'" + ) + exit(1) + +print(f"API Key length: {len(api_key)}") +print(f"API Key prefix: {api_key[:30]}...") + + +def make_request(endpoint, method="GET", data=None): + """Make API request to Lambda Labs""" + url = f"https://cloud.lambdalabs.com/api/v1{endpoint}" + + headers = { + "Authorization": f"Bearer {api_key}", + "User-Agent": "kdevops/1.0", + "Accept": "application/json", + "Content-Type": "application/json", + } + + req_data = None + if data and method in ["POST", "PUT", "PATCH", "DELETE"]: + req_data = json.dumps(data).encode("utf-8") + + try: + req = urllib.request.Request(url, headers=headers, data=req_data, method=method) + with urllib.request.urlopen(req) as response: + content = response.read().decode() + if content: + return json.loads(content) + return {"status": "success"} + except urllib.error.HTTPError as e: + print(f"\nHTTP Error {e.code} for {method} {endpoint}") + try: + error_content = e.read().decode() + error_data = json.loads(error_content) + print(f"Error: {json.dumps(error_data, indent=2)}") + except: + print(f"Error response: {error_content[:500]}") + return None + except Exception as e: + print(f"\nException for {method} {endpoint}: {e}") + return None + + +# Test different endpoints +print("\n1. Testing /instances endpoint...") +result = make_request("/instances") +if result and "data" in result: + print(f" ✓ Instances: Found {len(result['data'])} instances") +else: + print(" ✗ Instances endpoint failed") + +print("\n2. Testing /instance-types endpoint...") +result = make_request("/instance-types") +if result and "data" in result: + print(f" ✓ Instance types: Found {len(result['data'])} types") +else: + print(" ✗ Instance types endpoint failed") + +print("\n3. Testing /ssh-keys endpoint...") +result = make_request("/ssh-keys") +if result: + print(f" ✓ SSH Keys endpoint works!") + if "data" in result: + keys = result["data"] + print(f" Found {len(keys)} SSH keys:") + for key in keys: + if isinstance(key, dict): + name = key.get("name", key.get("id", "unknown")) + print(f" - {name}") + else: + print(f" - {key}") + + # Try to delete keys + print("\n4. Attempting to delete SSH keys...") + for key in keys: + if isinstance(key, dict): + key_name = key.get("name", key.get("id")) + else: + key_name = key + + if key_name: + print(f" Deleting key: {key_name}") + delete_result = make_request(f"/ssh-keys/{key_name}", method="DELETE") + if delete_result is not None: + print(f" ✓ Deleted {key_name}") + else: + print(f" ✗ Failed to delete {key_name}") + else: + print(" Response:", json.dumps(result, indent=2)) +else: + print(" ✗ SSH Keys endpoint failed") + +print("\n5. Testing if keys were deleted...") +result = make_request("/ssh-keys") +if result and "data" in result: + print(f" Remaining keys: {len(result['data'])}") diff --git a/scripts/update_lambdalabs_instance.sh b/scripts/update_lambdalabs_instance.sh new file mode 100755 index 0000000..219e425 --- /dev/null +++ b/scripts/update_lambdalabs_instance.sh @@ -0,0 +1,29 @@ +#!/bin/bash + +echo "Lambda Labs Instance Availability Update" +echo "========================================" +echo +echo "The configured instance type 'gpu_1x_a10' is currently unavailable." +echo +echo "Available options:" +echo +echo "1. Use gpu_1x_a100_sxm4 in us-east-1 (Virginia) - $1.29/hr" +echo " To use this, run:" +echo " make menuconfig" +echo " Then navigate to:" +echo " - Terraform -> Lambda Labs cloud provider" +echo " - Change 'Lambda Labs region' to 'us-east-1'" +echo " - Change 'Lambda Labs instance type' to 'gpu_1x_a100_sxm4'" +echo +echo "2. Use gpu_8x_a100 in us-west-1 (California) - $10.32/hr" +echo " To use this, run:" +echo " make menuconfig" +echo " Then navigate to:" +echo " - Terraform -> Lambda Labs cloud provider" +echo " - Change 'Lambda Labs instance type' to 'gpu_8x_a100'" +echo +echo "3. Wait for gpu_1x_a10 to become available" +echo " Check availability at: https://cloud.lambdalabs.com/" +echo +echo "Current configuration:" +grep "CONFIG_TERRAFORM_LAMBDALABS_REGION\|CONFIG_TERRAFORM_LAMBDALABS_INSTANCE_TYPE" .config 2>/dev/null || echo " Configuration not found" -- 2.50.1