[PATCH v3 2/3] aws: enable GPU AMI support for GPU instances

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Luis Chamberlain <mcgrof@kernel.org>
To: Chuck Lever <cel@kernel.org>, Daniel Gomez <da.gomez@kruces.com>,
	kdevops@lists.linux.dev
Cc: Luis Chamberlain <mcgrof@kernel.org>
Subject: [PATCH v3 2/3] aws: enable GPU AMI support for GPU instances
Date: Mon,  8 Sep 2025 17:56:42 -0700	[thread overview]
Message-ID: <20250909005644.798127-3-mcgrof@kernel.org> (raw)
In-Reply-To: <20250909005644.798127-1-mcgrof@kernel.org>

Add support for using GPU-optimized Amazon Machine Images (AMIs) when
deploying GPU instances on AWS. This enables automatic selection of
Deep Learning AMIs with pre-installed NVIDIA drivers, CUDA toolkit,
and ML frameworks for GPU-accelerated workloads.

Key changes:
- Add CONFIG_TERRAFORM_AWS_USE_GPU_AMI to enable GPU AMI selection
- Support GPU AMI name/owner configuration via Kconfig
- Default to Deep Learning OSS Nvidia Driver AMI with PyTorch for Ubuntu 22.04
- Update terraform.tfvars template to conditionally use GPU AMI settings
- Add aws-gpu-g6e-ai defconfig for G6E.2xlarge with GPU AMI
- Generate GPU AMI Kconfig options dynamically via AWS API
- Provide fallback GPU AMI defaults when AWS CLI unavailable

The aws-gpu-g6e-ai defconfig demonstrates usage with:
- G6E.2xlarge instance (8 vCPUs, 32GB RAM, NVIDIA L40S GPU)
- Deep Learning AMI with NVIDIA drivers 535+, CUDA 12.x, PyTorch
- 200GB storage for datasets and models

Generated-by: Claude AI
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 defconfigs/aws-gpu-g6e-ai                     | 40 +++++++++++++++++++
 .../templates/aws/terraform.tfvars.j2         |  5 +++
 scripts/aws_api.py                            |  4 +-
 scripts/dynamic-cloud-kconfig.Makefile        |  5 +++
 4 files changed, 52 insertions(+), 2 deletions(-)
 create mode 100644 defconfigs/aws-gpu-g6e-ai

diff --git a/defconfigs/aws-gpu-g6e-ai b/defconfigs/aws-gpu-g6e-ai
new file mode 100644
index 00000000..a168bbc3
--- /dev/null
+++ b/defconfigs/aws-gpu-g6e-ai
@@ -0,0 +1,40 @@
+# AWS G6e.2xlarge GPU instance with Deep Learning AMI for AI/ML workloads
+# This configuration sets up an AWS G6e.2xlarge instance with NVIDIA L40S GPU
+# optimized for machine learning, AI inference, and GPU-accelerated workloads
+
+# Cloud provider configuration
+CONFIG_KDEVOPS_ENABLE_TERRAFORM=y
+CONFIG_TERRAFORM=y
+CONFIG_TERRAFORM_AWS=y
+
+
+# AWS Instance configuration - G6E family with NVIDIA L40S GPU
+# G6E.2XLARGE specifications:
+# - 8 vCPUs (3rd Gen AMD EPYC processors)
+# - 32 GB system RAM
+# - 1x NVIDIA L40S Tensor Core GPU
+# - 48 GB GPU memory
+# - Up to 15 Gbps network performance
+# - Up to 10 Gbps EBS bandwidth
+CONFIG_TERRAFORM_AWS_INSTANCE_TYPE_G6E=y
+CONFIG_TERRAFORM_AWS_INSTANCE_G6E_2XLARGE=y
+
+# AWS Region - US East (N. Virginia) - primary availability for G6E
+CONFIG_TERRAFORM_AWS_REGION_US_EAST_1=y
+
+# GPU-optimized Deep Learning AMI
+# Includes: NVIDIA drivers 535+, CUDA 12.x, cuDNN, TensorFlow, PyTorch, MXNet
+CONFIG_TERRAFORM_AWS_USE_GPU_AMI=y
+CONFIG_TERRAFORM_AWS_GPU_AMI_DEEP_LEARNING=y
+CONFIG_TERRAFORM_AWS_GPU_AMI_NAME="Deep Learning OSS Nvidia Driver AMI GPU PyTorch*Ubuntu 22.04*"
+CONFIG_TERRAFORM_AWS_GPU_AMI_OWNER="amazon"
+
+# Storage configuration optimized for ML workloads
+# 200 GB for datasets, models, and experiment artifacts
+CONFIG_TERRAFORM_AWS_DATA_VOLUME_SIZE=200
+
+# Note: After provisioning, the instance will have:
+# - Jupyter notebook server ready for ML experiments
+# - Pre-installed deep learning frameworks
+# - NVIDIA GPU drivers and CUDA toolkit
+# - Docker with NVIDIA Container Toolkit for containerized ML workloads
diff --git a/playbooks/roles/gen_tfvars/templates/aws/terraform.tfvars.j2 b/playbooks/roles/gen_tfvars/templates/aws/terraform.tfvars.j2
index d880254b..f8f4c842 100644
--- a/playbooks/roles/gen_tfvars/templates/aws/terraform.tfvars.j2
+++ b/playbooks/roles/gen_tfvars/templates/aws/terraform.tfvars.j2
@@ -1,8 +1,13 @@
 aws_profile = "{{ terraform_aws_profile }}"
 aws_region = "{{ terraform_aws_region }}"
 aws_availability_zone = "{{ terraform_aws_av_zone }}"
+{% if terraform_aws_use_gpu_ami is defined and terraform_aws_use_gpu_ami %}
+aws_name_search = "{{ terraform_aws_gpu_ami_name }}"
+aws_ami_owner = "{{ terraform_aws_gpu_ami_owner }}"
+{% else %}
 aws_name_search = "{{ terraform_aws_ns }}"
 aws_ami_owner = "{{ terraform_aws_ami_owner }}"
+{% endif %}
 aws_instance_type = "{{ terraform_aws_instance_type }}"
 aws_ebs_volumes_per_instance = "{{ terraform_aws_ebs_volumes_per_instance }}"
 aws_ebs_volume_size = {{ terraform_aws_ebs_volume_size }}
diff --git a/scripts/aws_api.py b/scripts/aws_api.py
index a9180b31..fe66b3e0 100755
--- a/scripts/aws_api.py
+++ b/scripts/aws_api.py
@@ -956,7 +956,7 @@ if TERRAFORM_AWS_GPU_AMI_DEEP_LEARNING
 config TERRAFORM_AWS_GPU_AMI_NAME
     string
     output yaml
-    default "Deep Learning AMI GPU TensorFlow*"
+    default "Deep Learning OSS Nvidia Driver AMI GPU PyTorch*Ubuntu 22.04*"
     help
       AMI name pattern for AWS Deep Learning AMI.
 
@@ -1061,7 +1061,7 @@ if TERRAFORM_AWS_GPU_AMI_DEEP_LEARNING
 config TERRAFORM_AWS_GPU_AMI_NAME
     string
     output yaml
-    default "Deep Learning AMI GPU TensorFlow*"
+    default "Deep Learning OSS Nvidia Driver AMI GPU PyTorch*Ubuntu 22.04*"
 
 config TERRAFORM_AWS_GPU_AMI_OWNER
     string
diff --git a/scripts/dynamic-cloud-kconfig.Makefile b/scripts/dynamic-cloud-kconfig.Makefile
index 9c9d718e..dbcda506 100644
--- a/scripts/dynamic-cloud-kconfig.Makefile
+++ b/scripts/dynamic-cloud-kconfig.Makefile
@@ -110,6 +110,11 @@ cloud-update:
 		sed -i 's/Kconfig\.\([^.]*\)\.generated/Kconfig.\1.static/g' $(AWS_KCONFIG_DIR)/Kconfig.location.static; \
 		echo "  Created $(AWS_KCONFIG_DIR)/Kconfig.location.static"; \
 	fi
+	$(Q)if [ -f $(AWS_KCONFIG_DIR)/Kconfig.gpu-amis.generated ]; then \
+		cp $(AWS_KCONFIG_DIR)/Kconfig.gpu-amis.generated $(AWS_KCONFIG_DIR)/Kconfig.gpu-amis.static; \
+		sed -i 's/Kconfig\.\([^.]*\)\.generated/Kconfig.\1.static/g' $(AWS_KCONFIG_DIR)/Kconfig.gpu-amis.static; \
+		echo "  Created $(AWS_KCONFIG_DIR)/Kconfig.gpu-amis.static"; \
+	fi
 	# AWS instance type families
 	$(Q)for file in $(AWS_INSTANCE_TYPES_DIR)/Kconfig.*.generated; do \
 		if [ -f "$$file" ]; then \
-- 
2.50.1

next prev parent reply	other threads:[~2025-09-09  0:56 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-09  0:56 [PATCH v3 0/3] aws: add dynamic kconfig support Luis Chamberlain
2025-09-09  0:56 ` [PATCH v3 1/3] aws: add dynamic cloud configuration support Luis Chamberlain
2025-09-10 20:53   ` Chuck Lever
2025-09-09  0:56 ` Luis Chamberlain [this message]
2025-09-09  0:56 ` [PATCH v3 3/3] cloud: run make cloud-update Luis Chamberlain
2025-09-10 14:48 ` [PATCH v3 0/3] aws: add dynamic kconfig support Chuck Lever

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:a168bbc dfblob:d880254 dfblob:f8f4c84 dfblob:a9180b3
dfblob:fe66b3e dfblob:9c9d718 dfblob:dbcda50 )
 OR (
bs:"[PATCH v3 2/3] aws: enable GPU AMI support for GPU instances" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250909005644.798127-3-mcgrof@kernel.org \
    --to=mcgrof@kernel.org \
    --cc=cel@kernel.org \
    --cc=da.gomez@kruces.com \
    --cc=kdevops@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.