From: Luis Chamberlain <mcgrof@kernel.org>
To: Chuck Lever <cel@kernel.org>, Daniel Gomez <da.gomez@kruces.com>,
kdevops@lists.linux.dev
Cc: Devasena Inupakutika <devasena.i@samsung.com>,
DongjooSeo <dongjoo.seo1@samsung.com>,
Joel Fernandes <Joelagnelf@nvidia.com>,
Luis Chamberlain <mcgrof@kernel.org>
Subject: [PATCH v2 3/4] vllm: Add GPU-enabled defconfig with compatibility documentation
Date: Sat, 4 Oct 2025 09:38:13 -0700 [thread overview]
Message-ID: <20251004163816.3303237-4-mcgrof@kernel.org> (raw)
In-Reply-To: <20251004163816.3303237-1-mcgrof@kernel.org>
This introduces GPU support for vLLM deployments on declared hosts with
comprehensive hardware compatibility documentation for both NVIDIA and AMD GPUs.
The new defconfig enables GPU-accelerated inference by disabling CPU-only mode
and configuring GPU device passthrough. To prevent cryptic deployment failures,
the playbook now validates kernel CONFIG_VETH support before starting minikube,
as Docker networking requires virtual ethernet devices. The check attempts to
load the veth module if built as CONFIG_VETH=m and provides clear error messages
when the kernel lacks required networking support.
GPU compatibility documentation covers both NVIDIA CUDA and AMD ROCm platforms,
addressing a critical limitation in vLLM v0.10.x which requires FlashInfer CUDA
kernels with compute capability >= 8.0. Older NVIDIA GPUs (Tesla T4, V100, P100)
fail with resource errors because FlashInfer's fused attention kernels exceed
the hardware limits of earlier architectures. AMD GPUs use different criteria
based on GFX architecture versions rather than compute capability, with the
MI300X providing the best AMD support and W7900 requiring Flash Attention to
be disabled during build.
The documentation provides clear workarounds for incompatible hardware including
CPU inference mode, older vLLM versions, and hardware upgrade paths. Commands
for verifying GPU compatibility on both platforms are included.
This goes tested with Lambda Labs lambdalabs-gpu-1x-a10.
Generated-by: Claude AI
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
.../vllm-production-stack-declared-hosts-gpu | 118 +++++++++++
.../vllm/tasks/install-deps/debian/main.yml | 33 ++-
.../roles/vllm/tasks/setup-kubernetes.yml | 99 +++++++--
workflows/vllm/README.md | 200 ++++++++++++++++++
4 files changed, 435 insertions(+), 15 deletions(-)
create mode 100644 defconfigs/vllm-production-stack-declared-hosts-gpu
diff --git a/defconfigs/vllm-production-stack-declared-hosts-gpu b/defconfigs/vllm-production-stack-declared-hosts-gpu
new file mode 100644
index 00000000..9136755e
--- /dev/null
+++ b/defconfigs/vllm-production-stack-declared-hosts-gpu
@@ -0,0 +1,118 @@
+#
+# vLLM Production Stack with declared hosts - GPU ENABLED
+#
+# ============================================================================
+# NVIDIA GPU COMPATIBILITY (CUDA):
+# ============================================================================
+#
+# vLLM v0.10.x uses FlashInfer CUDA kernels that require NVIDIA GPUs with
+# compute capability >= 8.0. Older NVIDIA GPUs will fail with:
+# "RuntimeError: TopPSamplingFromProbs failed with error code
+# too many resources requested for launch"
+#
+# INCOMPATIBLE NVIDIA GPUs (compute capability < 8.0):
+# - Tesla T4 (7.5) - WILL NOT WORK with vLLM v0.10.x+
+# - Tesla V100 (7.0)
+# - Tesla P100 (6.0)
+# - GTX 1080 Ti (6.1)
+#
+# COMPATIBLE NVIDIA GPUs (compute capability >= 8.0):
+# - A100 (8.0)
+# - A10G (8.6)
+# - A30 (8.0)
+# - H100 (9.0)
+# - RTX 3090 (8.6)
+# - RTX 4090 (8.9)
+#
+# ============================================================================
+# AMD GPU COMPATIBILITY (ROCm):
+# ============================================================================
+#
+# AMD GPUs use ROCm instead of CUDA and have DIFFERENT requirements.
+# vLLM supports AMD GPUs with ROCm 6.2+:
+#
+# FULLY COMPATIBLE AMD GPUs:
+# - MI300X (gfx942) - BEST AMD support, vLLM V1 optimized
+# - MI250/MI250X (gfx90a) - Production ready
+# - MI210 (gfx90a) - Production ready
+# - W7900 (gfx1100) - Supported but requires BUILD_FA=0
+# - RX 7900 XTX/XT (gfx1100) - Supported but requires BUILD_FA=0
+#
+# Note: AMD uses GFX architecture versions, not compute capability numbers.
+# W7900 and RX 7900 series require disabling Flash Attention during
+# vLLM build due to RDNA 3 architecture limitations.
+#
+# ============================================================================
+# WORKAROUNDS for incompatible NVIDIA GPUs:
+# ============================================================================
+# 1. Use an older vLLM version (v0.6.x or earlier)
+# 2. Use CPU inference mode (vllm-production-stack-declared-hosts)
+# 3. Use a compatible NVIDIA GPU (CC >= 8.0) or AMD GPU (see above)
+#
+# Automatically generated file; DO NOT EDIT.
+# kdevops 5.0.2 Configuration
+#
+CONFIG_WORKFLOWS=y
+CONFIG_WORKFLOWS_TESTS=y
+CONFIG_WORKFLOWS_LINUX_TESTS=y
+CONFIG_WORKFLOWS_DEDICATED_WORKFLOW=y
+CONFIG_KDEVOPS_WORKFLOW_DEDICATE_VLLM=y
+
+# Skip bringup for declared hosts
+CONFIG_SKIP_BRINGUP=y
+CONFIG_KDEVOPS_USE_DECLARED_HOSTS=y
+
+# vLLM Production Stack with Kubernetes on declared hosts
+CONFIG_VLLM_PRODUCTION_STACK=y
+CONFIG_VLLM_K8S_MINIKUBE=y
+CONFIG_VLLM_VERSION_STABLE=y
+CONFIG_VLLM_ENGINE_IMAGE_TAG="v0.10.2"
+CONFIG_VLLM_HELM_RELEASE_NAME="vllm-prod"
+CONFIG_VLLM_HELM_NAMESPACE="vllm-system"
+
+# Production Stack components
+CONFIG_VLLM_PROD_STACK_REPO="https://vllm-project.github.io/production-stack"
+CONFIG_VLLM_PROD_STACK_CHART_VERSION="latest"
+CONFIG_VLLM_PROD_STACK_ROUTER_IMAGE="ghcr.io/vllm-project/production-stack/router"
+CONFIG_VLLM_PROD_STACK_ROUTER_TAG="latest"
+CONFIG_VLLM_PROD_STACK_ENABLE_MONITORING=y
+CONFIG_VLLM_PROD_STACK_ENABLE_AUTOSCALING=y
+CONFIG_VLLM_PROD_STACK_MIN_REPLICAS=2
+CONFIG_VLLM_PROD_STACK_MAX_REPLICAS=5
+CONFIG_VLLM_PROD_STACK_TARGET_GPU_UTILIZATION=80
+
+# Model configuration
+CONFIG_VLLM_MODEL_URL="facebook/opt-125m"
+CONFIG_VLLM_MODEL_NAME="opt-125m"
+
+# GPU configuration - EXPLICITLY DISABLED CPU INFERENCE
+# CONFIG_VLLM_USE_CPU_INFERENCE is not set
+CONFIG_VLLM_REQUEST_GPU=1
+CONFIG_VLLM_GPU_TYPE=""
+CONFIG_VLLM_GPU_MEMORY_UTILIZATION="0.5"
+CONFIG_VLLM_TENSOR_PARALLEL_SIZE=1
+
+# Engine configuration for GPU
+CONFIG_VLLM_REPLICA_COUNT=1
+CONFIG_VLLM_REQUEST_CPU=8
+CONFIG_VLLM_REQUEST_MEMORY="16Gi"
+CONFIG_VLLM_MAX_MODEL_LEN=1024
+CONFIG_VLLM_DTYPE="auto"
+
+# Router and observability
+CONFIG_VLLM_ROUTER_ENABLED=y
+CONFIG_VLLM_ROUTER_ROUND_ROBIN=y
+CONFIG_VLLM_OBSERVABILITY_ENABLED=y
+CONFIG_VLLM_GRAFANA_PORT=3000
+CONFIG_VLLM_PROMETHEUS_PORT=9090
+
+# API configuration
+CONFIG_VLLM_API_PORT=8000
+CONFIG_VLLM_API_KEY=""
+CONFIG_VLLM_HF_TOKEN=""
+
+# Benchmarking
+CONFIG_VLLM_BENCHMARK_ENABLED=y
+CONFIG_VLLM_BENCHMARK_DURATION=60
+CONFIG_VLLM_BENCHMARK_CONCURRENT_USERS=10
+CONFIG_VLLM_BENCHMARK_RESULTS_DIR="/data/vllm-benchmark"
diff --git a/playbooks/roles/vllm/tasks/install-deps/debian/main.yml b/playbooks/roles/vllm/tasks/install-deps/debian/main.yml
index a7a82193..b40f0717 100644
--- a/playbooks/roles/vllm/tasks/install-deps/debian/main.yml
+++ b/playbooks/roles/vllm/tasks/install-deps/debian/main.yml
@@ -6,7 +6,16 @@
update_cache: true
tags: vllm
-- name: Install vLLM system dependencies
+- name: Check if docker-ce is already installed
+ become: true
+ become_method: sudo
+ ansible.builtin.command: dpkg -l docker-ce
+ register: docker_ce_check
+ failed_when: false
+ changed_when: false
+ tags: ["vllm", "deps"]
+
+- name: Install vLLM system dependencies (with docker.io if docker-ce not present)
become: true
become_method: sudo
ansible.builtin.apt:
@@ -25,6 +34,28 @@
- conntrack
state: present
update_cache: true
+ when: docker_ce_check.rc != 0
+ tags: ["vllm", "deps"]
+
+- name: Install vLLM system dependencies (without docker.io when docker-ce present)
+ become: true
+ become_method: sudo
+ ansible.builtin.apt:
+ name:
+ - git
+ - curl
+ - wget
+ - python3
+ - python3-venv
+ - ca-certificates
+ - gnupg
+ - lsb-release
+ - apt-transport-https
+ - iptables
+ - conntrack
+ state: present
+ update_cache: true
+ when: docker_ce_check.rc == 0
tags: ["vllm", "deps"]
- name: Install Python development dependencies
diff --git a/playbooks/roles/vllm/tasks/setup-kubernetes.yml b/playbooks/roles/vllm/tasks/setup-kubernetes.yml
index c3cde217..fd5cd72d 100644
--- a/playbooks/roles/vllm/tasks/setup-kubernetes.yml
+++ b/playbooks/roles/vllm/tasks/setup-kubernetes.yml
@@ -83,17 +83,82 @@
remote_src: yes
become: yes
- - name: Check if minikube is running
+ - name: Ensure docker socket has correct permissions
+ ansible.builtin.file:
+ path: /var/run/docker.sock
+ mode: '0666'
+ become: yes
+ ignore_errors: yes
+
+ - name: Check if veth kernel support is available
+ ansible.builtin.shell:
+ cmd: |
+ if [ -f /proc/config.gz ]; then
+ zcat /proc/config.gz | grep "^CONFIG_VETH=" || echo "CONFIG_VETH_NOT_SET"
+ elif [ -f /boot/config-$(uname -r) ]; then
+ grep "^CONFIG_VETH=" /boot/config-$(uname -r) || echo "CONFIG_VETH_NOT_SET"
+ else
+ echo "CONFIG_VETH=unknown"
+ fi
+ register: veth_check
+ failed_when: false
+ changed_when: false
+
+ - name: Try to load veth module if configured as module
ansible.builtin.command:
- cmd: minikube status
- register: minikube_status
+ cmd: modprobe veth
+ become: yes
+ when: "'CONFIG_VETH=m' in veth_check.stdout"
+ failed_when: false
+ changed_when: false
+
+ - name: Verify veth module is loaded
+ ansible.builtin.shell:
+ cmd: lsmod | grep -q veth && echo "loaded" || echo "not_loaded"
+ register: veth_loaded
+ failed_when: false
+ changed_when: false
+
+ - name: Determine if veth is available
+ ansible.builtin.set_fact:
+ veth_available: "{{ (veth_check.stdout == 'CONFIG_VETH=y' or veth_check.stdout == 'CONFIG_VETH=m') or veth_loaded.stdout == 'loaded' }}"
+
+ - name: Fail if veth support is missing (required for Docker/minikube)
+ ansible.builtin.fail:
+ msg: |
+ ERROR: Kernel veth support is REQUIRED for Docker networking with minikube.
+
+ Current status:
+ Kernel configuration: {{ veth_check.stdout }}
+ veth module status: {{ veth_loaded.stdout }}
+ Kernel version: {{ ansible_kernel }}
+
+ REQUIRED: Rebuild your kernel with CONFIG_VETH enabled:
+ CONFIG_VETH=y (built-in, recommended)
+ OR
+ CONFIG_VETH=m (loadable module)
+
+ Most distribution kernels include this by default.
+ Custom/RC kernels may need to explicitly enable it.
+ when:
+ - vllm_k8s_minikube | default(false)
+ - not veth_available
+
+ - name: Check if minikube is running
+ ansible.builtin.shell:
+ cmd: minikube status 2>&1 | grep -q "Running" && echo "RUNNING" || echo "NOT_RUNNING"
+ register: minikube_status_check
failed_when: false
changed_when: false
environment:
MINIKUBE_HOME: /data/minikube
+ - name: Set minikube running status
+ ansible.builtin.set_fact:
+ minikube_is_running: "{{ minikube_status_check.stdout == 'RUNNING' }}"
+
- name: Check if minikube container exists but is stopped
- when: minikube_status.rc != 0
+ when: not minikube_is_running
ansible.builtin.command:
cmd: docker ps -a --format "table {% raw %}{{.Names}}\t{{.Status}}{% endraw %}" | grep minikube || true
register: minikube_container
@@ -102,7 +167,7 @@
- name: Clean up stopped minikube container if exists
when:
- - minikube_status.rc != 0
+ - not minikube_is_running
- "'minikube' in minikube_container.stdout"
ansible.builtin.command:
cmd: minikube delete --all --purge
@@ -151,18 +216,23 @@
append: yes
become: yes
- - name: Ensure /data/minikube has correct permissions for minikube
- ansible.builtin.file:
- path: /data/minikube
- state: directory
- owner: kdevops
- group: docker
- mode: '0775'
- recurse: yes
+ - name: Ensure /data/minikube directory exists
+ ansible.builtin.shell:
+ cmd: |
+ # Handle symlink case - ensure target exists
+ if [ -L /data ]; then
+ target=$(readlink -f /data)
+ mkdir -p "$target"
+ fi
+ mkdir -p /data/minikube
+ chown kdevops:docker /data/minikube
+ chmod 0775 /data/minikube
+ args:
+ creates: /data/minikube
become: yes
- name: Start minikube with appropriate resources
- when: minikube_status.rc != 0
+ when: not minikube_is_running
ansible.builtin.command:
cmd: >-
minikube start
@@ -172,6 +242,7 @@
--memory={{ [(ansible_memtotal_mb * 0.75) | int, 49152] | min }}
--disk-size=50g
--delete-on-failure=true
+ {{ '--gpus all' if not vllm_use_cpu_inference|default(false) else '' }}
environment:
MINIKUBE_HOME: /data/minikube
register: minikube_start
diff --git a/workflows/vllm/README.md b/workflows/vllm/README.md
index 8335e0c7..0acd1941 100644
--- a/workflows/vllm/README.md
+++ b/workflows/vllm/README.md
@@ -291,6 +291,206 @@ kubectl describe deployment -n vllm-system vllm
helm list -n vllm-system
```
+## GPU Compatibility
+
+### NVIDIA GPU Requirements (CUDA)
+
+vLLM v0.10.x and later versions use **FlashInfer** CUDA kernels for optimized attention computation on NVIDIA GPUs. FlashInfer requires NVIDIA GPUs with **compute capability >= 8.0**. Using older NVIDIA GPUs will result in runtime failures during inference.
+
+**Important**: The compute capability requirements below apply **only to NVIDIA CUDA GPUs**. AMD GPUs use ROCm and have different compatibility requirements (see AMD GPU section below).
+
+#### Error Symptoms
+
+If you attempt to use an incompatible GPU, vLLM will fail during engine initialization with:
+
+```
+RuntimeError: TopPSamplingFromProbs failed with error code too many resources requested for launch
+```
+
+This error occurs when FlashInfer CUDA kernels try to allocate more GPU resources (registers, shared memory, thread blocks) than the GPU architecture can provide.
+
+#### Incompatible GPUs (Compute Capability < 8.0)
+
+The following GPUs **WILL NOT WORK** with vLLM v0.10.x+ GPU inference:
+
+| GPU Model | Compute Capability | Status |
+|-----------|-------------------|--------|
+| Tesla T4 | 7.5 | ❌ Incompatible |
+| Tesla V100 | 7.0 | ❌ Incompatible |
+| Tesla P100 | 6.0 | ❌ Incompatible |
+| GTX 1080 Ti | 6.1 | ❌ Incompatible |
+| GTX 1070 | 6.1 | ❌ Incompatible |
+| Quadro P6000 | 6.1 | ❌ Incompatible |
+
+#### Compatible GPUs (Compute Capability >= 8.0)
+
+The following GPUs **WILL WORK** with vLLM v0.10.x+ GPU inference:
+
+| GPU Model | Compute Capability | Status |
+|-----------|-------------------|--------|
+| A100 | 8.0 | ✅ Compatible |
+| A10G | 8.6 | ✅ Compatible |
+| A30 | 8.0 | ✅ Compatible |
+| H100 | 9.0 | ✅ Compatible |
+| L40 | 8.9 | ✅ Compatible |
+| RTX 3090 | 8.6 | ✅ Compatible |
+| RTX 4090 | 8.9 | ✅ Compatible |
+| RTX A6000 | 8.6 | ✅ Compatible |
+
+#### Workarounds for Incompatible GPUs
+
+If you have a GPU with compute capability < 8.0, you have several options:
+
+**Option 1: Use CPU Inference**
+```bash
+make defconfig-vllm-production-stack-declared-hosts
+# This uses CPU-optimized vLLM images (openeuler/vllm-cpu)
+```
+
+**Option 2: Use Older vLLM Version**
+
+vLLM v0.6.x and earlier versions don't use FlashInfer and work with older GPUs. You can modify the defconfig to use an older engine image:
+
+```bash
+CONFIG_VLLM_ENGINE_IMAGE_TAG="v0.6.3"
+```
+
+**Note**: Older versions lack production stack features and may have different API compatibility.
+
+**Option 3: Upgrade to Compatible GPU**
+
+For production GPU inference with vLLM v0.10.x+, upgrade to a GPU with compute capability >= 8.0 (see compatible GPUs table above).
+
+#### Technical Background
+
+FlashInfer implements fused CUDA kernels for attention computation that use advanced GPU features:
+- **Dynamic shared memory allocation**: Requires larger shared memory per block
+- **Warp-level primitives**: Uses newer warp shuffle and reduction operations
+- **Thread block size**: Requires support for larger thread blocks
+- **Register file size**: Needs more registers per thread than older architectures provide
+
+GPUs with compute capability < 8.0 have architectural limitations in:
+- Maximum shared memory per block (48KB on CC 7.x vs 164KB on CC 8.0)
+- Register file size per SM
+- Maximum thread blocks per SM
+- Warp scheduling efficiency
+
+When FlashInfer kernels launch on these older GPUs, the CUDA runtime returns `too many resources requested for launch` because the kernel configuration exceeds the hardware's architectural limits.
+
+#### Verifying NVIDIA GPU Compatibility
+
+To check your NVIDIA GPU's compute capability:
+
+```bash
+# Using nvidia-smi
+nvidia-smi --query-gpu=name,compute_cap --format=csv
+
+# Using CUDA samples (if installed)
+/usr/local/cuda/extras/demo_suite/deviceQuery
+```
+
+### AMD GPU Requirements (ROCm)
+
+AMD GPUs use **ROCm** instead of CUDA and have **different compatibility requirements** than NVIDIA GPUs. vLLM supports AMD GPUs through ROCm 6.2+ with architecture-specific optimizations.
+
+#### Supported AMD GPU Architectures
+
+| GPU Model | Architecture | ROCm Support | Flash Attention | Notes |
+|-----------|-------------|--------------|-----------------|-------|
+| **MI300X/MI300A** | gfx942 (CDNA 3) | ✅ Excellent | ✅ Yes | Best AMD support, FP8 KV cache, vLLM V1 optimized |
+| **MI250X/MI250** | gfx90a (CDNA 2) | ✅ Full | ✅ Yes | Production ready, well tested |
+| **MI210** | gfx90a (CDNA 2) | ✅ Full | ✅ Yes | Production ready |
+| **W7900** | gfx1100 (RDNA 3) | ✅ Supported | ❌ No | Requires `BUILD_FA=0` |
+| **RX 7900 XTX** | gfx1100 (RDNA 3) | ✅ Supported | ❌ No | Requires `BUILD_FA=0` |
+| **RX 7900 XT** | gfx1100 (RDNA 3) | ✅ Supported | ❌ No | Requires `BUILD_FA=0` |
+
+#### Key Differences from NVIDIA
+
+1. **No Compute Capability**: AMD uses GFX architecture versions (gfx90a, gfx942, gfx1100) instead of NVIDIA's compute capability numbering
+2. **ROCm Instead of CUDA**: Requires ROCm 6.2+ runtime and drivers
+3. **Different Attention Kernels**: Uses CK (Composable Kernel) Flash Attention instead of FlashInfer
+4. **Architecture-Specific Builds**: vLLM must be built with specific GFX targets (e.g., `FX_GFX_ARCHS=gfx90a;gfx942`)
+
+#### AMD W7900 Workstation GPU
+
+The **AMD Radeon Pro W7900** is fully supported but requires special configuration:
+
+**Requirements:**
+- ROCm 6.2 or later
+- Flash Attention must be disabled during build
+- Build command: `BUILD_FA=0 DOCKER_BUILDKIT=1 docker build ...`
+
+**Why disable Flash Attention?**
+The gfx1100 architecture (RDNA 3) used in W7900/RX 7900 series doesn't support CK Flash Attention kernels. vLLM will fall back to standard attention mechanisms, which still provide good performance for workstation inference workloads.
+
+**Performance Notes:**
+- W7900 has 48GB VRAM (excellent for large models)
+- RDNA 3 architecture is optimized for graphics/workstation tasks
+- For maximum LLM inference performance, MI300X (CDNA 3) is preferred
+
+#### AMD MI300X Data Center GPU
+
+The **AMD Instinct MI300X** has the **best vLLM support** among AMD GPUs:
+
+**Advantages:**
+- ✅ vLLM V1 engine fully optimized for MI300X
+- ✅ FP8 KV cache support (MI300+ exclusive)
+- ✅ CK Flash Attention enabled by default
+- ✅ 192GB HBM3 memory per GPU
+- ✅ Extensively tested and documented by AMD ROCm team
+
+**Use Cases:**
+- Large-scale production LLM serving
+- Multi-GPU distributed inference
+- Models requiring >80GB VRAM (e.g., Llama-70B, Mixtral-8x22B)
+
+#### Building vLLM for AMD GPUs
+
+**For MI300X/MI250 (CDNA):**
+```bash
+# Flash Attention enabled (default)
+export FX_GFX_ARCHS="gfx90a;gfx942"
+docker build -t vllm-rocm .
+```
+
+**For W7900/RX 7900 (RDNA 3):**
+```bash
+# Flash Attention must be disabled
+export FX_GFX_ARCHS="gfx1100"
+BUILD_FA=0 DOCKER_BUILDKIT=1 docker build -t vllm-rocm .
+```
+
+#### Verifying AMD GPU Compatibility
+
+To check your AMD GPU architecture:
+
+```bash
+# Using rocminfo
+rocminfo | grep "Name:" | grep -E "gfx"
+
+# Using rocm-smi
+rocm-smi --showproductname
+
+# Check ROCm version
+cat /opt/rocm/.info/version
+```
+
+Expected output examples:
+- MI300X: `gfx942` (CDNA 3)
+- MI250: `gfx90a` (CDNA 2)
+- W7900: `gfx1100` (RDNA 3)
+
+#### AMD vs NVIDIA: Summary
+
+| Feature | NVIDIA (CUDA) | AMD (ROCm) |
+|---------|--------------|------------|
+| **Compatibility Metric** | Compute Capability (e.g., 8.0) | GFX Architecture (e.g., gfx942) |
+| **Minimum Requirement** | CC >= 8.0 for FlashInfer | ROCm 6.2+, architecture-dependent |
+| **Attention Kernels** | FlashInfer (CUDA) | CK Flash Attention (ROCm) |
+| **Best GPU for vLLM** | H100, A100 | MI300X |
+| **Workstation GPU** | RTX 4090 | W7900 (Flash Attn disabled) |
+| **Budget Option** | Not compatible (need CC 8.0+) | W7900 (48GB VRAM) |
+
## Integration with kdevops Workflows
The vLLM workflow integrates with kdevops features:
--
2.51.0
next prev parent reply other threads:[~2025-10-04 16:38 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-04 16:38 [PATCH v2 0/4] vLLM and the vLLM production stack Luis Chamberlain
2025-10-04 16:38 ` [PATCH v2 1/4] workflows: Add vLLM workflow for LLM inference and production deployment Luis Chamberlain
2025-10-04 16:38 ` [PATCH v2 2/4] vllm: Add DECLARE_HOSTS support for bare metal and existing infrastructure Luis Chamberlain
2025-10-04 16:38 ` Luis Chamberlain [this message]
2025-10-04 16:38 ` [PATCH v2 4/4] defconfigs: Add composable fragments for Lambda Labs vLLM deployment Luis Chamberlain
2025-10-04 16:39 ` [PATCH v2 0/4] vLLM and the vLLM production stack Luis Chamberlain
2025-10-04 16:55 ` Chuck Lever
2025-10-04 17:03 ` Luis Chamberlain
2025-10-04 17:14 ` Chuck Lever
2025-10-08 17:46 ` Chuck Lever
2025-10-10 0:55 ` Luis Chamberlain
2025-10-10 12:38 ` Chuck Lever
2025-10-10 16:20 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251004163816.3303237-4-mcgrof@kernel.org \
--to=mcgrof@kernel.org \
--cc=Joelagnelf@nvidia.com \
--cc=cel@kernel.org \
--cc=da.gomez@kruces.com \
--cc=devasena.i@samsung.com \
--cc=dongjoo.seo1@samsung.com \
--cc=kdevops@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox