From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A2E3A3CEB90; Mon, 20 Apr 2026 13:17:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776691072; cv=none; b=By7tvS+0kljRB0iqW6Ulnk7juRUB+218IPQvtcPXf+ozqjz5rg/NisGjRYNrQppGEekQ0FdI/M+r1RDXchusmSZFAd057ISnb4jQAUu7lpsyxIwM/CMsXFNuRMnepqfKwUJy4ulNIikXRrvlJOjPSArllYPn0sw9/aiFyjAG/N0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776691072; c=relaxed/simple; bh=sXs0e0ZxqTG7ZrHQ090u7oJNMiq7KcY5riKMVzChehA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=GKi/kmYYC3vmkFnOO3RdZk6vFuQvidsLC+gOUeFHg3fZL8j7Kk/ruT2wHUjRP0VWhPyxk4uWyYhdHWyRf1asYQpyTErgWFX0A350V6vuwPS0VjfNb6jzCbxSPqxKP0AnDTDi978SmR016rjHXm9y/aiy+AHLceBMectQOveo/FU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=O2eFMblt; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="O2eFMblt" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3A1A0C2BCB8; Mon, 20 Apr 2026 13:17:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776691072; bh=sXs0e0ZxqTG7ZrHQ090u7oJNMiq7KcY5riKMVzChehA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=O2eFMbltCjXz3hDprKRQinsbzQFrQ7DCFpkrt4zyp8YB97t/+bc8GBZDSZtNKDY+u CWt0BRcH6SfO/mDMOWOMMO2MGjSdmK4lT4D2rnK6cmDdCr58VYuR5l6OT12AQyifjO HePIMQQtwNZwDuyuCveOsqKDobFSvVI5RsFrSQbFxSoHRZF09eeAlT7XljsiVhvaXz eoTzGzBn+iY9dXZtpWxazaaHabCzWtfxWPrFfB224kc4nuXtejtz2TKWqkgqO3djuw hAL/0Qcp6YGYGsi2H09E9bDf1OyRQpzlGC/agORSxF+UqacUC1S71JisJBot9c4eyw nIiKibYrX14gw== From: Sasha Levin To: patches@lists.linux.dev, stable@vger.kernel.org Cc: Donet Tom , Felix Kuehling , Alex Deucher , Sasha Levin , Felix.Kuehling@amd.com, christian.koenig@amd.com, Xinhui.Pan@amd.com, airlied@gmail.com, daniel@ffwll.ch, amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH AUTOSEL 6.18] drm/amdkfd: Fix queue preemption/eviction failures by aligning control stack size to GPU page size Date: Mon, 20 Apr 2026 09:09:03 -0400 Message-ID: <20260420131539.986432-77-sashal@kernel.org> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260420131539.986432-1-sashal@kernel.org> References: <20260420131539.986432-1-sashal@kernel.org> Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-stable: review X-Patchwork-Hint: Ignore X-stable-base: Linux 6.18.23 Content-Transfer-Encoding: 8bit From: Donet Tom [ Upstream commit 78746a474e92fc7aaed12219bec7c78ae1bd6156 ] The control stack size is calculated based on the number of CUs and waves, and is then aligned to PAGE_SIZE. When the resulting control stack size is aligned to 64 KB, GPU hangs and queue preemption failures are observed while running RCCL unit tests on systems with more than two GPUs. amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with doorbell_id: 80030008 amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues amdgpu 0048:0f:00.0: amdgpu: GPU reset begin!. Source: 4 amdgpu 0048:0f:00.0: amdgpu: Queue preemption failed for queue with doorbell_id: 80030008 amdgpu 0048:0f:00.0: amdgpu: Failed to evict process queues amdgpu 0048:0f:00.0: amdgpu: Failed to restore process queues This issue is observed on both 4 KB and 64 KB system page-size configurations. This patch fixes the issue by aligning the control stack size to AMDGPU_GPU_PAGE_SIZE instead of PAGE_SIZE, so the control stack size will not be 64 KB on systems with a 64 KB page size and queue preemption works correctly. Additionally, In the current code, wg_data_size is aligned to PAGE_SIZE, which can waste memory if the system page size is large. In this patch, wg_data_size is aligned to AMDGPU_GPU_PAGE_SIZE. The cwsr_size, calculated from wg_data_size and the control stack size, is aligned to PAGE_SIZE. Reviewed-by: Felix Kuehling Signed-off-by: Donet Tom Signed-off-by: Alex Deucher (cherry picked from commit a3e14436304392fbada359edd0f1d1659850c9b7) Signed-off-by: Sasha Levin --- LLM Generated explanations, may be completely bogus: Error: Failed to generate final synthesis drivers/gpu/drm/amd/amdkfd/kfd_queue.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c index 2822c90bd7be4..b97f4a51db6e3 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_queue.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_queue.c @@ -444,10 +444,11 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev) min(cu_num * 40, props->array_count / props->simd_arrays_per_engine * 512) : cu_num * 32; - wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), PAGE_SIZE); + wg_data_size = ALIGN(cu_num * WG_CONTEXT_DATA_SIZE_PER_CU(gfxv, props), + AMDGPU_GPU_PAGE_SIZE); ctl_stack_size = wave_num * CNTL_STACK_BYTES_PER_WAVE(gfxv) + 8; ctl_stack_size = ALIGN(SIZEOF_HSA_USER_CONTEXT_SAVE_AREA_HEADER + ctl_stack_size, - PAGE_SIZE); + AMDGPU_GPU_PAGE_SIZE); if ((gfxv / 10000 * 10000) == 100000) { /* HW design limits control stack size to 0x7000. @@ -459,7 +460,7 @@ void kfd_queue_ctx_save_restore_size(struct kfd_topology_device *dev) props->ctl_stack_size = ctl_stack_size; props->debug_memory_size = ALIGN(wave_num * DEBUGGER_BYTES_PER_WAVE, DEBUGGER_BYTES_ALIGN); - props->cwsr_size = ctl_stack_size + wg_data_size; + props->cwsr_size = ALIGN(ctl_stack_size + wg_data_size, PAGE_SIZE); if (gfxv == 80002) /* GFX_VERSION_TONGA */ props->eop_buffer_size = 0x8000; -- 2.53.0