From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Jonathan Kim <jonathan.kim@amd.com>,
Alex Deucher <alexander.deucher@amd.com>,
Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>,
Sasha Levin <sashal@kernel.org>,
Felix.Kuehling@amd.com, amd-gfx@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17] drm/amdkfd: fix suspend/resume all calls in mes based eviction path
Date: Mon, 27 Oct 2025 20:38:55 -0400 [thread overview]
Message-ID: <20251028003940.884625-11-sashal@kernel.org> (raw)
In-Reply-To: <20251028003940.884625-1-sashal@kernel.org>
From: Jonathan Kim <jonathan.kim@amd.com>
[ Upstream commit 079ae5118e1f0dcf5b1ab68ffdb5760b06ed79a2 ]
Suspend/resume all gangs should be done with the device lock is held.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
## BACKPORT RECOMMENDATION: YES
### Summary
This commit fixes a **locking correctness bug** in the AMD KFD (GPU
compute) driver's MES-based eviction path. The bug allows
suspend_all_queues_mes() and resume_all_queues_mes() to be called
without holding the required device queue manager lock, creating race
conditions that can cause GPU hangs and system instability.
---
### Semantic Analysis Tools Used
1. **mcp__semcode__find_function**: Located evict_process_queues_cpsch,
suspend_all_queues_mes, resume_all_queues_mes, and
kfd_evict_process_device functions
2. **mcp__semcode__find_callers**: Identified 4 direct callers of
kfd_evict_process_device:
- kfd_set_dbg_ev_from_interrupt (debug interrupts)
- kfd_dbg_send_exception_to_runtime (ioctl handler)
- kfd_signal_vm_fault_event_with_userptr (VM fault handler)
- cik_event_interrupt_wq (interrupt handler)
3. **mcp__semcode__find_callchain**: Traced call paths showing user-
space can trigger this via kfd_ioctl_set_debug_trap
4. **Git history analysis**: Determined bug was introduced in v6.12
(commit 9a16042f02cd0) and fixed in v6.18-rc2
---
### Code Analysis
**The Bug (OLD CODE in kfd_dqm_evict_pasid_mes):**
```c
dqm_lock(dqm);
if (qpd->evicted) { ... }
dqm_unlock(dqm); // ← Lock released here
ret = suspend_all_queues_mes(dqm); // ← Called WITHOUT lock
ret = dqm->ops.evict_process_queues(dqm, qpd);
ret = resume_all_queues_mes(dqm); // ← Called WITHOUT lock
```
The old code released the dqm lock, then called suspend/resume without
re-acquiring it. This violates the locking contract stated in the commit
message: "Suspend/resume all gangs should be done with the device lock
is held."
**The Fix (NEW CODE in evict_process_queues_cpsch):**
```c
dqm_lock(dqm); // ← Lock held from start
if (dqm->dev->kfd->shared_resources.enable_mes) {
retval = suspend_all_queues_mes(dqm); // ← Called WITH lock
if (retval) goto out;
}
// ... eviction work ...
if (dqm->dev->kfd->shared_resources.enable_mes) {
retval = resume_all_queues_mes(dqm); // ← Called WITH lock
}
out:
dqm_unlock(dqm); // ← Lock held until end
```
The fix moves suspend/resume calls inside evict_process_queues_cpsch
where the dqm lock is held throughout the entire operation. It also:
- Eliminates the buggy kfd_dqm_evict_pasid_mes wrapper entirely
- Improves error handling with early exit on suspend failure
- Changes error path from continuing with `retval = err` to immediately
exiting with `goto out`
---
### Impact Assessment
**Severity: Medium-High**
- **User-triggerable:** YES - via ioctl (kfd_ioctl_set_debug_trap) and
VM fault handlers
- **Affected kernels:** v6.12 through v6.17 (6 major versions)
- **Subsystem criticality:** GPU compute device queue management
- **Potential consequences:**
- Race conditions during queue eviction
- GPU hangs and device state corruption
- System instability
- Possible kernel crashes
**Scope: Well-contained**
- Changes confined to AMD KFD driver (drivers/gpu/drm/amd/amdkfd/)
- Only affects MES-based queue eviction path
- No architectural changes or API modifications
- Eliminates 44 lines of buggy code, adds proper locking discipline
---
### Stable Tree Compliance
✅ **Fixes an important bug:** Locking correctness issue causing race
conditions
✅ **Does not introduce new features:** Pure bug fix
✅ **No architectural changes:** Refactors existing code path
✅ **Minimal regression risk:** Simplifies code and improves locking
discipline
✅ **Self-contained:** Limited to single driver subsystem
✅ **User-impacting:** Affects systems running AMD GPU compute workloads
⚠️ **Missing Fixes: tag:** The commit doesn't have "Fixes:
9a16042f02cd0" tag, but this doesn't disqualify it from backporting
---
### Recommendation Rationale
1. **Clear bug with clear fix:** The commit message explicitly states
the locking requirement that was violated
2. **User-reachable code path:** Semantic analysis confirmed user-space
can trigger this via ioctl
3. **Multi-version impact:** Bug has existed since v6.12 (June 2024),
affecting 6 kernel versions
4. **Low backport risk:** Code change is straightforward - moves
function calls inside locked region
5. **Historical precedent:** Similar locking fixes in this subsystem
have been backported (e.g., commit 70df8273ca0ce)
This commit should be backported to **stable kernels v6.12+** to prevent
GPU hangs and system instability on AMD compute workloads.
.../drm/amd/amdkfd/kfd_device_queue_manager.c | 73 ++++++-------------
1 file changed, 21 insertions(+), 52 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 6c5c7c1bf5eda..6e7bc983fc0b6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1209,6 +1209,15 @@ static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
pr_debug_ratelimited("Evicting process pid %d queues\n",
pdd->process->lead_thread->pid);
+ if (dqm->dev->kfd->shared_resources.enable_mes) {
+ pdd->last_evict_timestamp = get_jiffies_64();
+ retval = suspend_all_queues_mes(dqm);
+ if (retval) {
+ dev_err(dev, "Suspending all queues failed");
+ goto out;
+ }
+ }
+
/* Mark all queues as evicted. Deactivate all active queues on
* the qpd.
*/
@@ -1221,23 +1230,27 @@ static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
decrement_queue_count(dqm, qpd, q);
if (dqm->dev->kfd->shared_resources.enable_mes) {
- int err;
-
- err = remove_queue_mes(dqm, q, qpd);
- if (err) {
+ retval = remove_queue_mes(dqm, q, qpd);
+ if (retval) {
dev_err(dev, "Failed to evict queue %d\n",
q->properties.queue_id);
- retval = err;
+ goto out;
}
}
}
- pdd->last_evict_timestamp = get_jiffies_64();
- if (!dqm->dev->kfd->shared_resources.enable_mes)
+
+ if (!dqm->dev->kfd->shared_resources.enable_mes) {
+ pdd->last_evict_timestamp = get_jiffies_64();
retval = execute_queues_cpsch(dqm,
qpd->is_debug ?
KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES :
KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0,
USE_DEFAULT_GRACE_PERIOD);
+ } else {
+ retval = resume_all_queues_mes(dqm);
+ if (retval)
+ dev_err(dev, "Resuming all queues failed");
+ }
out:
dqm_unlock(dqm);
@@ -3098,61 +3111,17 @@ int kfd_dqm_suspend_bad_queue_mes(struct kfd_node *knode, u32 pasid, u32 doorbel
return ret;
}
-static int kfd_dqm_evict_pasid_mes(struct device_queue_manager *dqm,
- struct qcm_process_device *qpd)
-{
- struct device *dev = dqm->dev->adev->dev;
- int ret = 0;
-
- /* Check if process is already evicted */
- dqm_lock(dqm);
- if (qpd->evicted) {
- /* Increment the evicted count to make sure the
- * process stays evicted before its terminated.
- */
- qpd->evicted++;
- dqm_unlock(dqm);
- goto out;
- }
- dqm_unlock(dqm);
-
- ret = suspend_all_queues_mes(dqm);
- if (ret) {
- dev_err(dev, "Suspending all queues failed");
- goto out;
- }
-
- ret = dqm->ops.evict_process_queues(dqm, qpd);
- if (ret) {
- dev_err(dev, "Evicting process queues failed");
- goto out;
- }
-
- ret = resume_all_queues_mes(dqm);
- if (ret)
- dev_err(dev, "Resuming all queues failed");
-
-out:
- return ret;
-}
-
int kfd_evict_process_device(struct kfd_process_device *pdd)
{
struct device_queue_manager *dqm;
struct kfd_process *p;
- int ret = 0;
p = pdd->process;
dqm = pdd->dev->dqm;
WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
- if (dqm->dev->kfd->shared_resources.enable_mes)
- ret = kfd_dqm_evict_pasid_mes(dqm, &pdd->qpd);
- else
- ret = dqm->ops.evict_process_queues(dqm, &pdd->qpd);
-
- return ret;
+ return dqm->ops.evict_process_queues(dqm, &pdd->qpd);
}
int reserve_debug_trap_vmid(struct device_queue_manager *dqm,
--
2.51.0
next prev parent reply other threads:[~2025-10-28 0:40 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-28 0:38 [PATCH AUTOSEL 6.17-6.1] smb/server: fix possible memory leak in smb2_read() Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17-5.4] NFS4: Fix state renewals missing after boot Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17-6.12] drm/amdgpu: remove two invalid BUG_ON()s Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17-5.15] NFS: check if suid/sgid was cleared after a write as needed Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17-6.12] HID: logitech-hidpp: Add HIDPP_QUIRK_RESET_HI_RES_SCROLL Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17-5.4] ASoC: max98090/91: fixed max98091 ALSA widget powering up/down Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17] ALSA: hda/realtek: Fix mute led for HP Omen 17-cb0xxx Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17-5.10] RISC-V: clear hot-unplugged cores from all task mm_cpumasks to avoid rfence errors Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17] ASoC: nau8821: Avoid unnecessary blocking in IRQ handler Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17-5.4] HID: quirks: avoid Cooler Master MM712 dongle wakeup bug Sasha Levin
2025-10-28 0:38 ` Sasha Levin [this message]
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17-6.12] exfat: fix improper check of dentry.stream.valid_size Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17] io_uring: fix unexpected placement on same size resizing Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17] drm/amd: Disable ASPM on SI Sasha Levin
2025-10-28 0:38 ` [PATCH AUTOSEL 6.17-6.6] riscv: acpi: avoid errors caused by probing DT devices when ACPI is used Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17-6.1] drm/amd/pm: Disable MCLK switching on SI at high pixel clocks Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17-6.12] drm/amdgpu: hide VRAM sysfs attributes on GPUs without VRAM Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17-6.12] NFS4: Apply delay_retrans to async operations Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17-6.1] drm/amdgpu: Fix NULL pointer dereference in VRAM logic for APU devices Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17] ixgbe: handle IXGBE_VF_FEATURES_NEGOTIATE mbox cmd Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17] ixgbe: handle IXGBE_VF_GET_PF_LINK_STATE mailbox operation Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17-6.6] HID: quirks: Add ALWAYS_POLL quirk for VRS R295 steering wheel Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17] HID: intel-thc-hid: intel-quickspi: Add ARL PCI Device Id's Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17-6.12] HID: nintendo: Wait longer for initial probe Sasha Levin
2025-10-28 0:39 ` [PATCH AUTOSEL 6.17-6.1] smb/server: fix possible refcount leak in smb2_sess_setup() Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251028003940.884625-11-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=Felix.Kuehling@amd.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=harish.kasiviswanathan@amd.com \
--cc=jonathan.kim@amd.com \
--cc=patches@lists.linux.dev \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).