patches.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: Jonathan Kim <jonathan.kim@amd.com>,
	Alex Deucher <alexander.deucher@amd.com>,
	Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>,
	Sasha Levin <sashal@kernel.org>,
	Felix.Kuehling@amd.com, amd-gfx@lists.freedesktop.org
Subject: [PATCH AUTOSEL 6.17] drm/amdkfd: fix suspend/resume all calls in mes based eviction path
Date: Mon, 27 Oct 2025 20:38:55 -0400	[thread overview]
Message-ID: <20251028003940.884625-11-sashal@kernel.org> (raw)
In-Reply-To: <20251028003940.884625-1-sashal@kernel.org>

From: Jonathan Kim <jonathan.kim@amd.com>

[ Upstream commit 079ae5118e1f0dcf5b1ab68ffdb5760b06ed79a2 ]

Suspend/resume all gangs should be done with the device lock is held.

Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

## BACKPORT RECOMMENDATION: YES

### Summary
This commit fixes a **locking correctness bug** in the AMD KFD (GPU
compute) driver's MES-based eviction path. The bug allows
suspend_all_queues_mes() and resume_all_queues_mes() to be called
without holding the required device queue manager lock, creating race
conditions that can cause GPU hangs and system instability.

---

### Semantic Analysis Tools Used

1. **mcp__semcode__find_function**: Located evict_process_queues_cpsch,
   suspend_all_queues_mes, resume_all_queues_mes, and
   kfd_evict_process_device functions
2. **mcp__semcode__find_callers**: Identified 4 direct callers of
   kfd_evict_process_device:
   - kfd_set_dbg_ev_from_interrupt (debug interrupts)
   - kfd_dbg_send_exception_to_runtime (ioctl handler)
   - kfd_signal_vm_fault_event_with_userptr (VM fault handler)
   - cik_event_interrupt_wq (interrupt handler)
3. **mcp__semcode__find_callchain**: Traced call paths showing user-
   space can trigger this via kfd_ioctl_set_debug_trap
4. **Git history analysis**: Determined bug was introduced in v6.12
   (commit 9a16042f02cd0) and fixed in v6.18-rc2

---

### Code Analysis

**The Bug (OLD CODE in kfd_dqm_evict_pasid_mes):**
```c
dqm_lock(dqm);
if (qpd->evicted) { ... }
dqm_unlock(dqm);  // ← Lock released here

ret = suspend_all_queues_mes(dqm);  // ← Called WITHOUT lock
ret = dqm->ops.evict_process_queues(dqm, qpd);
ret = resume_all_queues_mes(dqm);  // ← Called WITHOUT lock
```

The old code released the dqm lock, then called suspend/resume without
re-acquiring it. This violates the locking contract stated in the commit
message: "Suspend/resume all gangs should be done with the device lock
is held."

**The Fix (NEW CODE in evict_process_queues_cpsch):**
```c
dqm_lock(dqm);  // ← Lock held from start
if (dqm->dev->kfd->shared_resources.enable_mes) {
    retval = suspend_all_queues_mes(dqm);  // ← Called WITH lock
    if (retval) goto out;
}
// ... eviction work ...
if (dqm->dev->kfd->shared_resources.enable_mes) {
    retval = resume_all_queues_mes(dqm);  // ← Called WITH lock
}
out:
    dqm_unlock(dqm);  // ← Lock held until end
```

The fix moves suspend/resume calls inside evict_process_queues_cpsch
where the dqm lock is held throughout the entire operation. It also:
- Eliminates the buggy kfd_dqm_evict_pasid_mes wrapper entirely
- Improves error handling with early exit on suspend failure
- Changes error path from continuing with `retval = err` to immediately
  exiting with `goto out`

---

### Impact Assessment

**Severity: Medium-High**
- **User-triggerable:** YES - via ioctl (kfd_ioctl_set_debug_trap) and
  VM fault handlers
- **Affected kernels:** v6.12 through v6.17 (6 major versions)
- **Subsystem criticality:** GPU compute device queue management
- **Potential consequences:**
  - Race conditions during queue eviction
  - GPU hangs and device state corruption
  - System instability
  - Possible kernel crashes

**Scope: Well-contained**
- Changes confined to AMD KFD driver (drivers/gpu/drm/amd/amdkfd/)
- Only affects MES-based queue eviction path
- No architectural changes or API modifications
- Eliminates 44 lines of buggy code, adds proper locking discipline

---

### Stable Tree Compliance

✅ **Fixes an important bug:** Locking correctness issue causing race
conditions
✅ **Does not introduce new features:** Pure bug fix
✅ **No architectural changes:** Refactors existing code path
✅ **Minimal regression risk:** Simplifies code and improves locking
discipline
✅ **Self-contained:** Limited to single driver subsystem
✅ **User-impacting:** Affects systems running AMD GPU compute workloads

⚠️ **Missing Fixes: tag:** The commit doesn't have "Fixes:
9a16042f02cd0" tag, but this doesn't disqualify it from backporting

---

### Recommendation Rationale

1. **Clear bug with clear fix:** The commit message explicitly states
   the locking requirement that was violated
2. **User-reachable code path:** Semantic analysis confirmed user-space
   can trigger this via ioctl
3. **Multi-version impact:** Bug has existed since v6.12 (June 2024),
   affecting 6 kernel versions
4. **Low backport risk:** Code change is straightforward - moves
   function calls inside locked region
5. **Historical precedent:** Similar locking fixes in this subsystem
   have been backported (e.g., commit 70df8273ca0ce)

This commit should be backported to **stable kernels v6.12+** to prevent
GPU hangs and system instability on AMD compute workloads.

 .../drm/amd/amdkfd/kfd_device_queue_manager.c | 73 ++++++-------------
 1 file changed, 21 insertions(+), 52 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
index 6c5c7c1bf5eda..6e7bc983fc0b6 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_device_queue_manager.c
@@ -1209,6 +1209,15 @@ static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
 	pr_debug_ratelimited("Evicting process pid %d queues\n",
 			    pdd->process->lead_thread->pid);
 
+	if (dqm->dev->kfd->shared_resources.enable_mes) {
+		pdd->last_evict_timestamp = get_jiffies_64();
+		retval = suspend_all_queues_mes(dqm);
+		if (retval) {
+			dev_err(dev, "Suspending all queues failed");
+			goto out;
+		}
+	}
+
 	/* Mark all queues as evicted. Deactivate all active queues on
 	 * the qpd.
 	 */
@@ -1221,23 +1230,27 @@ static int evict_process_queues_cpsch(struct device_queue_manager *dqm,
 		decrement_queue_count(dqm, qpd, q);
 
 		if (dqm->dev->kfd->shared_resources.enable_mes) {
-			int err;
-
-			err = remove_queue_mes(dqm, q, qpd);
-			if (err) {
+			retval = remove_queue_mes(dqm, q, qpd);
+			if (retval) {
 				dev_err(dev, "Failed to evict queue %d\n",
 					q->properties.queue_id);
-				retval = err;
+				goto out;
 			}
 		}
 	}
-	pdd->last_evict_timestamp = get_jiffies_64();
-	if (!dqm->dev->kfd->shared_resources.enable_mes)
+
+	if (!dqm->dev->kfd->shared_resources.enable_mes) {
+		pdd->last_evict_timestamp = get_jiffies_64();
 		retval = execute_queues_cpsch(dqm,
 					      qpd->is_debug ?
 					      KFD_UNMAP_QUEUES_FILTER_ALL_QUEUES :
 					      KFD_UNMAP_QUEUES_FILTER_DYNAMIC_QUEUES, 0,
 					      USE_DEFAULT_GRACE_PERIOD);
+	} else {
+		retval = resume_all_queues_mes(dqm);
+		if (retval)
+			dev_err(dev, "Resuming all queues failed");
+	}
 
 out:
 	dqm_unlock(dqm);
@@ -3098,61 +3111,17 @@ int kfd_dqm_suspend_bad_queue_mes(struct kfd_node *knode, u32 pasid, u32 doorbel
 	return ret;
 }
 
-static int kfd_dqm_evict_pasid_mes(struct device_queue_manager *dqm,
-				   struct qcm_process_device *qpd)
-{
-	struct device *dev = dqm->dev->adev->dev;
-	int ret = 0;
-
-	/* Check if process is already evicted */
-	dqm_lock(dqm);
-	if (qpd->evicted) {
-		/* Increment the evicted count to make sure the
-		 * process stays evicted before its terminated.
-		 */
-		qpd->evicted++;
-		dqm_unlock(dqm);
-		goto out;
-	}
-	dqm_unlock(dqm);
-
-	ret = suspend_all_queues_mes(dqm);
-	if (ret) {
-		dev_err(dev, "Suspending all queues failed");
-		goto out;
-	}
-
-	ret = dqm->ops.evict_process_queues(dqm, qpd);
-	if (ret) {
-		dev_err(dev, "Evicting process queues failed");
-		goto out;
-	}
-
-	ret = resume_all_queues_mes(dqm);
-	if (ret)
-		dev_err(dev, "Resuming all queues failed");
-
-out:
-	return ret;
-}
-
 int kfd_evict_process_device(struct kfd_process_device *pdd)
 {
 	struct device_queue_manager *dqm;
 	struct kfd_process *p;
-	int ret = 0;
 
 	p = pdd->process;
 	dqm = pdd->dev->dqm;
 
 	WARN(debug_evictions, "Evicting pid %d", p->lead_thread->pid);
 
-	if (dqm->dev->kfd->shared_resources.enable_mes)
-		ret = kfd_dqm_evict_pasid_mes(dqm, &pdd->qpd);
-	else
-		ret = dqm->ops.evict_process_queues(dqm, &pdd->qpd);
-
-	return ret;
+	return dqm->ops.evict_process_queues(dqm, &pdd->qpd);
 }
 
 int reserve_debug_trap_vmid(struct device_queue_manager *dqm,
-- 
2.51.0


  parent reply	other threads:[~2025-10-28  0:40 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-28  0:38 [PATCH AUTOSEL 6.17-6.1] smb/server: fix possible memory leak in smb2_read() Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.4] NFS4: Fix state renewals missing after boot Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.12] drm/amdgpu: remove two invalid BUG_ON()s Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.15] NFS: check if suid/sgid was cleared after a write as needed Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.12] HID: logitech-hidpp: Add HIDPP_QUIRK_RESET_HI_RES_SCROLL Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.4] ASoC: max98090/91: fixed max98091 ALSA widget powering up/down Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] ALSA: hda/realtek: Fix mute led for HP Omen 17-cb0xxx Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.10] RISC-V: clear hot-unplugged cores from all task mm_cpumasks to avoid rfence errors Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] ASoC: nau8821: Avoid unnecessary blocking in IRQ handler Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-5.4] HID: quirks: avoid Cooler Master MM712 dongle wakeup bug Sasha Levin
2025-10-28  0:38 ` Sasha Levin [this message]
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.12] exfat: fix improper check of dentry.stream.valid_size Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] io_uring: fix unexpected placement on same size resizing Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17] drm/amd: Disable ASPM on SI Sasha Levin
2025-10-28  0:38 ` [PATCH AUTOSEL 6.17-6.6] riscv: acpi: avoid errors caused by probing DT devices when ACPI is used Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.1] drm/amd/pm: Disable MCLK switching on SI at high pixel clocks Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.12] drm/amdgpu: hide VRAM sysfs attributes on GPUs without VRAM Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.12] NFS4: Apply delay_retrans to async operations Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.1] drm/amdgpu: Fix NULL pointer dereference in VRAM logic for APU devices Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] ixgbe: handle IXGBE_VF_FEATURES_NEGOTIATE mbox cmd Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] ixgbe: handle IXGBE_VF_GET_PF_LINK_STATE mailbox operation Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.6] HID: quirks: Add ALWAYS_POLL quirk for VRS R295 steering wheel Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17] HID: intel-thc-hid: intel-quickspi: Add ARL PCI Device Id's Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.12] HID: nintendo: Wait longer for initial probe Sasha Levin
2025-10-28  0:39 ` [PATCH AUTOSEL 6.17-6.1] smb/server: fix possible refcount leak in smb2_sess_setup() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251028003940.884625-11-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=Felix.Kuehling@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=harish.kasiviswanathan@amd.com \
    --cc=jonathan.kim@amd.com \
    --cc=patches@lists.linux.dev \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).