[PATCH AUTOSEL 6.17-6.12] drm/amdkfd: Fix mmap write lock not release

AMD-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH AUTOSEL 6.17-6.12] drm/amdkfd: Fix mmap write lock not release
       [not found] <20251026144958.26750-1-sashal@kernel.org>
@ 2025-10-26 14:49 ` Sasha Levin
  0 siblings, 0 replies; only message in thread
From: Sasha Levin @ 2025-10-26 14:49 UTC (permalink / raw)
  To: patches, stable
  Cc: Philip Yang, Harish Kasiviswanathan, Alex Deucher, Sasha Levin,
	Felix.Kuehling, amd-gfx

From: Philip Yang <Philip.Yang@amd.com>

[ Upstream commit 7574f30337e19045f03126b4c51f525b84e5049e ]

If mmap write lock is taken while draining retry fault, mmap write lock
is not released because svm_range_restore_pages calls mmap_read_unlock
then returns. This causes deadlock and system hangs later because mmap
read or write lock cannot be taken.

Downgrade mmap write lock to read lock if draining retry fault fix this
bug.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---

LLM Generated explanations, may be completely bogus:

YES
- `svm_range_restore_pages()` can upgrade to a `mmap_write_lock()` when
  it must recreate a missing SVM range; if the retry-fault path is hit
  before the range is rebuilt we return `-EAGAIN` while still holding
  the write lock, so the later `mmap_read_unlock()` never releases it,
  leaving the mm’s mmap_lock stuck and hanging future faults
  (`drivers/gpu/drm/amd/amdkfd/kfd_svm.c:3022-3029`).
- The fix simply downgrades the lock back to read mode before that early
  return (`drivers/gpu/drm/amd/amdkfd/kfd_svm.c:3026-3027`), matching
  the existing teardown path already used when range creation fails
  (`drivers/gpu/drm/amd/amdkfd/kfd_svm.c:3053-3063`). This ensures the
  subsequent `mmap_read_unlock()` actually drops the lock.
- The regression was introduced by commit f844732e3ad9 (“drm/amdgpu: Fix
  the race condition for draining retry fault”), which is already in
  v6.15 and newer tags, so affected stable trees will deadlock under
  retry-fault draining unless they get this fix.
- Change is tiny, self-contained, and follows existing locking
  conventions; no new APIs or behavioral changes beyond correcting the
  lock lifecycle, so regression risk is low while preventing a user-
  visible hang in the GPU fault handler
  (`drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c:2896-2920`).

Possible next steps:
1. Queue 8cf4d56246c236935fc87384b2e2e32d12f57b91 for all stable
   branches that contain f844732e3ad9.
2. Run KFD retry-fault stress tests (e.g., `kfdtest --stress`) on an
   affected GPU to confirm the hang no longer occurs.

 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index 3d8b20828c068..6fa08f12cb429 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3023,6 +3023,8 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid,
 	if (svms->checkpoint_ts[gpuidx] != 0) {
 		if (amdgpu_ih_ts_after_or_equal(ts,  svms->checkpoint_ts[gpuidx])) {
 			pr_debug("draining retry fault, drop fault 0x%llx\n", addr);
+			if (write_locked)
+				mmap_write_downgrade(mm);
 			r = -EAGAIN;
 			goto out_unlock_svms;
 		} else {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] only message in thread

only message in thread, other threads:[~2025-10-26 14:51 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20251026144958.26750-1-sashal@kernel.org>
2025-10-26 14:49 ` [PATCH AUTOSEL 6.17-6.12] drm/amdkfd: Fix mmap write lock not release Sasha Levin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox