* [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
@ 2026-06-18 5:52 Mikhail Gavrilov
2026-06-18 6:17 ` sashiko-bot
2026-06-18 8:03 ` Christian König
0 siblings, 2 replies; 4+ messages in thread
From: Mikhail Gavrilov @ 2026-06-18 5:52 UTC (permalink / raw)
To: Alex Deucher, Christian König, Vitaly Prosyak
Cc: David Airlie, Simona Vetter, amd-gfx, dri-devel, linux-kernel,
Mikhail Gavrilov
amdgpu_lockdep_init() trains lockdep on the driver lock ordering by
taking a chain of dummy locks in order and calling fs_reclaim_acquire()
in the middle of it. The fs_reclaim_acquire()/fs_reclaim_release() pair
is placed while notifier_lock (amdgpu_notifier_lock_key) is held, which
teaches lockdep that it is legal to enter memory reclaim with the MMU
notifier lock held:
notifier_lock -> vram_lock -> reset_domain->sem -> reset_lock ->
fs_reclaim
notifier_lock is, however, acquired from the MMU notifier invalidate
callback amdgpu_hmm_invalidate_gfx(), which mm/ runs from inside memory
reclaim via __mmu_notifier_invalidate_range_start(). That establishes
the mandatory reverse ordering:
fs_reclaim -> mmu_notifier_invalidate_range_start -> notifier_lock
The two together form a cycle. It stays dormant until reclaim first
unmaps a page covered by an amdgpu userptr interval notifier, at which
point kswapd closes the loop and lockdep reports a false circular
locking dependency:
WARNING: possible circular locking dependency detected
kswapd0/268 is trying to acquire lock:
(&amdgpu_notifier_lock_key){+.+.}, at: amdgpu_hmm_invalidate_gfx
but task is already holding lock:
(mmu_notifier_invalidate_range_start){+.+.}, at: try_to_unmap_one
A lock that is taken inside an MMU notifier callback must never be held
across a reclaiming allocation, so the fs_reclaim annotation does not
belong inside the notifier_lock region. Drop it. The remaining chain
still teaches the intended lock nesting.
Fixes: 1d0f5838b126 ("drm/amdgpu: Add lockdep annotations for lock ordering validation")
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c | 8 --------
1 file changed, 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
index d5d71fd7c70d..eeb3b5007f80 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
@@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
/* Level 6: Reset control lock */
mutex_lock(&reset_ctl.reset_lock);
- /*
- * Mark potential memory reclaim boundary.
- * GPU operations might trigger memory allocation/reclaim.
- */
- fs_reclaim_acquire(GFP_KERNEL);
-
/* Level 7: SRBM register access */
mutex_lock(&srbm_mutex);
@@ -175,8 +169,6 @@ int amdgpu_lockdep_init(void)
mutex_unlock(&grbm_idx_mutex);
mutex_unlock(&srbm_mutex);
- fs_reclaim_release(GFP_KERNEL);
-
mutex_unlock(&reset_ctl.reset_lock);
up_read(&reset_domain->sem);
mutex_unlock(&vram_lock);
--
2.54.0
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training 2026-06-18 5:52 [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training Mikhail Gavrilov @ 2026-06-18 6:17 ` sashiko-bot 2026-06-18 6:44 ` Mikhail Gavrilov 2026-06-18 8:03 ` Christian König 1 sibling, 1 reply; 4+ messages in thread From: sashiko-bot @ 2026-06-18 6:17 UTC (permalink / raw) To: Mikhail Gavrilov; +Cc: dri-devel Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: Pre-existing issues: - [High] Removing the `fs_reclaim` lockdep annotation silences a true-positive warning about a real ABBA deadlock between memory reclaim and GPU reset, rather than fixing the underlying bug. -- commit 4896f663f62249014930226b2bc964c2419486d0 Author: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training The commit removes the fs_reclaim_acquire() and fs_reclaim_release() calls from amdgpu_lockdep_init() to silence a circular locking dependency warning, claiming it is a false positive. > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c > index d5d71fd7c70d0..eeb3b5007f80f 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c > @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void) > /* Level 6: Reset control lock */ > mutex_lock(&reset_ctl.reset_lock); > > - /* > - * Mark potential memory reclaim boundary. > - * GPU operations might trigger memory allocation/reclaim. > - */ > - fs_reclaim_acquire(GFP_KERNEL); > - [Severity: High] This is a pre-existing issue, but does removing this lockdep annotation silence a true-positive warning about a real ABBA deadlock? While the circular dependency might not happen in the same thread, couldn't a cross-thread deadlock occur between memory reclaim and GPU reset? For example, if Thread A executes the MMU notifier amdgpu_hmm_invalidate_gfx() (e.g., during direct reclaim), it acquires adev->notifier_lock and blocks in dma_resv_wait_timeout() waiting for a hardware fence: drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c:amdgpu_hmm_invalidate_gfx() { ... mutex_lock(&adev->notifier_lock); ... r = dma_resv_wait_timeout(vm_root->tbo.base.resv, DMA_RESV_USAGE_BOOKKEEP, false, MAX_SCHEDULE_TIMEOUT); } If Thread B (the GPU reset thread) takes down_write(&adev->reset_domain->sem) to reset the GPU and signal the fences, and performs GFP_KERNEL allocations, it could be forced into direct memory reclaim (fs_reclaim). This would invoke the MMU notifier, attempting to acquire adev->notifier_lock, forming a cycle. Would it be better to fix the underlying deadlock by wrapping the GPU reset path with memalloc_noreclaim_save() or memalloc_noio_save() to prevent it from entering memory reclaim, instead of removing the lockdep annotation? -- Sashiko AI review · https://sashiko.dev/#/patchset/20260618055216.56191-1-mikhail.v.gavrilov@gmail.com?part=1 ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training 2026-06-18 6:17 ` sashiko-bot @ 2026-06-18 6:44 ` Mikhail Gavrilov 0 siblings, 0 replies; 4+ messages in thread From: Mikhail Gavrilov @ 2026-06-18 6:44 UTC (permalink / raw) To: sashiko-reviews, Alex Deucher, Christian König, Vitaly Prosyak Cc: dri-devel, David Airlie, Simona Vetter, amd-gfx list, Linux List Kernel Mailing On Thu, Jun 18, 2026 at 11:17 AM <sashiko-bot@kernel.org> wrote: > > Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: > > Pre-existing issues: > - [High] Removing the `fs_reclaim` lockdep annotation silences a true-positive warning about a real ABBA deadlock between memory reclaim and GPU reset, rather than fixing the underlying bug. > -- > > commit 4896f663f62249014930226b2bc964c2419486d0 > Author: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> > > drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training > > The commit removes the fs_reclaim_acquire() and fs_reclaim_release() calls > from amdgpu_lockdep_init() to silence a circular locking dependency warning, > claiming it is a false positive. > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c > > index d5d71fd7c70d0..eeb3b5007f80f 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c > > @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void) > > /* Level 6: Reset control lock */ > > mutex_lock(&reset_ctl.reset_lock); > > > > - /* > > - * Mark potential memory reclaim boundary. > > - * GPU operations might trigger memory allocation/reclaim. > > - */ > > - fs_reclaim_acquire(GFP_KERNEL); > > - > > [Severity: High] > This is a pre-existing issue, but does removing this lockdep annotation > silence a true-positive warning about a real ABBA deadlock? > > While the circular dependency might not happen in the same thread, couldn't a > cross-thread deadlock occur between memory reclaim and GPU reset? > > For example, if Thread A executes the MMU notifier amdgpu_hmm_invalidate_gfx() > (e.g., during direct reclaim), it acquires adev->notifier_lock and blocks in > dma_resv_wait_timeout() waiting for a hardware fence: > > drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c:amdgpu_hmm_invalidate_gfx() { > ... > mutex_lock(&adev->notifier_lock); > ... > r = dma_resv_wait_timeout(vm_root->tbo.base.resv, > DMA_RESV_USAGE_BOOKKEEP, false, > MAX_SCHEDULE_TIMEOUT); > } > > If Thread B (the GPU reset thread) takes down_write(&adev->reset_domain->sem) > to reset the GPU and signal the fences, and performs GFP_KERNEL allocations, > it could be forced into direct memory reclaim (fs_reclaim). This would invoke > the MMU notifier, attempting to acquire adev->notifier_lock, forming a cycle. > > Would it be better to fix the underlying deadlock by wrapping the GPU reset > path with memalloc_noreclaim_save() or memalloc_noio_save() to prevent it > from entering memory reclaim, instead of removing the lockdep annotation? > > -- > Sashiko AI review · https://sashiko.dev/#/patchset/20260618055216.56191-1-mikhail.v.gavrilov@gmail.com?part=1 The reported splat is not the reset/reclaim deadlock described here. It is single-threaded: kswapd holds fs_reclaim and the mmu_notifier range, then takes notifier_lock in amdgpu_hmm_invalidate_gfx(). That is the normal, mandatory direction fs_reclaim -> mmu_notifier -> notifier_lock. The MMU notifier callback runs from inside reclaim, so notifier_lock is acquired below fs_reclaim and is never held across a reclaiming allocation. amdgpu_lockdep_init() asserts the opposite edge, notifier_lock -> fs_reclaim, by calling fs_reclaim_acquire() while notifier_lock is held. That edge does not occur at runtime, so the reported cycle is a false positive. Dropping the annotation removes the impossible edge and touches no real lock. On the cross-thread reset case: if the reset path really holds reset_domain->sem across a GFP_KERNEL allocation, lockdep learns reset_sem -> fs_reclaim from that real allocation, not from this annotation. The fs_reclaim_acquire() here adds nothing for that real edge; it only injects the impossible notifier_lock -> fs_reclaim one. And because it fires on the innocent kswapd path, it calls debug_locks_off() and disables lockdep for the rest of the boot, which would prevent detecting exactly that reset deadlock. Note memalloc_noreclaim_save() on the reset path would not silence this splat: the false edge lives in amdgpu_lockdep_init(), independent of the reset path. The splat reproduces with a userptr BO + MADV_PAGEOUT and is gone after this change; I verified both. If reset is confirmed to allocate under reset_domain->sem with reclaim, that is a real and separate issue and memalloc_noreclaim_save() there would be reasonable, but it is a different patch and does not change this one. -- Best Regards, Mike Gavrilov. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training 2026-06-18 5:52 [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training Mikhail Gavrilov 2026-06-18 6:17 ` sashiko-bot @ 2026-06-18 8:03 ` Christian König 1 sibling, 0 replies; 4+ messages in thread From: Christian König @ 2026-06-18 8:03 UTC (permalink / raw) To: Mikhail Gavrilov, Alex Deucher, Vitaly Prosyak Cc: David Airlie, Simona Vetter, amd-gfx, dri-devel, linux-kernel On 6/18/26 07:52, Mikhail Gavrilov wrote: > amdgpu_lockdep_init() trains lockdep on the driver lock ordering by > taking a chain of dummy locks in order and calling fs_reclaim_acquire() > in the middle of it. The fs_reclaim_acquire()/fs_reclaim_release() pair > is placed while notifier_lock (amdgpu_notifier_lock_key) is held, which > teaches lockdep that it is legal to enter memory reclaim with the MMU > notifier lock held: > > notifier_lock -> vram_lock -> reset_domain->sem -> reset_lock -> > fs_reclaim > > notifier_lock is, however, acquired from the MMU notifier invalidate > callback amdgpu_hmm_invalidate_gfx(), which mm/ runs from inside memory > reclaim via __mmu_notifier_invalidate_range_start(). That establishes > the mandatory reverse ordering: > > fs_reclaim -> mmu_notifier_invalidate_range_start -> notifier_lock > > The two together form a cycle. It stays dormant until reclaim first > unmaps a page covered by an amdgpu userptr interval notifier, at which > point kswapd closes the loop and lockdep reports a false circular > locking dependency: > > WARNING: possible circular locking dependency detected > kswapd0/268 is trying to acquire lock: > (&amdgpu_notifier_lock_key){+.+.}, at: amdgpu_hmm_invalidate_gfx > but task is already holding lock: > (mmu_notifier_invalidate_range_start){+.+.}, at: try_to_unmap_one > > A lock that is taken inside an MMU notifier callback must never be held > across a reclaiming allocation, so the fs_reclaim annotation does not > belong inside the notifier_lock region. Drop it. The remaining chain > still teaches the intended lock nesting. > > Fixes: 1d0f5838b126 ("drm/amdgpu: Add lockdep annotations for lock ordering validation") > Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c | 8 -------- > 1 file changed, 8 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c > index d5d71fd7c70d..eeb3b5007f80 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c > @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void) > /* Level 6: Reset control lock */ > mutex_lock(&reset_ctl.reset_lock); > > - /* > - * Mark potential memory reclaim boundary. > - * GPU operations might trigger memory allocation/reclaim. > - */ > - fs_reclaim_acquire(GFP_KERNEL); > - This shouldn't be removed, but instead moved outside the notifier lock. The notifier lock and vram_lock are also in incorrect order. @Vitaly can you take care of fixing that? Thanks in advance. Amdgpus VM eviction lock needs to be handled here as well, but has another ordering bug with fs_reclaim_acquire(). Patches to fix that are pending on the amdgpu mailing list, but I need to find time to work on them. Thanks for the report, Christian. > /* Level 7: SRBM register access */ > mutex_lock(&srbm_mutex); > > @@ -175,8 +169,6 @@ int amdgpu_lockdep_init(void) > mutex_unlock(&grbm_idx_mutex); > mutex_unlock(&srbm_mutex); > > - fs_reclaim_release(GFP_KERNEL); > - > mutex_unlock(&reset_ctl.reset_lock); > up_read(&reset_domain->sem); > mutex_unlock(&vram_lock); ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-06-18 8:03 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-18 5:52 [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training Mikhail Gavrilov 2026-06-18 6:17 ` sashiko-bot 2026-06-18 6:44 ` Mikhail Gavrilov 2026-06-18 8:03 ` Christian König
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.