* [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
@ 2026-06-18 5:52 Mikhail Gavrilov
2026-06-18 6:17 ` sashiko-bot
2026-06-18 8:03 ` Christian König
0 siblings, 2 replies; 4+ messages in thread
From: Mikhail Gavrilov @ 2026-06-18 5:52 UTC (permalink / raw)
To: Alex Deucher, Christian König, Vitaly Prosyak
Cc: David Airlie, Simona Vetter, amd-gfx, dri-devel, linux-kernel,
Mikhail Gavrilov
amdgpu_lockdep_init() trains lockdep on the driver lock ordering by
taking a chain of dummy locks in order and calling fs_reclaim_acquire()
in the middle of it. The fs_reclaim_acquire()/fs_reclaim_release() pair
is placed while notifier_lock (amdgpu_notifier_lock_key) is held, which
teaches lockdep that it is legal to enter memory reclaim with the MMU
notifier lock held:
notifier_lock -> vram_lock -> reset_domain->sem -> reset_lock ->
fs_reclaim
notifier_lock is, however, acquired from the MMU notifier invalidate
callback amdgpu_hmm_invalidate_gfx(), which mm/ runs from inside memory
reclaim via __mmu_notifier_invalidate_range_start(). That establishes
the mandatory reverse ordering:
fs_reclaim -> mmu_notifier_invalidate_range_start -> notifier_lock
The two together form a cycle. It stays dormant until reclaim first
unmaps a page covered by an amdgpu userptr interval notifier, at which
point kswapd closes the loop and lockdep reports a false circular
locking dependency:
WARNING: possible circular locking dependency detected
kswapd0/268 is trying to acquire lock:
(&amdgpu_notifier_lock_key){+.+.}, at: amdgpu_hmm_invalidate_gfx
but task is already holding lock:
(mmu_notifier_invalidate_range_start){+.+.}, at: try_to_unmap_one
A lock that is taken inside an MMU notifier callback must never be held
across a reclaiming allocation, so the fs_reclaim annotation does not
belong inside the notifier_lock region. Drop it. The remaining chain
still teaches the intended lock nesting.
Fixes: 1d0f5838b126 ("drm/amdgpu: Add lockdep annotations for lock ordering validation")
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c | 8 --------
1 file changed, 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
index d5d71fd7c70d..eeb3b5007f80 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
@@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
/* Level 6: Reset control lock */
mutex_lock(&reset_ctl.reset_lock);
- /*
- * Mark potential memory reclaim boundary.
- * GPU operations might trigger memory allocation/reclaim.
- */
- fs_reclaim_acquire(GFP_KERNEL);
-
/* Level 7: SRBM register access */
mutex_lock(&srbm_mutex);
@@ -175,8 +169,6 @@ int amdgpu_lockdep_init(void)
mutex_unlock(&grbm_idx_mutex);
mutex_unlock(&srbm_mutex);
- fs_reclaim_release(GFP_KERNEL);
-
mutex_unlock(&reset_ctl.reset_lock);
up_read(&reset_domain->sem);
mutex_unlock(&vram_lock);
--
2.54.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
2026-06-18 5:52 [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training Mikhail Gavrilov
@ 2026-06-18 6:17 ` sashiko-bot
2026-06-18 6:44 ` Mikhail Gavrilov
2026-06-18 8:03 ` Christian König
1 sibling, 1 reply; 4+ messages in thread
From: sashiko-bot @ 2026-06-18 6:17 UTC (permalink / raw)
To: Mikhail Gavrilov; +Cc: dri-devel
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
Pre-existing issues:
- [High] Removing the `fs_reclaim` lockdep annotation silences a true-positive warning about a real ABBA deadlock between memory reclaim and GPU reset, rather than fixing the underlying bug.
--
commit 4896f663f62249014930226b2bc964c2419486d0
Author: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
The commit removes the fs_reclaim_acquire() and fs_reclaim_release() calls
from amdgpu_lockdep_init() to silence a circular locking dependency warning,
claiming it is a false positive.
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> index d5d71fd7c70d0..eeb3b5007f80f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
> /* Level 6: Reset control lock */
> mutex_lock(&reset_ctl.reset_lock);
>
> - /*
> - * Mark potential memory reclaim boundary.
> - * GPU operations might trigger memory allocation/reclaim.
> - */
> - fs_reclaim_acquire(GFP_KERNEL);
> -
[Severity: High]
This is a pre-existing issue, but does removing this lockdep annotation
silence a true-positive warning about a real ABBA deadlock?
While the circular dependency might not happen in the same thread, couldn't a
cross-thread deadlock occur between memory reclaim and GPU reset?
For example, if Thread A executes the MMU notifier amdgpu_hmm_invalidate_gfx()
(e.g., during direct reclaim), it acquires adev->notifier_lock and blocks in
dma_resv_wait_timeout() waiting for a hardware fence:
drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c:amdgpu_hmm_invalidate_gfx() {
...
mutex_lock(&adev->notifier_lock);
...
r = dma_resv_wait_timeout(vm_root->tbo.base.resv,
DMA_RESV_USAGE_BOOKKEEP, false,
MAX_SCHEDULE_TIMEOUT);
}
If Thread B (the GPU reset thread) takes down_write(&adev->reset_domain->sem)
to reset the GPU and signal the fences, and performs GFP_KERNEL allocations,
it could be forced into direct memory reclaim (fs_reclaim). This would invoke
the MMU notifier, attempting to acquire adev->notifier_lock, forming a cycle.
Would it be better to fix the underlying deadlock by wrapping the GPU reset
path with memalloc_noreclaim_save() or memalloc_noio_save() to prevent it
from entering memory reclaim, instead of removing the lockdep annotation?
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260618055216.56191-1-mikhail.v.gavrilov@gmail.com?part=1
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
2026-06-18 6:17 ` sashiko-bot
@ 2026-06-18 6:44 ` Mikhail Gavrilov
0 siblings, 0 replies; 4+ messages in thread
From: Mikhail Gavrilov @ 2026-06-18 6:44 UTC (permalink / raw)
To: sashiko-reviews, Alex Deucher, Christian König,
Vitaly Prosyak
Cc: dri-devel, David Airlie, Simona Vetter, amd-gfx list,
Linux List Kernel Mailing
On Thu, Jun 18, 2026 at 11:17 AM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
>
> Pre-existing issues:
> - [High] Removing the `fs_reclaim` lockdep annotation silences a true-positive warning about a real ABBA deadlock between memory reclaim and GPU reset, rather than fixing the underlying bug.
> --
>
> commit 4896f663f62249014930226b2bc964c2419486d0
> Author: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
>
> drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
>
> The commit removes the fs_reclaim_acquire() and fs_reclaim_release() calls
> from amdgpu_lockdep_init() to silence a circular locking dependency warning,
> claiming it is a false positive.
>
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> > index d5d71fd7c70d0..eeb3b5007f80f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> > @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
> > /* Level 6: Reset control lock */
> > mutex_lock(&reset_ctl.reset_lock);
> >
> > - /*
> > - * Mark potential memory reclaim boundary.
> > - * GPU operations might trigger memory allocation/reclaim.
> > - */
> > - fs_reclaim_acquire(GFP_KERNEL);
> > -
>
> [Severity: High]
> This is a pre-existing issue, but does removing this lockdep annotation
> silence a true-positive warning about a real ABBA deadlock?
>
> While the circular dependency might not happen in the same thread, couldn't a
> cross-thread deadlock occur between memory reclaim and GPU reset?
>
> For example, if Thread A executes the MMU notifier amdgpu_hmm_invalidate_gfx()
> (e.g., during direct reclaim), it acquires adev->notifier_lock and blocks in
> dma_resv_wait_timeout() waiting for a hardware fence:
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c:amdgpu_hmm_invalidate_gfx() {
> ...
> mutex_lock(&adev->notifier_lock);
> ...
> r = dma_resv_wait_timeout(vm_root->tbo.base.resv,
> DMA_RESV_USAGE_BOOKKEEP, false,
> MAX_SCHEDULE_TIMEOUT);
> }
>
> If Thread B (the GPU reset thread) takes down_write(&adev->reset_domain->sem)
> to reset the GPU and signal the fences, and performs GFP_KERNEL allocations,
> it could be forced into direct memory reclaim (fs_reclaim). This would invoke
> the MMU notifier, attempting to acquire adev->notifier_lock, forming a cycle.
>
> Would it be better to fix the underlying deadlock by wrapping the GPU reset
> path with memalloc_noreclaim_save() or memalloc_noio_save() to prevent it
> from entering memory reclaim, instead of removing the lockdep annotation?
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260618055216.56191-1-mikhail.v.gavrilov@gmail.com?part=1
The reported splat is not the reset/reclaim deadlock described here.
It is single-threaded: kswapd holds fs_reclaim and the mmu_notifier range,
then takes notifier_lock in amdgpu_hmm_invalidate_gfx(). That is the normal,
mandatory direction fs_reclaim -> mmu_notifier -> notifier_lock. The MMU
notifier callback runs from inside reclaim, so notifier_lock is acquired
below fs_reclaim and is never held across a reclaiming allocation.
amdgpu_lockdep_init() asserts the opposite edge, notifier_lock -> fs_reclaim,
by calling fs_reclaim_acquire() while notifier_lock is held. That edge does
not occur at runtime, so the reported cycle is a false positive. Dropping the
annotation removes the impossible edge and touches no real lock.
On the cross-thread reset case: if the reset path really holds
reset_domain->sem across a GFP_KERNEL allocation, lockdep learns
reset_sem -> fs_reclaim from that real allocation, not from this annotation.
The fs_reclaim_acquire() here adds nothing for that real edge; it only injects
the impossible notifier_lock -> fs_reclaim one. And because it fires on the
innocent kswapd path, it calls debug_locks_off() and disables lockdep for the
rest of the boot, which would prevent detecting exactly that reset deadlock.
Note memalloc_noreclaim_save() on the reset path would not silence this splat:
the false edge lives in amdgpu_lockdep_init(), independent of the reset path.
The splat reproduces with a userptr BO + MADV_PAGEOUT and is gone after this
change; I verified both.
If reset is confirmed to allocate under reset_domain->sem with reclaim, that
is a real and separate issue and memalloc_noreclaim_save() there would be
reasonable, but it is a different patch and does not change this one.
--
Best Regards,
Mike Gavrilov.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
2026-06-18 5:52 [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training Mikhail Gavrilov
2026-06-18 6:17 ` sashiko-bot
@ 2026-06-18 8:03 ` Christian König
1 sibling, 0 replies; 4+ messages in thread
From: Christian König @ 2026-06-18 8:03 UTC (permalink / raw)
To: Mikhail Gavrilov, Alex Deucher, Vitaly Prosyak
Cc: David Airlie, Simona Vetter, amd-gfx, dri-devel, linux-kernel
On 6/18/26 07:52, Mikhail Gavrilov wrote:
> amdgpu_lockdep_init() trains lockdep on the driver lock ordering by
> taking a chain of dummy locks in order and calling fs_reclaim_acquire()
> in the middle of it. The fs_reclaim_acquire()/fs_reclaim_release() pair
> is placed while notifier_lock (amdgpu_notifier_lock_key) is held, which
> teaches lockdep that it is legal to enter memory reclaim with the MMU
> notifier lock held:
>
> notifier_lock -> vram_lock -> reset_domain->sem -> reset_lock ->
> fs_reclaim
>
> notifier_lock is, however, acquired from the MMU notifier invalidate
> callback amdgpu_hmm_invalidate_gfx(), which mm/ runs from inside memory
> reclaim via __mmu_notifier_invalidate_range_start(). That establishes
> the mandatory reverse ordering:
>
> fs_reclaim -> mmu_notifier_invalidate_range_start -> notifier_lock
>
> The two together form a cycle. It stays dormant until reclaim first
> unmaps a page covered by an amdgpu userptr interval notifier, at which
> point kswapd closes the loop and lockdep reports a false circular
> locking dependency:
>
> WARNING: possible circular locking dependency detected
> kswapd0/268 is trying to acquire lock:
> (&amdgpu_notifier_lock_key){+.+.}, at: amdgpu_hmm_invalidate_gfx
> but task is already holding lock:
> (mmu_notifier_invalidate_range_start){+.+.}, at: try_to_unmap_one
>
> A lock that is taken inside an MMU notifier callback must never be held
> across a reclaiming allocation, so the fs_reclaim annotation does not
> belong inside the notifier_lock region. Drop it. The remaining chain
> still teaches the intended lock nesting.
>
> Fixes: 1d0f5838b126 ("drm/amdgpu: Add lockdep annotations for lock ordering validation")
> Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c | 8 --------
> 1 file changed, 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> index d5d71fd7c70d..eeb3b5007f80 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
> /* Level 6: Reset control lock */
> mutex_lock(&reset_ctl.reset_lock);
>
> - /*
> - * Mark potential memory reclaim boundary.
> - * GPU operations might trigger memory allocation/reclaim.
> - */
> - fs_reclaim_acquire(GFP_KERNEL);
> -
This shouldn't be removed, but instead moved outside the notifier lock.
The notifier lock and vram_lock are also in incorrect order.
@Vitaly can you take care of fixing that? Thanks in advance.
Amdgpus VM eviction lock needs to be handled here as well, but has another ordering bug with fs_reclaim_acquire().
Patches to fix that are pending on the amdgpu mailing list, but I need to find time to work on them.
Thanks for the report,
Christian.
> /* Level 7: SRBM register access */
> mutex_lock(&srbm_mutex);
>
> @@ -175,8 +169,6 @@ int amdgpu_lockdep_init(void)
> mutex_unlock(&grbm_idx_mutex);
> mutex_unlock(&srbm_mutex);
>
> - fs_reclaim_release(GFP_KERNEL);
> -
> mutex_unlock(&reset_ctl.reset_lock);
> up_read(&reset_domain->sem);
> mutex_unlock(&vram_lock);
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-06-18 8:03 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-18 5:52 [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training Mikhail Gavrilov
2026-06-18 6:17 ` sashiko-bot
2026-06-18 6:44 ` Mikhail Gavrilov
2026-06-18 8:03 ` Christian König
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.