[PATCH] drm/amdgpu: do not enter fs_reclaim under notifier

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
@ 2026-06-18  5:52 Mikhail Gavrilov
  2026-06-18  6:17 ` sashiko-bot
  2026-06-18  8:03 ` Christian König
  0 siblings, 2 replies; 4+ messages in thread
From: Mikhail Gavrilov @ 2026-06-18  5:52 UTC (permalink / raw)
  To: Alex Deucher, Christian König, Vitaly Prosyak
  Cc: David Airlie, Simona Vetter, amd-gfx, dri-devel, linux-kernel,
	Mikhail Gavrilov

amdgpu_lockdep_init() trains lockdep on the driver lock ordering by
taking a chain of dummy locks in order and calling fs_reclaim_acquire()
in the middle of it. The fs_reclaim_acquire()/fs_reclaim_release() pair
is placed while notifier_lock (amdgpu_notifier_lock_key) is held, which
teaches lockdep that it is legal to enter memory reclaim with the MMU
notifier lock held:

  notifier_lock -> vram_lock -> reset_domain->sem -> reset_lock ->
  fs_reclaim

notifier_lock is, however, acquired from the MMU notifier invalidate
callback amdgpu_hmm_invalidate_gfx(), which mm/ runs from inside memory
reclaim via __mmu_notifier_invalidate_range_start(). That establishes
the mandatory reverse ordering:

  fs_reclaim -> mmu_notifier_invalidate_range_start -> notifier_lock

The two together form a cycle. It stays dormant until reclaim first
unmaps a page covered by an amdgpu userptr interval notifier, at which
point kswapd closes the loop and lockdep reports a false circular
locking dependency:

  WARNING: possible circular locking dependency detected
  kswapd0/268 is trying to acquire lock:
   (&amdgpu_notifier_lock_key){+.+.}, at: amdgpu_hmm_invalidate_gfx
  but task is already holding lock:
   (mmu_notifier_invalidate_range_start){+.+.}, at: try_to_unmap_one

A lock that is taken inside an MMU notifier callback must never be held
across a reclaiming allocation, so the fs_reclaim annotation does not
belong inside the notifier_lock region. Drop it. The remaining chain
still teaches the intended lock nesting.

Fixes: 1d0f5838b126 ("drm/amdgpu: Add lockdep annotations for lock ordering validation")
Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
index d5d71fd7c70d..eeb3b5007f80 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
@@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
 	/* Level 6: Reset control lock */
 	mutex_lock(&reset_ctl.reset_lock);

-	/*
-	 * Mark potential memory reclaim boundary.
-	 * GPU operations might trigger memory allocation/reclaim.
-	 */
-	fs_reclaim_acquire(GFP_KERNEL);
-
 	/* Level 7: SRBM register access */
 	mutex_lock(&srbm_mutex);

@@ -175,8 +169,6 @@ int amdgpu_lockdep_init(void)
 	mutex_unlock(&grbm_idx_mutex);
 	mutex_unlock(&srbm_mutex);

-	fs_reclaim_release(GFP_KERNEL);
-
 	mutex_unlock(&reset_ctl.reset_lock);
 	up_read(&reset_domain->sem);
 	mutex_unlock(&vram_lock);
-- 
2.54.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
  2026-06-18  5:52 [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training Mikhail Gavrilov
@ 2026-06-18  6:17 ` sashiko-bot
  2026-06-18  6:44   ` Mikhail Gavrilov
  2026-06-18  8:03 ` Christian König
  1 sibling, 1 reply; 4+ messages in thread
From: sashiko-bot @ 2026-06-18  6:17 UTC (permalink / raw)
  To: Mikhail Gavrilov; +Cc: dri-devel

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Removing the `fs_reclaim` lockdep annotation silences a true-positive warning about a real ABBA deadlock between memory reclaim and GPU reset, rather than fixing the underlying bug.
--

commit 4896f663f62249014930226b2bc964c2419486d0
Author: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>

drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training

The commit removes the fs_reclaim_acquire() and fs_reclaim_release() calls
from amdgpu_lockdep_init() to silence a circular locking dependency warning,
claiming it is a false positive.

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> index d5d71fd7c70d0..eeb3b5007f80f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
>  	/* Level 6: Reset control lock */
>  	mutex_lock(&reset_ctl.reset_lock);
>  
> -	/*
> -	 * Mark potential memory reclaim boundary.
> -	 * GPU operations might trigger memory allocation/reclaim.
> -	 */
> -	fs_reclaim_acquire(GFP_KERNEL);
> -

[Severity: High]
This is a pre-existing issue, but does removing this lockdep annotation
silence a true-positive warning about a real ABBA deadlock?

While the circular dependency might not happen in the same thread, couldn't a
cross-thread deadlock occur between memory reclaim and GPU reset?

For example, if Thread A executes the MMU notifier amdgpu_hmm_invalidate_gfx()
(e.g., during direct reclaim), it acquires adev->notifier_lock and blocks in
dma_resv_wait_timeout() waiting for a hardware fence:

drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c:amdgpu_hmm_invalidate_gfx() {
	...
	mutex_lock(&adev->notifier_lock);
	...
	r = dma_resv_wait_timeout(vm_root->tbo.base.resv,
				  DMA_RESV_USAGE_BOOKKEEP, false,
				  MAX_SCHEDULE_TIMEOUT);
}

If Thread B (the GPU reset thread) takes down_write(&adev->reset_domain->sem)
to reset the GPU and signal the fences, and performs GFP_KERNEL allocations,
it could be forced into direct memory reclaim (fs_reclaim). This would invoke
the MMU notifier, attempting to acquire adev->notifier_lock, forming a cycle.

Would it be better to fix the underlying deadlock by wrapping the GPU reset
path with memalloc_noreclaim_save() or memalloc_noio_save() to prevent it
from entering memory reclaim, instead of removing the lockdep annotation?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260618055216.56191-1-mikhail.v.gavrilov@gmail.com?part=1

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
  2026-06-18  6:17 ` sashiko-bot
@ 2026-06-18  6:44   ` Mikhail Gavrilov
  0 siblings, 0 replies; 4+ messages in thread
From: Mikhail Gavrilov @ 2026-06-18  6:44 UTC (permalink / raw)
  To: sashiko-reviews, Alex Deucher, Christian König,
	Vitaly Prosyak
  Cc: dri-devel, David Airlie, Simona Vetter, amd-gfx list,
	Linux List Kernel Mailing

On Thu, Jun 18, 2026 at 11:17 AM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
>
> Pre-existing issues:
> - [High] Removing the `fs_reclaim` lockdep annotation silences a true-positive warning about a real ABBA deadlock between memory reclaim and GPU reset, rather than fixing the underlying bug.
> --
>
> commit 4896f663f62249014930226b2bc964c2419486d0
> Author: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
>
> drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
>
> The commit removes the fs_reclaim_acquire() and fs_reclaim_release() calls
> from amdgpu_lockdep_init() to silence a circular locking dependency warning,
> claiming it is a false positive.
>
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> > index d5d71fd7c70d0..eeb3b5007f80f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> > @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
> >       /* Level 6: Reset control lock */
> >       mutex_lock(&reset_ctl.reset_lock);
> >
> > -     /*
> > -      * Mark potential memory reclaim boundary.
> > -      * GPU operations might trigger memory allocation/reclaim.
> > -      */
> > -     fs_reclaim_acquire(GFP_KERNEL);
> > -
>
> [Severity: High]
> This is a pre-existing issue, but does removing this lockdep annotation
> silence a true-positive warning about a real ABBA deadlock?
>
> While the circular dependency might not happen in the same thread, couldn't a
> cross-thread deadlock occur between memory reclaim and GPU reset?
>
> For example, if Thread A executes the MMU notifier amdgpu_hmm_invalidate_gfx()
> (e.g., during direct reclaim), it acquires adev->notifier_lock and blocks in
> dma_resv_wait_timeout() waiting for a hardware fence:
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_hmm.c:amdgpu_hmm_invalidate_gfx() {
>         ...
>         mutex_lock(&adev->notifier_lock);
>         ...
>         r = dma_resv_wait_timeout(vm_root->tbo.base.resv,
>                                   DMA_RESV_USAGE_BOOKKEEP, false,
>                                   MAX_SCHEDULE_TIMEOUT);
> }
>
> If Thread B (the GPU reset thread) takes down_write(&adev->reset_domain->sem)
> to reset the GPU and signal the fences, and performs GFP_KERNEL allocations,
> it could be forced into direct memory reclaim (fs_reclaim). This would invoke
> the MMU notifier, attempting to acquire adev->notifier_lock, forming a cycle.
>
> Would it be better to fix the underlying deadlock by wrapping the GPU reset
> path with memalloc_noreclaim_save() or memalloc_noio_save() to prevent it
> from entering memory reclaim, instead of removing the lockdep annotation?
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260618055216.56191-1-mikhail.v.gavrilov@gmail.com?part=1

The reported splat is not the reset/reclaim deadlock described here.

It is single-threaded: kswapd holds fs_reclaim and the mmu_notifier range,
then takes notifier_lock in amdgpu_hmm_invalidate_gfx(). That is the normal,
mandatory direction fs_reclaim -> mmu_notifier -> notifier_lock. The MMU
notifier callback runs from inside reclaim, so notifier_lock is acquired
below fs_reclaim and is never held across a reclaiming allocation.

amdgpu_lockdep_init() asserts the opposite edge, notifier_lock -> fs_reclaim,
by calling fs_reclaim_acquire() while notifier_lock is held. That edge does
not occur at runtime, so the reported cycle is a false positive. Dropping the
annotation removes the impossible edge and touches no real lock.

On the cross-thread reset case: if the reset path really holds
reset_domain->sem across a GFP_KERNEL allocation, lockdep learns
reset_sem -> fs_reclaim from that real allocation, not from this annotation.
The fs_reclaim_acquire() here adds nothing for that real edge; it only injects
the impossible notifier_lock -> fs_reclaim one. And because it fires on the
innocent kswapd path, it calls debug_locks_off() and disables lockdep for the
rest of the boot, which would prevent detecting exactly that reset deadlock.

Note memalloc_noreclaim_save() on the reset path would not silence this splat:
the false edge lives in amdgpu_lockdep_init(), independent of the reset path.
The splat reproduces with a userptr BO + MADV_PAGEOUT and is gone after this
change; I verified both.

If reset is confirmed to allocate under reset_domain->sem with reclaim, that
is a real and separate issue and memalloc_noreclaim_save() there would be
reasonable, but it is a different patch and does not change this one.

-- 
Best Regards,
Mike Gavrilov.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training
  2026-06-18  5:52 [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training Mikhail Gavrilov
  2026-06-18  6:17 ` sashiko-bot
@ 2026-06-18  8:03 ` Christian König
  1 sibling, 0 replies; 4+ messages in thread
From: Christian König @ 2026-06-18  8:03 UTC (permalink / raw)
  To: Mikhail Gavrilov, Alex Deucher, Vitaly Prosyak
  Cc: David Airlie, Simona Vetter, amd-gfx, dri-devel, linux-kernel

On 6/18/26 07:52, Mikhail Gavrilov wrote:
> amdgpu_lockdep_init() trains lockdep on the driver lock ordering by
> taking a chain of dummy locks in order and calling fs_reclaim_acquire()
> in the middle of it. The fs_reclaim_acquire()/fs_reclaim_release() pair
> is placed while notifier_lock (amdgpu_notifier_lock_key) is held, which
> teaches lockdep that it is legal to enter memory reclaim with the MMU
> notifier lock held:
> 
>   notifier_lock -> vram_lock -> reset_domain->sem -> reset_lock ->
>   fs_reclaim
> 
> notifier_lock is, however, acquired from the MMU notifier invalidate
> callback amdgpu_hmm_invalidate_gfx(), which mm/ runs from inside memory
> reclaim via __mmu_notifier_invalidate_range_start(). That establishes
> the mandatory reverse ordering:
> 
>   fs_reclaim -> mmu_notifier_invalidate_range_start -> notifier_lock
> 
> The two together form a cycle. It stays dormant until reclaim first
> unmaps a page covered by an amdgpu userptr interval notifier, at which
> point kswapd closes the loop and lockdep reports a false circular
> locking dependency:
> 
>   WARNING: possible circular locking dependency detected
>   kswapd0/268 is trying to acquire lock:
>    (&amdgpu_notifier_lock_key){+.+.}, at: amdgpu_hmm_invalidate_gfx
>   but task is already holding lock:
>    (mmu_notifier_invalidate_range_start){+.+.}, at: try_to_unmap_one
> 
> A lock that is taken inside an MMU notifier callback must never be held
> across a reclaiming allocation, so the fs_reclaim annotation does not
> belong inside the notifier_lock region. Drop it. The remaining chain
> still teaches the intended lock nesting.
> 
> Fixes: 1d0f5838b126 ("drm/amdgpu: Add lockdep annotations for lock ordering validation")
> Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c | 8 --------
>  1 file changed, 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> index d5d71fd7c70d..eeb3b5007f80 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_lockdep.c
> @@ -150,12 +150,6 @@ int amdgpu_lockdep_init(void)
>  	/* Level 6: Reset control lock */
>  	mutex_lock(&reset_ctl.reset_lock);
>  
> -	/*
> -	 * Mark potential memory reclaim boundary.
> -	 * GPU operations might trigger memory allocation/reclaim.
> -	 */
> -	fs_reclaim_acquire(GFP_KERNEL);
> -

This shouldn't be removed, but instead moved outside the notifier lock.

The notifier lock and vram_lock are also in incorrect order.

@Vitaly can you take care of fixing that? Thanks in advance.

Amdgpus VM eviction lock needs to be handled here as well, but has another ordering bug with fs_reclaim_acquire().

Patches to fix that are pending on the amdgpu mailing list, but I need to find time to work on them.

Thanks for the report,
Christian.

>  	/* Level 7: SRBM register access */
>  	mutex_lock(&srbm_mutex);
>  
> @@ -175,8 +169,6 @@ int amdgpu_lockdep_init(void)
>  	mutex_unlock(&grbm_idx_mutex);
>  	mutex_unlock(&srbm_mutex);
>  
> -	fs_reclaim_release(GFP_KERNEL);
> -
>  	mutex_unlock(&reset_ctl.reset_lock);
>  	up_read(&reset_domain->sem);
>  	mutex_unlock(&vram_lock);


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-18  8:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-18  5:52 [PATCH] drm/amdgpu: do not enter fs_reclaim under notifier_lock in lockdep training Mikhail Gavrilov
2026-06-18  6:17 ` sashiko-bot
2026-06-18  6:44   ` Mikhail Gavrilov
2026-06-18  8:03 ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.