netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
@ 2025-03-21  9:30 Breno Leitao
  2025-03-21 10:37 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Breno Leitao @ 2025-03-21  9:30 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long
  Cc: aeh, linux-kernel, netdev, edumazet, jhs, kernel-team,
	Erik Lundgren, Breno Leitao, Paul E. McKenney

lockdep_unregister_key() is called from critical code paths, including
sections where rtnl_lock() is held. For example, when replacing a qdisc
in a network device, network egress traffic is disabled while
__qdisc_destroy() is called for every network queue.

If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
which gets blocked waiting for synchronize_rcu() to complete.

For example, a simple tc command to replace a qdisc could take 13
seconds:

  # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
    real    0m13.195s
    user    0m0.001s
    sys     0m2.746s

During this time, network egress is completely frozen while waiting for
RCU synchronization.

Use synchronize_rcu_expedited() instead to minimize the impact on
critical operations like network connectivity changes.

This improves 10x the function call to tc, when replacing the qdisc for
a network card.

   # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
     real     0m1.789s
     user     0m0.000s
     sys      0m1.613s

Reported-by: Erik Lundgren <elundgren@meta.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
---
 kernel/locking/lockdep.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 4470680f02269..a79030ac36dd4 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
 	if (need_callback)
 		call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
 
-	/* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
-	synchronize_rcu();
+	/* Wait until is_dynamic_key() has finished accessing k->hash_entry.
+	 * This needs to be quick, since it is called in critical sections
+	 */
+	synchronize_rcu_expedited();
 }
 EXPORT_SYMBOL_GPL(lockdep_unregister_key);
 

---
base-commit: 81e4f8d68c66da301bb881862735bd74c6241a19
change-id: 20250319-lockdep-b1eca0479665

Best regards,
-- 
Breno Leitao <leitao@debian.org>


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-21  9:30 [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization Breno Leitao
@ 2025-03-21 10:37 ` Eric Dumazet
  2025-03-21 14:22   ` Breno Leitao
  2025-03-24 12:12 ` Peter Zijlstra
  2025-07-09 10:00 ` Breno Leitao
  2 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2025-03-21 10:37 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren,
	Paul E. McKenney

On Fri, Mar 21, 2025 at 10:31 AM Breno Leitao <leitao@debian.org> wrote:
>
> lockdep_unregister_key() is called from critical code paths, including
> sections where rtnl_lock() is held. For example, when replacing a qdisc
> in a network device, network egress traffic is disabled while
> __qdisc_destroy() is called for every network queue.
>
> If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
> which gets blocked waiting for synchronize_rcu() to complete.
>
> For example, a simple tc command to replace a qdisc could take 13
> seconds:
>
>   # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
>     real    0m13.195s
>     user    0m0.001s
>     sys     0m2.746s
>
> During this time, network egress is completely frozen while waiting for
> RCU synchronization.
>
> Use synchronize_rcu_expedited() instead to minimize the impact on
> critical operations like network connectivity changes.

Running 'critical operations' with LOCKDEP enabled kernels seems a bit
strange :)

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-21 10:37 ` Eric Dumazet
@ 2025-03-21 14:22   ` Breno Leitao
  0 siblings, 0 replies; 35+ messages in thread
From: Breno Leitao @ 2025-03-21 14:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren,
	Paul E. McKenney

On Fri, Mar 21, 2025 at 11:37:31AM +0100, Eric Dumazet wrote:
> On Fri, Mar 21, 2025 at 10:31 AM Breno Leitao <leitao@debian.org> wrote:
> >
> > lockdep_unregister_key() is called from critical code paths, including
> > sections where rtnl_lock() is held. For example, when replacing a qdisc
> > in a network device, network egress traffic is disabled while
> > __qdisc_destroy() is called for every network queue.
> >
> > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
> > which gets blocked waiting for synchronize_rcu() to complete.
> >
> > For example, a simple tc command to replace a qdisc could take 13
> > seconds:
> >
> >   # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> >     real    0m13.195s
> >     user    0m0.001s
> >     sys     0m2.746s
> >
> > During this time, network egress is completely frozen while waiting for
> > RCU synchronization.
> >
> > Use synchronize_rcu_expedited() instead to minimize the impact on
> > critical operations like network connectivity changes.
> 
> Running 'critical operations' with LOCKDEP enabled kernels seems a bit
> strange :)

Apologies, I meant to say that at Meta, we have certain service tiers
that can accommodate the performance trade-offs of running a "debug"
kernel. This kernel provides valuable feedback about its operations.

The aim is to identify any significant issues early on, rather than
having to debug "hard issues" (such as deadlocks, out-of-memory access,
etc) once the kernel is in production.

> Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks for your support during this investigation,
--breno

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-21  9:30 [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization Breno Leitao
  2025-03-21 10:37 ` Eric Dumazet
@ 2025-03-24 12:12 ` Peter Zijlstra
  2025-03-24 12:23   ` Eric Dumazet
  2025-07-09 10:00 ` Breno Leitao
  2 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2025-03-24 12:12 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, aeh,
	linux-kernel, netdev, edumazet, jhs, kernel-team, Erik Lundgren,
	Paul E. McKenney

On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote:
> lockdep_unregister_key() is called from critical code paths, including
> sections where rtnl_lock() is held. For example, when replacing a qdisc
> in a network device, network egress traffic is disabled while
> __qdisc_destroy() is called for every network queue.
> 
> If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
> which gets blocked waiting for synchronize_rcu() to complete.
> 
> For example, a simple tc command to replace a qdisc could take 13
> seconds:
> 
>   # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
>     real    0m13.195s
>     user    0m0.001s
>     sys     0m2.746s
> 
> During this time, network egress is completely frozen while waiting for
> RCU synchronization.
> 
> Use synchronize_rcu_expedited() instead to minimize the impact on
> critical operations like network connectivity changes.
> 
> This improves 10x the function call to tc, when replacing the qdisc for
> a network card.
> 
>    # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
>      real     0m1.789s
>      user     0m0.000s
>      sys      0m1.613s
> 
> Reported-by: Erik Lundgren <elundgren@meta.com>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
> ---
>  kernel/locking/lockdep.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 4470680f02269..a79030ac36dd4 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
>  	if (need_callback)
>  		call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
>  
> -	/* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> -	synchronize_rcu();
> +	/* Wait until is_dynamic_key() has finished accessing k->hash_entry.
> +	 * This needs to be quick, since it is called in critical sections
> +	 */
> +	synchronize_rcu_expedited();
>  }
>  EXPORT_SYMBOL_GPL(lockdep_unregister_key);

So I fundamentally despise synchronize_rcu_expedited(), also your
comment style is broken.

Why can't qdisc call this outside of the lock?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-24 12:12 ` Peter Zijlstra
@ 2025-03-24 12:23   ` Eric Dumazet
  2025-03-24 12:24     ` Eric Dumazet
  2025-03-24 19:21     ` Boqun Feng
  0 siblings, 2 replies; 35+ messages in thread
From: Eric Dumazet @ 2025-03-24 12:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Breno Leitao, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren,
	Paul E. McKenney

On Mon, Mar 24, 2025 at 1:12 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote:
> > lockdep_unregister_key() is called from critical code paths, including
> > sections where rtnl_lock() is held. For example, when replacing a qdisc
> > in a network device, network egress traffic is disabled while
> > __qdisc_destroy() is called for every network queue.
> >
> > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
> > which gets blocked waiting for synchronize_rcu() to complete.
> >
> > For example, a simple tc command to replace a qdisc could take 13
> > seconds:
> >
> >   # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> >     real    0m13.195s
> >     user    0m0.001s
> >     sys     0m2.746s
> >
> > During this time, network egress is completely frozen while waiting for
> > RCU synchronization.
> >
> > Use synchronize_rcu_expedited() instead to minimize the impact on
> > critical operations like network connectivity changes.
> >
> > This improves 10x the function call to tc, when replacing the qdisc for
> > a network card.
> >
> >    # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> >      real     0m1.789s
> >      user     0m0.000s
> >      sys      0m1.613s
> >
> > Reported-by: Erik Lundgren <elundgren@meta.com>
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
> > ---
> >  kernel/locking/lockdep.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index 4470680f02269..a79030ac36dd4 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> >       if (need_callback)
> >               call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> >
> > -     /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > -     synchronize_rcu();
> > +     /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
> > +      * This needs to be quick, since it is called in critical sections
> > +      */
> > +     synchronize_rcu_expedited();
> >  }
> >  EXPORT_SYMBOL_GPL(lockdep_unregister_key);
>
> So I fundamentally despise synchronize_rcu_expedited(), also your
> comment style is broken.
>
> Why can't qdisc call this outside of the lock?

Good luck with that, and anyway the time to call it 256 times would
still hurt Breno use case.

My suggestion was to change lockdep_unregister_key() contract, and use
kfree_rcu() there

> I think we should redesign lockdep_unregister_key() to work on a separately
> allocated piece of memory,
> then use kfree_rcu() in it.
>
> Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
>
> struct ... {
>      struct lock_class_key key;
>      struct rcu_head  rcu;
> }

More work because it requires changing all lockdep_unregister_key() users.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-24 12:23   ` Eric Dumazet
@ 2025-03-24 12:24     ` Eric Dumazet
  2025-03-24 19:21     ` Boqun Feng
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2025-03-24 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Breno Leitao, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren,
	Paul E. McKenney

On Mon, Mar 24, 2025 at 1:23 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Mar 24, 2025 at 1:12 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote:
> > > lockdep_unregister_key() is called from critical code paths, including
> > > sections where rtnl_lock() is held. For example, when replacing a qdisc
> > > in a network device, network egress traffic is disabled while
> > > __qdisc_destroy() is called for every network queue.
> > >
> > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
> > > which gets blocked waiting for synchronize_rcu() to complete.
> > >
> > > For example, a simple tc command to replace a qdisc could take 13
> > > seconds:
> > >
> > >   # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> > >     real    0m13.195s
> > >     user    0m0.001s
> > >     sys     0m2.746s
> > >
> > > During this time, network egress is completely frozen while waiting for
> > > RCU synchronization.
> > >
> > > Use synchronize_rcu_expedited() instead to minimize the impact on
> > > critical operations like network connectivity changes.
> > >
> > > This improves 10x the function call to tc, when replacing the qdisc for
> > > a network card.
> > >
> > >    # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> > >      real     0m1.789s
> > >      user     0m0.000s
> > >      sys      0m1.613s
> > >
> > > Reported-by: Erik Lundgren <elundgren@meta.com>
> > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
> > > ---
> > >  kernel/locking/lockdep.c | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > index 4470680f02269..a79030ac36dd4 100644
> > > --- a/kernel/locking/lockdep.c
> > > +++ b/kernel/locking/lockdep.c
> > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > >       if (need_callback)
> > >               call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > >
> > > -     /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > -     synchronize_rcu();
> > > +     /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
> > > +      * This needs to be quick, since it is called in critical sections
> > > +      */
> > > +     synchronize_rcu_expedited();
> > >  }
> > >  EXPORT_SYMBOL_GPL(lockdep_unregister_key);
> >
> > So I fundamentally despise synchronize_rcu_expedited(), also your
> > comment style is broken.
> >
> > Why can't qdisc call this outside of the lock?
>
> Good luck with that, and anyway the time to call it 256 times would
> still hurt Breno use case.
>
> My suggestion was to change lockdep_unregister_key() contract, and use
> kfree_rcu() there
>
> > I think we should redesign lockdep_unregister_key() to work on a separately
> > allocated piece of memory,
> > then use kfree_rcu() in it.
> >
> > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
> >
> > struct ... {
> >      struct lock_class_key key;
> >      struct rcu_head  rcu;
> > }
>
> More work because it requires changing all lockdep_unregister_key() users.

Or add a new function, basically allow users to be converted when they need to.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-24 12:23   ` Eric Dumazet
  2025-03-24 12:24     ` Eric Dumazet
@ 2025-03-24 19:21     ` Boqun Feng
  2025-03-24 19:30       ` Boqun Feng
  1 sibling, 1 reply; 35+ messages in thread
From: Boqun Feng @ 2025-03-24 19:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon,
	Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
[...]
> > > ---
> > >  kernel/locking/lockdep.c | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > index 4470680f02269..a79030ac36dd4 100644
> > > --- a/kernel/locking/lockdep.c
> > > +++ b/kernel/locking/lockdep.c
> > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > >       if (need_callback)
> > >               call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > >
> > > -     /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > -     synchronize_rcu();

I feel a bit confusing even for the old comment, normally I would expect
the caller of lockdep_unregister_key() should guarantee the key has been
unpublished, in other words, there is no way a lockdep_unregister_key()
could race with a register_lock_class()/lockdep_init_map_type(). The
synchronize_rcu() is not needed then.

Let's say someone breaks my assumption above, then when doing a
register_lock_class() with a key about to be unregister, I cannot see
anything stops the following:

	CPU 0				CPU 1
	=====				=====
	register_lock_class():
	  ...
	  } else if (... && !is_dynamic_key(lock->key)) {
	  	// ->key is not unregistered yet, so this branch is not
		// taken.
	  	return NULL;
	  }
	  				lockdep_unregister_key(..);
					// key unregister, can be free
					// any time.
	  key = lock->key->subkeys + subclass; // BOOM! UAF.

So either we don't need the synchronize_rcu() here or the
synchronize_rcu() doesn't help at all. Am I missing something subtle
here?

Regards,
Boqun

> > > +     /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
> > > +      * This needs to be quick, since it is called in critical sections
> > > +      */
> > > +     synchronize_rcu_expedited();
> > >  }
> > >  EXPORT_SYMBOL_GPL(lockdep_unregister_key);
> >
> > So I fundamentally despise synchronize_rcu_expedited(), also your
> > comment style is broken.
> >
> > Why can't qdisc call this outside of the lock?
> 
> Good luck with that, and anyway the time to call it 256 times would
> still hurt Breno use case.
> 
> My suggestion was to change lockdep_unregister_key() contract, and use
> kfree_rcu() there
> 
> > I think we should redesign lockdep_unregister_key() to work on a separately
> > allocated piece of memory,
> > then use kfree_rcu() in it.
> >
> > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
> >
> > struct ... {
> >      struct lock_class_key key;
> >      struct rcu_head  rcu;
> > }
> 
> More work because it requires changing all lockdep_unregister_key() users.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-24 19:21     ` Boqun Feng
@ 2025-03-24 19:30       ` Boqun Feng
  2025-03-25  0:47         ` Boqun Feng
  0 siblings, 1 reply; 35+ messages in thread
From: Boqun Feng @ 2025-03-24 19:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon,
	Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
> On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
> [...]
> > > > ---
> > > >  kernel/locking/lockdep.c | 6 ++++--
> > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > > index 4470680f02269..a79030ac36dd4 100644
> > > > --- a/kernel/locking/lockdep.c
> > > > +++ b/kernel/locking/lockdep.c
> > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > > >       if (need_callback)
> > > >               call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > > >
> > > > -     /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > > -     synchronize_rcu();
> 
> I feel a bit confusing even for the old comment, normally I would expect
> the caller of lockdep_unregister_key() should guarantee the key has been
> unpublished, in other words, there is no way a lockdep_unregister_key()
> could race with a register_lock_class()/lockdep_init_map_type(). The
> synchronize_rcu() is not needed then.
> 
> Let's say someone breaks my assumption above, then when doing a
> register_lock_class() with a key about to be unregister, I cannot see
> anything stops the following:
> 
> 	CPU 0				CPU 1
> 	=====				=====
> 	register_lock_class():
> 	  ...
> 	  } else if (... && !is_dynamic_key(lock->key)) {
> 	  	// ->key is not unregistered yet, so this branch is not
> 		// taken.
> 	  	return NULL;
> 	  }
> 	  				lockdep_unregister_key(..);
> 					// key unregister, can be free
> 					// any time.
> 	  key = lock->key->subkeys + subclass; // BOOM! UAF.
> 
> So either we don't need the synchronize_rcu() here or the
> synchronize_rcu() doesn't help at all. Am I missing something subtle
> here?
> 

Oh! Maybe I was missing register_lock_class() must be called with irq
disabled, which is also an RCU read-side critical section.

Regards,
Boqun

> Regards,
> Boqun
> 
> > > > +     /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
> > > > +      * This needs to be quick, since it is called in critical sections
> > > > +      */
> > > > +     synchronize_rcu_expedited();
> > > >  }
> > > >  EXPORT_SYMBOL_GPL(lockdep_unregister_key);
> > >
> > > So I fundamentally despise synchronize_rcu_expedited(), also your
> > > comment style is broken.
> > >
> > > Why can't qdisc call this outside of the lock?
> > 
> > Good luck with that, and anyway the time to call it 256 times would
> > still hurt Breno use case.
> > 
> > My suggestion was to change lockdep_unregister_key() contract, and use
> > kfree_rcu() there
> > 
> > > I think we should redesign lockdep_unregister_key() to work on a separately
> > > allocated piece of memory,
> > > then use kfree_rcu() in it.
> > >
> > > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to
> > >
> > > struct ... {
> > >      struct lock_class_key key;
> > >      struct rcu_head  rcu;
> > > }
> > 
> > More work because it requires changing all lockdep_unregister_key() users.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-24 19:30       ` Boqun Feng
@ 2025-03-25  0:47         ` Boqun Feng
  2025-03-25  1:56           ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Boqun Feng @ 2025-03-25  0:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon,
	Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote:
> On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
> > On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
> > [...]
> > > > > ---
> > > > >  kernel/locking/lockdep.c | 6 ++++--
> > > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > > > index 4470680f02269..a79030ac36dd4 100644
> > > > > --- a/kernel/locking/lockdep.c
> > > > > +++ b/kernel/locking/lockdep.c
> > > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > > > >       if (need_callback)
> > > > >               call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > > > >
> > > > > -     /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > > > -     synchronize_rcu();
> > 
> > I feel a bit confusing even for the old comment, normally I would expect
> > the caller of lockdep_unregister_key() should guarantee the key has been
> > unpublished, in other words, there is no way a lockdep_unregister_key()
> > could race with a register_lock_class()/lockdep_init_map_type(). The
> > synchronize_rcu() is not needed then.
> > 
> > Let's say someone breaks my assumption above, then when doing a
> > register_lock_class() with a key about to be unregister, I cannot see
> > anything stops the following:
> > 
> > 	CPU 0				CPU 1
> > 	=====				=====
> > 	register_lock_class():
> > 	  ...
> > 	  } else if (... && !is_dynamic_key(lock->key)) {
> > 	  	// ->key is not unregistered yet, so this branch is not
> > 		// taken.
> > 	  	return NULL;
> > 	  }
> > 	  				lockdep_unregister_key(..);
> > 					// key unregister, can be free
> > 					// any time.
> > 	  key = lock->key->subkeys + subclass; // BOOM! UAF.
> > 
> > So either we don't need the synchronize_rcu() here or the
> > synchronize_rcu() doesn't help at all. Am I missing something subtle
> > here?
> > 
> 
> Oh! Maybe I was missing register_lock_class() must be called with irq
> disabled, which is also an RCU read-side critical section.
> 

Since register_lock_class() will be call with irq disabled, maybe hazard
pointers [1] is better because most of the case we only have nr_cpus
readers, so the potential hazard pointer slots are fixed.

So the below patch can reduce the time of the tc command from real ~1.7
second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env,
which is not surprising given it's a dedicated hazard pointers for
lock_class_key.

Thoughts?

[1]: https://lore.kernel.org/lkml/20240917143402.930114-1-boqun.feng@gmail.com/

Regards,
Boqun

---------------------------------->8
From: Boqun Feng <boqun.feng@gmail.com>
Date: Mon, 24 Mar 2025 13:38:15 -0700
Subject: [PATCH] WIP

Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/locking/lockdep.c | 65 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 61 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 4470680f0226..0b6e78d56cfe 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -111,6 +111,29 @@ late_initcall(kernel_lockdep_sysctls_init);
 DEFINE_PER_CPU(unsigned int, lockdep_recursion);
 EXPORT_PER_CPU_SYMBOL_GPL(lockdep_recursion);
 
+/* hazptr free always clears the slot */
+DEFINE_FREE(lockdep_key_hazptr, struct lock_class_key **, if (_T) smp_store_release((_T), NULL));
+DEFINE_PER_CPU(struct lock_class_key *, lockdep_key_hazptr);
+
+static void synchronize_lockdep_key_hazptr(struct lock_class_key *key)
+{
+	int cpu;
+
+	if (!key)
+		return;
+	/*
+	 * Synchronizes with the smp_mb() after protecting the pointer with
+	 * WRITE_ONCE().
+	 */
+	smp_mb();
+
+	for_each_possible_cpu(cpu) {
+		struct lock_class_key **hazptr = per_cpu_ptr(&lockdep_key_hazptr, cpu);
+
+		smp_cond_load_acquire(hazptr, VAL != key);
+	}
+}
+
 static __always_inline bool lockdep_enabled(void)
 {
 	if (!debug_locks)
@@ -1283,6 +1306,7 @@ static struct lock_class *
 register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 {
 	struct lockdep_subclass_key *key;
+	struct lock_class_key **key_hazptr __free(lockdep_key_hazptr) = NULL;
 	struct hlist_head *hash_head;
 	struct lock_class *class;
 	int idx;
@@ -1293,11 +1317,25 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
 	if (likely(class))
 		goto out_set_class_cache;
 
+	/* Interrupts are disabled hence it's safe to acquire the hazptr slot */
+	key_hazptr = this_cpu_ptr(&lockdep_key_hazptr);
+
+	/* hazptr slot must be unusued */
+	BUG_ON(*key_hazptr);
+
 	if (!lock->key) {
 		if (!assign_lock_key(lock))
 			return NULL;
-	} else if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) {
-		return NULL;
+	} else {
+		/* hazptr: protect the key */
+		WRITE_ONCE(*key_hazptr, lock->key);
+
+		/* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */
+		smp_mb();
+
+		if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) {
+			return NULL;
+		}
 	}
 
 	key = lock->key->subkeys + subclass;
@@ -4964,16 +5002,35 @@ void lockdep_init_map_type(struct lockdep_map *lock, const char *name,
 	 */
 	if (DEBUG_LOCKS_WARN_ON(!key))
 		return;
+
+	preempt_disable();
+	struct lock_class_key **hazptr __free(lockdep_key_hazptr) = this_cpu_ptr(&lockdep_key_hazptr);
+
+	/* hapztr: Protect the key */
+	WRITE_ONCE(*hazptr, key);
+
+	/* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */
+	smp_mb();
+
 	/*
 	 * Sanity check, the lock-class key must either have been allocated
 	 * statically or must have been registered as a dynamic key.
 	 */
 	if (!static_obj(key) && !is_dynamic_key(key)) {
+		preempt_enable();
 		if (debug_locks)
 			printk(KERN_ERR "BUG: key %px has not been registered!\n", key);
 		DEBUG_LOCKS_WARN_ON(1);
 		return;
 	}
+
+	/* hazptr: Release the key */
+	smp_store_release(hazptr, NULL);
+
+	/* Do not auto clean up*/
+	hazptr = NULL;
+	preempt_enable();
+
 	lock->key = key;
 
 	if (unlikely(!debug_locks))
@@ -6595,8 +6652,8 @@ void lockdep_unregister_key(struct lock_class_key *key)
 	if (need_callback)
 		call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
 
-	/* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
-	synchronize_rcu();
+	/* Wait until register_lock_class() has finished accessing key. */
+	synchronize_lockdep_key_hazptr(key);
 }
 EXPORT_SYMBOL_GPL(lockdep_unregister_key);
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-25  0:47         ` Boqun Feng
@ 2025-03-25  1:56           ` Waiman Long
  2025-03-25  3:41             ` Boqun Feng
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2025-03-25  1:56 UTC (permalink / raw)
  To: Boqun Feng, Eric Dumazet
  Cc: Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh,
	linux-kernel, netdev, jhs, kernel-team, Erik Lundgren,
	Paul E. McKenney

On 3/24/25 8:47 PM, Boqun Feng wrote:
> On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote:
>> On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
>>> On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
>>> [...]
>>>>>> ---
>>>>>>   kernel/locking/lockdep.c | 6 ++++--
>>>>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>>>>> index 4470680f02269..a79030ac36dd4 100644
>>>>>> --- a/kernel/locking/lockdep.c
>>>>>> +++ b/kernel/locking/lockdep.c
>>>>>> @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
>>>>>>        if (need_callback)
>>>>>>                call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
>>>>>>
>>>>>> -     /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
>>>>>> -     synchronize_rcu();
>>> I feel a bit confusing even for the old comment, normally I would expect
>>> the caller of lockdep_unregister_key() should guarantee the key has been
>>> unpublished, in other words, there is no way a lockdep_unregister_key()
>>> could race with a register_lock_class()/lockdep_init_map_type(). The
>>> synchronize_rcu() is not needed then.
>>>
>>> Let's say someone breaks my assumption above, then when doing a
>>> register_lock_class() with a key about to be unregister, I cannot see
>>> anything stops the following:
>>>
>>> 	CPU 0				CPU 1
>>> 	=====				=====
>>> 	register_lock_class():
>>> 	  ...
>>> 	  } else if (... && !is_dynamic_key(lock->key)) {
>>> 	  	// ->key is not unregistered yet, so this branch is not
>>> 		// taken.
>>> 	  	return NULL;
>>> 	  }
>>> 	  				lockdep_unregister_key(..);
>>> 					// key unregister, can be free
>>> 					// any time.
>>> 	  key = lock->key->subkeys + subclass; // BOOM! UAF.
>>>
>>> So either we don't need the synchronize_rcu() here or the
>>> synchronize_rcu() doesn't help at all. Am I missing something subtle
>>> here?
>>>
>> Oh! Maybe I was missing register_lock_class() must be called with irq
>> disabled, which is also an RCU read-side critical section.
>>
> Since register_lock_class() will be call with irq disabled, maybe hazard
> pointers [1] is better because most of the case we only have nr_cpus
> readers, so the potential hazard pointer slots are fixed.
>
> So the below patch can reduce the time of the tc command from real ~1.7
> second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env,
> which is not surprising given it's a dedicated hazard pointers for
> lock_class_key.
>
> Thoughts?

My understanding is that it is not a race between register_lock_class() 
and lockdep_unregister_key(). It is the fact that the structure that 
holds the lock_class_key may be freed immediately after return from 
lockdep_unregister_key(). So any processes that are in the process of 
iterating the hash_list containing the hash_entry to be unregistered may 
hit a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid 
potential access of invalid memory in lock_class") for a discussion of 
this kind of UAF problem.

As suggested by Eric, one possible solution is to add a 
lockdep_unregister_key() variant function that presumes the structure 
holding the key won't be freed until after a RCU delay. In this case, we 
can skip the last synchronize_rcu() call. Any callers that need 
immediate return should use kfree_rcu() to free the structure after 
calling the lockdep_unregister_key() variant.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-25  1:56           ` Waiman Long
@ 2025-03-25  3:41             ` Boqun Feng
       [not found]               ` <934d794b-7ebc-422c-b4fe-3e658a2e5e7a@redhat.com>
  0 siblings, 1 reply; 35+ messages in thread
From: Boqun Feng @ 2025-03-25  3:41 UTC (permalink / raw)
  To: Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Mon, Mar 24, 2025 at 09:56:25PM -0400, Waiman Long wrote:
> On 3/24/25 8:47 PM, Boqun Feng wrote:
> > On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote:
> > > On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
> > > > On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
> > > > [...]
> > > > > > > ---
> > > > > > >   kernel/locking/lockdep.c | 6 ++++--
> > > > > > >   1 file changed, 4 insertions(+), 2 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > > > > > index 4470680f02269..a79030ac36dd4 100644
> > > > > > > --- a/kernel/locking/lockdep.c
> > > > > > > +++ b/kernel/locking/lockdep.c
> > > > > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > > > > > >        if (need_callback)
> > > > > > >                call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > > > > > > 
> > > > > > > -     /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > > > > > -     synchronize_rcu();
> > > > I feel a bit confusing even for the old comment, normally I would expect
> > > > the caller of lockdep_unregister_key() should guarantee the key has been
> > > > unpublished, in other words, there is no way a lockdep_unregister_key()
> > > > could race with a register_lock_class()/lockdep_init_map_type(). The
> > > > synchronize_rcu() is not needed then.
> > > > 
> > > > Let's say someone breaks my assumption above, then when doing a
> > > > register_lock_class() with a key about to be unregister, I cannot see
> > > > anything stops the following:
> > > > 
> > > > 	CPU 0				CPU 1
> > > > 	=====				=====
> > > > 	register_lock_class():
> > > > 	  ...
> > > > 	  } else if (... && !is_dynamic_key(lock->key)) {
> > > > 	  	// ->key is not unregistered yet, so this branch is not
> > > > 		// taken.
> > > > 	  	return NULL;
> > > > 	  }
> > > > 	  				lockdep_unregister_key(..);
> > > > 					// key unregister, can be free
> > > > 					// any time.
> > > > 	  key = lock->key->subkeys + subclass; // BOOM! UAF.

This is not a UAF :(

> > > > 
> > > > So either we don't need the synchronize_rcu() here or the
> > > > synchronize_rcu() doesn't help at all. Am I missing something subtle
> > > > here?
> > > > 
> > > Oh! Maybe I was missing register_lock_class() must be called with irq
> > > disabled, which is also an RCU read-side critical section.
> > > 
> > Since register_lock_class() will be call with irq disabled, maybe hazard
> > pointers [1] is better because most of the case we only have nr_cpus
> > readers, so the potential hazard pointer slots are fixed.
> > 
> > So the below patch can reduce the time of the tc command from real ~1.7
> > second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env,
> > which is not surprising given it's a dedicated hazard pointers for
> > lock_class_key.
> > 
> > Thoughts?
> 
> My understanding is that it is not a race between register_lock_class() and
> lockdep_unregister_key(). It is the fact that the structure that holds the
> lock_class_key may be freed immediately after return from
> lockdep_unregister_key(). So any processes that are in the process of
> iterating the hash_list containing the hash_entry to be unregistered may hit

You mean the lock_keys_hash table, right? I used register_lock_class()
as an example, because it's one of the places that iterates
lock_keys_hash. IIUC lock_keys_hash is only used in
lockdep_{un,}register_key() and is_dynamic_key() (which are only called
by lockdep_init_map_type() and register_lock_class()).

> a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid potential
> access of invalid memory in lock_class") for a discussion of this kind of
> UAF problem.
> 

That commit seemed fixing a race between disabling lockdep and
unregistering key, and most importantly, call zap_class() for the
unregistered key even if lockdep is disabled (debug_locks = 0). It might
be related, but I'm not sure that's the reason of putting
synchronize_rcu() there. Say you want to synchronize between
/proc/lockdep and lockdep_unregister_key(), and you have
synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side
critical section at /proc/lockdep?

Regards,
Boqun

> As suggested by Eric, one possible solution is to add a
> lockdep_unregister_key() variant function that presumes the structure
> holding the key won't be freed until after a RCU delay. In this case, we can
> skip the last synchronize_rcu() call. Any callers that need immediate return
> should use kfree_rcu() to free the structure after calling the
> lockdep_unregister_key() variant.
> 
> Cheers,
> Longman
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
       [not found]               ` <934d794b-7ebc-422c-b4fe-3e658a2e5e7a@redhat.com>
@ 2025-03-25 14:57                 ` Waiman Long
  2025-03-25 18:45                 ` Boqun Feng
  1 sibling, 0 replies; 35+ messages in thread
From: Waiman Long @ 2025-03-25 14:57 UTC (permalink / raw)
  To: Waiman Long, Boqun Feng
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On 3/25/25 10:52 AM, Waiman Long wrote:
> I agree that the commit that I mentioned is not relevant to the 
> current case. You are right that is_dynamic_key() is the only function 
> that is problematic, the other two are protected by the lockdep_lock. 
> So they are safe. Anyway, I believe that the actual race happens in 
> the iteration of the hashed list in is_dynamic_key(). The key that you 
> save in the lockdep_key_hazptr in your proposed patch should never be 
> the key (dead_key) that is passed to lockdep_unregister_key(). In 
> is_dynamic_key():
>
>     hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
>                 if (k == key) {
>                         found = true;
>                         break;
>                 }
>         }
>
> key != k (dead_key), but before accessing its content to get to 
> hash_entry, an interrupt/NMI can happen. In the mean time, the 
> structure holding the key is freed and its content can be overwritten 
> with some garbage. When interrupt/NMI returns, hash_entry can point to 
> anything leading to crash or an infinite loop.  Perhaps we can use 
> some kind of synchronization mechanism between is_dynamic_key() and 
> lockdep_unregister_key() to prevent this kind of racing. For example, 
> we can have an atomic counter associated with each head of the hashed 
> table. is_dynamic_key() can increment the counter if it is not zero to 
> proceed and lockdep_unregister_key() have to make sure it can safely 
> decrement the counter to 0 before going ahead. Just a thought!
>
Well, that is essentially an arch_rwlock_t for each list head.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
       [not found]               ` <934d794b-7ebc-422c-b4fe-3e658a2e5e7a@redhat.com>
  2025-03-25 14:57                 ` Waiman Long
@ 2025-03-25 18:45                 ` Boqun Feng
  2025-03-25 19:23                   ` Waiman Long
  1 sibling, 1 reply; 35+ messages in thread
From: Boqun Feng @ 2025-03-25 18:45 UTC (permalink / raw)
  To: Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Tue, Mar 25, 2025 at 10:52:16AM -0400, Waiman Long wrote:
> On 3/24/25 11:41 PM, Boqun Feng wrote:
> > On Mon, Mar 24, 2025 at 09:56:25PM -0400, Waiman Long wrote:
> > > On 3/24/25 8:47 PM, Boqun Feng wrote:
> > > > On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote:
> > > > > On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
> > > > > > On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
> > > > > > [...]
> > > > > > > > > ---
> > > > > > > > >    kernel/locking/lockdep.c | 6 ++++--
> > > > > > > > >    1 file changed, 4 insertions(+), 2 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > > > > > > > index 4470680f02269..a79030ac36dd4 100644
> > > > > > > > > --- a/kernel/locking/lockdep.c
> > > > > > > > > +++ b/kernel/locking/lockdep.c
> > > > > > > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
> > > > > > > > >         if (need_callback)
> > > > > > > > >                 call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
> > > > > > > > > 
> > > > > > > > > -     /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
> > > > > > > > > -     synchronize_rcu();
> > > > > > I feel a bit confusing even for the old comment, normally I would expect
> > > > > > the caller of lockdep_unregister_key() should guarantee the key has been
> > > > > > unpublished, in other words, there is no way a lockdep_unregister_key()
> > > > > > could race with a register_lock_class()/lockdep_init_map_type(). The
> > > > > > synchronize_rcu() is not needed then.
> > > > > > 
> > > > > > Let's say someone breaks my assumption above, then when doing a
> > > > > > register_lock_class() with a key about to be unregister, I cannot see
> > > > > > anything stops the following:
> > > > > > 
> > > > > > 	CPU 0				CPU 1
> > > > > > 	=====				=====
> > > > > > 	register_lock_class():
> > > > > > 	  ...
> > > > > > 	  } else if (... && !is_dynamic_key(lock->key)) {
> > > > > > 	  	// ->key is not unregistered yet, so this branch is not
> > > > > > 		// taken.
> > > > > > 	  	return NULL;
> > > > > > 	  }
> > > > > > 	  				lockdep_unregister_key(..);
> > > > > > 					// key unregister, can be free
> > > > > > 					// any time.
> > > > > > 	  key = lock->key->subkeys + subclass; // BOOM! UAF.
> > This is not a UAF :(
> > 
> > > > > > So either we don't need the synchronize_rcu() here or the
> > > > > > synchronize_rcu() doesn't help at all. Am I missing something subtle
> > > > > > here?
> > > > > > 
> > > > > Oh! Maybe I was missing register_lock_class() must be called with irq
> > > > > disabled, which is also an RCU read-side critical section.
> > > > > 
> > > > Since register_lock_class() will be call with irq disabled, maybe hazard
> > > > pointers [1] is better because most of the case we only have nr_cpus
> > > > readers, so the potential hazard pointer slots are fixed.
> > > > 
> > > > So the below patch can reduce the time of the tc command from real ~1.7
> > > > second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env,
> > > > which is not surprising given it's a dedicated hazard pointers for
> > > > lock_class_key.
> > > > 
> > > > Thoughts?
> > > My understanding is that it is not a race between register_lock_class() and
> > > lockdep_unregister_key(). It is the fact that the structure that holds the
> > > lock_class_key may be freed immediately after return from
> > > lockdep_unregister_key(). So any processes that are in the process of
> > > iterating the hash_list containing the hash_entry to be unregistered may hit
> > You mean the lock_keys_hash table, right? I used register_lock_class()
> > as an example, because it's one of the places that iterates
> > lock_keys_hash. IIUC lock_keys_hash is only used in
> > lockdep_{un,}register_key() and is_dynamic_key() (which are only called
> > by lockdep_init_map_type() and register_lock_class()).
> > 
> > > a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid potential
> > > access of invalid memory in lock_class") for a discussion of this kind of
> > > UAF problem.
> > > 
> > That commit seemed fixing a race between disabling lockdep and
> > unregistering key, and most importantly, call zap_class() for the
> > unregistered key even if lockdep is disabled (debug_locks = 0). It might
> > be related, but I'm not sure that's the reason of putting
> > synchronize_rcu() there. Say you want to synchronize between
> > /proc/lockdep and lockdep_unregister_key(), and you have
> > synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side
> > critical section at /proc/lockdep?
> I agree that the commit that I mentioned is not relevant to the current
> case. You are right that is_dynamic_key() is the only function that is
> problematic, the other two are protected by the lockdep_lock. So they are
> safe. Anyway, I believe that the actual race happens in the iteration of the
> hashed list in is_dynamic_key(). The key that you save in the
> lockdep_key_hazptr in your proposed patch should never be the key (dead_key)

The key stored in lockdep_key_hazptr is the one that the rest of the
function will use after is_dynamic_key() return true. That is,

	CPU 0				CPU 1
	=====				=====
	WRITE_ONCE(*lockdep_key_hazptr, key);
	smp_mb();

	is_dynamic_key():
	  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
	    if (k == key) {
	      found = true;
	      break;
	    }
	  }
	  				lockdep_unregister_key():
					  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
					    if (k == key) {
					      hlist_del_rcu(&k->hash_entry);
				              found = true;
				              break;
					    }
					  }

				        smp_mb();

					synchronize_lockdep_key_hazptr():
					  for_each_possible_cpu(cpu) {
					    <wait for the hazptr slot on
					    that CPU to be not equal to
					    the removed key>
					  }


, so that if is_dynamic_key() finds a key was in the hash, even though
later on the key would be removed by lockdep_unregister_key(), the
hazard pointers guarantee lockdep_unregister_key() would wait for the
hazard pointer to release.

> that is passed to lockdep_unregister_key(). In is_dynamic_key():
> 
>     hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
>                 if (k == key) {
>                         found = true;
>                         break;
>                 }
>         }
> 
> key != k (dead_key), but before accessing its content to get to hash_entry,

It is the dead_key.

> an interrupt/NMI can happen. In the mean time, the structure holding the key
> is freed and its content can be overwritten with some garbage. When
> interrupt/NMI returns, hash_entry can point to anything leading to crash or
> an infinite loop.  Perhaps we can use some kind of synchronization mechanism

No, hash_entry cannot be freed or overwritten because the user has
protect the key with hazard pointers, only when the user reset the
hazard pointer to NULL, lockdep_unregister_key() can then return and the
key can be freed.

> between is_dynamic_key() and lockdep_unregister_key() to prevent this kind
> of racing. For example, we can have an atomic counter associated with each

The hazard pointer I proposed provides the exact synchronization ;-)

Regards,
Boqun

> head of the hashed table. is_dynamic_key() can increment the counter if it
> is not zero to proceed and lockdep_unregister_key() have to make sure it can
> safely decrement the counter to 0 before going ahead. Just a thought!
> 
> Cheers,
> Longman

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-25 18:45                 ` Boqun Feng
@ 2025-03-25 19:23                   ` Waiman Long
  2025-03-25 19:42                     ` Boqun Feng
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2025-03-25 19:23 UTC (permalink / raw)
  To: Boqun Feng, Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On 3/25/25 2:45 PM, Boqun Feng wrote:
> On Tue, Mar 25, 2025 at 10:52:16AM -0400, Waiman Long wrote:
>> On 3/24/25 11:41 PM, Boqun Feng wrote:
>>> On Mon, Mar 24, 2025 at 09:56:25PM -0400, Waiman Long wrote:
>>>> On 3/24/25 8:47 PM, Boqun Feng wrote:
>>>>> On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote:
>>>>>> On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote:
>>>>>>> On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote:
>>>>>>> [...]
>>>>>>>>>> ---
>>>>>>>>>>     kernel/locking/lockdep.c | 6 ++++--
>>>>>>>>>>     1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>>>>>>>>> index 4470680f02269..a79030ac36dd4 100644
>>>>>>>>>> --- a/kernel/locking/lockdep.c
>>>>>>>>>> +++ b/kernel/locking/lockdep.c
>>>>>>>>>> @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
>>>>>>>>>>          if (need_callback)
>>>>>>>>>>                  call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
>>>>>>>>>>
>>>>>>>>>> -     /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
>>>>>>>>>> -     synchronize_rcu();
>>>>>>> I feel a bit confusing even for the old comment, normally I would expect
>>>>>>> the caller of lockdep_unregister_key() should guarantee the key has been
>>>>>>> unpublished, in other words, there is no way a lockdep_unregister_key()
>>>>>>> could race with a register_lock_class()/lockdep_init_map_type(). The
>>>>>>> synchronize_rcu() is not needed then.
>>>>>>>
>>>>>>> Let's say someone breaks my assumption above, then when doing a
>>>>>>> register_lock_class() with a key about to be unregister, I cannot see
>>>>>>> anything stops the following:
>>>>>>>
>>>>>>> 	CPU 0				CPU 1
>>>>>>> 	=====				=====
>>>>>>> 	register_lock_class():
>>>>>>> 	  ...
>>>>>>> 	  } else if (... && !is_dynamic_key(lock->key)) {
>>>>>>> 	  	// ->key is not unregistered yet, so this branch is not
>>>>>>> 		// taken.
>>>>>>> 	  	return NULL;
>>>>>>> 	  }
>>>>>>> 	  				lockdep_unregister_key(..);
>>>>>>> 					// key unregister, can be free
>>>>>>> 					// any time.
>>>>>>> 	  key = lock->key->subkeys + subclass; // BOOM! UAF.
>>> This is not a UAF :(
>>>
>>>>>>> So either we don't need the synchronize_rcu() here or the
>>>>>>> synchronize_rcu() doesn't help at all. Am I missing something subtle
>>>>>>> here?
>>>>>>>
>>>>>> Oh! Maybe I was missing register_lock_class() must be called with irq
>>>>>> disabled, which is also an RCU read-side critical section.
>>>>>>
>>>>> Since register_lock_class() will be call with irq disabled, maybe hazard
>>>>> pointers [1] is better because most of the case we only have nr_cpus
>>>>> readers, so the potential hazard pointer slots are fixed.
>>>>>
>>>>> So the below patch can reduce the time of the tc command from real ~1.7
>>>>> second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env,
>>>>> which is not surprising given it's a dedicated hazard pointers for
>>>>> lock_class_key.
>>>>>
>>>>> Thoughts?
>>>> My understanding is that it is not a race between register_lock_class() and
>>>> lockdep_unregister_key(). It is the fact that the structure that holds the
>>>> lock_class_key may be freed immediately after return from
>>>> lockdep_unregister_key(). So any processes that are in the process of
>>>> iterating the hash_list containing the hash_entry to be unregistered may hit
>>> You mean the lock_keys_hash table, right? I used register_lock_class()
>>> as an example, because it's one of the places that iterates
>>> lock_keys_hash. IIUC lock_keys_hash is only used in
>>> lockdep_{un,}register_key() and is_dynamic_key() (which are only called
>>> by lockdep_init_map_type() and register_lock_class()).
>>>
>>>> a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid potential
>>>> access of invalid memory in lock_class") for a discussion of this kind of
>>>> UAF problem.
>>>>
>>> That commit seemed fixing a race between disabling lockdep and
>>> unregistering key, and most importantly, call zap_class() for the
>>> unregistered key even if lockdep is disabled (debug_locks = 0). It might
>>> be related, but I'm not sure that's the reason of putting
>>> synchronize_rcu() there. Say you want to synchronize between
>>> /proc/lockdep and lockdep_unregister_key(), and you have
>>> synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side
>>> critical section at /proc/lockdep?
>> I agree that the commit that I mentioned is not relevant to the current
>> case. You are right that is_dynamic_key() is the only function that is
>> problematic, the other two are protected by the lockdep_lock. So they are
>> safe. Anyway, I believe that the actual race happens in the iteration of the
>> hashed list in is_dynamic_key(). The key that you save in the
>> lockdep_key_hazptr in your proposed patch should never be the key (dead_key)
> The key stored in lockdep_key_hazptr is the one that the rest of the
> function will use after is_dynamic_key() return true. That is,
>
> 	CPU 0				CPU 1
> 	=====				=====
> 	WRITE_ONCE(*lockdep_key_hazptr, key);
> 	smp_mb();
>
> 	is_dynamic_key():
> 	  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
> 	    if (k == key) {
> 	      found = true;
> 	      break;
> 	    }
> 	  }
> 	  				lockdep_unregister_key():
> 					  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
> 					    if (k == key) {
> 					      hlist_del_rcu(&k->hash_entry);
> 				              found = true;
> 				              break;
> 					    }
> 					  }
>
> 				        smp_mb();
>
> 					synchronize_lockdep_key_hazptr():
> 					  for_each_possible_cpu(cpu) {
> 					    <wait for the hazptr slot on
> 					    that CPU to be not equal to
> 					    the removed key>
> 					  }
>
>
> , so that if is_dynamic_key() finds a key was in the hash, even though
> later on the key would be removed by lockdep_unregister_key(), the
> hazard pointers guarantee lockdep_unregister_key() would wait for the
> hazard pointer to release.
>
>> that is passed to lockdep_unregister_key(). In is_dynamic_key():
>>
>>      hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
>>                  if (k == key) {
>>                          found = true;
>>                          break;
>>                  }
>>          }
>>
>> key != k (dead_key), but before accessing its content to get to hash_entry,
> It is the dead_key.
>
>> an interrupt/NMI can happen. In the mean time, the structure holding the key
>> is freed and its content can be overwritten with some garbage. When
>> interrupt/NMI returns, hash_entry can point to anything leading to crash or
>> an infinite loop.  Perhaps we can use some kind of synchronization mechanism
> No, hash_entry cannot be freed or overwritten because the user has
> protect the key with hazard pointers, only when the user reset the
> hazard pointer to NULL, lockdep_unregister_key() can then return and the
> key can be freed.
>
>> between is_dynamic_key() and lockdep_unregister_key() to prevent this kind
>> of racing. For example, we can have an atomic counter associated with each
> The hazard pointer I proposed provides the exact synchronization ;-)

What I am saying is that register_lock_class() is trying to find a 
newkey while lockdep_unregister_key() is trying to take out an oldkey 
(newkey != oldkey). If they happens in the same hash list, 
register_lock_class() will put newkey into the hazard pointer, but 
synchronize_lockdep_key_hazptr() call from lockdep_unregister_key() is 
checking for oldkey which is not the one saved in the hazard pointer. So 
lockdep_unregister_key() will return and the key will be freed and 
reused while is_dynamic_key() may just have a reference to the oldkey 
and trying to access its content which is invalid. I think this is a 
possible scenario.

Cheers, Longman

>
> Regards,
> Boqun
>
>> head of the hashed table. is_dynamic_key() can increment the counter if it
>> is not zero to proceed and lockdep_unregister_key() have to make sure it can
>> safely decrement the counter to 0 before going ahead. Just a thought!
>>
>> Cheers,
>> Longman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-25 19:23                   ` Waiman Long
@ 2025-03-25 19:42                     ` Boqun Feng
  2025-03-25 23:20                       ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Boqun Feng @ 2025-03-25 19:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Tue, Mar 25, 2025 at 03:23:30PM -0400, Waiman Long wrote:
[...]
> > > > That commit seemed fixing a race between disabling lockdep and
> > > > unregistering key, and most importantly, call zap_class() for the
> > > > unregistered key even if lockdep is disabled (debug_locks = 0). It might
> > > > be related, but I'm not sure that's the reason of putting
> > > > synchronize_rcu() there. Say you want to synchronize between
> > > > /proc/lockdep and lockdep_unregister_key(), and you have
> > > > synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side
> > > > critical section at /proc/lockdep?
> > > I agree that the commit that I mentioned is not relevant to the current
> > > case. You are right that is_dynamic_key() is the only function that is
> > > problematic, the other two are protected by the lockdep_lock. So they are
> > > safe. Anyway, I believe that the actual race happens in the iteration of the
> > > hashed list in is_dynamic_key(). The key that you save in the
> > > lockdep_key_hazptr in your proposed patch should never be the key (dead_key)
> > The key stored in lockdep_key_hazptr is the one that the rest of the
> > function will use after is_dynamic_key() return true. That is,
> > 
> > 	CPU 0				CPU 1
> > 	=====				=====
> > 	WRITE_ONCE(*lockdep_key_hazptr, key);
> > 	smp_mb();
> > 
> > 	is_dynamic_key():
> > 	  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
> > 	    if (k == key) {
> > 	      found = true;
> > 	      break;
> > 	    }
> > 	  }
> > 	  				lockdep_unregister_key():
> > 					  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
> > 					    if (k == key) {
> > 					      hlist_del_rcu(&k->hash_entry);
> > 				              found = true;
> > 				              break;
> > 					    }
> > 					  }
> > 
> > 				        smp_mb();
> > 
> > 					synchronize_lockdep_key_hazptr():
> > 					  for_each_possible_cpu(cpu) {
> > 					    <wait for the hazptr slot on
> > 					    that CPU to be not equal to
> > 					    the removed key>
> > 					  }
> > 
> > 
> > , so that if is_dynamic_key() finds a key was in the hash, even though
> > later on the key would be removed by lockdep_unregister_key(), the
> > hazard pointers guarantee lockdep_unregister_key() would wait for the
> > hazard pointer to release.
> > 
> > > that is passed to lockdep_unregister_key(). In is_dynamic_key():
> > > 
> > >      hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
> > >                  if (k == key) {
> > >                          found = true;
> > >                          break;
> > >                  }
> > >          }
> > > 
> > > key != k (dead_key), but before accessing its content to get to hash_entry,
> > It is the dead_key.
> > 
> > > an interrupt/NMI can happen. In the mean time, the structure holding the key
> > > is freed and its content can be overwritten with some garbage. When
> > > interrupt/NMI returns, hash_entry can point to anything leading to crash or
> > > an infinite loop.  Perhaps we can use some kind of synchronization mechanism
> > No, hash_entry cannot be freed or overwritten because the user has
> > protect the key with hazard pointers, only when the user reset the
> > hazard pointer to NULL, lockdep_unregister_key() can then return and the
> > key can be freed.
> > 
> > > between is_dynamic_key() and lockdep_unregister_key() to prevent this kind
> > > of racing. For example, we can have an atomic counter associated with each
> > The hazard pointer I proposed provides the exact synchronization ;-)
> 
> What I am saying is that register_lock_class() is trying to find a newkey
> while lockdep_unregister_key() is trying to take out an oldkey (newkey !=
> oldkey). If they happens in the same hash list, register_lock_class() will
> put newkey into the hazard pointer, but synchronize_lockdep_key_hazptr()
> call from lockdep_unregister_key() is checking for oldkey which is not the
> one saved in the hazard pointer. So lockdep_unregister_key() will return and
> the key will be freed and reused while is_dynamic_key() may just have a
> reference to the oldkey and trying to access its content which is invalid. I
> think this is a possible scenario.
> 

Oh, I see. And yes, the hazard pointers I proposed cannot prevent this
race unfortunately. (Well, technically we can still use an extra slot to
hold the key in the hash list iteration, but that would generates a lot
of stores, so it won't be ideal). But...

[...]
> > > head of the hashed table. is_dynamic_key() can increment the counter if it
> > > is not zero to proceed and lockdep_unregister_key() have to make sure it can
> > > safely decrement the counter to 0 before going ahead. Just a thought!
> > > 

Your idea inspires another solution with hazard pointers, we can
put the pointer of the hash_head into the hazard pointer slot ;-)

in register_lock_class():

		/* hazptr: protect the key */
		WRITE_ONCE(*key_hazptr, keyhashentry(lock->key));

		/* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */
		smp_mb();

		if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) {
			return NULL;
		}

in lockdep_unregister_key():

	/* Wait until register_lock_class() has finished accessing k->hash_entry. */
	synchronize_lockdep_key_hazptr(keyhashentry(key));


which is more space efficient than per hash list atomic or lock unless
you have 1024 or more CPUs.

Regards,
Boqun

> > > Cheers,
> > > Longman
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-25 19:42                     ` Boqun Feng
@ 2025-03-25 23:20                       ` Waiman Long
  2025-03-26  5:25                         ` Boqun Feng
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2025-03-25 23:20 UTC (permalink / raw)
  To: Boqun Feng, Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On 3/25/25 3:42 PM, Boqun Feng wrote:
> On Tue, Mar 25, 2025 at 03:23:30PM -0400, Waiman Long wrote:
> [...]
>>>>> That commit seemed fixing a race between disabling lockdep and
>>>>> unregistering key, and most importantly, call zap_class() for the
>>>>> unregistered key even if lockdep is disabled (debug_locks = 0). It might
>>>>> be related, but I'm not sure that's the reason of putting
>>>>> synchronize_rcu() there. Say you want to synchronize between
>>>>> /proc/lockdep and lockdep_unregister_key(), and you have
>>>>> synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side
>>>>> critical section at /proc/lockdep?
>>>> I agree that the commit that I mentioned is not relevant to the current
>>>> case. You are right that is_dynamic_key() is the only function that is
>>>> problematic, the other two are protected by the lockdep_lock. So they are
>>>> safe. Anyway, I believe that the actual race happens in the iteration of the
>>>> hashed list in is_dynamic_key(). The key that you save in the
>>>> lockdep_key_hazptr in your proposed patch should never be the key (dead_key)
>>> The key stored in lockdep_key_hazptr is the one that the rest of the
>>> function will use after is_dynamic_key() return true. That is,
>>>
>>> 	CPU 0				CPU 1
>>> 	=====				=====
>>> 	WRITE_ONCE(*lockdep_key_hazptr, key);
>>> 	smp_mb();
>>>
>>> 	is_dynamic_key():
>>> 	  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
>>> 	    if (k == key) {
>>> 	      found = true;
>>> 	      break;
>>> 	    }
>>> 	  }
>>> 	  				lockdep_unregister_key():
>>> 					  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
>>> 					    if (k == key) {
>>> 					      hlist_del_rcu(&k->hash_entry);
>>> 				              found = true;
>>> 				              break;
>>> 					    }
>>> 					  }
>>>
>>> 				        smp_mb();
>>>
>>> 					synchronize_lockdep_key_hazptr():
>>> 					  for_each_possible_cpu(cpu) {
>>> 					    <wait for the hazptr slot on
>>> 					    that CPU to be not equal to
>>> 					    the removed key>
>>> 					  }
>>>
>>>
>>> , so that if is_dynamic_key() finds a key was in the hash, even though
>>> later on the key would be removed by lockdep_unregister_key(), the
>>> hazard pointers guarantee lockdep_unregister_key() would wait for the
>>> hazard pointer to release.
>>>
>>>> that is passed to lockdep_unregister_key(). In is_dynamic_key():
>>>>
>>>>       hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
>>>>                   if (k == key) {
>>>>                           found = true;
>>>>                           break;
>>>>                   }
>>>>           }
>>>>
>>>> key != k (dead_key), but before accessing its content to get to hash_entry,
>>> It is the dead_key.
>>>
>>>> an interrupt/NMI can happen. In the mean time, the structure holding the key
>>>> is freed and its content can be overwritten with some garbage. When
>>>> interrupt/NMI returns, hash_entry can point to anything leading to crash or
>>>> an infinite loop.  Perhaps we can use some kind of synchronization mechanism
>>> No, hash_entry cannot be freed or overwritten because the user has
>>> protect the key with hazard pointers, only when the user reset the
>>> hazard pointer to NULL, lockdep_unregister_key() can then return and the
>>> key can be freed.
>>>
>>>> between is_dynamic_key() and lockdep_unregister_key() to prevent this kind
>>>> of racing. For example, we can have an atomic counter associated with each
>>> The hazard pointer I proposed provides the exact synchronization ;-)
>> What I am saying is that register_lock_class() is trying to find a newkey
>> while lockdep_unregister_key() is trying to take out an oldkey (newkey !=
>> oldkey). If they happens in the same hash list, register_lock_class() will
>> put newkey into the hazard pointer, but synchronize_lockdep_key_hazptr()
>> call from lockdep_unregister_key() is checking for oldkey which is not the
>> one saved in the hazard pointer. So lockdep_unregister_key() will return and
>> the key will be freed and reused while is_dynamic_key() may just have a
>> reference to the oldkey and trying to access its content which is invalid. I
>> think this is a possible scenario.
>>
> Oh, I see. And yes, the hazard pointers I proposed cannot prevent this
> race unfortunately. (Well, technically we can still use an extra slot to
> hold the key in the hash list iteration, but that would generates a lot
> of stores, so it won't be ideal). But...
>
> [...]
>>>> head of the hashed table. is_dynamic_key() can increment the counter if it
>>>> is not zero to proceed and lockdep_unregister_key() have to make sure it can
>>>> safely decrement the counter to 0 before going ahead. Just a thought!
>>>>
> Your idea inspires another solution with hazard pointers, we can
> put the pointer of the hash_head into the hazard pointer slot ;-)
>
> in register_lock_class():
>
> 		/* hazptr: protect the key */
> 		WRITE_ONCE(*key_hazptr, keyhashentry(lock->key));
>
> 		/* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */
> 		smp_mb();
>
> 		if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) {
> 			return NULL;
> 		}
>
> in lockdep_unregister_key():
>
> 	/* Wait until register_lock_class() has finished accessing k->hash_entry. */
> 	synchronize_lockdep_key_hazptr(keyhashentry(key));
>
>
> which is more space efficient than per hash list atomic or lock unless
> you have 1024 or more CPUs.

It looks like you are trying hard to find a use case for hazard pointer 
in the kernel :-)

Anyway, that may work. The only problem that I see is the issue of 
nesting of an interrupt context on top of a task context. It is possible 
that the first use of a raw_spinlock may happen in an interrupt context. 
If the interrupt happens when the task has set the hazard pointer and 
iterating the hash list, the value of the hazard pointer may be 
overwritten. Alternatively we could have multiple slots for the hazard 
pointer, but that will make the code more complicated. Or we could 
disable interrupt before setting the hazard pointer.

The solution that I am thinking about is to have a simple unfair rwlock 
to protect just the hash list iteration. lockdep_unregister_key() and 
lockdep_register_key() take the write lock with interrupt disabled. 
While is_dynamic_key() takes the read lock. Nesting in this case isn't a 
problem and we don't need RCU to protect the iteration process and so 
the last synchronize_rcu() call isn't needed. The level of contention 
should be low enough that live lock isn't an issue.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-25 23:20                       ` Waiman Long
@ 2025-03-26  5:25                         ` Boqun Feng
       [not found]                           ` <df237702-55c3-466b-b51e-f3fe46ae03ba@redhat.com>
  0 siblings, 1 reply; 35+ messages in thread
From: Boqun Feng @ 2025-03-26  5:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Tue, Mar 25, 2025 at 07:20:44PM -0400, Waiman Long wrote:
> On 3/25/25 3:42 PM, Boqun Feng wrote:
> > On Tue, Mar 25, 2025 at 03:23:30PM -0400, Waiman Long wrote:
> > [...]
> > > > > > That commit seemed fixing a race between disabling lockdep and
> > > > > > unregistering key, and most importantly, call zap_class() for the
> > > > > > unregistered key even if lockdep is disabled (debug_locks = 0). It might
> > > > > > be related, but I'm not sure that's the reason of putting
> > > > > > synchronize_rcu() there. Say you want to synchronize between
> > > > > > /proc/lockdep and lockdep_unregister_key(), and you have
> > > > > > synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side
> > > > > > critical section at /proc/lockdep?
> > > > > I agree that the commit that I mentioned is not relevant to the current
> > > > > case. You are right that is_dynamic_key() is the only function that is
> > > > > problematic, the other two are protected by the lockdep_lock. So they are
> > > > > safe. Anyway, I believe that the actual race happens in the iteration of the
> > > > > hashed list in is_dynamic_key(). The key that you save in the
> > > > > lockdep_key_hazptr in your proposed patch should never be the key (dead_key)
> > > > The key stored in lockdep_key_hazptr is the one that the rest of the
> > > > function will use after is_dynamic_key() return true. That is,
> > > > 
> > > > 	CPU 0				CPU 1
> > > > 	=====				=====
> > > > 	WRITE_ONCE(*lockdep_key_hazptr, key);
> > > > 	smp_mb();
> > > > 
> > > > 	is_dynamic_key():
> > > > 	  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
> > > > 	    if (k == key) {
> > > > 	      found = true;
> > > > 	      break;
> > > > 	    }
> > > > 	  }
> > > > 	  				lockdep_unregister_key():
> > > > 					  hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
> > > > 					    if (k == key) {
> > > > 					      hlist_del_rcu(&k->hash_entry);
> > > > 				              found = true;
> > > > 				              break;
> > > > 					    }
> > > > 					  }
> > > > 
> > > > 				        smp_mb();
> > > > 
> > > > 					synchronize_lockdep_key_hazptr():
> > > > 					  for_each_possible_cpu(cpu) {
> > > > 					    <wait for the hazptr slot on
> > > > 					    that CPU to be not equal to
> > > > 					    the removed key>
> > > > 					  }
> > > > 
> > > > 
> > > > , so that if is_dynamic_key() finds a key was in the hash, even though
> > > > later on the key would be removed by lockdep_unregister_key(), the
> > > > hazard pointers guarantee lockdep_unregister_key() would wait for the
> > > > hazard pointer to release.
> > > > 
> > > > > that is passed to lockdep_unregister_key(). In is_dynamic_key():
> > > > > 
> > > > >       hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
> > > > >                   if (k == key) {
> > > > >                           found = true;
> > > > >                           break;
> > > > >                   }
> > > > >           }
> > > > > 
> > > > > key != k (dead_key), but before accessing its content to get to hash_entry,
> > > > It is the dead_key.
> > > > 
> > > > > an interrupt/NMI can happen. In the mean time, the structure holding the key
> > > > > is freed and its content can be overwritten with some garbage. When
> > > > > interrupt/NMI returns, hash_entry can point to anything leading to crash or
> > > > > an infinite loop.  Perhaps we can use some kind of synchronization mechanism
> > > > No, hash_entry cannot be freed or overwritten because the user has
> > > > protect the key with hazard pointers, only when the user reset the
> > > > hazard pointer to NULL, lockdep_unregister_key() can then return and the
> > > > key can be freed.
> > > > 
> > > > > between is_dynamic_key() and lockdep_unregister_key() to prevent this kind
> > > > > of racing. For example, we can have an atomic counter associated with each
> > > > The hazard pointer I proposed provides the exact synchronization ;-)
> > > What I am saying is that register_lock_class() is trying to find a newkey
> > > while lockdep_unregister_key() is trying to take out an oldkey (newkey !=
> > > oldkey). If they happens in the same hash list, register_lock_class() will
> > > put newkey into the hazard pointer, but synchronize_lockdep_key_hazptr()
> > > call from lockdep_unregister_key() is checking for oldkey which is not the
> > > one saved in the hazard pointer. So lockdep_unregister_key() will return and
> > > the key will be freed and reused while is_dynamic_key() may just have a
> > > reference to the oldkey and trying to access its content which is invalid. I
> > > think this is a possible scenario.
> > > 
> > Oh, I see. And yes, the hazard pointers I proposed cannot prevent this
> > race unfortunately. (Well, technically we can still use an extra slot to
> > hold the key in the hash list iteration, but that would generates a lot
> > of stores, so it won't be ideal). But...
> > 
> > [...]
> > > > > head of the hashed table. is_dynamic_key() can increment the counter if it
> > > > > is not zero to proceed and lockdep_unregister_key() have to make sure it can
> > > > > safely decrement the counter to 0 before going ahead. Just a thought!
> > > > > 
> > Your idea inspires another solution with hazard pointers, we can
> > put the pointer of the hash_head into the hazard pointer slot ;-)
> > 
> > in register_lock_class():
> > 
> > 		/* hazptr: protect the key */
> > 		WRITE_ONCE(*key_hazptr, keyhashentry(lock->key));
> > 
> > 		/* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */
> > 		smp_mb();
> > 
> > 		if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) {
> > 			return NULL;
> > 		}
> > 
> > in lockdep_unregister_key():
> > 
> > 	/* Wait until register_lock_class() has finished accessing k->hash_entry. */
> > 	synchronize_lockdep_key_hazptr(keyhashentry(key));
> > 
> > 
> > which is more space efficient than per hash list atomic or lock unless
> > you have 1024 or more CPUs.
> 
> It looks like you are trying hard to find a use case for hazard pointer in
> the kernel :-)
> 

Well, if it does the job, why not use it ;-) Also this shows how
flexible hazard pointers can be.

At least when using hazard pointers, the reader side of the hash list
iteration is still lockless. Plus, since the synchronization part
doesn't need to wait for the RCU readers in the whole system, it will be
faster (I tried with the protecting-the-whole-hash-list approach as
well, it's the same result on the tc command). This is why I choose to
look into hazard pointers. Another mechanism can achieve the similar
behavior is SRCU, but SRCU is slightly heavier compared to hazard
pointers in this case (of course SRCU has more functionalities).

We can provide a lockdep_unregister_key_nosync() without the
synchronize_rcu() in it and let users do the synchronization, but it's
going to be hard to enforce and review, especially when someone
refactors the code and move the free code to somewhere else.

> Anyway, that may work. The only problem that I see is the issue of nesting
> of an interrupt context on top of a task context. It is possible that the
> first use of a raw_spinlock may happen in an interrupt context. If the
> interrupt happens when the task has set the hazard pointer and iterating the
> hash list, the value of the hazard pointer may be overwritten. Alternatively
> we could have multiple slots for the hazard pointer, but that will make the
> code more complicated. Or we could disable interrupt before setting the
> hazard pointer.

Or we can use lockdep_recursion:

	preempt_disable();
	lockdep_recursion_inc();
	barrier();

	WRITE_ONCE(*hazptr, ...);

, it should prevent the re-entrant of lockdep in irq.

> 
> The solution that I am thinking about is to have a simple unfair rwlock to
> protect just the hash list iteration. lockdep_unregister_key() and
> lockdep_register_key() take the write lock with interrupt disabled. While
> is_dynamic_key() takes the read lock. Nesting in this case isn't a problem
> and we don't need RCU to protect the iteration process and so the last
> synchronize_rcu() call isn't needed. The level of contention should be low
> enough that live lock isn't an issue.
> 

This could work, one thing though is that locks don't compose. Using a
hash write_lock in lockdep_unregister_key() will create a lockdep_lock()
-> "hash write_lock" dependency, and that means you cannot
lockdep_lock() while you're holding a hash read_lock, although it's
not the case today, but it certainly complicates the locking design
inside lockdep where there's no lockdep to help ;-)

Regards,
Boqun

> Cheers,
> Longman
> 
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
       [not found]                           ` <df237702-55c3-466b-b51e-f3fe46ae03ba@redhat.com>
@ 2025-03-26 16:40                             ` Waiman Long
  2025-03-26 16:47                               ` Boqun Feng
  2025-03-31 17:26                             ` Boqun Feng
  1 sibling, 1 reply; 35+ messages in thread
From: Waiman Long @ 2025-03-26 16:40 UTC (permalink / raw)
  To: Waiman Long, Boqun Feng
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On 3/26/25 11:39 AM, Waiman Long wrote:
> On 3/26/25 1:25 AM, Boqun Feng wrote:
>>> It looks like you are trying hard to find a use case for hazard pointer in
>>> the kernel 🙂
>>>
>> Well, if it does the job, why not use it 😉 Also this shows how
>> flexible hazard pointers can be.
>>
>> At least when using hazard pointers, the reader side of the hash list
>> iteration is still lockless. Plus, since the synchronization part
>> doesn't need to wait for the RCU readers in the whole system, it will be
>> faster (I tried with the protecting-the-whole-hash-list approach as
>> well, it's the same result on the tc command). This is why I choose to
>> look into hazard pointers. Another mechanism can achieve the similar
>> behavior is SRCU, but SRCU is slightly heavier compared to hazard
>> pointers in this case (of course SRCU has more functionalities).
>>
>> We can provide a lockdep_unregister_key_nosync() without the
>> synchronize_rcu() in it and let users do the synchronization, but it's
>> going to be hard to enforce and review, especially when someone
>> refactors the code and move the free code to somewhere else.
> Providing a second API and ask callers to do the right thing is 
> probably not a good idea and mistake is going to be made sooner or later.
>>> Anyway, that may work. The only problem that I see is the issue of nesting
>>> of an interrupt context on top of a task context. It is possible that the
>>> first use of a raw_spinlock may happen in an interrupt context. If the
>>> interrupt happens when the task has set the hazard pointer and iterating the
>>> hash list, the value of the hazard pointer may be overwritten. Alternatively
>>> we could have multiple slots for the hazard pointer, but that will make the
>>> code more complicated. Or we could disable interrupt before setting the
>>> hazard pointer.
>> Or we can use lockdep_recursion:
>>
>> 	preempt_disable();
>> 	lockdep_recursion_inc();
>> 	barrier();
>>
>> 	WRITE_ONCE(*hazptr, ...);
>>
>> , it should prevent the re-entrant of lockdep in irq.
> That will probably work. Or we can disable irq. I am fine with both.
>>> The solution that I am thinking about is to have a simple unfair rwlock to
>>> protect just the hash list iteration. lockdep_unregister_key() and
>>> lockdep_register_key() take the write lock with interrupt disabled. While
>>> is_dynamic_key() takes the read lock. Nesting in this case isn't a problem
>>> and we don't need RCU to protect the iteration process and so the last
>>> synchronize_rcu() call isn't needed. The level of contention should be low
>>> enough that live lock isn't an issue.
>>>
>> This could work, one thing though is that locks don't compose. Using a
>> hash write_lock in lockdep_unregister_key() will create a lockdep_lock()
>> -> "hash write_lock" dependency, and that means you cannot
>> lockdep_lock() while you're holding a hash read_lock, although it's
>> not the case today, but it certainly complicates the locking design
>> inside lockdep where there's no lockdep to help 😉
>
> Thinking about it more, doing it in a lockless way is probably a good 
> idea.
>
If we are using hazard pointer for synchronization, should we also take 
off "_rcu" from the list iteration/insertion/deletion macros to avoid 
the confusion that RCU is being used?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-26 16:40                             ` Waiman Long
@ 2025-03-26 16:47                               ` Boqun Feng
  2025-03-26 17:02                                 ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Boqun Feng @ 2025-03-26 16:47 UTC (permalink / raw)
  To: Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Wed, Mar 26, 2025 at 12:40:59PM -0400, Waiman Long wrote:
> On 3/26/25 11:39 AM, Waiman Long wrote:
> > On 3/26/25 1:25 AM, Boqun Feng wrote:
> > > > It looks like you are trying hard to find a use case for hazard pointer in
> > > > the kernel 🙂
> > > > 
> > > Well, if it does the job, why not use it 😉 Also this shows how
> > > flexible hazard pointers can be.
> > > 
> > > At least when using hazard pointers, the reader side of the hash list
> > > iteration is still lockless. Plus, since the synchronization part
> > > doesn't need to wait for the RCU readers in the whole system, it will be
> > > faster (I tried with the protecting-the-whole-hash-list approach as
> > > well, it's the same result on the tc command). This is why I choose to
> > > look into hazard pointers. Another mechanism can achieve the similar
> > > behavior is SRCU, but SRCU is slightly heavier compared to hazard
> > > pointers in this case (of course SRCU has more functionalities).
> > > 
> > > We can provide a lockdep_unregister_key_nosync() without the
> > > synchronize_rcu() in it and let users do the synchronization, but it's
> > > going to be hard to enforce and review, especially when someone
> > > refactors the code and move the free code to somewhere else.
> > Providing a second API and ask callers to do the right thing is probably
> > not a good idea and mistake is going to be made sooner or later.
> > > > Anyway, that may work. The only problem that I see is the issue of nesting
> > > > of an interrupt context on top of a task context. It is possible that the
> > > > first use of a raw_spinlock may happen in an interrupt context. If the
> > > > interrupt happens when the task has set the hazard pointer and iterating the
> > > > hash list, the value of the hazard pointer may be overwritten. Alternatively
> > > > we could have multiple slots for the hazard pointer, but that will make the
> > > > code more complicated. Or we could disable interrupt before setting the
> > > > hazard pointer.
> > > Or we can use lockdep_recursion:
> > > 
> > > 	preempt_disable();
> > > 	lockdep_recursion_inc();
> > > 	barrier();
> > > 
> > > 	WRITE_ONCE(*hazptr, ...);
> > > 
> > > , it should prevent the re-entrant of lockdep in irq.
> > That will probably work. Or we can disable irq. I am fine with both.
> > > > The solution that I am thinking about is to have a simple unfair rwlock to
> > > > protect just the hash list iteration. lockdep_unregister_key() and
> > > > lockdep_register_key() take the write lock with interrupt disabled. While
> > > > is_dynamic_key() takes the read lock. Nesting in this case isn't a problem
> > > > and we don't need RCU to protect the iteration process and so the last
> > > > synchronize_rcu() call isn't needed. The level of contention should be low
> > > > enough that live lock isn't an issue.
> > > > 
> > > This could work, one thing though is that locks don't compose. Using a
> > > hash write_lock in lockdep_unregister_key() will create a lockdep_lock()
> > > -> "hash write_lock" dependency, and that means you cannot
> > > lockdep_lock() while you're holding a hash read_lock, although it's
> > > not the case today, but it certainly complicates the locking design
> > > inside lockdep where there's no lockdep to help 😉
> > 
> > Thinking about it more, doing it in a lockless way is probably a good
> > idea.
> > 
> If we are using hazard pointer for synchronization, should we also take off
> "_rcu" from the list iteration/insertion/deletion macros to avoid the
> confusion that RCU is being used?
> 

We can, but we probably want to introduce a new set of API with suffix
"_lockless" or something because they will still need a lockless fashion
similar to RCU list iteration/insertion/deletion.

Regards,
Boqun

> Cheers,
> Longman
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-26 16:47                               ` Boqun Feng
@ 2025-03-26 17:02                                 ` Waiman Long
  2025-03-26 17:10                                   ` Paul E. McKenney
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2025-03-26 17:02 UTC (permalink / raw)
  To: Boqun Feng, Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney


On 3/26/25 12:47 PM, Boqun Feng wrote:
> On Wed, Mar 26, 2025 at 12:40:59PM -0400, Waiman Long wrote:
>> On 3/26/25 11:39 AM, Waiman Long wrote:
>>> On 3/26/25 1:25 AM, Boqun Feng wrote:
>>>>> It looks like you are trying hard to find a use case for hazard pointer in
>>>>> the kernel 🙂
>>>>>
>>>> Well, if it does the job, why not use it 😉 Also this shows how
>>>> flexible hazard pointers can be.
>>>>
>>>> At least when using hazard pointers, the reader side of the hash list
>>>> iteration is still lockless. Plus, since the synchronization part
>>>> doesn't need to wait for the RCU readers in the whole system, it will be
>>>> faster (I tried with the protecting-the-whole-hash-list approach as
>>>> well, it's the same result on the tc command). This is why I choose to
>>>> look into hazard pointers. Another mechanism can achieve the similar
>>>> behavior is SRCU, but SRCU is slightly heavier compared to hazard
>>>> pointers in this case (of course SRCU has more functionalities).
>>>>
>>>> We can provide a lockdep_unregister_key_nosync() without the
>>>> synchronize_rcu() in it and let users do the synchronization, but it's
>>>> going to be hard to enforce and review, especially when someone
>>>> refactors the code and move the free code to somewhere else.
>>> Providing a second API and ask callers to do the right thing is probably
>>> not a good idea and mistake is going to be made sooner or later.
>>>>> Anyway, that may work. The only problem that I see is the issue of nesting
>>>>> of an interrupt context on top of a task context. It is possible that the
>>>>> first use of a raw_spinlock may happen in an interrupt context. If the
>>>>> interrupt happens when the task has set the hazard pointer and iterating the
>>>>> hash list, the value of the hazard pointer may be overwritten. Alternatively
>>>>> we could have multiple slots for the hazard pointer, but that will make the
>>>>> code more complicated. Or we could disable interrupt before setting the
>>>>> hazard pointer.
>>>> Or we can use lockdep_recursion:
>>>>
>>>> 	preempt_disable();
>>>> 	lockdep_recursion_inc();
>>>> 	barrier();
>>>>
>>>> 	WRITE_ONCE(*hazptr, ...);
>>>>
>>>> , it should prevent the re-entrant of lockdep in irq.
>>> That will probably work. Or we can disable irq. I am fine with both.
>>>>> The solution that I am thinking about is to have a simple unfair rwlock to
>>>>> protect just the hash list iteration. lockdep_unregister_key() and
>>>>> lockdep_register_key() take the write lock with interrupt disabled. While
>>>>> is_dynamic_key() takes the read lock. Nesting in this case isn't a problem
>>>>> and we don't need RCU to protect the iteration process and so the last
>>>>> synchronize_rcu() call isn't needed. The level of contention should be low
>>>>> enough that live lock isn't an issue.
>>>>>
>>>> This could work, one thing though is that locks don't compose. Using a
>>>> hash write_lock in lockdep_unregister_key() will create a lockdep_lock()
>>>> -> "hash write_lock" dependency, and that means you cannot
>>>> lockdep_lock() while you're holding a hash read_lock, although it's
>>>> not the case today, but it certainly complicates the locking design
>>>> inside lockdep where there's no lockdep to help 😉
>>> Thinking about it more, doing it in a lockless way is probably a good
>>> idea.
>>>
>> If we are using hazard pointer for synchronization, should we also take off
>> "_rcu" from the list iteration/insertion/deletion macros to avoid the
>> confusion that RCU is being used?
>>
> We can, but we probably want to introduce a new set of API with suffix
> "_lockless" or something because they will still need a lockless fashion
> similar to RCU list iteration/insertion/deletion.

The lockless part is just the iteration of the list. Insertion and 
deletion is protected by lockdep_lock().

The current hlist_*_rcu() macros are doing the right things for lockless 
use case too. We can either document that RCU is not being used or have 
some _lockless helpers that just call the _rcu equivalent.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-26 17:02                                 ` Waiman Long
@ 2025-03-26 17:10                                   ` Paul E. McKenney
  2025-03-26 18:42                                     ` Boqun Feng
  0 siblings, 1 reply; 35+ messages in thread
From: Paul E. McKenney @ 2025-03-26 17:10 UTC (permalink / raw)
  To: Waiman Long
  Cc: Boqun Feng, Eric Dumazet, Peter Zijlstra, Breno Leitao,
	Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs,
	kernel-team, Erik Lundgren

On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote:
> 
> On 3/26/25 12:47 PM, Boqun Feng wrote:
> > On Wed, Mar 26, 2025 at 12:40:59PM -0400, Waiman Long wrote:
> > > On 3/26/25 11:39 AM, Waiman Long wrote:
> > > > On 3/26/25 1:25 AM, Boqun Feng wrote:
> > > > > > It looks like you are trying hard to find a use case for hazard pointer in
> > > > > > the kernel 🙂
> > > > > > 
> > > > > Well, if it does the job, why not use it 😉 Also this shows how
> > > > > flexible hazard pointers can be.
> > > > > 
> > > > > At least when using hazard pointers, the reader side of the hash list
> > > > > iteration is still lockless. Plus, since the synchronization part
> > > > > doesn't need to wait for the RCU readers in the whole system, it will be
> > > > > faster (I tried with the protecting-the-whole-hash-list approach as
> > > > > well, it's the same result on the tc command). This is why I choose to
> > > > > look into hazard pointers. Another mechanism can achieve the similar
> > > > > behavior is SRCU, but SRCU is slightly heavier compared to hazard
> > > > > pointers in this case (of course SRCU has more functionalities).
> > > > > 
> > > > > We can provide a lockdep_unregister_key_nosync() without the
> > > > > synchronize_rcu() in it and let users do the synchronization, but it's
> > > > > going to be hard to enforce and review, especially when someone
> > > > > refactors the code and move the free code to somewhere else.
> > > > Providing a second API and ask callers to do the right thing is probably
> > > > not a good idea and mistake is going to be made sooner or later.
> > > > > > Anyway, that may work. The only problem that I see is the issue of nesting
> > > > > > of an interrupt context on top of a task context. It is possible that the
> > > > > > first use of a raw_spinlock may happen in an interrupt context. If the
> > > > > > interrupt happens when the task has set the hazard pointer and iterating the
> > > > > > hash list, the value of the hazard pointer may be overwritten. Alternatively
> > > > > > we could have multiple slots for the hazard pointer, but that will make the
> > > > > > code more complicated. Or we could disable interrupt before setting the
> > > > > > hazard pointer.
> > > > > Or we can use lockdep_recursion:
> > > > > 
> > > > > 	preempt_disable();
> > > > > 	lockdep_recursion_inc();
> > > > > 	barrier();
> > > > > 
> > > > > 	WRITE_ONCE(*hazptr, ...);
> > > > > 
> > > > > , it should prevent the re-entrant of lockdep in irq.
> > > > That will probably work. Or we can disable irq. I am fine with both.
> > > > > > The solution that I am thinking about is to have a simple unfair rwlock to
> > > > > > protect just the hash list iteration. lockdep_unregister_key() and
> > > > > > lockdep_register_key() take the write lock with interrupt disabled. While
> > > > > > is_dynamic_key() takes the read lock. Nesting in this case isn't a problem
> > > > > > and we don't need RCU to protect the iteration process and so the last
> > > > > > synchronize_rcu() call isn't needed. The level of contention should be low
> > > > > > enough that live lock isn't an issue.
> > > > > > 
> > > > > This could work, one thing though is that locks don't compose. Using a
> > > > > hash write_lock in lockdep_unregister_key() will create a lockdep_lock()
> > > > > -> "hash write_lock" dependency, and that means you cannot
> > > > > lockdep_lock() while you're holding a hash read_lock, although it's
> > > > > not the case today, but it certainly complicates the locking design
> > > > > inside lockdep where there's no lockdep to help 😉
> > > > Thinking about it more, doing it in a lockless way is probably a good
> > > > idea.
> > > > 
> > > If we are using hazard pointer for synchronization, should we also take off
> > > "_rcu" from the list iteration/insertion/deletion macros to avoid the
> > > confusion that RCU is being used?
> > > 
> > We can, but we probably want to introduce a new set of API with suffix
> > "_lockless" or something because they will still need a lockless fashion
> > similar to RCU list iteration/insertion/deletion.
> 
> The lockless part is just the iteration of the list. Insertion and deletion
> is protected by lockdep_lock().
> 
> The current hlist_*_rcu() macros are doing the right things for lockless use
> case too. We can either document that RCU is not being used or have some
> _lockless helpers that just call the _rcu equivalent.

We used to have _lockless helper, but we got rid of them.  Not necessarily
meaning that we should not add them back in, but...  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-26 17:10                                   ` Paul E. McKenney
@ 2025-03-26 18:42                                     ` Boqun Feng
  2025-03-26 21:37                                       ` Paul E. McKenney
  2025-03-31 16:48                                       ` Breno Leitao
  0 siblings, 2 replies; 35+ messages in thread
From: Boqun Feng @ 2025-03-26 18:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Waiman Long, Eric Dumazet, Peter Zijlstra, Breno Leitao,
	Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs,
	kernel-team, Erik Lundgren

On Wed, Mar 26, 2025 at 10:10:28AM -0700, Paul E. McKenney wrote:
> On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote:
[...]
> > > > > Thinking about it more, doing it in a lockless way is probably a good
> > > > > idea.
> > > > > 
> > > > If we are using hazard pointer for synchronization, should we also take off
> > > > "_rcu" from the list iteration/insertion/deletion macros to avoid the
> > > > confusion that RCU is being used?
> > > > 
> > > We can, but we probably want to introduce a new set of API with suffix
> > > "_lockless" or something because they will still need a lockless fashion
> > > similar to RCU list iteration/insertion/deletion.
> > 
> > The lockless part is just the iteration of the list. Insertion and deletion
> > is protected by lockdep_lock().
> > 
> > The current hlist_*_rcu() macros are doing the right things for lockless use
> > case too. We can either document that RCU is not being used or have some
> > _lockless helpers that just call the _rcu equivalent.
> 
> We used to have _lockless helper, but we got rid of them.  Not necessarily
> meaning that we should not add them back in, but...  ;-)
> 

I will probably go with using *_rcu() first with some comments, if this
"hazard pointers for hash table" is a good idea in other places, we can
add *_hazptr() or pick a better name then.

Regards,
Boqun

> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-26 18:42                                     ` Boqun Feng
@ 2025-03-26 21:37                                       ` Paul E. McKenney
  2025-03-31 16:48                                       ` Breno Leitao
  1 sibling, 0 replies; 35+ messages in thread
From: Paul E. McKenney @ 2025-03-26 21:37 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Waiman Long, Eric Dumazet, Peter Zijlstra, Breno Leitao,
	Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs,
	kernel-team, Erik Lundgren

On Wed, Mar 26, 2025 at 11:42:37AM -0700, Boqun Feng wrote:
> On Wed, Mar 26, 2025 at 10:10:28AM -0700, Paul E. McKenney wrote:
> > On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote:
> [...]
> > > > > > Thinking about it more, doing it in a lockless way is probably a good
> > > > > > idea.
> > > > > > 
> > > > > If we are using hazard pointer for synchronization, should we also take off
> > > > > "_rcu" from the list iteration/insertion/deletion macros to avoid the
> > > > > confusion that RCU is being used?
> > > > > 
> > > > We can, but we probably want to introduce a new set of API with suffix
> > > > "_lockless" or something because they will still need a lockless fashion
> > > > similar to RCU list iteration/insertion/deletion.
> > > 
> > > The lockless part is just the iteration of the list. Insertion and deletion
> > > is protected by lockdep_lock().
> > > 
> > > The current hlist_*_rcu() macros are doing the right things for lockless use
> > > case too. We can either document that RCU is not being used or have some
> > > _lockless helpers that just call the _rcu equivalent.
> > 
> > We used to have _lockless helper, but we got rid of them.  Not necessarily
> > meaning that we should not add them back in, but...  ;-)
> > 
> 
> I will probably go with using *_rcu() first with some comments, if this
> "hazard pointers for hash table" is a good idea in other places, we can
> add *_hazptr() or pick a better name then.

Works for me!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-26 18:42                                     ` Boqun Feng
  2025-03-26 21:37                                       ` Paul E. McKenney
@ 2025-03-31 16:48                                       ` Breno Leitao
  2025-03-31 17:34                                         ` Boqun Feng
  1 sibling, 1 reply; 35+ messages in thread
From: Breno Leitao @ 2025-03-31 16:48 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Paul E. McKenney, Waiman Long, Eric Dumazet, Peter Zijlstra,
	Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs,
	kernel-team, Erik Lundgren

Hello Boqun, Waimn

On Wed, Mar 26, 2025 at 11:42:37AM -0700, Boqun Feng wrote:
> On Wed, Mar 26, 2025 at 10:10:28AM -0700, Paul E. McKenney wrote:
> > On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote:
> [...]
> > > > > > Thinking about it more, doing it in a lockless way is probably a good
> > > > > > idea.
> > > > > > 
> > > > > If we are using hazard pointer for synchronization, should we also take off
> > > > > "_rcu" from the list iteration/insertion/deletion macros to avoid the
> > > > > confusion that RCU is being used?
> > > > > 
> > > > We can, but we probably want to introduce a new set of API with suffix
> > > > "_lockless" or something because they will still need a lockless fashion
> > > > similar to RCU list iteration/insertion/deletion.
> > > 
> > > The lockless part is just the iteration of the list. Insertion and deletion
> > > is protected by lockdep_lock().
> > > 
> > > The current hlist_*_rcu() macros are doing the right things for lockless use
> > > case too. We can either document that RCU is not being used or have some
> > > _lockless helpers that just call the _rcu equivalent.
> > 
> > We used to have _lockless helper, but we got rid of them.  Not necessarily
> > meaning that we should not add them back in, but...  ;-)
> > 
> 
> I will probably go with using *_rcu() first with some comments, if this
> "hazard pointers for hash table" is a good idea in other places, we can
> add *_hazptr() or pick a better name then.

I am trying to figure out what are the next steps to get this issue
solve.

Would you mind help me to understand what _rcu() fuction you are
suggesting and what will it replace?

Thank you,
--breno

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
       [not found]                           ` <df237702-55c3-466b-b51e-f3fe46ae03ba@redhat.com>
  2025-03-26 16:40                             ` Waiman Long
@ 2025-03-31 17:26                             ` Boqun Feng
  2025-03-31 17:33                               ` Waiman Long
  2025-03-31 17:42                               ` Eric Dumazet
  1 sibling, 2 replies; 35+ messages in thread
From: Boqun Feng @ 2025-03-31 17:26 UTC (permalink / raw)
  To: Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote:
[...]
> > > Anyway, that may work. The only problem that I see is the issue of nesting
> > > of an interrupt context on top of a task context. It is possible that the
> > > first use of a raw_spinlock may happen in an interrupt context. If the
> > > interrupt happens when the task has set the hazard pointer and iterating the
> > > hash list, the value of the hazard pointer may be overwritten. Alternatively
> > > we could have multiple slots for the hazard pointer, but that will make the
> > > code more complicated. Or we could disable interrupt before setting the
> > > hazard pointer.
> > Or we can use lockdep_recursion:
> > 
> > 	preempt_disable();
> > 	lockdep_recursion_inc();
> > 	barrier();
> > 
> > 	WRITE_ONCE(*hazptr, ...);
> > 
> > , it should prevent the re-entrant of lockdep in irq.
> That will probably work. Or we can disable irq. I am fine with both.

Disabling irq may not work in this case, because an NMI can also happen
and call register_lock_class().

I'm experimenting a new idea here, it might be better (for general
cases), and this has the similar spirit that we could move the
protection scope of a hazard pointer from a key to a hash_list: we can
introduce a wildcard address, and whenever we do a synchronize_hazptr(),
if the hazptr slot equal to wildcard, we treat as it matches to any ptr,
hence synchronize_hazptr() will still wait until it's zero'd. Not only
this could help in the nesting case, it can also be used if the users
want to protect multiple things with this simple hazard pointer
implementation.

Regards,
Boqun

[..]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-31 17:26                             ` Boqun Feng
@ 2025-03-31 17:33                               ` Waiman Long
  2025-03-31 18:33                                 ` Paul E. McKenney
  2025-03-31 17:42                               ` Eric Dumazet
  1 sibling, 1 reply; 35+ messages in thread
From: Waiman Long @ 2025-03-31 17:33 UTC (permalink / raw)
  To: Boqun Feng, Waiman Long
  Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On 3/31/25 1:26 PM, Boqun Feng wrote:
> On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote:
> [...]
>>>> Anyway, that may work. The only problem that I see is the issue of nesting
>>>> of an interrupt context on top of a task context. It is possible that the
>>>> first use of a raw_spinlock may happen in an interrupt context. If the
>>>> interrupt happens when the task has set the hazard pointer and iterating the
>>>> hash list, the value of the hazard pointer may be overwritten. Alternatively
>>>> we could have multiple slots for the hazard pointer, but that will make the
>>>> code more complicated. Or we could disable interrupt before setting the
>>>> hazard pointer.
>>> Or we can use lockdep_recursion:
>>>
>>> 	preempt_disable();
>>> 	lockdep_recursion_inc();
>>> 	barrier();
>>>
>>> 	WRITE_ONCE(*hazptr, ...);
>>>
>>> , it should prevent the re-entrant of lockdep in irq.
>> That will probably work. Or we can disable irq. I am fine with both.
> Disabling irq may not work in this case, because an NMI can also happen
> and call register_lock_class().
Right, disabling irq doesn't work with NMI. So incrementing the 
recursion count is likely the way to go and I think it will work even in 
the NMI case.

>
> I'm experimenting a new idea here, it might be better (for general
> cases), and this has the similar spirit that we could move the
> protection scope of a hazard pointer from a key to a hash_list: we can
> introduce a wildcard address, and whenever we do a synchronize_hazptr(),
> if the hazptr slot equal to wildcard, we treat as it matches to any ptr,
> hence synchronize_hazptr() will still wait until it's zero'd. Not only
> this could help in the nesting case, it can also be used if the users
> want to protect multiple things with this simple hazard pointer
> implementation.

I think it is a good idea to add a wildcard for the general use case. 
Setting the hazptr to the list head will be enough for this particular case.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-31 16:48                                       ` Breno Leitao
@ 2025-03-31 17:34                                         ` Boqun Feng
  0 siblings, 0 replies; 35+ messages in thread
From: Boqun Feng @ 2025-03-31 17:34 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Paul E. McKenney, Waiman Long, Eric Dumazet, Peter Zijlstra,
	Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs,
	kernel-team, Erik Lundgren

On Mon, Mar 31, 2025 at 09:48:34AM -0700, Breno Leitao wrote:
> Hello Boqun, Waimn
> 
> On Wed, Mar 26, 2025 at 11:42:37AM -0700, Boqun Feng wrote:
> > On Wed, Mar 26, 2025 at 10:10:28AM -0700, Paul E. McKenney wrote:
> > > On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote:
> > [...]
> > > > > > > Thinking about it more, doing it in a lockless way is probably a good
> > > > > > > idea.
> > > > > > > 
> > > > > > If we are using hazard pointer for synchronization, should we also take off
> > > > > > "_rcu" from the list iteration/insertion/deletion macros to avoid the
> > > > > > confusion that RCU is being used?
> > > > > > 
> > > > > We can, but we probably want to introduce a new set of API with suffix
> > > > > "_lockless" or something because they will still need a lockless fashion
> > > > > similar to RCU list iteration/insertion/deletion.
> > > > 
> > > > The lockless part is just the iteration of the list. Insertion and deletion
> > > > is protected by lockdep_lock().
> > > > 
> > > > The current hlist_*_rcu() macros are doing the right things for lockless use
> > > > case too. We can either document that RCU is not being used or have some
> > > > _lockless helpers that just call the _rcu equivalent.
> > > 
> > > We used to have _lockless helper, but we got rid of them.  Not necessarily
> > > meaning that we should not add them back in, but...  ;-)
> > > 
> > 
> > I will probably go with using *_rcu() first with some comments, if this
> > "hazard pointers for hash table" is a good idea in other places, we can
> > add *_hazptr() or pick a better name then.
> 
> I am trying to figure out what are the next steps to get this issue
> solve.
> 

I will send out a serise including introduction of simple hazard
pointers and use it in lockdep for this case, hopefully that can resolve
your issue.

> Would you mind help me to understand what _rcu() fuction you are
> suggesting and what will it replace?
> 

The _rcu() functions we are talking about are:
hlist_for_each_entry_rcu() in is_dynamic_key(), hlist_add_head_rcu() in
lockdep_register_key() and hlist_del_rcu() in lockdep_unregister_key(),
because if we move to hazptr, they are technically not protected by RCU.
But the implementation of these functions is still correct with hazptr,
it's just their names might be confusing, and we may change them later.

Hope this helps.

Regards,
Boqun


> Thank you,
> --breno

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-31 17:26                             ` Boqun Feng
  2025-03-31 17:33                               ` Waiman Long
@ 2025-03-31 17:42                               ` Eric Dumazet
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2025-03-31 17:42 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Waiman Long, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On Mon, Mar 31, 2025 at 7:26 PM Boqun Feng <boqun.feng@gmail.com> wrote:
>
> On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote:
> [...]
> > > > Anyway, that may work. The only problem that I see is the issue of nesting
> > > > of an interrupt context on top of a task context. It is possible that the
> > > > first use of a raw_spinlock may happen in an interrupt context. If the
> > > > interrupt happens when the task has set the hazard pointer and iterating the
> > > > hash list, the value of the hazard pointer may be overwritten. Alternatively
> > > > we could have multiple slots for the hazard pointer, but that will make the
> > > > code more complicated. Or we could disable interrupt before setting the
> > > > hazard pointer.
> > > Or we can use lockdep_recursion:
> > >
> > >     preempt_disable();
> > >     lockdep_recursion_inc();
> > >     barrier();
> > >
> > >     WRITE_ONCE(*hazptr, ...);
> > >
> > > , it should prevent the re-entrant of lockdep in irq.
> > That will probably work. Or we can disable irq. I am fine with both.
>
> Disabling irq may not work in this case, because an NMI can also happen
> and call register_lock_class().
>
> I'm experimenting a new idea here, it might be better (for general
> cases), and this has the similar spirit that we could move the
> protection scope of a hazard pointer from a key to a hash_list: we can
> introduce a wildcard address, and whenever we do a synchronize_hazptr(),
> if the hazptr slot equal to wildcard, we treat as it matches to any ptr,
> hence synchronize_hazptr() will still wait until it's zero'd. Not only
> this could help in the nesting case, it can also be used if the users
> want to protect multiple things with this simple hazard pointer
> implementation.

For the record this was my suggestion.

Breno, what do you think ? Feel free to grab, test, add comments and a
nice changelog, thanks !

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 67964dc4db952ea11d4b88554383ea0ec5946ef9..fb56bcb887ca163525f003cb7880c76511d166e7
100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -117,7 +117,12 @@ do {
         \
 } while (0)

 extern void lockdep_register_key(struct lock_class_key *key);
-extern void lockdep_unregister_key(struct lock_class_key *key);
+
+extern void __lockdep_unregister_key(struct lock_class_key *key, bool sync);
+static inline void lockdep_unregister_key(struct lock_class_key *key)
+{
+       return __lockdep_unregister_key(key, true);
+}

 /*
  * These methods are used by specific locking variants (spinlocks,
@@ -372,8 +377,12 @@ static inline void lockdep_register_key(struct
lock_class_key *key)
 {
 }

+static inline void __lockdep_unregister_key(struct lock_class_key
*key, bool sync)
+{
+}
 static inline void lockdep_unregister_key(struct lock_class_key *key)
 {
+       return __lockdep_unregister_key(key, true);
 }

 #define lockdep_depth(tsk)     (0)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b15757e6362615aeecb1629f8e60375e13a87e6d..3b754c7fdaf19887734b1ee37f7c058f8f751a4e
100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -6574,7 +6574,7 @@ void lockdep_reset_lock(struct lockdep_map *lock)
  * key irrespective of debug_locks to avoid potential invalid access to freed
  * memory in lock_class entry.
  */
-void lockdep_unregister_key(struct lock_class_key *key)
+void lockdep_unregister_key(struct lock_class_key *key, bool sync)
 {
        struct hlist_head *hash_head = keyhashentry(key);
        struct lock_class_key *k;
@@ -6610,8 +6610,11 @@ void lockdep_unregister_key(struct lock_class_key *key)
        if (need_callback)
                call_rcu(&delayed_free.rcu_head, free_zapped_rcu);

-       /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
-       synchronize_rcu();
+       /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
+        * @sync is false for callers dealing with the sync themselves.
+        */
+       if (sync)
+               synchronize_rcu();
 }
 EXPORT_SYMBOL_GPL(lockdep_unregister_key);

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 14ab2f4c190a1e201dd1788b413a06e799a829f2..9affef187237a13559642d435e1a196a909f75f9
100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -988,8 +988,8 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,

        return sch;
 errout1:
-       lockdep_unregister_key(&sch->root_lock_key);
-       kfree(sch);
+       __lockdep_unregister_key(&sch->root_lock_key, false);
+       kfree_rcu(sch, rcu);
 errout:
        return ERR_PTR(err);
 }
@@ -1077,7 +1077,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc)
        if (ops->destroy)
                ops->destroy(qdisc);

-       lockdep_unregister_key(&qdisc->root_lock_key);
+       __lockdep_unregister_key(&qdisc->root_lock_key, false);
        module_put(ops->owner);
        netdev_put(dev, &qdisc->dev_tracker);

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-31 17:33                               ` Waiman Long
@ 2025-03-31 18:33                                 ` Paul E. McKenney
  2025-03-31 18:57                                   ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Paul E. McKenney @ 2025-03-31 18:33 UTC (permalink / raw)
  To: Waiman Long
  Cc: Boqun Feng, Eric Dumazet, Peter Zijlstra, Breno Leitao,
	Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs,
	kernel-team, Erik Lundgren

On Mon, Mar 31, 2025 at 01:33:22PM -0400, Waiman Long wrote:
> On 3/31/25 1:26 PM, Boqun Feng wrote:
> > On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote:
> > [...]
> > > > > Anyway, that may work. The only problem that I see is the issue of nesting
> > > > > of an interrupt context on top of a task context. It is possible that the
> > > > > first use of a raw_spinlock may happen in an interrupt context. If the
> > > > > interrupt happens when the task has set the hazard pointer and iterating the
> > > > > hash list, the value of the hazard pointer may be overwritten. Alternatively
> > > > > we could have multiple slots for the hazard pointer, but that will make the
> > > > > code more complicated. Or we could disable interrupt before setting the
> > > > > hazard pointer.
> > > > Or we can use lockdep_recursion:
> > > > 
> > > > 	preempt_disable();
> > > > 	lockdep_recursion_inc();
> > > > 	barrier();
> > > > 
> > > > 	WRITE_ONCE(*hazptr, ...);
> > > > 
> > > > , it should prevent the re-entrant of lockdep in irq.
> > > That will probably work. Or we can disable irq. I am fine with both.
> > Disabling irq may not work in this case, because an NMI can also happen
> > and call register_lock_class().
> Right, disabling irq doesn't work with NMI. So incrementing the recursion
> count is likely the way to go and I think it will work even in the NMI case.
> 
> > 
> > I'm experimenting a new idea here, it might be better (for general
> > cases), and this has the similar spirit that we could move the
> > protection scope of a hazard pointer from a key to a hash_list: we can
> > introduce a wildcard address, and whenever we do a synchronize_hazptr(),
> > if the hazptr slot equal to wildcard, we treat as it matches to any ptr,
> > hence synchronize_hazptr() will still wait until it's zero'd. Not only
> > this could help in the nesting case, it can also be used if the users
> > want to protect multiple things with this simple hazard pointer
> > implementation.
> 
> I think it is a good idea to add a wildcard for the general use case.
> Setting the hazptr to the list head will be enough for this particular case.

Careful!  If we enable use of wildcards outside of the special case
of synchronize_hazptr(), we give up the small-memory-footprint advantages
of hazard pointers.  You end up having to wait on all hazard-pointer
readers, which was exactly why RCU was troublesome here.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-31 18:33                                 ` Paul E. McKenney
@ 2025-03-31 18:57                                   ` Waiman Long
  2025-03-31 21:21                                     ` Boqun Feng
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2025-03-31 18:57 UTC (permalink / raw)
  To: paulmck, Waiman Long
  Cc: Boqun Feng, Eric Dumazet, Peter Zijlstra, Breno Leitao,
	Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs,
	kernel-team, Erik Lundgren

On 3/31/25 2:33 PM, Paul E. McKenney wrote:
> On Mon, Mar 31, 2025 at 01:33:22PM -0400, Waiman Long wrote:
>> On 3/31/25 1:26 PM, Boqun Feng wrote:
>>> On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote:
>>> [...]
>>>>>> Anyway, that may work. The only problem that I see is the issue of nesting
>>>>>> of an interrupt context on top of a task context. It is possible that the
>>>>>> first use of a raw_spinlock may happen in an interrupt context. If the
>>>>>> interrupt happens when the task has set the hazard pointer and iterating the
>>>>>> hash list, the value of the hazard pointer may be overwritten. Alternatively
>>>>>> we could have multiple slots for the hazard pointer, but that will make the
>>>>>> code more complicated. Or we could disable interrupt before setting the
>>>>>> hazard pointer.
>>>>> Or we can use lockdep_recursion:
>>>>>
>>>>> 	preempt_disable();
>>>>> 	lockdep_recursion_inc();
>>>>> 	barrier();
>>>>>
>>>>> 	WRITE_ONCE(*hazptr, ...);
>>>>>
>>>>> , it should prevent the re-entrant of lockdep in irq.
>>>> That will probably work. Or we can disable irq. I am fine with both.
>>> Disabling irq may not work in this case, because an NMI can also happen
>>> and call register_lock_class().
>> Right, disabling irq doesn't work with NMI. So incrementing the recursion
>> count is likely the way to go and I think it will work even in the NMI case.
>>
>>> I'm experimenting a new idea here, it might be better (for general
>>> cases), and this has the similar spirit that we could move the
>>> protection scope of a hazard pointer from a key to a hash_list: we can
>>> introduce a wildcard address, and whenever we do a synchronize_hazptr(),
>>> if the hazptr slot equal to wildcard, we treat as it matches to any ptr,
>>> hence synchronize_hazptr() will still wait until it's zero'd. Not only
>>> this could help in the nesting case, it can also be used if the users
>>> want to protect multiple things with this simple hazard pointer
>>> implementation.
>> I think it is a good idea to add a wildcard for the general use case.
>> Setting the hazptr to the list head will be enough for this particular case.
> Careful!  If we enable use of wildcards outside of the special case
> of synchronize_hazptr(), we give up the small-memory-footprint advantages
> of hazard pointers.  You end up having to wait on all hazard-pointer
> readers, which was exactly why RCU was troublesome here.  ;-)

If the plan is to have one global set of hazard pointers for all the 
possible use cases, supporting wildcard may be a problem. If we allow 
different sets of hazard pointers for different use cases, it will be 
less an issue. Anyway, maybe we should skip wildcard for the current 
case so that we have more time to think through it first.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-31 18:57                                   ` Waiman Long
@ 2025-03-31 21:21                                     ` Boqun Feng
  2025-03-31 21:47                                       ` Waiman Long
  0 siblings, 1 reply; 35+ messages in thread
From: Boqun Feng @ 2025-03-31 21:21 UTC (permalink / raw)
  To: Waiman Long
  Cc: paulmck, Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren

On Mon, Mar 31, 2025 at 02:57:20PM -0400, Waiman Long wrote:
> On 3/31/25 2:33 PM, Paul E. McKenney wrote:
> > On Mon, Mar 31, 2025 at 01:33:22PM -0400, Waiman Long wrote:
> > > On 3/31/25 1:26 PM, Boqun Feng wrote:
> > > > On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote:
> > > > [...]
> > > > > > > Anyway, that may work. The only problem that I see is the issue of nesting
> > > > > > > of an interrupt context on top of a task context. It is possible that the
> > > > > > > first use of a raw_spinlock may happen in an interrupt context. If the
> > > > > > > interrupt happens when the task has set the hazard pointer and iterating the
> > > > > > > hash list, the value of the hazard pointer may be overwritten. Alternatively
> > > > > > > we could have multiple slots for the hazard pointer, but that will make the
> > > > > > > code more complicated. Or we could disable interrupt before setting the
> > > > > > > hazard pointer.
> > > > > > Or we can use lockdep_recursion:
> > > > > > 
> > > > > > 	preempt_disable();
> > > > > > 	lockdep_recursion_inc();
> > > > > > 	barrier();
> > > > > > 
> > > > > > 	WRITE_ONCE(*hazptr, ...);
> > > > > > 
> > > > > > , it should prevent the re-entrant of lockdep in irq.
> > > > > That will probably work. Or we can disable irq. I am fine with both.
> > > > Disabling irq may not work in this case, because an NMI can also happen
> > > > and call register_lock_class().
> > > Right, disabling irq doesn't work with NMI. So incrementing the recursion
> > > count is likely the way to go and I think it will work even in the NMI case.
> > > 
> > > > I'm experimenting a new idea here, it might be better (for general
> > > > cases), and this has the similar spirit that we could move the
> > > > protection scope of a hazard pointer from a key to a hash_list: we can
> > > > introduce a wildcard address, and whenever we do a synchronize_hazptr(),
> > > > if the hazptr slot equal to wildcard, we treat as it matches to any ptr,
> > > > hence synchronize_hazptr() will still wait until it's zero'd. Not only
> > > > this could help in the nesting case, it can also be used if the users
> > > > want to protect multiple things with this simple hazard pointer
> > > > implementation.
> > > I think it is a good idea to add a wildcard for the general use case.
> > > Setting the hazptr to the list head will be enough for this particular case.
> > Careful!  If we enable use of wildcards outside of the special case
> > of synchronize_hazptr(), we give up the small-memory-footprint advantages
> > of hazard pointers.  You end up having to wait on all hazard-pointer
> > readers, which was exactly why RCU was troublesome here.  ;-)

Technically, only the hazard-pointer readers that have switched to
wildcard mode because multiple hazptr critical sections ;-)

> 
> If the plan is to have one global set of hazard pointers for all the

A global set of hazard pointers for all the possible use cases is the
current plan (at least it should be when we have fully-featured hazptr
[1]). Because the hazard pointer value already points the the data to
protect, so no need to group things into "domain"s.

> possible use cases, supporting wildcard may be a problem. If we allow

I had some off-list discussions with Paul, and I ended up with the idea
of user-specific wildcard (i.e. different users can have different
wildcards) + one global set of hazard pointers. However, it just occured
to me that it won'd quite work in this simple hazard pointer
implementation (one slot per-CPU) :( Because you can have a user A's
hazptr critical interrupted by a user B's interrupt handler, and if both
A & B are using customized wildcard but they don't know each other, it's
not going to work by setting either wildcard value into the slot.

To make it clear for the discussion, we have two hazard pointer
implementations:

1. The fully-featured one [1], which allow users to provide memory for
   hazptr slots, so no issue about nesting/re-entry etc. And wildcard
   doesn't make sense in this implemenation.

2. The simple variant, which is what I've proposed in this thread, and
   since it only has one slot per CPU, either all the users need to
   prevent the re-entries or we need a global wildcard. Also the readers
   of the simple variant need to disable preemption regardlessly because
   it only has one hazptr slot to use. That means its read-side critical
   section should be short usually.

I could try to use the fully-featured one in lockdep, what I need to do
is creating enough hazard_context so we have enough slots for lockdep
and may or may not need lockdep_recursion to prevent reentries. However,
I still believe (or I don't have data to show otherwise) that the simple
variant with one slot per CPU + global wildcard will work fine in
practice.

So what I would like to do is introducing the simple variant as a
general API with a global wildcard (because without it, it cannot be a
general API because one user have to prevent entering another user's
critical section), and lockdep can use it. And we can monitor the
delay of synchronize_shazptr() and if wildcard becomes a problem, move
to a fully-featured hazptr implementation. Sounds like a plan?

[1]: https://lore.kernel.org/lkml/20240917143402.930114-2-boqun.feng@gmail.com/

Regards,
Boqun

> different sets of hazard pointers for different use cases, it will be less
> an issue. Anyway, maybe we should skip wildcard for the current case so that
> we have more time to think through it first.
> 
> Cheers,
> Longman
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-31 21:21                                     ` Boqun Feng
@ 2025-03-31 21:47                                       ` Waiman Long
  0 siblings, 0 replies; 35+ messages in thread
From: Waiman Long @ 2025-03-31 21:47 UTC (permalink / raw)
  To: Boqun Feng, Waiman Long
  Cc: paulmck, Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar,
	Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team,
	Erik Lundgren

On 3/31/25 5:21 PM, Boqun Feng wrote:
> On Mon, Mar 31, 2025 at 02:57:20PM -0400, Waiman Long wrote:
>> On 3/31/25 2:33 PM, Paul E. McKenney wrote:
>>> On Mon, Mar 31, 2025 at 01:33:22PM -0400, Waiman Long wrote:
>>>> On 3/31/25 1:26 PM, Boqun Feng wrote:
>>>>> On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote:
>>>>> [...]
>>>>>>>> Anyway, that may work. The only problem that I see is the issue of nesting
>>>>>>>> of an interrupt context on top of a task context. It is possible that the
>>>>>>>> first use of a raw_spinlock may happen in an interrupt context. If the
>>>>>>>> interrupt happens when the task has set the hazard pointer and iterating the
>>>>>>>> hash list, the value of the hazard pointer may be overwritten. Alternatively
>>>>>>>> we could have multiple slots for the hazard pointer, but that will make the
>>>>>>>> code more complicated. Or we could disable interrupt before setting the
>>>>>>>> hazard pointer.
>>>>>>> Or we can use lockdep_recursion:
>>>>>>>
>>>>>>> 	preempt_disable();
>>>>>>> 	lockdep_recursion_inc();
>>>>>>> 	barrier();
>>>>>>>
>>>>>>> 	WRITE_ONCE(*hazptr, ...);
>>>>>>>
>>>>>>> , it should prevent the re-entrant of lockdep in irq.
>>>>>> That will probably work. Or we can disable irq. I am fine with both.
>>>>> Disabling irq may not work in this case, because an NMI can also happen
>>>>> and call register_lock_class().
>>>> Right, disabling irq doesn't work with NMI. So incrementing the recursion
>>>> count is likely the way to go and I think it will work even in the NMI case.
>>>>
>>>>> I'm experimenting a new idea here, it might be better (for general
>>>>> cases), and this has the similar spirit that we could move the
>>>>> protection scope of a hazard pointer from a key to a hash_list: we can
>>>>> introduce a wildcard address, and whenever we do a synchronize_hazptr(),
>>>>> if the hazptr slot equal to wildcard, we treat as it matches to any ptr,
>>>>> hence synchronize_hazptr() will still wait until it's zero'd. Not only
>>>>> this could help in the nesting case, it can also be used if the users
>>>>> want to protect multiple things with this simple hazard pointer
>>>>> implementation.
>>>> I think it is a good idea to add a wildcard for the general use case.
>>>> Setting the hazptr to the list head will be enough for this particular case.
>>> Careful!  If we enable use of wildcards outside of the special case
>>> of synchronize_hazptr(), we give up the small-memory-footprint advantages
>>> of hazard pointers.  You end up having to wait on all hazard-pointer
>>> readers, which was exactly why RCU was troublesome here.  ;-)
> Technically, only the hazard-pointer readers that have switched to
> wildcard mode because multiple hazptr critical sections ;-)
>
>> If the plan is to have one global set of hazard pointers for all the
> A global set of hazard pointers for all the possible use cases is the
> current plan (at least it should be when we have fully-featured hazptr
> [1]). Because the hazard pointer value already points the the data to
> protect, so no need to group things into "domain"s.
>
>> possible use cases, supporting wildcard may be a problem. If we allow
> I had some off-list discussions with Paul, and I ended up with the idea
> of user-specific wildcard (i.e. different users can have different
> wildcards) + one global set of hazard pointers. However, it just occured
> to me that it won'd quite work in this simple hazard pointer
> implementation (one slot per-CPU) :( Because you can have a user A's
> hazptr critical interrupted by a user B's interrupt handler, and if both
> A & B are using customized wildcard but they don't know each other, it's
> not going to work by setting either wildcard value into the slot.
>
> To make it clear for the discussion, we have two hazard pointer
> implementations:
>
> 1. The fully-featured one [1], which allow users to provide memory for
>     hazptr slots, so no issue about nesting/re-entry etc. And wildcard
>     doesn't make sense in this implemenation.
>
> 2. The simple variant, which is what I've proposed in this thread, and
>     since it only has one slot per CPU, either all the users need to
>     prevent the re-entries or we need a global wildcard. Also the readers
>     of the simple variant need to disable preemption regardlessly because
>     it only has one hazptr slot to use. That means its read-side critical
>     section should be short usually.
>
> I could try to use the fully-featured one in lockdep, what I need to do
> is creating enough hazard_context so we have enough slots for lockdep
> and may or may not need lockdep_recursion to prevent reentries. However,
> I still believe (or I don't have data to show otherwise) that the simple
> variant with one slot per CPU + global wildcard will work fine in
> practice.
>
> So what I would like to do is introducing the simple variant as a
> general API with a global wildcard (because without it, it cannot be a
> general API because one user have to prevent entering another user's
> critical section), and lockdep can use it. And we can monitor the
> delay of synchronize_shazptr() and if wildcard becomes a problem, move
> to a fully-featured hazptr implementation. Sounds like a plan?
>
> [1]: https://lore.kernel.org/lkml/20240917143402.930114-2-boqun.feng@gmail.com/

Thank for the detailed explanation. I am looking forward to your new 
hazptr patch series.

Cheers,
Longman

>
> Regards,
> Boqun
>
>> different sets of hazard pointers for different use cases, it will be less
>> an issue. Anyway, maybe we should skip wildcard for the current case so that
>> we have more time to think through it first.
>>
>> Cheers,
>> Longman
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-03-21  9:30 [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization Breno Leitao
  2025-03-21 10:37 ` Eric Dumazet
  2025-03-24 12:12 ` Peter Zijlstra
@ 2025-07-09 10:00 ` Breno Leitao
  2025-07-09 13:57   ` Waiman Long
  2 siblings, 1 reply; 35+ messages in thread
From: Breno Leitao @ 2025-07-09 10:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long
  Cc: aeh, linux-kernel, netdev, edumazet, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

Hello Waiman, Boqun,

On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote:
> lockdep_unregister_key() is called from critical code paths, including
> sections where rtnl_lock() is held. For example, when replacing a qdisc
> in a network device, network egress traffic is disabled while
> __qdisc_destroy() is called for every network queue.
> 
> If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
> which gets blocked waiting for synchronize_rcu() to complete.
> 
> For example, a simple tc command to replace a qdisc could take 13
> seconds:
> 
>   # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
>     real    0m13.195s
>     user    0m0.001s
>     sys     0m2.746s
> 
> During this time, network egress is completely frozen while waiting for
> RCU synchronization.
> 
> Use synchronize_rcu_expedited() instead to minimize the impact on
> critical operations like network connectivity changes.
> 
> This improves 10x the function call to tc, when replacing the qdisc for
> a network card.
> 
>    # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
>      real     0m1.789s
>      user     0m0.000s
>      sys      0m1.613s

Can I have this landed as a workaround for the problem above, while
hazard pointers doesn't get merged?

This is affecting some systems that runs the Linus' upstream kernel with
some debug flags enabled, and I would like to have they unblocked.

Once hazard pointer lands, this will be reverted. Is this a fair
approach?

Thanks for your help,
--breno

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-07-09 10:00 ` Breno Leitao
@ 2025-07-09 13:57   ` Waiman Long
  2025-07-09 14:57     ` Boqun Feng
  0 siblings, 1 reply; 35+ messages in thread
From: Waiman Long @ 2025-07-09 13:57 UTC (permalink / raw)
  To: Breno Leitao, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Boqun Feng
  Cc: aeh, linux-kernel, netdev, edumazet, jhs, kernel-team,
	Erik Lundgren, Paul E. McKenney

On 7/9/25 6:00 AM, Breno Leitao wrote:
> Hello Waiman, Boqun,
>
> On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote:
>> lockdep_unregister_key() is called from critical code paths, including
>> sections where rtnl_lock() is held. For example, when replacing a qdisc
>> in a network device, network egress traffic is disabled while
>> __qdisc_destroy() is called for every network queue.
>>
>> If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
>> which gets blocked waiting for synchronize_rcu() to complete.
>>
>> For example, a simple tc command to replace a qdisc could take 13
>> seconds:
>>
>>    # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
>>      real    0m13.195s
>>      user    0m0.001s
>>      sys     0m2.746s
>>
>> During this time, network egress is completely frozen while waiting for
>> RCU synchronization.
>>
>> Use synchronize_rcu_expedited() instead to minimize the impact on
>> critical operations like network connectivity changes.
>>
>> This improves 10x the function call to tc, when replacing the qdisc for
>> a network card.
>>
>>     # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
>>       real     0m1.789s
>>       user     0m0.000s
>>       sys      0m1.613s
> Can I have this landed as a workaround for the problem above, while
> hazard pointers doesn't get merged?
>
> This is affecting some systems that runs the Linus' upstream kernel with
> some debug flags enabled, and I would like to have they unblocked.
>
> Once hazard pointer lands, this will be reverted. Is this a fair
> approach?
>
> Thanks for your help,
> --breno

I am fine with this patch going in as a temporary workaround. Boqun, 
what do you think?

Cheers,
Longman


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
  2025-07-09 13:57   ` Waiman Long
@ 2025-07-09 14:57     ` Boqun Feng
  0 siblings, 0 replies; 35+ messages in thread
From: Boqun Feng @ 2025-07-09 14:57 UTC (permalink / raw)
  To: Waiman Long
  Cc: Breno Leitao, Peter Zijlstra, Ingo Molnar, Will Deacon, aeh,
	linux-kernel, netdev, edumazet, jhs, kernel-team, Erik Lundgren,
	Paul E. McKenney

On Wed, Jul 09, 2025 at 09:57:44AM -0400, Waiman Long wrote:
> On 7/9/25 6:00 AM, Breno Leitao wrote:
> > Hello Waiman, Boqun,
> > 
> > On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote:
> > > lockdep_unregister_key() is called from critical code paths, including
> > > sections where rtnl_lock() is held. For example, when replacing a qdisc
> > > in a network device, network egress traffic is disabled while
> > > __qdisc_destroy() is called for every network queue.
> > > 
> > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
> > > which gets blocked waiting for synchronize_rcu() to complete.
> > > 
> > > For example, a simple tc command to replace a qdisc could take 13
> > > seconds:
> > > 
> > >    # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> > >      real    0m13.195s
> > >      user    0m0.001s
> > >      sys     0m2.746s
> > > 
> > > During this time, network egress is completely frozen while waiting for
> > > RCU synchronization.
> > > 
> > > Use synchronize_rcu_expedited() instead to minimize the impact on
> > > critical operations like network connectivity changes.
> > > 
> > > This improves 10x the function call to tc, when replacing the qdisc for
> > > a network card.
> > > 
> > >     # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> > >       real     0m1.789s
> > >       user     0m0.000s
> > >       sys      0m1.613s
> > Can I have this landed as a workaround for the problem above, while
> > hazard pointers doesn't get merged?
> > 
> > This is affecting some systems that runs the Linus' upstream kernel with
> > some debug flags enabled, and I would like to have they unblocked.
> > 
> > Once hazard pointer lands, this will be reverted. Is this a fair
> > approach?
> > 
> > Thanks for your help,
> > --breno
> 
> I am fine with this patch going in as a temporary workaround. Boqun, what do
> you think?
> 

Seems good to me. We can add a "// TODO" comment to call out it's a
temporary workaround and should be replaced. Although, I believe Peter
had some some concern about IPI, I would like to get his opinion on
using synchronize_rcu_expedited() as a temporary solution. Peter?

Regards,
Boqun

> Cheers,
> Longman
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-07-09 14:57 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-21  9:30 [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization Breno Leitao
2025-03-21 10:37 ` Eric Dumazet
2025-03-21 14:22   ` Breno Leitao
2025-03-24 12:12 ` Peter Zijlstra
2025-03-24 12:23   ` Eric Dumazet
2025-03-24 12:24     ` Eric Dumazet
2025-03-24 19:21     ` Boqun Feng
2025-03-24 19:30       ` Boqun Feng
2025-03-25  0:47         ` Boqun Feng
2025-03-25  1:56           ` Waiman Long
2025-03-25  3:41             ` Boqun Feng
     [not found]               ` <934d794b-7ebc-422c-b4fe-3e658a2e5e7a@redhat.com>
2025-03-25 14:57                 ` Waiman Long
2025-03-25 18:45                 ` Boqun Feng
2025-03-25 19:23                   ` Waiman Long
2025-03-25 19:42                     ` Boqun Feng
2025-03-25 23:20                       ` Waiman Long
2025-03-26  5:25                         ` Boqun Feng
     [not found]                           ` <df237702-55c3-466b-b51e-f3fe46ae03ba@redhat.com>
2025-03-26 16:40                             ` Waiman Long
2025-03-26 16:47                               ` Boqun Feng
2025-03-26 17:02                                 ` Waiman Long
2025-03-26 17:10                                   ` Paul E. McKenney
2025-03-26 18:42                                     ` Boqun Feng
2025-03-26 21:37                                       ` Paul E. McKenney
2025-03-31 16:48                                       ` Breno Leitao
2025-03-31 17:34                                         ` Boqun Feng
2025-03-31 17:26                             ` Boqun Feng
2025-03-31 17:33                               ` Waiman Long
2025-03-31 18:33                                 ` Paul E. McKenney
2025-03-31 18:57                                   ` Waiman Long
2025-03-31 21:21                                     ` Boqun Feng
2025-03-31 21:47                                       ` Waiman Long
2025-03-31 17:42                               ` Eric Dumazet
2025-07-09 10:00 ` Breno Leitao
2025-07-09 13:57   ` Waiman Long
2025-07-09 14:57     ` Boqun Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).