* [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization
@ 2025-03-21 9:30 Breno Leitao
2025-03-21 10:37 ` Eric Dumazet
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Breno Leitao @ 2025-03-21 9:30 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long
Cc: aeh, linux-kernel, netdev, edumazet, jhs, kernel-team,
Erik Lundgren, Breno Leitao, Paul E. McKenney
lockdep_unregister_key() is called from critical code paths, including
sections where rtnl_lock() is held. For example, when replacing a qdisc
in a network device, network egress traffic is disabled while
__qdisc_destroy() is called for every network queue.
If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(),
which gets blocked waiting for synchronize_rcu() to complete.
For example, a simple tc command to replace a qdisc could take 13
seconds:
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
real 0m13.195s
user 0m0.001s
sys 0m2.746s
During this time, network egress is completely frozen while waiting for
RCU synchronization.
Use synchronize_rcu_expedited() instead to minimize the impact on
critical operations like network connectivity changes.
This improves 10x the function call to tc, when replacing the qdisc for
a network card.
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
real 0m1.789s
user 0m0.000s
sys 0m1.613s
Reported-by: Erik Lundgren <elundgren@meta.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org>
---
kernel/locking/lockdep.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 4470680f02269..a79030ac36dd4 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key)
if (need_callback)
call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
- /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
- synchronize_rcu();
+ /* Wait until is_dynamic_key() has finished accessing k->hash_entry.
+ * This needs to be quick, since it is called in critical sections
+ */
+ synchronize_rcu_expedited();
}
EXPORT_SYMBOL_GPL(lockdep_unregister_key);
---
base-commit: 81e4f8d68c66da301bb881862735bd74c6241a19
change-id: 20250319-lockdep-b1eca0479665
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-21 9:30 [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization Breno Leitao @ 2025-03-21 10:37 ` Eric Dumazet 2025-03-21 14:22 ` Breno Leitao 2025-03-24 12:12 ` Peter Zijlstra 2025-07-09 10:00 ` Breno Leitao 2 siblings, 1 reply; 35+ messages in thread From: Eric Dumazet @ 2025-03-21 10:37 UTC (permalink / raw) To: Breno Leitao Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Fri, Mar 21, 2025 at 10:31 AM Breno Leitao <leitao@debian.org> wrote: > > lockdep_unregister_key() is called from critical code paths, including > sections where rtnl_lock() is held. For example, when replacing a qdisc > in a network device, network egress traffic is disabled while > __qdisc_destroy() is called for every network queue. > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > which gets blocked waiting for synchronize_rcu() to complete. > > For example, a simple tc command to replace a qdisc could take 13 > seconds: > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m13.195s > user 0m0.001s > sys 0m2.746s > > During this time, network egress is completely frozen while waiting for > RCU synchronization. > > Use synchronize_rcu_expedited() instead to minimize the impact on > critical operations like network connectivity changes. Running 'critical operations' with LOCKDEP enabled kernels seems a bit strange :) Reviewed-by: Eric Dumazet <edumazet@google.com> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-21 10:37 ` Eric Dumazet @ 2025-03-21 14:22 ` Breno Leitao 0 siblings, 0 replies; 35+ messages in thread From: Breno Leitao @ 2025-03-21 14:22 UTC (permalink / raw) To: Eric Dumazet Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Fri, Mar 21, 2025 at 11:37:31AM +0100, Eric Dumazet wrote: > On Fri, Mar 21, 2025 at 10:31 AM Breno Leitao <leitao@debian.org> wrote: > > > > lockdep_unregister_key() is called from critical code paths, including > > sections where rtnl_lock() is held. For example, when replacing a qdisc > > in a network device, network egress traffic is disabled while > > __qdisc_destroy() is called for every network queue. > > > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > > which gets blocked waiting for synchronize_rcu() to complete. > > > > For example, a simple tc command to replace a qdisc could take 13 > > seconds: > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > real 0m13.195s > > user 0m0.001s > > sys 0m2.746s > > > > During this time, network egress is completely frozen while waiting for > > RCU synchronization. > > > > Use synchronize_rcu_expedited() instead to minimize the impact on > > critical operations like network connectivity changes. > > Running 'critical operations' with LOCKDEP enabled kernels seems a bit > strange :) Apologies, I meant to say that at Meta, we have certain service tiers that can accommodate the performance trade-offs of running a "debug" kernel. This kernel provides valuable feedback about its operations. The aim is to identify any significant issues early on, rather than having to debug "hard issues" (such as deadlocks, out-of-memory access, etc) once the kernel is in production. > Reviewed-by: Eric Dumazet <edumazet@google.com> Thanks for your support during this investigation, --breno ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-21 9:30 [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization Breno Leitao 2025-03-21 10:37 ` Eric Dumazet @ 2025-03-24 12:12 ` Peter Zijlstra 2025-03-24 12:23 ` Eric Dumazet 2025-07-09 10:00 ` Breno Leitao 2 siblings, 1 reply; 35+ messages in thread From: Peter Zijlstra @ 2025-03-24 12:12 UTC (permalink / raw) To: Breno Leitao Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, aeh, linux-kernel, netdev, edumazet, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: > lockdep_unregister_key() is called from critical code paths, including > sections where rtnl_lock() is held. For example, when replacing a qdisc > in a network device, network egress traffic is disabled while > __qdisc_destroy() is called for every network queue. > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > which gets blocked waiting for synchronize_rcu() to complete. > > For example, a simple tc command to replace a qdisc could take 13 > seconds: > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m13.195s > user 0m0.001s > sys 0m2.746s > > During this time, network egress is completely frozen while waiting for > RCU synchronization. > > Use synchronize_rcu_expedited() instead to minimize the impact on > critical operations like network connectivity changes. > > This improves 10x the function call to tc, when replacing the qdisc for > a network card. > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m1.789s > user 0m0.000s > sys 0m1.613s > > Reported-by: Erik Lundgren <elundgren@meta.com> > Signed-off-by: Breno Leitao <leitao@debian.org> > Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> > --- > kernel/locking/lockdep.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 4470680f02269..a79030ac36dd4 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) > if (need_callback) > call_rcu(&delayed_free.rcu_head, free_zapped_rcu); > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ > - synchronize_rcu(); > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry. > + * This needs to be quick, since it is called in critical sections > + */ > + synchronize_rcu_expedited(); > } > EXPORT_SYMBOL_GPL(lockdep_unregister_key); So I fundamentally despise synchronize_rcu_expedited(), also your comment style is broken. Why can't qdisc call this outside of the lock? ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-24 12:12 ` Peter Zijlstra @ 2025-03-24 12:23 ` Eric Dumazet 2025-03-24 12:24 ` Eric Dumazet 2025-03-24 19:21 ` Boqun Feng 0 siblings, 2 replies; 35+ messages in thread From: Eric Dumazet @ 2025-03-24 12:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Breno Leitao, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Mon, Mar 24, 2025 at 1:12 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: > > lockdep_unregister_key() is called from critical code paths, including > > sections where rtnl_lock() is held. For example, when replacing a qdisc > > in a network device, network egress traffic is disabled while > > __qdisc_destroy() is called for every network queue. > > > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > > which gets blocked waiting for synchronize_rcu() to complete. > > > > For example, a simple tc command to replace a qdisc could take 13 > > seconds: > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > real 0m13.195s > > user 0m0.001s > > sys 0m2.746s > > > > During this time, network egress is completely frozen while waiting for > > RCU synchronization. > > > > Use synchronize_rcu_expedited() instead to minimize the impact on > > critical operations like network connectivity changes. > > > > This improves 10x the function call to tc, when replacing the qdisc for > > a network card. > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > real 0m1.789s > > user 0m0.000s > > sys 0m1.613s > > > > Reported-by: Erik Lundgren <elundgren@meta.com> > > Signed-off-by: Breno Leitao <leitao@debian.org> > > Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> > > --- > > kernel/locking/lockdep.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > index 4470680f02269..a79030ac36dd4 100644 > > --- a/kernel/locking/lockdep.c > > +++ b/kernel/locking/lockdep.c > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) > > if (need_callback) > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu); > > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ > > - synchronize_rcu(); > > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry. > > + * This needs to be quick, since it is called in critical sections > > + */ > > + synchronize_rcu_expedited(); > > } > > EXPORT_SYMBOL_GPL(lockdep_unregister_key); > > So I fundamentally despise synchronize_rcu_expedited(), also your > comment style is broken. > > Why can't qdisc call this outside of the lock? Good luck with that, and anyway the time to call it 256 times would still hurt Breno use case. My suggestion was to change lockdep_unregister_key() contract, and use kfree_rcu() there > I think we should redesign lockdep_unregister_key() to work on a separately > allocated piece of memory, > then use kfree_rcu() in it. > > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to > > struct ... { > struct lock_class_key key; > struct rcu_head rcu; > } More work because it requires changing all lockdep_unregister_key() users. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-24 12:23 ` Eric Dumazet @ 2025-03-24 12:24 ` Eric Dumazet 2025-03-24 19:21 ` Boqun Feng 1 sibling, 0 replies; 35+ messages in thread From: Eric Dumazet @ 2025-03-24 12:24 UTC (permalink / raw) To: Peter Zijlstra Cc: Breno Leitao, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Mon, Mar 24, 2025 at 1:23 PM Eric Dumazet <edumazet@google.com> wrote: > > On Mon, Mar 24, 2025 at 1:12 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: > > > lockdep_unregister_key() is called from critical code paths, including > > > sections where rtnl_lock() is held. For example, when replacing a qdisc > > > in a network device, network egress traffic is disabled while > > > __qdisc_destroy() is called for every network queue. > > > > > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > > > which gets blocked waiting for synchronize_rcu() to complete. > > > > > > For example, a simple tc command to replace a qdisc could take 13 > > > seconds: > > > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > > real 0m13.195s > > > user 0m0.001s > > > sys 0m2.746s > > > > > > During this time, network egress is completely frozen while waiting for > > > RCU synchronization. > > > > > > Use synchronize_rcu_expedited() instead to minimize the impact on > > > critical operations like network connectivity changes. > > > > > > This improves 10x the function call to tc, when replacing the qdisc for > > > a network card. > > > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > > real 0m1.789s > > > user 0m0.000s > > > sys 0m1.613s > > > > > > Reported-by: Erik Lundgren <elundgren@meta.com> > > > Signed-off-by: Breno Leitao <leitao@debian.org> > > > Reviewed-by: "Paul E. McKenney" <paulmck@kernel.org> > > > --- > > > kernel/locking/lockdep.c | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > > index 4470680f02269..a79030ac36dd4 100644 > > > --- a/kernel/locking/lockdep.c > > > +++ b/kernel/locking/lockdep.c > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) > > > if (need_callback) > > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu); > > > > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ > > > - synchronize_rcu(); > > > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry. > > > + * This needs to be quick, since it is called in critical sections > > > + */ > > > + synchronize_rcu_expedited(); > > > } > > > EXPORT_SYMBOL_GPL(lockdep_unregister_key); > > > > So I fundamentally despise synchronize_rcu_expedited(), also your > > comment style is broken. > > > > Why can't qdisc call this outside of the lock? > > Good luck with that, and anyway the time to call it 256 times would > still hurt Breno use case. > > My suggestion was to change lockdep_unregister_key() contract, and use > kfree_rcu() there > > > I think we should redesign lockdep_unregister_key() to work on a separately > > allocated piece of memory, > > then use kfree_rcu() in it. > > > > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to > > > > struct ... { > > struct lock_class_key key; > > struct rcu_head rcu; > > } > > More work because it requires changing all lockdep_unregister_key() users. Or add a new function, basically allow users to be converted when they need to. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-24 12:23 ` Eric Dumazet 2025-03-24 12:24 ` Eric Dumazet @ 2025-03-24 19:21 ` Boqun Feng 2025-03-24 19:30 ` Boqun Feng 1 sibling, 1 reply; 35+ messages in thread From: Boqun Feng @ 2025-03-24 19:21 UTC (permalink / raw) To: Eric Dumazet Cc: Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote: [...] > > > --- > > > kernel/locking/lockdep.c | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > > index 4470680f02269..a79030ac36dd4 100644 > > > --- a/kernel/locking/lockdep.c > > > +++ b/kernel/locking/lockdep.c > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) > > > if (need_callback) > > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu); > > > > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ > > > - synchronize_rcu(); I feel a bit confusing even for the old comment, normally I would expect the caller of lockdep_unregister_key() should guarantee the key has been unpublished, in other words, there is no way a lockdep_unregister_key() could race with a register_lock_class()/lockdep_init_map_type(). The synchronize_rcu() is not needed then. Let's say someone breaks my assumption above, then when doing a register_lock_class() with a key about to be unregister, I cannot see anything stops the following: CPU 0 CPU 1 ===== ===== register_lock_class(): ... } else if (... && !is_dynamic_key(lock->key)) { // ->key is not unregistered yet, so this branch is not // taken. return NULL; } lockdep_unregister_key(..); // key unregister, can be free // any time. key = lock->key->subkeys + subclass; // BOOM! UAF. So either we don't need the synchronize_rcu() here or the synchronize_rcu() doesn't help at all. Am I missing something subtle here? Regards, Boqun > > > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry. > > > + * This needs to be quick, since it is called in critical sections > > > + */ > > > + synchronize_rcu_expedited(); > > > } > > > EXPORT_SYMBOL_GPL(lockdep_unregister_key); > > > > So I fundamentally despise synchronize_rcu_expedited(), also your > > comment style is broken. > > > > Why can't qdisc call this outside of the lock? > > Good luck with that, and anyway the time to call it 256 times would > still hurt Breno use case. > > My suggestion was to change lockdep_unregister_key() contract, and use > kfree_rcu() there > > > I think we should redesign lockdep_unregister_key() to work on a separately > > allocated piece of memory, > > then use kfree_rcu() in it. > > > > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to > > > > struct ... { > > struct lock_class_key key; > > struct rcu_head rcu; > > } > > More work because it requires changing all lockdep_unregister_key() users. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-24 19:21 ` Boqun Feng @ 2025-03-24 19:30 ` Boqun Feng 2025-03-25 0:47 ` Boqun Feng 0 siblings, 1 reply; 35+ messages in thread From: Boqun Feng @ 2025-03-24 19:30 UTC (permalink / raw) To: Eric Dumazet Cc: Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote: > On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote: > [...] > > > > --- > > > > kernel/locking/lockdep.c | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > > > index 4470680f02269..a79030ac36dd4 100644 > > > > --- a/kernel/locking/lockdep.c > > > > +++ b/kernel/locking/lockdep.c > > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) > > > > if (need_callback) > > > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu); > > > > > > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ > > > > - synchronize_rcu(); > > I feel a bit confusing even for the old comment, normally I would expect > the caller of lockdep_unregister_key() should guarantee the key has been > unpublished, in other words, there is no way a lockdep_unregister_key() > could race with a register_lock_class()/lockdep_init_map_type(). The > synchronize_rcu() is not needed then. > > Let's say someone breaks my assumption above, then when doing a > register_lock_class() with a key about to be unregister, I cannot see > anything stops the following: > > CPU 0 CPU 1 > ===== ===== > register_lock_class(): > ... > } else if (... && !is_dynamic_key(lock->key)) { > // ->key is not unregistered yet, so this branch is not > // taken. > return NULL; > } > lockdep_unregister_key(..); > // key unregister, can be free > // any time. > key = lock->key->subkeys + subclass; // BOOM! UAF. > > So either we don't need the synchronize_rcu() here or the > synchronize_rcu() doesn't help at all. Am I missing something subtle > here? > Oh! Maybe I was missing register_lock_class() must be called with irq disabled, which is also an RCU read-side critical section. Regards, Boqun > Regards, > Boqun > > > > > + /* Wait until is_dynamic_key() has finished accessing k->hash_entry. > > > > + * This needs to be quick, since it is called in critical sections > > > > + */ > > > > + synchronize_rcu_expedited(); > > > > } > > > > EXPORT_SYMBOL_GPL(lockdep_unregister_key); > > > > > > So I fundamentally despise synchronize_rcu_expedited(), also your > > > comment style is broken. > > > > > > Why can't qdisc call this outside of the lock? > > > > Good luck with that, and anyway the time to call it 256 times would > > still hurt Breno use case. > > > > My suggestion was to change lockdep_unregister_key() contract, and use > > kfree_rcu() there > > > > > I think we should redesign lockdep_unregister_key() to work on a separately > > > allocated piece of memory, > > > then use kfree_rcu() in it. > > > > > > Ie not embed a "struct lock_class_key" in the struct Qdisc, but a pointer to > > > > > > struct ... { > > > struct lock_class_key key; > > > struct rcu_head rcu; > > > } > > > > More work because it requires changing all lockdep_unregister_key() users. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-24 19:30 ` Boqun Feng @ 2025-03-25 0:47 ` Boqun Feng 2025-03-25 1:56 ` Waiman Long 0 siblings, 1 reply; 35+ messages in thread From: Boqun Feng @ 2025-03-25 0:47 UTC (permalink / raw) To: Eric Dumazet Cc: Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, Waiman Long, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote: > On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote: > > On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote: > > [...] > > > > > --- > > > > > kernel/locking/lockdep.c | 6 ++++-- > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > > > > index 4470680f02269..a79030ac36dd4 100644 > > > > > --- a/kernel/locking/lockdep.c > > > > > +++ b/kernel/locking/lockdep.c > > > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) > > > > > if (need_callback) > > > > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu); > > > > > > > > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ > > > > > - synchronize_rcu(); > > > > I feel a bit confusing even for the old comment, normally I would expect > > the caller of lockdep_unregister_key() should guarantee the key has been > > unpublished, in other words, there is no way a lockdep_unregister_key() > > could race with a register_lock_class()/lockdep_init_map_type(). The > > synchronize_rcu() is not needed then. > > > > Let's say someone breaks my assumption above, then when doing a > > register_lock_class() with a key about to be unregister, I cannot see > > anything stops the following: > > > > CPU 0 CPU 1 > > ===== ===== > > register_lock_class(): > > ... > > } else if (... && !is_dynamic_key(lock->key)) { > > // ->key is not unregistered yet, so this branch is not > > // taken. > > return NULL; > > } > > lockdep_unregister_key(..); > > // key unregister, can be free > > // any time. > > key = lock->key->subkeys + subclass; // BOOM! UAF. > > > > So either we don't need the synchronize_rcu() here or the > > synchronize_rcu() doesn't help at all. Am I missing something subtle > > here? > > > > Oh! Maybe I was missing register_lock_class() must be called with irq > disabled, which is also an RCU read-side critical section. > Since register_lock_class() will be call with irq disabled, maybe hazard pointers [1] is better because most of the case we only have nr_cpus readers, so the potential hazard pointer slots are fixed. So the below patch can reduce the time of the tc command from real ~1.7 second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env, which is not surprising given it's a dedicated hazard pointers for lock_class_key. Thoughts? [1]: https://lore.kernel.org/lkml/20240917143402.930114-1-boqun.feng@gmail.com/ Regards, Boqun ---------------------------------->8 From: Boqun Feng <boqun.feng@gmail.com> Date: Mon, 24 Mar 2025 13:38:15 -0700 Subject: [PATCH] WIP Signed-off-by: Boqun Feng <boqun.feng@gmail.com> --- kernel/locking/lockdep.c | 65 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 61 insertions(+), 4 deletions(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 4470680f0226..0b6e78d56cfe 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -111,6 +111,29 @@ late_initcall(kernel_lockdep_sysctls_init); DEFINE_PER_CPU(unsigned int, lockdep_recursion); EXPORT_PER_CPU_SYMBOL_GPL(lockdep_recursion); +/* hazptr free always clears the slot */ +DEFINE_FREE(lockdep_key_hazptr, struct lock_class_key **, if (_T) smp_store_release((_T), NULL)); +DEFINE_PER_CPU(struct lock_class_key *, lockdep_key_hazptr); + +static void synchronize_lockdep_key_hazptr(struct lock_class_key *key) +{ + int cpu; + + if (!key) + return; + /* + * Synchronizes with the smp_mb() after protecting the pointer with + * WRITE_ONCE(). + */ + smp_mb(); + + for_each_possible_cpu(cpu) { + struct lock_class_key **hazptr = per_cpu_ptr(&lockdep_key_hazptr, cpu); + + smp_cond_load_acquire(hazptr, VAL != key); + } +} + static __always_inline bool lockdep_enabled(void) { if (!debug_locks) @@ -1283,6 +1306,7 @@ static struct lock_class * register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) { struct lockdep_subclass_key *key; + struct lock_class_key **key_hazptr __free(lockdep_key_hazptr) = NULL; struct hlist_head *hash_head; struct lock_class *class; int idx; @@ -1293,11 +1317,25 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force) if (likely(class)) goto out_set_class_cache; + /* Interrupts are disabled hence it's safe to acquire the hazptr slot */ + key_hazptr = this_cpu_ptr(&lockdep_key_hazptr); + + /* hazptr slot must be unusued */ + BUG_ON(*key_hazptr); + if (!lock->key) { if (!assign_lock_key(lock)) return NULL; - } else if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) { - return NULL; + } else { + /* hazptr: protect the key */ + WRITE_ONCE(*key_hazptr, lock->key); + + /* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */ + smp_mb(); + + if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) { + return NULL; + } } key = lock->key->subkeys + subclass; @@ -4964,16 +5002,35 @@ void lockdep_init_map_type(struct lockdep_map *lock, const char *name, */ if (DEBUG_LOCKS_WARN_ON(!key)) return; + + preempt_disable(); + struct lock_class_key **hazptr __free(lockdep_key_hazptr) = this_cpu_ptr(&lockdep_key_hazptr); + + /* hapztr: Protect the key */ + WRITE_ONCE(*hazptr, key); + + /* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */ + smp_mb(); + /* * Sanity check, the lock-class key must either have been allocated * statically or must have been registered as a dynamic key. */ if (!static_obj(key) && !is_dynamic_key(key)) { + preempt_enable(); if (debug_locks) printk(KERN_ERR "BUG: key %px has not been registered!\n", key); DEBUG_LOCKS_WARN_ON(1); return; } + + /* hazptr: Release the key */ + smp_store_release(hazptr, NULL); + + /* Do not auto clean up*/ + hazptr = NULL; + preempt_enable(); + lock->key = key; if (unlikely(!debug_locks)) @@ -6595,8 +6652,8 @@ void lockdep_unregister_key(struct lock_class_key *key) if (need_callback) call_rcu(&delayed_free.rcu_head, free_zapped_rcu); - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ - synchronize_rcu(); + /* Wait until register_lock_class() has finished accessing key. */ + synchronize_lockdep_key_hazptr(key); } EXPORT_SYMBOL_GPL(lockdep_unregister_key); -- 2.48.1 ^ permalink raw reply related [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-25 0:47 ` Boqun Feng @ 2025-03-25 1:56 ` Waiman Long 2025-03-25 3:41 ` Boqun Feng 0 siblings, 1 reply; 35+ messages in thread From: Waiman Long @ 2025-03-25 1:56 UTC (permalink / raw) To: Boqun Feng, Eric Dumazet Cc: Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On 3/24/25 8:47 PM, Boqun Feng wrote: > On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote: >> On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote: >>> On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote: >>> [...] >>>>>> --- >>>>>> kernel/locking/lockdep.c | 6 ++++-- >>>>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c >>>>>> index 4470680f02269..a79030ac36dd4 100644 >>>>>> --- a/kernel/locking/lockdep.c >>>>>> +++ b/kernel/locking/lockdep.c >>>>>> @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) >>>>>> if (need_callback) >>>>>> call_rcu(&delayed_free.rcu_head, free_zapped_rcu); >>>>>> >>>>>> - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ >>>>>> - synchronize_rcu(); >>> I feel a bit confusing even for the old comment, normally I would expect >>> the caller of lockdep_unregister_key() should guarantee the key has been >>> unpublished, in other words, there is no way a lockdep_unregister_key() >>> could race with a register_lock_class()/lockdep_init_map_type(). The >>> synchronize_rcu() is not needed then. >>> >>> Let's say someone breaks my assumption above, then when doing a >>> register_lock_class() with a key about to be unregister, I cannot see >>> anything stops the following: >>> >>> CPU 0 CPU 1 >>> ===== ===== >>> register_lock_class(): >>> ... >>> } else if (... && !is_dynamic_key(lock->key)) { >>> // ->key is not unregistered yet, so this branch is not >>> // taken. >>> return NULL; >>> } >>> lockdep_unregister_key(..); >>> // key unregister, can be free >>> // any time. >>> key = lock->key->subkeys + subclass; // BOOM! UAF. >>> >>> So either we don't need the synchronize_rcu() here or the >>> synchronize_rcu() doesn't help at all. Am I missing something subtle >>> here? >>> >> Oh! Maybe I was missing register_lock_class() must be called with irq >> disabled, which is also an RCU read-side critical section. >> > Since register_lock_class() will be call with irq disabled, maybe hazard > pointers [1] is better because most of the case we only have nr_cpus > readers, so the potential hazard pointer slots are fixed. > > So the below patch can reduce the time of the tc command from real ~1.7 > second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env, > which is not surprising given it's a dedicated hazard pointers for > lock_class_key. > > Thoughts? My understanding is that it is not a race between register_lock_class() and lockdep_unregister_key(). It is the fact that the structure that holds the lock_class_key may be freed immediately after return from lockdep_unregister_key(). So any processes that are in the process of iterating the hash_list containing the hash_entry to be unregistered may hit a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid potential access of invalid memory in lock_class") for a discussion of this kind of UAF problem. As suggested by Eric, one possible solution is to add a lockdep_unregister_key() variant function that presumes the structure holding the key won't be freed until after a RCU delay. In this case, we can skip the last synchronize_rcu() call. Any callers that need immediate return should use kfree_rcu() to free the structure after calling the lockdep_unregister_key() variant. Cheers, Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-25 1:56 ` Waiman Long @ 2025-03-25 3:41 ` Boqun Feng [not found] ` <934d794b-7ebc-422c-b4fe-3e658a2e5e7a@redhat.com> 0 siblings, 1 reply; 35+ messages in thread From: Boqun Feng @ 2025-03-25 3:41 UTC (permalink / raw) To: Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Mon, Mar 24, 2025 at 09:56:25PM -0400, Waiman Long wrote: > On 3/24/25 8:47 PM, Boqun Feng wrote: > > On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote: > > > On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote: > > > > On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote: > > > > [...] > > > > > > > --- > > > > > > > kernel/locking/lockdep.c | 6 ++++-- > > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > > > > > > index 4470680f02269..a79030ac36dd4 100644 > > > > > > > --- a/kernel/locking/lockdep.c > > > > > > > +++ b/kernel/locking/lockdep.c > > > > > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) > > > > > > > if (need_callback) > > > > > > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu); > > > > > > > > > > > > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ > > > > > > > - synchronize_rcu(); > > > > I feel a bit confusing even for the old comment, normally I would expect > > > > the caller of lockdep_unregister_key() should guarantee the key has been > > > > unpublished, in other words, there is no way a lockdep_unregister_key() > > > > could race with a register_lock_class()/lockdep_init_map_type(). The > > > > synchronize_rcu() is not needed then. > > > > > > > > Let's say someone breaks my assumption above, then when doing a > > > > register_lock_class() with a key about to be unregister, I cannot see > > > > anything stops the following: > > > > > > > > CPU 0 CPU 1 > > > > ===== ===== > > > > register_lock_class(): > > > > ... > > > > } else if (... && !is_dynamic_key(lock->key)) { > > > > // ->key is not unregistered yet, so this branch is not > > > > // taken. > > > > return NULL; > > > > } > > > > lockdep_unregister_key(..); > > > > // key unregister, can be free > > > > // any time. > > > > key = lock->key->subkeys + subclass; // BOOM! UAF. This is not a UAF :( > > > > > > > > So either we don't need the synchronize_rcu() here or the > > > > synchronize_rcu() doesn't help at all. Am I missing something subtle > > > > here? > > > > > > > Oh! Maybe I was missing register_lock_class() must be called with irq > > > disabled, which is also an RCU read-side critical section. > > > > > Since register_lock_class() will be call with irq disabled, maybe hazard > > pointers [1] is better because most of the case we only have nr_cpus > > readers, so the potential hazard pointer slots are fixed. > > > > So the below patch can reduce the time of the tc command from real ~1.7 > > second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env, > > which is not surprising given it's a dedicated hazard pointers for > > lock_class_key. > > > > Thoughts? > > My understanding is that it is not a race between register_lock_class() and > lockdep_unregister_key(). It is the fact that the structure that holds the > lock_class_key may be freed immediately after return from > lockdep_unregister_key(). So any processes that are in the process of > iterating the hash_list containing the hash_entry to be unregistered may hit You mean the lock_keys_hash table, right? I used register_lock_class() as an example, because it's one of the places that iterates lock_keys_hash. IIUC lock_keys_hash is only used in lockdep_{un,}register_key() and is_dynamic_key() (which are only called by lockdep_init_map_type() and register_lock_class()). > a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid potential > access of invalid memory in lock_class") for a discussion of this kind of > UAF problem. > That commit seemed fixing a race between disabling lockdep and unregistering key, and most importantly, call zap_class() for the unregistered key even if lockdep is disabled (debug_locks = 0). It might be related, but I'm not sure that's the reason of putting synchronize_rcu() there. Say you want to synchronize between /proc/lockdep and lockdep_unregister_key(), and you have synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side critical section at /proc/lockdep? Regards, Boqun > As suggested by Eric, one possible solution is to add a > lockdep_unregister_key() variant function that presumes the structure > holding the key won't be freed until after a RCU delay. In this case, we can > skip the last synchronize_rcu() call. Any callers that need immediate return > should use kfree_rcu() to free the structure after calling the > lockdep_unregister_key() variant. > > Cheers, > Longman > ^ permalink raw reply [flat|nested] 35+ messages in thread
[parent not found: <934d794b-7ebc-422c-b4fe-3e658a2e5e7a@redhat.com>]
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization [not found] ` <934d794b-7ebc-422c-b4fe-3e658a2e5e7a@redhat.com> @ 2025-03-25 14:57 ` Waiman Long 2025-03-25 18:45 ` Boqun Feng 1 sibling, 0 replies; 35+ messages in thread From: Waiman Long @ 2025-03-25 14:57 UTC (permalink / raw) To: Waiman Long, Boqun Feng Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On 3/25/25 10:52 AM, Waiman Long wrote: > I agree that the commit that I mentioned is not relevant to the > current case. You are right that is_dynamic_key() is the only function > that is problematic, the other two are protected by the lockdep_lock. > So they are safe. Anyway, I believe that the actual race happens in > the iteration of the hashed list in is_dynamic_key(). The key that you > save in the lockdep_key_hazptr in your proposed patch should never be > the key (dead_key) that is passed to lockdep_unregister_key(). In > is_dynamic_key(): > > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > if (k == key) { > found = true; > break; > } > } > > key != k (dead_key), but before accessing its content to get to > hash_entry, an interrupt/NMI can happen. In the mean time, the > structure holding the key is freed and its content can be overwritten > with some garbage. When interrupt/NMI returns, hash_entry can point to > anything leading to crash or an infinite loop. Perhaps we can use > some kind of synchronization mechanism between is_dynamic_key() and > lockdep_unregister_key() to prevent this kind of racing. For example, > we can have an atomic counter associated with each head of the hashed > table. is_dynamic_key() can increment the counter if it is not zero to > proceed and lockdep_unregister_key() have to make sure it can safely > decrement the counter to 0 before going ahead. Just a thought! > Well, that is essentially an arch_rwlock_t for each list head. Cheers, Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization [not found] ` <934d794b-7ebc-422c-b4fe-3e658a2e5e7a@redhat.com> 2025-03-25 14:57 ` Waiman Long @ 2025-03-25 18:45 ` Boqun Feng 2025-03-25 19:23 ` Waiman Long 1 sibling, 1 reply; 35+ messages in thread From: Boqun Feng @ 2025-03-25 18:45 UTC (permalink / raw) To: Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Tue, Mar 25, 2025 at 10:52:16AM -0400, Waiman Long wrote: > On 3/24/25 11:41 PM, Boqun Feng wrote: > > On Mon, Mar 24, 2025 at 09:56:25PM -0400, Waiman Long wrote: > > > On 3/24/25 8:47 PM, Boqun Feng wrote: > > > > On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote: > > > > > On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote: > > > > > > On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote: > > > > > > [...] > > > > > > > > > --- > > > > > > > > > kernel/locking/lockdep.c | 6 ++++-- > > > > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > > > > > > > > index 4470680f02269..a79030ac36dd4 100644 > > > > > > > > > --- a/kernel/locking/lockdep.c > > > > > > > > > +++ b/kernel/locking/lockdep.c > > > > > > > > > @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) > > > > > > > > > if (need_callback) > > > > > > > > > call_rcu(&delayed_free.rcu_head, free_zapped_rcu); > > > > > > > > > > > > > > > > > > - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ > > > > > > > > > - synchronize_rcu(); > > > > > > I feel a bit confusing even for the old comment, normally I would expect > > > > > > the caller of lockdep_unregister_key() should guarantee the key has been > > > > > > unpublished, in other words, there is no way a lockdep_unregister_key() > > > > > > could race with a register_lock_class()/lockdep_init_map_type(). The > > > > > > synchronize_rcu() is not needed then. > > > > > > > > > > > > Let's say someone breaks my assumption above, then when doing a > > > > > > register_lock_class() with a key about to be unregister, I cannot see > > > > > > anything stops the following: > > > > > > > > > > > > CPU 0 CPU 1 > > > > > > ===== ===== > > > > > > register_lock_class(): > > > > > > ... > > > > > > } else if (... && !is_dynamic_key(lock->key)) { > > > > > > // ->key is not unregistered yet, so this branch is not > > > > > > // taken. > > > > > > return NULL; > > > > > > } > > > > > > lockdep_unregister_key(..); > > > > > > // key unregister, can be free > > > > > > // any time. > > > > > > key = lock->key->subkeys + subclass; // BOOM! UAF. > > This is not a UAF :( > > > > > > > > So either we don't need the synchronize_rcu() here or the > > > > > > synchronize_rcu() doesn't help at all. Am I missing something subtle > > > > > > here? > > > > > > > > > > > Oh! Maybe I was missing register_lock_class() must be called with irq > > > > > disabled, which is also an RCU read-side critical section. > > > > > > > > > Since register_lock_class() will be call with irq disabled, maybe hazard > > > > pointers [1] is better because most of the case we only have nr_cpus > > > > readers, so the potential hazard pointer slots are fixed. > > > > > > > > So the below patch can reduce the time of the tc command from real ~1.7 > > > > second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env, > > > > which is not surprising given it's a dedicated hazard pointers for > > > > lock_class_key. > > > > > > > > Thoughts? > > > My understanding is that it is not a race between register_lock_class() and > > > lockdep_unregister_key(). It is the fact that the structure that holds the > > > lock_class_key may be freed immediately after return from > > > lockdep_unregister_key(). So any processes that are in the process of > > > iterating the hash_list containing the hash_entry to be unregistered may hit > > You mean the lock_keys_hash table, right? I used register_lock_class() > > as an example, because it's one of the places that iterates > > lock_keys_hash. IIUC lock_keys_hash is only used in > > lockdep_{un,}register_key() and is_dynamic_key() (which are only called > > by lockdep_init_map_type() and register_lock_class()). > > > > > a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid potential > > > access of invalid memory in lock_class") for a discussion of this kind of > > > UAF problem. > > > > > That commit seemed fixing a race between disabling lockdep and > > unregistering key, and most importantly, call zap_class() for the > > unregistered key even if lockdep is disabled (debug_locks = 0). It might > > be related, but I'm not sure that's the reason of putting > > synchronize_rcu() there. Say you want to synchronize between > > /proc/lockdep and lockdep_unregister_key(), and you have > > synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side > > critical section at /proc/lockdep? > I agree that the commit that I mentioned is not relevant to the current > case. You are right that is_dynamic_key() is the only function that is > problematic, the other two are protected by the lockdep_lock. So they are > safe. Anyway, I believe that the actual race happens in the iteration of the > hashed list in is_dynamic_key(). The key that you save in the > lockdep_key_hazptr in your proposed patch should never be the key (dead_key) The key stored in lockdep_key_hazptr is the one that the rest of the function will use after is_dynamic_key() return true. That is, CPU 0 CPU 1 ===== ===== WRITE_ONCE(*lockdep_key_hazptr, key); smp_mb(); is_dynamic_key(): hlist_for_each_entry_rcu(k, hash_head, hash_entry) { if (k == key) { found = true; break; } } lockdep_unregister_key(): hlist_for_each_entry_rcu(k, hash_head, hash_entry) { if (k == key) { hlist_del_rcu(&k->hash_entry); found = true; break; } } smp_mb(); synchronize_lockdep_key_hazptr(): for_each_possible_cpu(cpu) { <wait for the hazptr slot on that CPU to be not equal to the removed key> } , so that if is_dynamic_key() finds a key was in the hash, even though later on the key would be removed by lockdep_unregister_key(), the hazard pointers guarantee lockdep_unregister_key() would wait for the hazard pointer to release. > that is passed to lockdep_unregister_key(). In is_dynamic_key(): > > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > if (k == key) { > found = true; > break; > } > } > > key != k (dead_key), but before accessing its content to get to hash_entry, It is the dead_key. > an interrupt/NMI can happen. In the mean time, the structure holding the key > is freed and its content can be overwritten with some garbage. When > interrupt/NMI returns, hash_entry can point to anything leading to crash or > an infinite loop. Perhaps we can use some kind of synchronization mechanism No, hash_entry cannot be freed or overwritten because the user has protect the key with hazard pointers, only when the user reset the hazard pointer to NULL, lockdep_unregister_key() can then return and the key can be freed. > between is_dynamic_key() and lockdep_unregister_key() to prevent this kind > of racing. For example, we can have an atomic counter associated with each The hazard pointer I proposed provides the exact synchronization ;-) Regards, Boqun > head of the hashed table. is_dynamic_key() can increment the counter if it > is not zero to proceed and lockdep_unregister_key() have to make sure it can > safely decrement the counter to 0 before going ahead. Just a thought! > > Cheers, > Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-25 18:45 ` Boqun Feng @ 2025-03-25 19:23 ` Waiman Long 2025-03-25 19:42 ` Boqun Feng 0 siblings, 1 reply; 35+ messages in thread From: Waiman Long @ 2025-03-25 19:23 UTC (permalink / raw) To: Boqun Feng, Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On 3/25/25 2:45 PM, Boqun Feng wrote: > On Tue, Mar 25, 2025 at 10:52:16AM -0400, Waiman Long wrote: >> On 3/24/25 11:41 PM, Boqun Feng wrote: >>> On Mon, Mar 24, 2025 at 09:56:25PM -0400, Waiman Long wrote: >>>> On 3/24/25 8:47 PM, Boqun Feng wrote: >>>>> On Mon, Mar 24, 2025 at 12:30:10PM -0700, Boqun Feng wrote: >>>>>> On Mon, Mar 24, 2025 at 12:21:07PM -0700, Boqun Feng wrote: >>>>>>> On Mon, Mar 24, 2025 at 01:23:50PM +0100, Eric Dumazet wrote: >>>>>>> [...] >>>>>>>>>> --- >>>>>>>>>> kernel/locking/lockdep.c | 6 ++++-- >>>>>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c >>>>>>>>>> index 4470680f02269..a79030ac36dd4 100644 >>>>>>>>>> --- a/kernel/locking/lockdep.c >>>>>>>>>> +++ b/kernel/locking/lockdep.c >>>>>>>>>> @@ -6595,8 +6595,10 @@ void lockdep_unregister_key(struct lock_class_key *key) >>>>>>>>>> if (need_callback) >>>>>>>>>> call_rcu(&delayed_free.rcu_head, free_zapped_rcu); >>>>>>>>>> >>>>>>>>>> - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ >>>>>>>>>> - synchronize_rcu(); >>>>>>> I feel a bit confusing even for the old comment, normally I would expect >>>>>>> the caller of lockdep_unregister_key() should guarantee the key has been >>>>>>> unpublished, in other words, there is no way a lockdep_unregister_key() >>>>>>> could race with a register_lock_class()/lockdep_init_map_type(). The >>>>>>> synchronize_rcu() is not needed then. >>>>>>> >>>>>>> Let's say someone breaks my assumption above, then when doing a >>>>>>> register_lock_class() with a key about to be unregister, I cannot see >>>>>>> anything stops the following: >>>>>>> >>>>>>> CPU 0 CPU 1 >>>>>>> ===== ===== >>>>>>> register_lock_class(): >>>>>>> ... >>>>>>> } else if (... && !is_dynamic_key(lock->key)) { >>>>>>> // ->key is not unregistered yet, so this branch is not >>>>>>> // taken. >>>>>>> return NULL; >>>>>>> } >>>>>>> lockdep_unregister_key(..); >>>>>>> // key unregister, can be free >>>>>>> // any time. >>>>>>> key = lock->key->subkeys + subclass; // BOOM! UAF. >>> This is not a UAF :( >>> >>>>>>> So either we don't need the synchronize_rcu() here or the >>>>>>> synchronize_rcu() doesn't help at all. Am I missing something subtle >>>>>>> here? >>>>>>> >>>>>> Oh! Maybe I was missing register_lock_class() must be called with irq >>>>>> disabled, which is also an RCU read-side critical section. >>>>>> >>>>> Since register_lock_class() will be call with irq disabled, maybe hazard >>>>> pointers [1] is better because most of the case we only have nr_cpus >>>>> readers, so the potential hazard pointer slots are fixed. >>>>> >>>>> So the below patch can reduce the time of the tc command from real ~1.7 >>>>> second (v6.14) to real ~0.05 second (v6.14 + patch) in my test env, >>>>> which is not surprising given it's a dedicated hazard pointers for >>>>> lock_class_key. >>>>> >>>>> Thoughts? >>>> My understanding is that it is not a race between register_lock_class() and >>>> lockdep_unregister_key(). It is the fact that the structure that holds the >>>> lock_class_key may be freed immediately after return from >>>> lockdep_unregister_key(). So any processes that are in the process of >>>> iterating the hash_list containing the hash_entry to be unregistered may hit >>> You mean the lock_keys_hash table, right? I used register_lock_class() >>> as an example, because it's one of the places that iterates >>> lock_keys_hash. IIUC lock_keys_hash is only used in >>> lockdep_{un,}register_key() and is_dynamic_key() (which are only called >>> by lockdep_init_map_type() and register_lock_class()). >>> >>>> a UAF problem. See commit 61cc4534b6550 ("locking/lockdep: Avoid potential >>>> access of invalid memory in lock_class") for a discussion of this kind of >>>> UAF problem. >>>> >>> That commit seemed fixing a race between disabling lockdep and >>> unregistering key, and most importantly, call zap_class() for the >>> unregistered key even if lockdep is disabled (debug_locks = 0). It might >>> be related, but I'm not sure that's the reason of putting >>> synchronize_rcu() there. Say you want to synchronize between >>> /proc/lockdep and lockdep_unregister_key(), and you have >>> synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side >>> critical section at /proc/lockdep? >> I agree that the commit that I mentioned is not relevant to the current >> case. You are right that is_dynamic_key() is the only function that is >> problematic, the other two are protected by the lockdep_lock. So they are >> safe. Anyway, I believe that the actual race happens in the iteration of the >> hashed list in is_dynamic_key(). The key that you save in the >> lockdep_key_hazptr in your proposed patch should never be the key (dead_key) > The key stored in lockdep_key_hazptr is the one that the rest of the > function will use after is_dynamic_key() return true. That is, > > CPU 0 CPU 1 > ===== ===== > WRITE_ONCE(*lockdep_key_hazptr, key); > smp_mb(); > > is_dynamic_key(): > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > if (k == key) { > found = true; > break; > } > } > lockdep_unregister_key(): > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > if (k == key) { > hlist_del_rcu(&k->hash_entry); > found = true; > break; > } > } > > smp_mb(); > > synchronize_lockdep_key_hazptr(): > for_each_possible_cpu(cpu) { > <wait for the hazptr slot on > that CPU to be not equal to > the removed key> > } > > > , so that if is_dynamic_key() finds a key was in the hash, even though > later on the key would be removed by lockdep_unregister_key(), the > hazard pointers guarantee lockdep_unregister_key() would wait for the > hazard pointer to release. > >> that is passed to lockdep_unregister_key(). In is_dynamic_key(): >> >> hlist_for_each_entry_rcu(k, hash_head, hash_entry) { >> if (k == key) { >> found = true; >> break; >> } >> } >> >> key != k (dead_key), but before accessing its content to get to hash_entry, > It is the dead_key. > >> an interrupt/NMI can happen. In the mean time, the structure holding the key >> is freed and its content can be overwritten with some garbage. When >> interrupt/NMI returns, hash_entry can point to anything leading to crash or >> an infinite loop. Perhaps we can use some kind of synchronization mechanism > No, hash_entry cannot be freed or overwritten because the user has > protect the key with hazard pointers, only when the user reset the > hazard pointer to NULL, lockdep_unregister_key() can then return and the > key can be freed. > >> between is_dynamic_key() and lockdep_unregister_key() to prevent this kind >> of racing. For example, we can have an atomic counter associated with each > The hazard pointer I proposed provides the exact synchronization ;-) What I am saying is that register_lock_class() is trying to find a newkey while lockdep_unregister_key() is trying to take out an oldkey (newkey != oldkey). If they happens in the same hash list, register_lock_class() will put newkey into the hazard pointer, but synchronize_lockdep_key_hazptr() call from lockdep_unregister_key() is checking for oldkey which is not the one saved in the hazard pointer. So lockdep_unregister_key() will return and the key will be freed and reused while is_dynamic_key() may just have a reference to the oldkey and trying to access its content which is invalid. I think this is a possible scenario. Cheers, Longman > > Regards, > Boqun > >> head of the hashed table. is_dynamic_key() can increment the counter if it >> is not zero to proceed and lockdep_unregister_key() have to make sure it can >> safely decrement the counter to 0 before going ahead. Just a thought! >> >> Cheers, >> Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-25 19:23 ` Waiman Long @ 2025-03-25 19:42 ` Boqun Feng 2025-03-25 23:20 ` Waiman Long 0 siblings, 1 reply; 35+ messages in thread From: Boqun Feng @ 2025-03-25 19:42 UTC (permalink / raw) To: Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Tue, Mar 25, 2025 at 03:23:30PM -0400, Waiman Long wrote: [...] > > > > That commit seemed fixing a race between disabling lockdep and > > > > unregistering key, and most importantly, call zap_class() for the > > > > unregistered key even if lockdep is disabled (debug_locks = 0). It might > > > > be related, but I'm not sure that's the reason of putting > > > > synchronize_rcu() there. Say you want to synchronize between > > > > /proc/lockdep and lockdep_unregister_key(), and you have > > > > synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side > > > > critical section at /proc/lockdep? > > > I agree that the commit that I mentioned is not relevant to the current > > > case. You are right that is_dynamic_key() is the only function that is > > > problematic, the other two are protected by the lockdep_lock. So they are > > > safe. Anyway, I believe that the actual race happens in the iteration of the > > > hashed list in is_dynamic_key(). The key that you save in the > > > lockdep_key_hazptr in your proposed patch should never be the key (dead_key) > > The key stored in lockdep_key_hazptr is the one that the rest of the > > function will use after is_dynamic_key() return true. That is, > > > > CPU 0 CPU 1 > > ===== ===== > > WRITE_ONCE(*lockdep_key_hazptr, key); > > smp_mb(); > > > > is_dynamic_key(): > > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > > if (k == key) { > > found = true; > > break; > > } > > } > > lockdep_unregister_key(): > > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > > if (k == key) { > > hlist_del_rcu(&k->hash_entry); > > found = true; > > break; > > } > > } > > > > smp_mb(); > > > > synchronize_lockdep_key_hazptr(): > > for_each_possible_cpu(cpu) { > > <wait for the hazptr slot on > > that CPU to be not equal to > > the removed key> > > } > > > > > > , so that if is_dynamic_key() finds a key was in the hash, even though > > later on the key would be removed by lockdep_unregister_key(), the > > hazard pointers guarantee lockdep_unregister_key() would wait for the > > hazard pointer to release. > > > > > that is passed to lockdep_unregister_key(). In is_dynamic_key(): > > > > > > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > > > if (k == key) { > > > found = true; > > > break; > > > } > > > } > > > > > > key != k (dead_key), but before accessing its content to get to hash_entry, > > It is the dead_key. > > > > > an interrupt/NMI can happen. In the mean time, the structure holding the key > > > is freed and its content can be overwritten with some garbage. When > > > interrupt/NMI returns, hash_entry can point to anything leading to crash or > > > an infinite loop. Perhaps we can use some kind of synchronization mechanism > > No, hash_entry cannot be freed or overwritten because the user has > > protect the key with hazard pointers, only when the user reset the > > hazard pointer to NULL, lockdep_unregister_key() can then return and the > > key can be freed. > > > > > between is_dynamic_key() and lockdep_unregister_key() to prevent this kind > > > of racing. For example, we can have an atomic counter associated with each > > The hazard pointer I proposed provides the exact synchronization ;-) > > What I am saying is that register_lock_class() is trying to find a newkey > while lockdep_unregister_key() is trying to take out an oldkey (newkey != > oldkey). If they happens in the same hash list, register_lock_class() will > put newkey into the hazard pointer, but synchronize_lockdep_key_hazptr() > call from lockdep_unregister_key() is checking for oldkey which is not the > one saved in the hazard pointer. So lockdep_unregister_key() will return and > the key will be freed and reused while is_dynamic_key() may just have a > reference to the oldkey and trying to access its content which is invalid. I > think this is a possible scenario. > Oh, I see. And yes, the hazard pointers I proposed cannot prevent this race unfortunately. (Well, technically we can still use an extra slot to hold the key in the hash list iteration, but that would generates a lot of stores, so it won't be ideal). But... [...] > > > head of the hashed table. is_dynamic_key() can increment the counter if it > > > is not zero to proceed and lockdep_unregister_key() have to make sure it can > > > safely decrement the counter to 0 before going ahead. Just a thought! > > > Your idea inspires another solution with hazard pointers, we can put the pointer of the hash_head into the hazard pointer slot ;-) in register_lock_class(): /* hazptr: protect the key */ WRITE_ONCE(*key_hazptr, keyhashentry(lock->key)); /* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */ smp_mb(); if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) { return NULL; } in lockdep_unregister_key(): /* Wait until register_lock_class() has finished accessing k->hash_entry. */ synchronize_lockdep_key_hazptr(keyhashentry(key)); which is more space efficient than per hash list atomic or lock unless you have 1024 or more CPUs. Regards, Boqun > > > Cheers, > > > Longman > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-25 19:42 ` Boqun Feng @ 2025-03-25 23:20 ` Waiman Long 2025-03-26 5:25 ` Boqun Feng 0 siblings, 1 reply; 35+ messages in thread From: Waiman Long @ 2025-03-25 23:20 UTC (permalink / raw) To: Boqun Feng, Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On 3/25/25 3:42 PM, Boqun Feng wrote: > On Tue, Mar 25, 2025 at 03:23:30PM -0400, Waiman Long wrote: > [...] >>>>> That commit seemed fixing a race between disabling lockdep and >>>>> unregistering key, and most importantly, call zap_class() for the >>>>> unregistered key even if lockdep is disabled (debug_locks = 0). It might >>>>> be related, but I'm not sure that's the reason of putting >>>>> synchronize_rcu() there. Say you want to synchronize between >>>>> /proc/lockdep and lockdep_unregister_key(), and you have >>>>> synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side >>>>> critical section at /proc/lockdep? >>>> I agree that the commit that I mentioned is not relevant to the current >>>> case. You are right that is_dynamic_key() is the only function that is >>>> problematic, the other two are protected by the lockdep_lock. So they are >>>> safe. Anyway, I believe that the actual race happens in the iteration of the >>>> hashed list in is_dynamic_key(). The key that you save in the >>>> lockdep_key_hazptr in your proposed patch should never be the key (dead_key) >>> The key stored in lockdep_key_hazptr is the one that the rest of the >>> function will use after is_dynamic_key() return true. That is, >>> >>> CPU 0 CPU 1 >>> ===== ===== >>> WRITE_ONCE(*lockdep_key_hazptr, key); >>> smp_mb(); >>> >>> is_dynamic_key(): >>> hlist_for_each_entry_rcu(k, hash_head, hash_entry) { >>> if (k == key) { >>> found = true; >>> break; >>> } >>> } >>> lockdep_unregister_key(): >>> hlist_for_each_entry_rcu(k, hash_head, hash_entry) { >>> if (k == key) { >>> hlist_del_rcu(&k->hash_entry); >>> found = true; >>> break; >>> } >>> } >>> >>> smp_mb(); >>> >>> synchronize_lockdep_key_hazptr(): >>> for_each_possible_cpu(cpu) { >>> <wait for the hazptr slot on >>> that CPU to be not equal to >>> the removed key> >>> } >>> >>> >>> , so that if is_dynamic_key() finds a key was in the hash, even though >>> later on the key would be removed by lockdep_unregister_key(), the >>> hazard pointers guarantee lockdep_unregister_key() would wait for the >>> hazard pointer to release. >>> >>>> that is passed to lockdep_unregister_key(). In is_dynamic_key(): >>>> >>>> hlist_for_each_entry_rcu(k, hash_head, hash_entry) { >>>> if (k == key) { >>>> found = true; >>>> break; >>>> } >>>> } >>>> >>>> key != k (dead_key), but before accessing its content to get to hash_entry, >>> It is the dead_key. >>> >>>> an interrupt/NMI can happen. In the mean time, the structure holding the key >>>> is freed and its content can be overwritten with some garbage. When >>>> interrupt/NMI returns, hash_entry can point to anything leading to crash or >>>> an infinite loop. Perhaps we can use some kind of synchronization mechanism >>> No, hash_entry cannot be freed or overwritten because the user has >>> protect the key with hazard pointers, only when the user reset the >>> hazard pointer to NULL, lockdep_unregister_key() can then return and the >>> key can be freed. >>> >>>> between is_dynamic_key() and lockdep_unregister_key() to prevent this kind >>>> of racing. For example, we can have an atomic counter associated with each >>> The hazard pointer I proposed provides the exact synchronization ;-) >> What I am saying is that register_lock_class() is trying to find a newkey >> while lockdep_unregister_key() is trying to take out an oldkey (newkey != >> oldkey). If they happens in the same hash list, register_lock_class() will >> put newkey into the hazard pointer, but synchronize_lockdep_key_hazptr() >> call from lockdep_unregister_key() is checking for oldkey which is not the >> one saved in the hazard pointer. So lockdep_unregister_key() will return and >> the key will be freed and reused while is_dynamic_key() may just have a >> reference to the oldkey and trying to access its content which is invalid. I >> think this is a possible scenario. >> > Oh, I see. And yes, the hazard pointers I proposed cannot prevent this > race unfortunately. (Well, technically we can still use an extra slot to > hold the key in the hash list iteration, but that would generates a lot > of stores, so it won't be ideal). But... > > [...] >>>> head of the hashed table. is_dynamic_key() can increment the counter if it >>>> is not zero to proceed and lockdep_unregister_key() have to make sure it can >>>> safely decrement the counter to 0 before going ahead. Just a thought! >>>> > Your idea inspires another solution with hazard pointers, we can > put the pointer of the hash_head into the hazard pointer slot ;-) > > in register_lock_class(): > > /* hazptr: protect the key */ > WRITE_ONCE(*key_hazptr, keyhashentry(lock->key)); > > /* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */ > smp_mb(); > > if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) { > return NULL; > } > > in lockdep_unregister_key(): > > /* Wait until register_lock_class() has finished accessing k->hash_entry. */ > synchronize_lockdep_key_hazptr(keyhashentry(key)); > > > which is more space efficient than per hash list atomic or lock unless > you have 1024 or more CPUs. It looks like you are trying hard to find a use case for hazard pointer in the kernel :-) Anyway, that may work. The only problem that I see is the issue of nesting of an interrupt context on top of a task context. It is possible that the first use of a raw_spinlock may happen in an interrupt context. If the interrupt happens when the task has set the hazard pointer and iterating the hash list, the value of the hazard pointer may be overwritten. Alternatively we could have multiple slots for the hazard pointer, but that will make the code more complicated. Or we could disable interrupt before setting the hazard pointer. The solution that I am thinking about is to have a simple unfair rwlock to protect just the hash list iteration. lockdep_unregister_key() and lockdep_register_key() take the write lock with interrupt disabled. While is_dynamic_key() takes the read lock. Nesting in this case isn't a problem and we don't need RCU to protect the iteration process and so the last synchronize_rcu() call isn't needed. The level of contention should be low enough that live lock isn't an issue. Cheers, Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-25 23:20 ` Waiman Long @ 2025-03-26 5:25 ` Boqun Feng [not found] ` <df237702-55c3-466b-b51e-f3fe46ae03ba@redhat.com> 0 siblings, 1 reply; 35+ messages in thread From: Boqun Feng @ 2025-03-26 5:25 UTC (permalink / raw) To: Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Tue, Mar 25, 2025 at 07:20:44PM -0400, Waiman Long wrote: > On 3/25/25 3:42 PM, Boqun Feng wrote: > > On Tue, Mar 25, 2025 at 03:23:30PM -0400, Waiman Long wrote: > > [...] > > > > > > That commit seemed fixing a race between disabling lockdep and > > > > > > unregistering key, and most importantly, call zap_class() for the > > > > > > unregistered key even if lockdep is disabled (debug_locks = 0). It might > > > > > > be related, but I'm not sure that's the reason of putting > > > > > > synchronize_rcu() there. Say you want to synchronize between > > > > > > /proc/lockdep and lockdep_unregister_key(), and you have > > > > > > synchronize_rcu() in lockdep_unregister_key(), what's the RCU read-side > > > > > > critical section at /proc/lockdep? > > > > > I agree that the commit that I mentioned is not relevant to the current > > > > > case. You are right that is_dynamic_key() is the only function that is > > > > > problematic, the other two are protected by the lockdep_lock. So they are > > > > > safe. Anyway, I believe that the actual race happens in the iteration of the > > > > > hashed list in is_dynamic_key(). The key that you save in the > > > > > lockdep_key_hazptr in your proposed patch should never be the key (dead_key) > > > > The key stored in lockdep_key_hazptr is the one that the rest of the > > > > function will use after is_dynamic_key() return true. That is, > > > > > > > > CPU 0 CPU 1 > > > > ===== ===== > > > > WRITE_ONCE(*lockdep_key_hazptr, key); > > > > smp_mb(); > > > > > > > > is_dynamic_key(): > > > > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > > > > if (k == key) { > > > > found = true; > > > > break; > > > > } > > > > } > > > > lockdep_unregister_key(): > > > > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > > > > if (k == key) { > > > > hlist_del_rcu(&k->hash_entry); > > > > found = true; > > > > break; > > > > } > > > > } > > > > > > > > smp_mb(); > > > > > > > > synchronize_lockdep_key_hazptr(): > > > > for_each_possible_cpu(cpu) { > > > > <wait for the hazptr slot on > > > > that CPU to be not equal to > > > > the removed key> > > > > } > > > > > > > > > > > > , so that if is_dynamic_key() finds a key was in the hash, even though > > > > later on the key would be removed by lockdep_unregister_key(), the > > > > hazard pointers guarantee lockdep_unregister_key() would wait for the > > > > hazard pointer to release. > > > > > > > > > that is passed to lockdep_unregister_key(). In is_dynamic_key(): > > > > > > > > > > hlist_for_each_entry_rcu(k, hash_head, hash_entry) { > > > > > if (k == key) { > > > > > found = true; > > > > > break; > > > > > } > > > > > } > > > > > > > > > > key != k (dead_key), but before accessing its content to get to hash_entry, > > > > It is the dead_key. > > > > > > > > > an interrupt/NMI can happen. In the mean time, the structure holding the key > > > > > is freed and its content can be overwritten with some garbage. When > > > > > interrupt/NMI returns, hash_entry can point to anything leading to crash or > > > > > an infinite loop. Perhaps we can use some kind of synchronization mechanism > > > > No, hash_entry cannot be freed or overwritten because the user has > > > > protect the key with hazard pointers, only when the user reset the > > > > hazard pointer to NULL, lockdep_unregister_key() can then return and the > > > > key can be freed. > > > > > > > > > between is_dynamic_key() and lockdep_unregister_key() to prevent this kind > > > > > of racing. For example, we can have an atomic counter associated with each > > > > The hazard pointer I proposed provides the exact synchronization ;-) > > > What I am saying is that register_lock_class() is trying to find a newkey > > > while lockdep_unregister_key() is trying to take out an oldkey (newkey != > > > oldkey). If they happens in the same hash list, register_lock_class() will > > > put newkey into the hazard pointer, but synchronize_lockdep_key_hazptr() > > > call from lockdep_unregister_key() is checking for oldkey which is not the > > > one saved in the hazard pointer. So lockdep_unregister_key() will return and > > > the key will be freed and reused while is_dynamic_key() may just have a > > > reference to the oldkey and trying to access its content which is invalid. I > > > think this is a possible scenario. > > > > > Oh, I see. And yes, the hazard pointers I proposed cannot prevent this > > race unfortunately. (Well, technically we can still use an extra slot to > > hold the key in the hash list iteration, but that would generates a lot > > of stores, so it won't be ideal). But... > > > > [...] > > > > > head of the hashed table. is_dynamic_key() can increment the counter if it > > > > > is not zero to proceed and lockdep_unregister_key() have to make sure it can > > > > > safely decrement the counter to 0 before going ahead. Just a thought! > > > > > > > Your idea inspires another solution with hazard pointers, we can > > put the pointer of the hash_head into the hazard pointer slot ;-) > > > > in register_lock_class(): > > > > /* hazptr: protect the key */ > > WRITE_ONCE(*key_hazptr, keyhashentry(lock->key)); > > > > /* Synchronizes with the smp_mb() in synchronize_lockdep_key_hazptr() */ > > smp_mb(); > > > > if (!static_obj(lock->key) && !is_dynamic_key(lock->key)) { > > return NULL; > > } > > > > in lockdep_unregister_key(): > > > > /* Wait until register_lock_class() has finished accessing k->hash_entry. */ > > synchronize_lockdep_key_hazptr(keyhashentry(key)); > > > > > > which is more space efficient than per hash list atomic or lock unless > > you have 1024 or more CPUs. > > It looks like you are trying hard to find a use case for hazard pointer in > the kernel :-) > Well, if it does the job, why not use it ;-) Also this shows how flexible hazard pointers can be. At least when using hazard pointers, the reader side of the hash list iteration is still lockless. Plus, since the synchronization part doesn't need to wait for the RCU readers in the whole system, it will be faster (I tried with the protecting-the-whole-hash-list approach as well, it's the same result on the tc command). This is why I choose to look into hazard pointers. Another mechanism can achieve the similar behavior is SRCU, but SRCU is slightly heavier compared to hazard pointers in this case (of course SRCU has more functionalities). We can provide a lockdep_unregister_key_nosync() without the synchronize_rcu() in it and let users do the synchronization, but it's going to be hard to enforce and review, especially when someone refactors the code and move the free code to somewhere else. > Anyway, that may work. The only problem that I see is the issue of nesting > of an interrupt context on top of a task context. It is possible that the > first use of a raw_spinlock may happen in an interrupt context. If the > interrupt happens when the task has set the hazard pointer and iterating the > hash list, the value of the hazard pointer may be overwritten. Alternatively > we could have multiple slots for the hazard pointer, but that will make the > code more complicated. Or we could disable interrupt before setting the > hazard pointer. Or we can use lockdep_recursion: preempt_disable(); lockdep_recursion_inc(); barrier(); WRITE_ONCE(*hazptr, ...); , it should prevent the re-entrant of lockdep in irq. > > The solution that I am thinking about is to have a simple unfair rwlock to > protect just the hash list iteration. lockdep_unregister_key() and > lockdep_register_key() take the write lock with interrupt disabled. While > is_dynamic_key() takes the read lock. Nesting in this case isn't a problem > and we don't need RCU to protect the iteration process and so the last > synchronize_rcu() call isn't needed. The level of contention should be low > enough that live lock isn't an issue. > This could work, one thing though is that locks don't compose. Using a hash write_lock in lockdep_unregister_key() will create a lockdep_lock() -> "hash write_lock" dependency, and that means you cannot lockdep_lock() while you're holding a hash read_lock, although it's not the case today, but it certainly complicates the locking design inside lockdep where there's no lockdep to help ;-) Regards, Boqun > Cheers, > Longman > > ^ permalink raw reply [flat|nested] 35+ messages in thread
[parent not found: <df237702-55c3-466b-b51e-f3fe46ae03ba@redhat.com>]
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization [not found] ` <df237702-55c3-466b-b51e-f3fe46ae03ba@redhat.com> @ 2025-03-26 16:40 ` Waiman Long 2025-03-26 16:47 ` Boqun Feng 2025-03-31 17:26 ` Boqun Feng 1 sibling, 1 reply; 35+ messages in thread From: Waiman Long @ 2025-03-26 16:40 UTC (permalink / raw) To: Waiman Long, Boqun Feng Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On 3/26/25 11:39 AM, Waiman Long wrote: > On 3/26/25 1:25 AM, Boqun Feng wrote: >>> It looks like you are trying hard to find a use case for hazard pointer in >>> the kernel 🙂 >>> >> Well, if it does the job, why not use it 😉 Also this shows how >> flexible hazard pointers can be. >> >> At least when using hazard pointers, the reader side of the hash list >> iteration is still lockless. Plus, since the synchronization part >> doesn't need to wait for the RCU readers in the whole system, it will be >> faster (I tried with the protecting-the-whole-hash-list approach as >> well, it's the same result on the tc command). This is why I choose to >> look into hazard pointers. Another mechanism can achieve the similar >> behavior is SRCU, but SRCU is slightly heavier compared to hazard >> pointers in this case (of course SRCU has more functionalities). >> >> We can provide a lockdep_unregister_key_nosync() without the >> synchronize_rcu() in it and let users do the synchronization, but it's >> going to be hard to enforce and review, especially when someone >> refactors the code and move the free code to somewhere else. > Providing a second API and ask callers to do the right thing is > probably not a good idea and mistake is going to be made sooner or later. >>> Anyway, that may work. The only problem that I see is the issue of nesting >>> of an interrupt context on top of a task context. It is possible that the >>> first use of a raw_spinlock may happen in an interrupt context. If the >>> interrupt happens when the task has set the hazard pointer and iterating the >>> hash list, the value of the hazard pointer may be overwritten. Alternatively >>> we could have multiple slots for the hazard pointer, but that will make the >>> code more complicated. Or we could disable interrupt before setting the >>> hazard pointer. >> Or we can use lockdep_recursion: >> >> preempt_disable(); >> lockdep_recursion_inc(); >> barrier(); >> >> WRITE_ONCE(*hazptr, ...); >> >> , it should prevent the re-entrant of lockdep in irq. > That will probably work. Or we can disable irq. I am fine with both. >>> The solution that I am thinking about is to have a simple unfair rwlock to >>> protect just the hash list iteration. lockdep_unregister_key() and >>> lockdep_register_key() take the write lock with interrupt disabled. While >>> is_dynamic_key() takes the read lock. Nesting in this case isn't a problem >>> and we don't need RCU to protect the iteration process and so the last >>> synchronize_rcu() call isn't needed. The level of contention should be low >>> enough that live lock isn't an issue. >>> >> This could work, one thing though is that locks don't compose. Using a >> hash write_lock in lockdep_unregister_key() will create a lockdep_lock() >> -> "hash write_lock" dependency, and that means you cannot >> lockdep_lock() while you're holding a hash read_lock, although it's >> not the case today, but it certainly complicates the locking design >> inside lockdep where there's no lockdep to help 😉 > > Thinking about it more, doing it in a lockless way is probably a good > idea. > If we are using hazard pointer for synchronization, should we also take off "_rcu" from the list iteration/insertion/deletion macros to avoid the confusion that RCU is being used? Cheers, Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-26 16:40 ` Waiman Long @ 2025-03-26 16:47 ` Boqun Feng 2025-03-26 17:02 ` Waiman Long 0 siblings, 1 reply; 35+ messages in thread From: Boqun Feng @ 2025-03-26 16:47 UTC (permalink / raw) To: Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Wed, Mar 26, 2025 at 12:40:59PM -0400, Waiman Long wrote: > On 3/26/25 11:39 AM, Waiman Long wrote: > > On 3/26/25 1:25 AM, Boqun Feng wrote: > > > > It looks like you are trying hard to find a use case for hazard pointer in > > > > the kernel 🙂 > > > > > > > Well, if it does the job, why not use it 😉 Also this shows how > > > flexible hazard pointers can be. > > > > > > At least when using hazard pointers, the reader side of the hash list > > > iteration is still lockless. Plus, since the synchronization part > > > doesn't need to wait for the RCU readers in the whole system, it will be > > > faster (I tried with the protecting-the-whole-hash-list approach as > > > well, it's the same result on the tc command). This is why I choose to > > > look into hazard pointers. Another mechanism can achieve the similar > > > behavior is SRCU, but SRCU is slightly heavier compared to hazard > > > pointers in this case (of course SRCU has more functionalities). > > > > > > We can provide a lockdep_unregister_key_nosync() without the > > > synchronize_rcu() in it and let users do the synchronization, but it's > > > going to be hard to enforce and review, especially when someone > > > refactors the code and move the free code to somewhere else. > > Providing a second API and ask callers to do the right thing is probably > > not a good idea and mistake is going to be made sooner or later. > > > > Anyway, that may work. The only problem that I see is the issue of nesting > > > > of an interrupt context on top of a task context. It is possible that the > > > > first use of a raw_spinlock may happen in an interrupt context. If the > > > > interrupt happens when the task has set the hazard pointer and iterating the > > > > hash list, the value of the hazard pointer may be overwritten. Alternatively > > > > we could have multiple slots for the hazard pointer, but that will make the > > > > code more complicated. Or we could disable interrupt before setting the > > > > hazard pointer. > > > Or we can use lockdep_recursion: > > > > > > preempt_disable(); > > > lockdep_recursion_inc(); > > > barrier(); > > > > > > WRITE_ONCE(*hazptr, ...); > > > > > > , it should prevent the re-entrant of lockdep in irq. > > That will probably work. Or we can disable irq. I am fine with both. > > > > The solution that I am thinking about is to have a simple unfair rwlock to > > > > protect just the hash list iteration. lockdep_unregister_key() and > > > > lockdep_register_key() take the write lock with interrupt disabled. While > > > > is_dynamic_key() takes the read lock. Nesting in this case isn't a problem > > > > and we don't need RCU to protect the iteration process and so the last > > > > synchronize_rcu() call isn't needed. The level of contention should be low > > > > enough that live lock isn't an issue. > > > > > > > This could work, one thing though is that locks don't compose. Using a > > > hash write_lock in lockdep_unregister_key() will create a lockdep_lock() > > > -> "hash write_lock" dependency, and that means you cannot > > > lockdep_lock() while you're holding a hash read_lock, although it's > > > not the case today, but it certainly complicates the locking design > > > inside lockdep where there's no lockdep to help 😉 > > > > Thinking about it more, doing it in a lockless way is probably a good > > idea. > > > If we are using hazard pointer for synchronization, should we also take off > "_rcu" from the list iteration/insertion/deletion macros to avoid the > confusion that RCU is being used? > We can, but we probably want to introduce a new set of API with suffix "_lockless" or something because they will still need a lockless fashion similar to RCU list iteration/insertion/deletion. Regards, Boqun > Cheers, > Longman > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-26 16:47 ` Boqun Feng @ 2025-03-26 17:02 ` Waiman Long 2025-03-26 17:10 ` Paul E. McKenney 0 siblings, 1 reply; 35+ messages in thread From: Waiman Long @ 2025-03-26 17:02 UTC (permalink / raw) To: Boqun Feng, Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On 3/26/25 12:47 PM, Boqun Feng wrote: > On Wed, Mar 26, 2025 at 12:40:59PM -0400, Waiman Long wrote: >> On 3/26/25 11:39 AM, Waiman Long wrote: >>> On 3/26/25 1:25 AM, Boqun Feng wrote: >>>>> It looks like you are trying hard to find a use case for hazard pointer in >>>>> the kernel 🙂 >>>>> >>>> Well, if it does the job, why not use it 😉 Also this shows how >>>> flexible hazard pointers can be. >>>> >>>> At least when using hazard pointers, the reader side of the hash list >>>> iteration is still lockless. Plus, since the synchronization part >>>> doesn't need to wait for the RCU readers in the whole system, it will be >>>> faster (I tried with the protecting-the-whole-hash-list approach as >>>> well, it's the same result on the tc command). This is why I choose to >>>> look into hazard pointers. Another mechanism can achieve the similar >>>> behavior is SRCU, but SRCU is slightly heavier compared to hazard >>>> pointers in this case (of course SRCU has more functionalities). >>>> >>>> We can provide a lockdep_unregister_key_nosync() without the >>>> synchronize_rcu() in it and let users do the synchronization, but it's >>>> going to be hard to enforce and review, especially when someone >>>> refactors the code and move the free code to somewhere else. >>> Providing a second API and ask callers to do the right thing is probably >>> not a good idea and mistake is going to be made sooner or later. >>>>> Anyway, that may work. The only problem that I see is the issue of nesting >>>>> of an interrupt context on top of a task context. It is possible that the >>>>> first use of a raw_spinlock may happen in an interrupt context. If the >>>>> interrupt happens when the task has set the hazard pointer and iterating the >>>>> hash list, the value of the hazard pointer may be overwritten. Alternatively >>>>> we could have multiple slots for the hazard pointer, but that will make the >>>>> code more complicated. Or we could disable interrupt before setting the >>>>> hazard pointer. >>>> Or we can use lockdep_recursion: >>>> >>>> preempt_disable(); >>>> lockdep_recursion_inc(); >>>> barrier(); >>>> >>>> WRITE_ONCE(*hazptr, ...); >>>> >>>> , it should prevent the re-entrant of lockdep in irq. >>> That will probably work. Or we can disable irq. I am fine with both. >>>>> The solution that I am thinking about is to have a simple unfair rwlock to >>>>> protect just the hash list iteration. lockdep_unregister_key() and >>>>> lockdep_register_key() take the write lock with interrupt disabled. While >>>>> is_dynamic_key() takes the read lock. Nesting in this case isn't a problem >>>>> and we don't need RCU to protect the iteration process and so the last >>>>> synchronize_rcu() call isn't needed. The level of contention should be low >>>>> enough that live lock isn't an issue. >>>>> >>>> This could work, one thing though is that locks don't compose. Using a >>>> hash write_lock in lockdep_unregister_key() will create a lockdep_lock() >>>> -> "hash write_lock" dependency, and that means you cannot >>>> lockdep_lock() while you're holding a hash read_lock, although it's >>>> not the case today, but it certainly complicates the locking design >>>> inside lockdep where there's no lockdep to help 😉 >>> Thinking about it more, doing it in a lockless way is probably a good >>> idea. >>> >> If we are using hazard pointer for synchronization, should we also take off >> "_rcu" from the list iteration/insertion/deletion macros to avoid the >> confusion that RCU is being used? >> > We can, but we probably want to introduce a new set of API with suffix > "_lockless" or something because they will still need a lockless fashion > similar to RCU list iteration/insertion/deletion. The lockless part is just the iteration of the list. Insertion and deletion is protected by lockdep_lock(). The current hlist_*_rcu() macros are doing the right things for lockless use case too. We can either document that RCU is not being used or have some _lockless helpers that just call the _rcu equivalent. Cheers, Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-26 17:02 ` Waiman Long @ 2025-03-26 17:10 ` Paul E. McKenney 2025-03-26 18:42 ` Boqun Feng 0 siblings, 1 reply; 35+ messages in thread From: Paul E. McKenney @ 2025-03-26 17:10 UTC (permalink / raw) To: Waiman Long Cc: Boqun Feng, Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote: > > On 3/26/25 12:47 PM, Boqun Feng wrote: > > On Wed, Mar 26, 2025 at 12:40:59PM -0400, Waiman Long wrote: > > > On 3/26/25 11:39 AM, Waiman Long wrote: > > > > On 3/26/25 1:25 AM, Boqun Feng wrote: > > > > > > It looks like you are trying hard to find a use case for hazard pointer in > > > > > > the kernel 🙂 > > > > > > > > > > > Well, if it does the job, why not use it 😉 Also this shows how > > > > > flexible hazard pointers can be. > > > > > > > > > > At least when using hazard pointers, the reader side of the hash list > > > > > iteration is still lockless. Plus, since the synchronization part > > > > > doesn't need to wait for the RCU readers in the whole system, it will be > > > > > faster (I tried with the protecting-the-whole-hash-list approach as > > > > > well, it's the same result on the tc command). This is why I choose to > > > > > look into hazard pointers. Another mechanism can achieve the similar > > > > > behavior is SRCU, but SRCU is slightly heavier compared to hazard > > > > > pointers in this case (of course SRCU has more functionalities). > > > > > > > > > > We can provide a lockdep_unregister_key_nosync() without the > > > > > synchronize_rcu() in it and let users do the synchronization, but it's > > > > > going to be hard to enforce and review, especially when someone > > > > > refactors the code and move the free code to somewhere else. > > > > Providing a second API and ask callers to do the right thing is probably > > > > not a good idea and mistake is going to be made sooner or later. > > > > > > Anyway, that may work. The only problem that I see is the issue of nesting > > > > > > of an interrupt context on top of a task context. It is possible that the > > > > > > first use of a raw_spinlock may happen in an interrupt context. If the > > > > > > interrupt happens when the task has set the hazard pointer and iterating the > > > > > > hash list, the value of the hazard pointer may be overwritten. Alternatively > > > > > > we could have multiple slots for the hazard pointer, but that will make the > > > > > > code more complicated. Or we could disable interrupt before setting the > > > > > > hazard pointer. > > > > > Or we can use lockdep_recursion: > > > > > > > > > > preempt_disable(); > > > > > lockdep_recursion_inc(); > > > > > barrier(); > > > > > > > > > > WRITE_ONCE(*hazptr, ...); > > > > > > > > > > , it should prevent the re-entrant of lockdep in irq. > > > > That will probably work. Or we can disable irq. I am fine with both. > > > > > > The solution that I am thinking about is to have a simple unfair rwlock to > > > > > > protect just the hash list iteration. lockdep_unregister_key() and > > > > > > lockdep_register_key() take the write lock with interrupt disabled. While > > > > > > is_dynamic_key() takes the read lock. Nesting in this case isn't a problem > > > > > > and we don't need RCU to protect the iteration process and so the last > > > > > > synchronize_rcu() call isn't needed. The level of contention should be low > > > > > > enough that live lock isn't an issue. > > > > > > > > > > > This could work, one thing though is that locks don't compose. Using a > > > > > hash write_lock in lockdep_unregister_key() will create a lockdep_lock() > > > > > -> "hash write_lock" dependency, and that means you cannot > > > > > lockdep_lock() while you're holding a hash read_lock, although it's > > > > > not the case today, but it certainly complicates the locking design > > > > > inside lockdep where there's no lockdep to help 😉 > > > > Thinking about it more, doing it in a lockless way is probably a good > > > > idea. > > > > > > > If we are using hazard pointer for synchronization, should we also take off > > > "_rcu" from the list iteration/insertion/deletion macros to avoid the > > > confusion that RCU is being used? > > > > > We can, but we probably want to introduce a new set of API with suffix > > "_lockless" or something because they will still need a lockless fashion > > similar to RCU list iteration/insertion/deletion. > > The lockless part is just the iteration of the list. Insertion and deletion > is protected by lockdep_lock(). > > The current hlist_*_rcu() macros are doing the right things for lockless use > case too. We can either document that RCU is not being used or have some > _lockless helpers that just call the _rcu equivalent. We used to have _lockless helper, but we got rid of them. Not necessarily meaning that we should not add them back in, but... ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-26 17:10 ` Paul E. McKenney @ 2025-03-26 18:42 ` Boqun Feng 2025-03-26 21:37 ` Paul E. McKenney 2025-03-31 16:48 ` Breno Leitao 0 siblings, 2 replies; 35+ messages in thread From: Boqun Feng @ 2025-03-26 18:42 UTC (permalink / raw) To: Paul E. McKenney Cc: Waiman Long, Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren On Wed, Mar 26, 2025 at 10:10:28AM -0700, Paul E. McKenney wrote: > On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote: [...] > > > > > Thinking about it more, doing it in a lockless way is probably a good > > > > > idea. > > > > > > > > > If we are using hazard pointer for synchronization, should we also take off > > > > "_rcu" from the list iteration/insertion/deletion macros to avoid the > > > > confusion that RCU is being used? > > > > > > > We can, but we probably want to introduce a new set of API with suffix > > > "_lockless" or something because they will still need a lockless fashion > > > similar to RCU list iteration/insertion/deletion. > > > > The lockless part is just the iteration of the list. Insertion and deletion > > is protected by lockdep_lock(). > > > > The current hlist_*_rcu() macros are doing the right things for lockless use > > case too. We can either document that RCU is not being used or have some > > _lockless helpers that just call the _rcu equivalent. > > We used to have _lockless helper, but we got rid of them. Not necessarily > meaning that we should not add them back in, but... ;-) > I will probably go with using *_rcu() first with some comments, if this "hazard pointers for hash table" is a good idea in other places, we can add *_hazptr() or pick a better name then. Regards, Boqun > Thanx, Paul ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-26 18:42 ` Boqun Feng @ 2025-03-26 21:37 ` Paul E. McKenney 2025-03-31 16:48 ` Breno Leitao 1 sibling, 0 replies; 35+ messages in thread From: Paul E. McKenney @ 2025-03-26 21:37 UTC (permalink / raw) To: Boqun Feng Cc: Waiman Long, Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren On Wed, Mar 26, 2025 at 11:42:37AM -0700, Boqun Feng wrote: > On Wed, Mar 26, 2025 at 10:10:28AM -0700, Paul E. McKenney wrote: > > On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote: > [...] > > > > > > Thinking about it more, doing it in a lockless way is probably a good > > > > > > idea. > > > > > > > > > > > If we are using hazard pointer for synchronization, should we also take off > > > > > "_rcu" from the list iteration/insertion/deletion macros to avoid the > > > > > confusion that RCU is being used? > > > > > > > > > We can, but we probably want to introduce a new set of API with suffix > > > > "_lockless" or something because they will still need a lockless fashion > > > > similar to RCU list iteration/insertion/deletion. > > > > > > The lockless part is just the iteration of the list. Insertion and deletion > > > is protected by lockdep_lock(). > > > > > > The current hlist_*_rcu() macros are doing the right things for lockless use > > > case too. We can either document that RCU is not being used or have some > > > _lockless helpers that just call the _rcu equivalent. > > > > We used to have _lockless helper, but we got rid of them. Not necessarily > > meaning that we should not add them back in, but... ;-) > > > > I will probably go with using *_rcu() first with some comments, if this > "hazard pointers for hash table" is a good idea in other places, we can > add *_hazptr() or pick a better name then. Works for me! Thanx, Paul ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-26 18:42 ` Boqun Feng 2025-03-26 21:37 ` Paul E. McKenney @ 2025-03-31 16:48 ` Breno Leitao 2025-03-31 17:34 ` Boqun Feng 1 sibling, 1 reply; 35+ messages in thread From: Breno Leitao @ 2025-03-31 16:48 UTC (permalink / raw) To: Boqun Feng Cc: Paul E. McKenney, Waiman Long, Eric Dumazet, Peter Zijlstra, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren Hello Boqun, Waimn On Wed, Mar 26, 2025 at 11:42:37AM -0700, Boqun Feng wrote: > On Wed, Mar 26, 2025 at 10:10:28AM -0700, Paul E. McKenney wrote: > > On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote: > [...] > > > > > > Thinking about it more, doing it in a lockless way is probably a good > > > > > > idea. > > > > > > > > > > > If we are using hazard pointer for synchronization, should we also take off > > > > > "_rcu" from the list iteration/insertion/deletion macros to avoid the > > > > > confusion that RCU is being used? > > > > > > > > > We can, but we probably want to introduce a new set of API with suffix > > > > "_lockless" or something because they will still need a lockless fashion > > > > similar to RCU list iteration/insertion/deletion. > > > > > > The lockless part is just the iteration of the list. Insertion and deletion > > > is protected by lockdep_lock(). > > > > > > The current hlist_*_rcu() macros are doing the right things for lockless use > > > case too. We can either document that RCU is not being used or have some > > > _lockless helpers that just call the _rcu equivalent. > > > > We used to have _lockless helper, but we got rid of them. Not necessarily > > meaning that we should not add them back in, but... ;-) > > > > I will probably go with using *_rcu() first with some comments, if this > "hazard pointers for hash table" is a good idea in other places, we can > add *_hazptr() or pick a better name then. I am trying to figure out what are the next steps to get this issue solve. Would you mind help me to understand what _rcu() fuction you are suggesting and what will it replace? Thank you, --breno ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-31 16:48 ` Breno Leitao @ 2025-03-31 17:34 ` Boqun Feng 0 siblings, 0 replies; 35+ messages in thread From: Boqun Feng @ 2025-03-31 17:34 UTC (permalink / raw) To: Breno Leitao Cc: Paul E. McKenney, Waiman Long, Eric Dumazet, Peter Zijlstra, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren On Mon, Mar 31, 2025 at 09:48:34AM -0700, Breno Leitao wrote: > Hello Boqun, Waimn > > On Wed, Mar 26, 2025 at 11:42:37AM -0700, Boqun Feng wrote: > > On Wed, Mar 26, 2025 at 10:10:28AM -0700, Paul E. McKenney wrote: > > > On Wed, Mar 26, 2025 at 01:02:12PM -0400, Waiman Long wrote: > > [...] > > > > > > > Thinking about it more, doing it in a lockless way is probably a good > > > > > > > idea. > > > > > > > > > > > > > If we are using hazard pointer for synchronization, should we also take off > > > > > > "_rcu" from the list iteration/insertion/deletion macros to avoid the > > > > > > confusion that RCU is being used? > > > > > > > > > > > We can, but we probably want to introduce a new set of API with suffix > > > > > "_lockless" or something because they will still need a lockless fashion > > > > > similar to RCU list iteration/insertion/deletion. > > > > > > > > The lockless part is just the iteration of the list. Insertion and deletion > > > > is protected by lockdep_lock(). > > > > > > > > The current hlist_*_rcu() macros are doing the right things for lockless use > > > > case too. We can either document that RCU is not being used or have some > > > > _lockless helpers that just call the _rcu equivalent. > > > > > > We used to have _lockless helper, but we got rid of them. Not necessarily > > > meaning that we should not add them back in, but... ;-) > > > > > > > I will probably go with using *_rcu() first with some comments, if this > > "hazard pointers for hash table" is a good idea in other places, we can > > add *_hazptr() or pick a better name then. > > I am trying to figure out what are the next steps to get this issue > solve. > I will send out a serise including introduction of simple hazard pointers and use it in lockdep for this case, hopefully that can resolve your issue. > Would you mind help me to understand what _rcu() fuction you are > suggesting and what will it replace? > The _rcu() functions we are talking about are: hlist_for_each_entry_rcu() in is_dynamic_key(), hlist_add_head_rcu() in lockdep_register_key() and hlist_del_rcu() in lockdep_unregister_key(), because if we move to hazptr, they are technically not protected by RCU. But the implementation of these functions is still correct with hazptr, it's just their names might be confusing, and we may change them later. Hope this helps. Regards, Boqun > Thank you, > --breno ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization [not found] ` <df237702-55c3-466b-b51e-f3fe46ae03ba@redhat.com> 2025-03-26 16:40 ` Waiman Long @ 2025-03-31 17:26 ` Boqun Feng 2025-03-31 17:33 ` Waiman Long 2025-03-31 17:42 ` Eric Dumazet 1 sibling, 2 replies; 35+ messages in thread From: Boqun Feng @ 2025-03-31 17:26 UTC (permalink / raw) To: Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote: [...] > > > Anyway, that may work. The only problem that I see is the issue of nesting > > > of an interrupt context on top of a task context. It is possible that the > > > first use of a raw_spinlock may happen in an interrupt context. If the > > > interrupt happens when the task has set the hazard pointer and iterating the > > > hash list, the value of the hazard pointer may be overwritten. Alternatively > > > we could have multiple slots for the hazard pointer, but that will make the > > > code more complicated. Or we could disable interrupt before setting the > > > hazard pointer. > > Or we can use lockdep_recursion: > > > > preempt_disable(); > > lockdep_recursion_inc(); > > barrier(); > > > > WRITE_ONCE(*hazptr, ...); > > > > , it should prevent the re-entrant of lockdep in irq. > That will probably work. Or we can disable irq. I am fine with both. Disabling irq may not work in this case, because an NMI can also happen and call register_lock_class(). I'm experimenting a new idea here, it might be better (for general cases), and this has the similar spirit that we could move the protection scope of a hazard pointer from a key to a hash_list: we can introduce a wildcard address, and whenever we do a synchronize_hazptr(), if the hazptr slot equal to wildcard, we treat as it matches to any ptr, hence synchronize_hazptr() will still wait until it's zero'd. Not only this could help in the nesting case, it can also be used if the users want to protect multiple things with this simple hazard pointer implementation. Regards, Boqun [..] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-31 17:26 ` Boqun Feng @ 2025-03-31 17:33 ` Waiman Long 2025-03-31 18:33 ` Paul E. McKenney 2025-03-31 17:42 ` Eric Dumazet 1 sibling, 1 reply; 35+ messages in thread From: Waiman Long @ 2025-03-31 17:33 UTC (permalink / raw) To: Boqun Feng, Waiman Long Cc: Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On 3/31/25 1:26 PM, Boqun Feng wrote: > On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote: > [...] >>>> Anyway, that may work. The only problem that I see is the issue of nesting >>>> of an interrupt context on top of a task context. It is possible that the >>>> first use of a raw_spinlock may happen in an interrupt context. If the >>>> interrupt happens when the task has set the hazard pointer and iterating the >>>> hash list, the value of the hazard pointer may be overwritten. Alternatively >>>> we could have multiple slots for the hazard pointer, but that will make the >>>> code more complicated. Or we could disable interrupt before setting the >>>> hazard pointer. >>> Or we can use lockdep_recursion: >>> >>> preempt_disable(); >>> lockdep_recursion_inc(); >>> barrier(); >>> >>> WRITE_ONCE(*hazptr, ...); >>> >>> , it should prevent the re-entrant of lockdep in irq. >> That will probably work. Or we can disable irq. I am fine with both. > Disabling irq may not work in this case, because an NMI can also happen > and call register_lock_class(). Right, disabling irq doesn't work with NMI. So incrementing the recursion count is likely the way to go and I think it will work even in the NMI case. > > I'm experimenting a new idea here, it might be better (for general > cases), and this has the similar spirit that we could move the > protection scope of a hazard pointer from a key to a hash_list: we can > introduce a wildcard address, and whenever we do a synchronize_hazptr(), > if the hazptr slot equal to wildcard, we treat as it matches to any ptr, > hence synchronize_hazptr() will still wait until it's zero'd. Not only > this could help in the nesting case, it can also be used if the users > want to protect multiple things with this simple hazard pointer > implementation. I think it is a good idea to add a wildcard for the general use case. Setting the hazptr to the list head will be enough for this particular case. Cheers, Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-31 17:33 ` Waiman Long @ 2025-03-31 18:33 ` Paul E. McKenney 2025-03-31 18:57 ` Waiman Long 0 siblings, 1 reply; 35+ messages in thread From: Paul E. McKenney @ 2025-03-31 18:33 UTC (permalink / raw) To: Waiman Long Cc: Boqun Feng, Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren On Mon, Mar 31, 2025 at 01:33:22PM -0400, Waiman Long wrote: > On 3/31/25 1:26 PM, Boqun Feng wrote: > > On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote: > > [...] > > > > > Anyway, that may work. The only problem that I see is the issue of nesting > > > > > of an interrupt context on top of a task context. It is possible that the > > > > > first use of a raw_spinlock may happen in an interrupt context. If the > > > > > interrupt happens when the task has set the hazard pointer and iterating the > > > > > hash list, the value of the hazard pointer may be overwritten. Alternatively > > > > > we could have multiple slots for the hazard pointer, but that will make the > > > > > code more complicated. Or we could disable interrupt before setting the > > > > > hazard pointer. > > > > Or we can use lockdep_recursion: > > > > > > > > preempt_disable(); > > > > lockdep_recursion_inc(); > > > > barrier(); > > > > > > > > WRITE_ONCE(*hazptr, ...); > > > > > > > > , it should prevent the re-entrant of lockdep in irq. > > > That will probably work. Or we can disable irq. I am fine with both. > > Disabling irq may not work in this case, because an NMI can also happen > > and call register_lock_class(). > Right, disabling irq doesn't work with NMI. So incrementing the recursion > count is likely the way to go and I think it will work even in the NMI case. > > > > > I'm experimenting a new idea here, it might be better (for general > > cases), and this has the similar spirit that we could move the > > protection scope of a hazard pointer from a key to a hash_list: we can > > introduce a wildcard address, and whenever we do a synchronize_hazptr(), > > if the hazptr slot equal to wildcard, we treat as it matches to any ptr, > > hence synchronize_hazptr() will still wait until it's zero'd. Not only > > this could help in the nesting case, it can also be used if the users > > want to protect multiple things with this simple hazard pointer > > implementation. > > I think it is a good idea to add a wildcard for the general use case. > Setting the hazptr to the list head will be enough for this particular case. Careful! If we enable use of wildcards outside of the special case of synchronize_hazptr(), we give up the small-memory-footprint advantages of hazard pointers. You end up having to wait on all hazard-pointer readers, which was exactly why RCU was troublesome here. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-31 18:33 ` Paul E. McKenney @ 2025-03-31 18:57 ` Waiman Long 2025-03-31 21:21 ` Boqun Feng 0 siblings, 1 reply; 35+ messages in thread From: Waiman Long @ 2025-03-31 18:57 UTC (permalink / raw) To: paulmck, Waiman Long Cc: Boqun Feng, Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren On 3/31/25 2:33 PM, Paul E. McKenney wrote: > On Mon, Mar 31, 2025 at 01:33:22PM -0400, Waiman Long wrote: >> On 3/31/25 1:26 PM, Boqun Feng wrote: >>> On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote: >>> [...] >>>>>> Anyway, that may work. The only problem that I see is the issue of nesting >>>>>> of an interrupt context on top of a task context. It is possible that the >>>>>> first use of a raw_spinlock may happen in an interrupt context. If the >>>>>> interrupt happens when the task has set the hazard pointer and iterating the >>>>>> hash list, the value of the hazard pointer may be overwritten. Alternatively >>>>>> we could have multiple slots for the hazard pointer, but that will make the >>>>>> code more complicated. Or we could disable interrupt before setting the >>>>>> hazard pointer. >>>>> Or we can use lockdep_recursion: >>>>> >>>>> preempt_disable(); >>>>> lockdep_recursion_inc(); >>>>> barrier(); >>>>> >>>>> WRITE_ONCE(*hazptr, ...); >>>>> >>>>> , it should prevent the re-entrant of lockdep in irq. >>>> That will probably work. Or we can disable irq. I am fine with both. >>> Disabling irq may not work in this case, because an NMI can also happen >>> and call register_lock_class(). >> Right, disabling irq doesn't work with NMI. So incrementing the recursion >> count is likely the way to go and I think it will work even in the NMI case. >> >>> I'm experimenting a new idea here, it might be better (for general >>> cases), and this has the similar spirit that we could move the >>> protection scope of a hazard pointer from a key to a hash_list: we can >>> introduce a wildcard address, and whenever we do a synchronize_hazptr(), >>> if the hazptr slot equal to wildcard, we treat as it matches to any ptr, >>> hence synchronize_hazptr() will still wait until it's zero'd. Not only >>> this could help in the nesting case, it can also be used if the users >>> want to protect multiple things with this simple hazard pointer >>> implementation. >> I think it is a good idea to add a wildcard for the general use case. >> Setting the hazptr to the list head will be enough for this particular case. > Careful! If we enable use of wildcards outside of the special case > of synchronize_hazptr(), we give up the small-memory-footprint advantages > of hazard pointers. You end up having to wait on all hazard-pointer > readers, which was exactly why RCU was troublesome here. ;-) If the plan is to have one global set of hazard pointers for all the possible use cases, supporting wildcard may be a problem. If we allow different sets of hazard pointers for different use cases, it will be less an issue. Anyway, maybe we should skip wildcard for the current case so that we have more time to think through it first. Cheers, Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-31 18:57 ` Waiman Long @ 2025-03-31 21:21 ` Boqun Feng 2025-03-31 21:47 ` Waiman Long 0 siblings, 1 reply; 35+ messages in thread From: Boqun Feng @ 2025-03-31 21:21 UTC (permalink / raw) To: Waiman Long Cc: paulmck, Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren On Mon, Mar 31, 2025 at 02:57:20PM -0400, Waiman Long wrote: > On 3/31/25 2:33 PM, Paul E. McKenney wrote: > > On Mon, Mar 31, 2025 at 01:33:22PM -0400, Waiman Long wrote: > > > On 3/31/25 1:26 PM, Boqun Feng wrote: > > > > On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote: > > > > [...] > > > > > > > Anyway, that may work. The only problem that I see is the issue of nesting > > > > > > > of an interrupt context on top of a task context. It is possible that the > > > > > > > first use of a raw_spinlock may happen in an interrupt context. If the > > > > > > > interrupt happens when the task has set the hazard pointer and iterating the > > > > > > > hash list, the value of the hazard pointer may be overwritten. Alternatively > > > > > > > we could have multiple slots for the hazard pointer, but that will make the > > > > > > > code more complicated. Or we could disable interrupt before setting the > > > > > > > hazard pointer. > > > > > > Or we can use lockdep_recursion: > > > > > > > > > > > > preempt_disable(); > > > > > > lockdep_recursion_inc(); > > > > > > barrier(); > > > > > > > > > > > > WRITE_ONCE(*hazptr, ...); > > > > > > > > > > > > , it should prevent the re-entrant of lockdep in irq. > > > > > That will probably work. Or we can disable irq. I am fine with both. > > > > Disabling irq may not work in this case, because an NMI can also happen > > > > and call register_lock_class(). > > > Right, disabling irq doesn't work with NMI. So incrementing the recursion > > > count is likely the way to go and I think it will work even in the NMI case. > > > > > > > I'm experimenting a new idea here, it might be better (for general > > > > cases), and this has the similar spirit that we could move the > > > > protection scope of a hazard pointer from a key to a hash_list: we can > > > > introduce a wildcard address, and whenever we do a synchronize_hazptr(), > > > > if the hazptr slot equal to wildcard, we treat as it matches to any ptr, > > > > hence synchronize_hazptr() will still wait until it's zero'd. Not only > > > > this could help in the nesting case, it can also be used if the users > > > > want to protect multiple things with this simple hazard pointer > > > > implementation. > > > I think it is a good idea to add a wildcard for the general use case. > > > Setting the hazptr to the list head will be enough for this particular case. > > Careful! If we enable use of wildcards outside of the special case > > of synchronize_hazptr(), we give up the small-memory-footprint advantages > > of hazard pointers. You end up having to wait on all hazard-pointer > > readers, which was exactly why RCU was troublesome here. ;-) Technically, only the hazard-pointer readers that have switched to wildcard mode because multiple hazptr critical sections ;-) > > If the plan is to have one global set of hazard pointers for all the A global set of hazard pointers for all the possible use cases is the current plan (at least it should be when we have fully-featured hazptr [1]). Because the hazard pointer value already points the the data to protect, so no need to group things into "domain"s. > possible use cases, supporting wildcard may be a problem. If we allow I had some off-list discussions with Paul, and I ended up with the idea of user-specific wildcard (i.e. different users can have different wildcards) + one global set of hazard pointers. However, it just occured to me that it won'd quite work in this simple hazard pointer implementation (one slot per-CPU) :( Because you can have a user A's hazptr critical interrupted by a user B's interrupt handler, and if both A & B are using customized wildcard but they don't know each other, it's not going to work by setting either wildcard value into the slot. To make it clear for the discussion, we have two hazard pointer implementations: 1. The fully-featured one [1], which allow users to provide memory for hazptr slots, so no issue about nesting/re-entry etc. And wildcard doesn't make sense in this implemenation. 2. The simple variant, which is what I've proposed in this thread, and since it only has one slot per CPU, either all the users need to prevent the re-entries or we need a global wildcard. Also the readers of the simple variant need to disable preemption regardlessly because it only has one hazptr slot to use. That means its read-side critical section should be short usually. I could try to use the fully-featured one in lockdep, what I need to do is creating enough hazard_context so we have enough slots for lockdep and may or may not need lockdep_recursion to prevent reentries. However, I still believe (or I don't have data to show otherwise) that the simple variant with one slot per CPU + global wildcard will work fine in practice. So what I would like to do is introducing the simple variant as a general API with a global wildcard (because without it, it cannot be a general API because one user have to prevent entering another user's critical section), and lockdep can use it. And we can monitor the delay of synchronize_shazptr() and if wildcard becomes a problem, move to a fully-featured hazptr implementation. Sounds like a plan? [1]: https://lore.kernel.org/lkml/20240917143402.930114-2-boqun.feng@gmail.com/ Regards, Boqun > different sets of hazard pointers for different use cases, it will be less > an issue. Anyway, maybe we should skip wildcard for the current case so that > we have more time to think through it first. > > Cheers, > Longman > ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-31 21:21 ` Boqun Feng @ 2025-03-31 21:47 ` Waiman Long 0 siblings, 0 replies; 35+ messages in thread From: Waiman Long @ 2025-03-31 21:47 UTC (permalink / raw) To: Boqun Feng, Waiman Long Cc: paulmck, Eric Dumazet, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren On 3/31/25 5:21 PM, Boqun Feng wrote: > On Mon, Mar 31, 2025 at 02:57:20PM -0400, Waiman Long wrote: >> On 3/31/25 2:33 PM, Paul E. McKenney wrote: >>> On Mon, Mar 31, 2025 at 01:33:22PM -0400, Waiman Long wrote: >>>> On 3/31/25 1:26 PM, Boqun Feng wrote: >>>>> On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote: >>>>> [...] >>>>>>>> Anyway, that may work. The only problem that I see is the issue of nesting >>>>>>>> of an interrupt context on top of a task context. It is possible that the >>>>>>>> first use of a raw_spinlock may happen in an interrupt context. If the >>>>>>>> interrupt happens when the task has set the hazard pointer and iterating the >>>>>>>> hash list, the value of the hazard pointer may be overwritten. Alternatively >>>>>>>> we could have multiple slots for the hazard pointer, but that will make the >>>>>>>> code more complicated. Or we could disable interrupt before setting the >>>>>>>> hazard pointer. >>>>>>> Or we can use lockdep_recursion: >>>>>>> >>>>>>> preempt_disable(); >>>>>>> lockdep_recursion_inc(); >>>>>>> barrier(); >>>>>>> >>>>>>> WRITE_ONCE(*hazptr, ...); >>>>>>> >>>>>>> , it should prevent the re-entrant of lockdep in irq. >>>>>> That will probably work. Or we can disable irq. I am fine with both. >>>>> Disabling irq may not work in this case, because an NMI can also happen >>>>> and call register_lock_class(). >>>> Right, disabling irq doesn't work with NMI. So incrementing the recursion >>>> count is likely the way to go and I think it will work even in the NMI case. >>>> >>>>> I'm experimenting a new idea here, it might be better (for general >>>>> cases), and this has the similar spirit that we could move the >>>>> protection scope of a hazard pointer from a key to a hash_list: we can >>>>> introduce a wildcard address, and whenever we do a synchronize_hazptr(), >>>>> if the hazptr slot equal to wildcard, we treat as it matches to any ptr, >>>>> hence synchronize_hazptr() will still wait until it's zero'd. Not only >>>>> this could help in the nesting case, it can also be used if the users >>>>> want to protect multiple things with this simple hazard pointer >>>>> implementation. >>>> I think it is a good idea to add a wildcard for the general use case. >>>> Setting the hazptr to the list head will be enough for this particular case. >>> Careful! If we enable use of wildcards outside of the special case >>> of synchronize_hazptr(), we give up the small-memory-footprint advantages >>> of hazard pointers. You end up having to wait on all hazard-pointer >>> readers, which was exactly why RCU was troublesome here. ;-) > Technically, only the hazard-pointer readers that have switched to > wildcard mode because multiple hazptr critical sections ;-) > >> If the plan is to have one global set of hazard pointers for all the > A global set of hazard pointers for all the possible use cases is the > current plan (at least it should be when we have fully-featured hazptr > [1]). Because the hazard pointer value already points the the data to > protect, so no need to group things into "domain"s. > >> possible use cases, supporting wildcard may be a problem. If we allow > I had some off-list discussions with Paul, and I ended up with the idea > of user-specific wildcard (i.e. different users can have different > wildcards) + one global set of hazard pointers. However, it just occured > to me that it won'd quite work in this simple hazard pointer > implementation (one slot per-CPU) :( Because you can have a user A's > hazptr critical interrupted by a user B's interrupt handler, and if both > A & B are using customized wildcard but they don't know each other, it's > not going to work by setting either wildcard value into the slot. > > To make it clear for the discussion, we have two hazard pointer > implementations: > > 1. The fully-featured one [1], which allow users to provide memory for > hazptr slots, so no issue about nesting/re-entry etc. And wildcard > doesn't make sense in this implemenation. > > 2. The simple variant, which is what I've proposed in this thread, and > since it only has one slot per CPU, either all the users need to > prevent the re-entries or we need a global wildcard. Also the readers > of the simple variant need to disable preemption regardlessly because > it only has one hazptr slot to use. That means its read-side critical > section should be short usually. > > I could try to use the fully-featured one in lockdep, what I need to do > is creating enough hazard_context so we have enough slots for lockdep > and may or may not need lockdep_recursion to prevent reentries. However, > I still believe (or I don't have data to show otherwise) that the simple > variant with one slot per CPU + global wildcard will work fine in > practice. > > So what I would like to do is introducing the simple variant as a > general API with a global wildcard (because without it, it cannot be a > general API because one user have to prevent entering another user's > critical section), and lockdep can use it. And we can monitor the > delay of synchronize_shazptr() and if wildcard becomes a problem, move > to a fully-featured hazptr implementation. Sounds like a plan? > > [1]: https://lore.kernel.org/lkml/20240917143402.930114-2-boqun.feng@gmail.com/ Thank for the detailed explanation. I am looking forward to your new hazptr patch series. Cheers, Longman > > Regards, > Boqun > >> different sets of hazard pointers for different use cases, it will be less >> an issue. Anyway, maybe we should skip wildcard for the current case so that >> we have more time to think through it first. >> >> Cheers, >> Longman >> ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-31 17:26 ` Boqun Feng 2025-03-31 17:33 ` Waiman Long @ 2025-03-31 17:42 ` Eric Dumazet 1 sibling, 0 replies; 35+ messages in thread From: Eric Dumazet @ 2025-03-31 17:42 UTC (permalink / raw) To: Boqun Feng Cc: Waiman Long, Peter Zijlstra, Breno Leitao, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Mon, Mar 31, 2025 at 7:26 PM Boqun Feng <boqun.feng@gmail.com> wrote: > > On Wed, Mar 26, 2025 at 11:39:49AM -0400, Waiman Long wrote: > [...] > > > > Anyway, that may work. The only problem that I see is the issue of nesting > > > > of an interrupt context on top of a task context. It is possible that the > > > > first use of a raw_spinlock may happen in an interrupt context. If the > > > > interrupt happens when the task has set the hazard pointer and iterating the > > > > hash list, the value of the hazard pointer may be overwritten. Alternatively > > > > we could have multiple slots for the hazard pointer, but that will make the > > > > code more complicated. Or we could disable interrupt before setting the > > > > hazard pointer. > > > Or we can use lockdep_recursion: > > > > > > preempt_disable(); > > > lockdep_recursion_inc(); > > > barrier(); > > > > > > WRITE_ONCE(*hazptr, ...); > > > > > > , it should prevent the re-entrant of lockdep in irq. > > That will probably work. Or we can disable irq. I am fine with both. > > Disabling irq may not work in this case, because an NMI can also happen > and call register_lock_class(). > > I'm experimenting a new idea here, it might be better (for general > cases), and this has the similar spirit that we could move the > protection scope of a hazard pointer from a key to a hash_list: we can > introduce a wildcard address, and whenever we do a synchronize_hazptr(), > if the hazptr slot equal to wildcard, we treat as it matches to any ptr, > hence synchronize_hazptr() will still wait until it's zero'd. Not only > this could help in the nesting case, it can also be used if the users > want to protect multiple things with this simple hazard pointer > implementation. For the record this was my suggestion. Breno, what do you think ? Feel free to grab, test, add comments and a nice changelog, thanks ! diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index 67964dc4db952ea11d4b88554383ea0ec5946ef9..fb56bcb887ca163525f003cb7880c76511d166e7 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -117,7 +117,12 @@ do { \ } while (0) extern void lockdep_register_key(struct lock_class_key *key); -extern void lockdep_unregister_key(struct lock_class_key *key); + +extern void __lockdep_unregister_key(struct lock_class_key *key, bool sync); +static inline void lockdep_unregister_key(struct lock_class_key *key) +{ + return __lockdep_unregister_key(key, true); +} /* * These methods are used by specific locking variants (spinlocks, @@ -372,8 +377,12 @@ static inline void lockdep_register_key(struct lock_class_key *key) { } +static inline void __lockdep_unregister_key(struct lock_class_key *key, bool sync) +{ +} static inline void lockdep_unregister_key(struct lock_class_key *key) { + return __lockdep_unregister_key(key, true); } #define lockdep_depth(tsk) (0) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index b15757e6362615aeecb1629f8e60375e13a87e6d..3b754c7fdaf19887734b1ee37f7c058f8f751a4e 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -6574,7 +6574,7 @@ void lockdep_reset_lock(struct lockdep_map *lock) * key irrespective of debug_locks to avoid potential invalid access to freed * memory in lock_class entry. */ -void lockdep_unregister_key(struct lock_class_key *key) +void lockdep_unregister_key(struct lock_class_key *key, bool sync) { struct hlist_head *hash_head = keyhashentry(key); struct lock_class_key *k; @@ -6610,8 +6610,11 @@ void lockdep_unregister_key(struct lock_class_key *key) if (need_callback) call_rcu(&delayed_free.rcu_head, free_zapped_rcu); - /* Wait until is_dynamic_key() has finished accessing k->hash_entry. */ - synchronize_rcu(); + /* Wait until is_dynamic_key() has finished accessing k->hash_entry. + * @sync is false for callers dealing with the sync themselves. + */ + if (sync) + synchronize_rcu(); } EXPORT_SYMBOL_GPL(lockdep_unregister_key); diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 14ab2f4c190a1e201dd1788b413a06e799a829f2..9affef187237a13559642d435e1a196a909f75f9 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -988,8 +988,8 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue, return sch; errout1: - lockdep_unregister_key(&sch->root_lock_key); - kfree(sch); + __lockdep_unregister_key(&sch->root_lock_key, false); + kfree_rcu(sch, rcu); errout: return ERR_PTR(err); } @@ -1077,7 +1077,7 @@ static void __qdisc_destroy(struct Qdisc *qdisc) if (ops->destroy) ops->destroy(qdisc); - lockdep_unregister_key(&qdisc->root_lock_key); + __lockdep_unregister_key(&qdisc->root_lock_key, false); module_put(ops->owner); netdev_put(dev, &qdisc->dev_tracker); ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-03-21 9:30 [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization Breno Leitao 2025-03-21 10:37 ` Eric Dumazet 2025-03-24 12:12 ` Peter Zijlstra @ 2025-07-09 10:00 ` Breno Leitao 2025-07-09 13:57 ` Waiman Long 2 siblings, 1 reply; 35+ messages in thread From: Breno Leitao @ 2025-07-09 10:00 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long Cc: aeh, linux-kernel, netdev, edumazet, jhs, kernel-team, Erik Lundgren, Paul E. McKenney Hello Waiman, Boqun, On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: > lockdep_unregister_key() is called from critical code paths, including > sections where rtnl_lock() is held. For example, when replacing a qdisc > in a network device, network egress traffic is disabled while > __qdisc_destroy() is called for every network queue. > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > which gets blocked waiting for synchronize_rcu() to complete. > > For example, a simple tc command to replace a qdisc could take 13 > seconds: > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m13.195s > user 0m0.001s > sys 0m2.746s > > During this time, network egress is completely frozen while waiting for > RCU synchronization. > > Use synchronize_rcu_expedited() instead to minimize the impact on > critical operations like network connectivity changes. > > This improves 10x the function call to tc, when replacing the qdisc for > a network card. > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > real 0m1.789s > user 0m0.000s > sys 0m1.613s Can I have this landed as a workaround for the problem above, while hazard pointers doesn't get merged? This is affecting some systems that runs the Linus' upstream kernel with some debug flags enabled, and I would like to have they unblocked. Once hazard pointer lands, this will be reverted. Is this a fair approach? Thanks for your help, --breno ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-07-09 10:00 ` Breno Leitao @ 2025-07-09 13:57 ` Waiman Long 2025-07-09 14:57 ` Boqun Feng 0 siblings, 1 reply; 35+ messages in thread From: Waiman Long @ 2025-07-09 13:57 UTC (permalink / raw) To: Breno Leitao, Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng Cc: aeh, linux-kernel, netdev, edumazet, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On 7/9/25 6:00 AM, Breno Leitao wrote: > Hello Waiman, Boqun, > > On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: >> lockdep_unregister_key() is called from critical code paths, including >> sections where rtnl_lock() is held. For example, when replacing a qdisc >> in a network device, network egress traffic is disabled while >> __qdisc_destroy() is called for every network queue. >> >> If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), >> which gets blocked waiting for synchronize_rcu() to complete. >> >> For example, a simple tc command to replace a qdisc could take 13 >> seconds: >> >> # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq >> real 0m13.195s >> user 0m0.001s >> sys 0m2.746s >> >> During this time, network egress is completely frozen while waiting for >> RCU synchronization. >> >> Use synchronize_rcu_expedited() instead to minimize the impact on >> critical operations like network connectivity changes. >> >> This improves 10x the function call to tc, when replacing the qdisc for >> a network card. >> >> # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq >> real 0m1.789s >> user 0m0.000s >> sys 0m1.613s > Can I have this landed as a workaround for the problem above, while > hazard pointers doesn't get merged? > > This is affecting some systems that runs the Linus' upstream kernel with > some debug flags enabled, and I would like to have they unblocked. > > Once hazard pointer lands, this will be reverted. Is this a fair > approach? > > Thanks for your help, > --breno I am fine with this patch going in as a temporary workaround. Boqun, what do you think? Cheers, Longman ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization 2025-07-09 13:57 ` Waiman Long @ 2025-07-09 14:57 ` Boqun Feng 0 siblings, 0 replies; 35+ messages in thread From: Boqun Feng @ 2025-07-09 14:57 UTC (permalink / raw) To: Waiman Long Cc: Breno Leitao, Peter Zijlstra, Ingo Molnar, Will Deacon, aeh, linux-kernel, netdev, edumazet, jhs, kernel-team, Erik Lundgren, Paul E. McKenney On Wed, Jul 09, 2025 at 09:57:44AM -0400, Waiman Long wrote: > On 7/9/25 6:00 AM, Breno Leitao wrote: > > Hello Waiman, Boqun, > > > > On Fri, Mar 21, 2025 at 02:30:49AM -0700, Breno Leitao wrote: > > > lockdep_unregister_key() is called from critical code paths, including > > > sections where rtnl_lock() is held. For example, when replacing a qdisc > > > in a network device, network egress traffic is disabled while > > > __qdisc_destroy() is called for every network queue. > > > > > > If lockdep is enabled, __qdisc_destroy() calls lockdep_unregister_key(), > > > which gets blocked waiting for synchronize_rcu() to complete. > > > > > > For example, a simple tc command to replace a qdisc could take 13 > > > seconds: > > > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > > real 0m13.195s > > > user 0m0.001s > > > sys 0m2.746s > > > > > > During this time, network egress is completely frozen while waiting for > > > RCU synchronization. > > > > > > Use synchronize_rcu_expedited() instead to minimize the impact on > > > critical operations like network connectivity changes. > > > > > > This improves 10x the function call to tc, when replacing the qdisc for > > > a network card. > > > > > > # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq > > > real 0m1.789s > > > user 0m0.000s > > > sys 0m1.613s > > Can I have this landed as a workaround for the problem above, while > > hazard pointers doesn't get merged? > > > > This is affecting some systems that runs the Linus' upstream kernel with > > some debug flags enabled, and I would like to have they unblocked. > > > > Once hazard pointer lands, this will be reverted. Is this a fair > > approach? > > > > Thanks for your help, > > --breno > > I am fine with this patch going in as a temporary workaround. Boqun, what do > you think? > Seems good to me. We can add a "// TODO" comment to call out it's a temporary workaround and should be replaced. Although, I believe Peter had some some concern about IPI, I would like to get his opinion on using synchronize_rcu_expedited() as a temporary solution. Peter? Regards, Boqun > Cheers, > Longman > ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2025-07-09 14:57 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-21 9:30 [PATCH] lockdep: Speed up lockdep_unregister_key() with expedited RCU synchronization Breno Leitao
2025-03-21 10:37 ` Eric Dumazet
2025-03-21 14:22 ` Breno Leitao
2025-03-24 12:12 ` Peter Zijlstra
2025-03-24 12:23 ` Eric Dumazet
2025-03-24 12:24 ` Eric Dumazet
2025-03-24 19:21 ` Boqun Feng
2025-03-24 19:30 ` Boqun Feng
2025-03-25 0:47 ` Boqun Feng
2025-03-25 1:56 ` Waiman Long
2025-03-25 3:41 ` Boqun Feng
[not found] ` <934d794b-7ebc-422c-b4fe-3e658a2e5e7a@redhat.com>
2025-03-25 14:57 ` Waiman Long
2025-03-25 18:45 ` Boqun Feng
2025-03-25 19:23 ` Waiman Long
2025-03-25 19:42 ` Boqun Feng
2025-03-25 23:20 ` Waiman Long
2025-03-26 5:25 ` Boqun Feng
[not found] ` <df237702-55c3-466b-b51e-f3fe46ae03ba@redhat.com>
2025-03-26 16:40 ` Waiman Long
2025-03-26 16:47 ` Boqun Feng
2025-03-26 17:02 ` Waiman Long
2025-03-26 17:10 ` Paul E. McKenney
2025-03-26 18:42 ` Boqun Feng
2025-03-26 21:37 ` Paul E. McKenney
2025-03-31 16:48 ` Breno Leitao
2025-03-31 17:34 ` Boqun Feng
2025-03-31 17:26 ` Boqun Feng
2025-03-31 17:33 ` Waiman Long
2025-03-31 18:33 ` Paul E. McKenney
2025-03-31 18:57 ` Waiman Long
2025-03-31 21:21 ` Boqun Feng
2025-03-31 21:47 ` Waiman Long
2025-03-31 17:42 ` Eric Dumazet
2025-07-09 10:00 ` Breno Leitao
2025-07-09 13:57 ` Waiman Long
2025-07-09 14:57 ` Boqun Feng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).