* [PATCH 0/8] Introduce simple hazard pointers for lockdep
@ 2025-06-25 3:10 Boqun Feng
2025-06-25 3:10 ` [PATCH 1/8] Introduce simple hazard pointers Boqun Feng
` (9 more replies)
0 siblings, 10 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 3:10 UTC (permalink / raw)
To: linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
Hi,
This is the official first version of simple hazard pointers following
the RFC:
https://lore.kernel.org/lkml/20250414060055.341516-1-boqun.feng@gmail.com/
I rebase it onto v6.16-rc3 and hope to get more feedback this time.
Thanks a lot for Breno Leitao to try the RFC out and share the numbers.
I did an extra comparison this time, between the shazptr solution and
the synchronize_rcu_expedited() solution. In my test, during a 100 times
"tc qdisc replace" run:
* IPI rate with the shazptr solution: ~14 per second per core.
* IPI rate with synchronize_rcu_expedited(): ~140 per second per core.
(IPI results were from the 'CAL' line in /proc/interrupt)
This shows that while both solutions have the similar speedup, shazptr
solution avoids the introduce of high IPI rate compared to
synchronize_rcu_expedited().
Feedback is welcome and please let know if there is any concern or
suggestion. Thanks!
Regards,
Boqun
--------------------------------------
Please find the old performance below:
On my system (a 96-cpu VMs), the results of:
time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
are (with lockdep enabled):
(without the patchset)
real 0m1.039s
user 0m0.001s
sys 0m0.069s
(with the patchset)
real 0m0.053s
user 0m0.000s
sys 0m0.051s
i.e. almost 20x speed-up.
Other comparisons between RCU and shazptr, the rcuscale results (using
default configuration from
tools/testing/selftests/rcutorture/bin/kvm.sh):
RCU:
Average grace-period duration: 7470.02 microseconds
Minimum grace-period duration: 3981.6
50th percentile grace-period duration: 6002.73
90th percentile grace-period duration: 7008.93
99th percentile grace-period duration: 10015
Maximum grace-period duration: 142228
shazptr:
Average grace-period duration: 0.845825 microseconds
Minimum grace-period duration: 0.199
50th percentile grace-period duration: 0.585
90th percentile grace-period duration: 1.656
99th percentile grace-period duration: 3.872
Maximum grace-period duration: 3049.05
shazptr (skip_synchronize_self_scan=1, i.e. always let scan kthread to
wakeup):
Average grace-period duration: 467.861 microseconds
Minimum grace-period duration: 92.913
50th percentile grace-period duration: 440.691
90th percentile grace-period duration: 460.623
99th percentile grace-period duration: 650.068
Maximum grace-period duration: 5775.46
shazptr_wildcard (i.e. readers always use SHAZPTR_WILDCARD):
Average grace-period duration: 599.569 microseconds
Minimum grace-period duration: 1.432
50th percentile grace-period duration: 582.631
90th percentile grace-period duration: 781.704
99th percentile grace-period duration: 1160.26
Maximum grace-period duration: 6727.53
shazptr_wildcard (skip_synchronize_self_scan=1):
Average grace-period duration: 460.466 microseconds
Minimum grace-period duration: 303.546
50th percentile grace-period duration: 424.334
90th percentile grace-period duration: 482.637
99th percentile grace-period duration: 600.214
Maximum grace-period duration: 4126.94
Boqun Feng (8):
Introduce simple hazard pointers
shazptr: Add refscale test
shazptr: Add refscale test for wildcard
shazptr: Avoid synchronize_shaptr() busy waiting
shazptr: Allow skip self scan in synchronize_shaptr()
rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL
rcuscale: Add tests for simple hazard pointers
locking/lockdep: Use shazptr to protect the key hashlist
include/linux/shazptr.h | 73 +++++++++
kernel/locking/Makefile | 2 +-
kernel/locking/lockdep.c | 11 +-
kernel/locking/shazptr.c | 318 +++++++++++++++++++++++++++++++++++++++
kernel/rcu/rcuscale.c | 60 +++++++-
kernel/rcu/refscale.c | 77 ++++++++++
6 files changed, 534 insertions(+), 7 deletions(-)
create mode 100644 include/linux/shazptr.h
create mode 100644 kernel/locking/shazptr.c
--
2.39.5 (Apple Git-154)
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH 1/8] Introduce simple hazard pointers
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
@ 2025-06-25 3:10 ` Boqun Feng
2025-06-25 10:00 ` Peter Zijlstra
` (2 more replies)
2025-06-25 3:10 ` [PATCH 2/8] shazptr: Add refscale test Boqun Feng
` (8 subsequent siblings)
9 siblings, 3 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 3:10 UTC (permalink / raw)
To: linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
As its name suggests, simple hazard pointers (shazptr) is a
simplification of hazard pointers [1]: it has only one hazard pointer
slot per-CPU and is targeted for simple use cases where the read-side
already has preemption disabled. It's a trade-off between full features
of a normal hazard pointer implementation (multiple slots, dynamic slot
allocation, etc.) and the simple use scenario.
Since there's only one slot per-CPU, so shazptr read-side critical
section nesting is a problem that needs to be resolved, because at very
least, interrupts and NMI can introduce nested shazptr read-side
critical sections. A SHAZPTR_WILDCARD is introduced to resolve this:
SHAZPTR_WILDCARD is a special address value that blocks *all* shazptr
waiters. In an interrupt-causing shazptr read-side critical section
nesting case (i.e. an interrupt happens while the per-CPU hazard pointer
slot being used and tries to acquire a hazard pointer itself), the inner
critical section will switch the value of the hazard pointer slot into
SHAZPTR_WILDCARD, and let the outer critical section eventually zero the
slot. The SHAZPTR_WILDCARD still provide the correct protection because
it blocks all the waiters.
It's true that once the wildcard mechanism is activated, shazptr
mechanism may be downgrade to something similar to RCU (and probably
with a worse implementation), which generally has longer wait time and
larger memory footprint compared to a typical hazard pointer
implementation. However, that can only happen with a lot of users using
hazard pointers, and then it's reasonable to introduce the
fully-featured hazard pointer implementation [2] and switch users to it.
Note that shazptr_protect() may be added later, the current potential
usage doesn't require it, and a shazptr_acquire(), which installs the
protected value to hazard pointer slot and proves the smp_mb(), is
enough for now.
[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
lock-free objects," in IEEE Transactions on Parallel and
Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
Link: https://lore.kernel.org/lkml/20240917143402.930114-1-boqun.feng@gmail.com/ [2]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
include/linux/shazptr.h | 73 ++++++++++++++++++++++++++++++++++++++++
kernel/locking/Makefile | 2 +-
kernel/locking/shazptr.c | 29 ++++++++++++++++
3 files changed, 103 insertions(+), 1 deletion(-)
create mode 100644 include/linux/shazptr.h
create mode 100644 kernel/locking/shazptr.c
diff --git a/include/linux/shazptr.h b/include/linux/shazptr.h
new file mode 100644
index 000000000000..287cd04b4be9
--- /dev/null
+++ b/include/linux/shazptr.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Simple hazard pointers
+ *
+ * Copyright (c) 2025, Microsoft Corporation.
+ *
+ * Author: Boqun Feng <boqun.feng@gmail.com>
+ *
+ * A simple variant of hazard pointers, the users must ensure the preemption
+ * is already disabled when calling a shazptr_acquire() to protect an address.
+ * If one shazptr_acquire() is called after another shazptr_acquire() has been
+ * called without the corresponding shazptr_clear() has been called, the later
+ * shazptr_acquire() must be cleared first.
+ *
+ * The most suitable usage is when only one address need to be protected in a
+ * preemption disabled critical section.
+ */
+
+#ifndef _LINUX_SHAZPTR_H
+#define _LINUX_SHAZPTR_H
+
+#include <linux/cleanup.h>
+#include <linux/percpu.h>
+
+/* Make ULONG_MAX the wildcard value */
+#define SHAZPTR_WILDCARD ((void *)(ULONG_MAX))
+
+DECLARE_PER_CPU_SHARED_ALIGNED(void *, shazptr_slots);
+
+/* Represent a held hazard pointer slot */
+struct shazptr_guard {
+ void **slot;
+ bool use_wildcard;
+};
+
+/*
+ * Acquire a hazptr slot and begin the hazard pointer critical section.
+ *
+ * Must be called with preemption disabled, and preemption must remain disabled
+ * until shazptr_clear().
+ */
+static inline struct shazptr_guard shazptr_acquire(void *ptr)
+{
+ struct shazptr_guard guard = {
+ /* Preemption is disabled. */
+ .slot = this_cpu_ptr(&shazptr_slots),
+ .use_wildcard = false,
+ };
+
+ if (likely(!READ_ONCE(*guard.slot))) {
+ WRITE_ONCE(*guard.slot, ptr);
+ } else {
+ guard.use_wildcard = true;
+ WRITE_ONCE(*guard.slot, SHAZPTR_WILDCARD);
+ }
+
+ smp_mb(); /* Synchronize with smp_mb() at synchronize_shazptr(). */
+
+ return guard;
+}
+
+static inline void shazptr_clear(struct shazptr_guard guard)
+{
+ /* Only clear the slot when the outermost guard is released */
+ if (likely(!guard.use_wildcard))
+ smp_store_release(guard.slot, NULL); /* Pair with ACQUIRE at synchronize_shazptr() */
+}
+
+void synchronize_shazptr(void *ptr);
+
+DEFINE_CLASS(shazptr, struct shazptr_guard, shazptr_clear(_T),
+ shazptr_acquire(ptr), void *ptr);
+#endif
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index a114949eeed5..1517076c98ec 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -3,7 +3,7 @@
# and is generally not a function of system call inputs.
KCOV_INSTRUMENT := n
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o shazptr.o
# Avoid recursion lockdep -> sanitizer -> ... -> lockdep & improve performance.
KASAN_SANITIZE_lockdep.o := n
diff --git a/kernel/locking/shazptr.c b/kernel/locking/shazptr.c
new file mode 100644
index 000000000000..991fd1a05cfd
--- /dev/null
+++ b/kernel/locking/shazptr.c
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Simple hazard pointers
+ *
+ * Copyright (c) 2025, Microsoft Corporation.
+ *
+ * Author: Boqun Feng <boqun.feng@gmail.com>
+ */
+
+#include <linux/atomic.h>
+#include <linux/cpumask.h>
+#include <linux/shazptr.h>
+
+DEFINE_PER_CPU_SHARED_ALIGNED(void *, shazptr_slots);
+EXPORT_PER_CPU_SYMBOL_GPL(shazptr_slots);
+
+void synchronize_shazptr(void *ptr)
+{
+ int cpu;
+
+ smp_mb(); /* Synchronize with the smp_mb() in shazptr_acquire(). */
+ for_each_possible_cpu(cpu) {
+ void **slot = per_cpu_ptr(&shazptr_slots, cpu);
+ /* Pair with smp_store_release() in shazptr_clear(). */
+ smp_cond_load_acquire(slot,
+ VAL != ptr && VAL != SHAZPTR_WILDCARD);
+ }
+}
+EXPORT_SYMBOL_GPL(synchronize_shazptr);
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 2/8] shazptr: Add refscale test
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
2025-06-25 3:10 ` [PATCH 1/8] Introduce simple hazard pointers Boqun Feng
@ 2025-06-25 3:10 ` Boqun Feng
2025-06-25 10:02 ` Peter Zijlstra
2025-06-25 3:10 ` [PATCH 3/8] shazptr: Add refscale test for wildcard Boqun Feng
` (7 subsequent siblings)
9 siblings, 1 reply; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 3:10 UTC (permalink / raw)
To: linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
Add the refscale test for shazptr to measure the reader side
performance.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
kernel/rcu/refscale.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index f11a7c2af778..154520e4ee4c 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -29,6 +29,7 @@
#include <linux/reboot.h>
#include <linux/sched.h>
#include <linux/seq_buf.h>
+#include <linux/shazptr.h>
#include <linux/spinlock.h>
#include <linux/smp.h>
#include <linux/stat.h>
@@ -890,6 +891,43 @@ static const struct ref_scale_ops typesafe_seqlock_ops = {
.name = "typesafe_seqlock"
};
+static void ref_shazptr_read_section(const int nloops)
+{
+ int i;
+
+ for (i = nloops; i >= 0; i--) {
+ preempt_disable();
+ { guard(shazptr)(ref_shazptr_read_section); }
+ preempt_enable();
+ }
+}
+
+static void ref_shazptr_delay_section(const int nloops, const int udl, const int ndl)
+{
+ int i;
+
+ for (i = nloops; i >= 0; i--) {
+ preempt_disable();
+ {
+ guard(shazptr)(ref_shazptr_delay_section);
+ un_delay(udl, ndl);
+ }
+ preempt_enable();
+ }
+}
+
+static bool ref_shazptr_init(void)
+{
+ return true;
+}
+
+static const struct ref_scale_ops shazptr_ops = {
+ .init = ref_shazptr_init,
+ .readsection = ref_shazptr_read_section,
+ .delaysection = ref_shazptr_delay_section,
+ .name = "shazptr"
+};
+
static void rcu_scale_one_reader(void)
{
if (readdelay <= 0)
@@ -1197,6 +1235,7 @@ ref_scale_init(void)
&refcnt_ops, &rwlock_ops, &rwsem_ops, &lock_ops, &lock_irq_ops,
&acqrel_ops, &sched_clock_ops, &clock_ops, &jiffies_ops,
&typesafe_ref_ops, &typesafe_lock_ops, &typesafe_seqlock_ops,
+ &shazptr_ops,
};
if (!torture_init_begin(scale_type, verbose))
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 3/8] shazptr: Add refscale test for wildcard
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
2025-06-25 3:10 ` [PATCH 1/8] Introduce simple hazard pointers Boqun Feng
2025-06-25 3:10 ` [PATCH 2/8] shazptr: Add refscale test Boqun Feng
@ 2025-06-25 3:10 ` Boqun Feng
2025-06-25 10:03 ` Peter Zijlstra
2025-06-25 3:10 ` [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting Boqun Feng
` (6 subsequent siblings)
9 siblings, 1 reply; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 3:10 UTC (permalink / raw)
To: linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
Add the refscale test for shazptr, which starts another shazptr critical
section inside an existing one to measure the reader side performance
when wildcard logic is triggered.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
kernel/rcu/refscale.c | 40 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 39 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
index 154520e4ee4c..fdbb4a2c91fe 100644
--- a/kernel/rcu/refscale.c
+++ b/kernel/rcu/refscale.c
@@ -928,6 +928,44 @@ static const struct ref_scale_ops shazptr_ops = {
.name = "shazptr"
};
+static void ref_shazptr_wc_read_section(const int nloops)
+{
+ int i;
+
+ for (i = nloops; i >= 0; i--) {
+ preempt_disable();
+ {
+ guard(shazptr)(ref_shazptr_read_section);
+ /* Trigger wildcard logic */
+ guard(shazptr)(ref_shazptr_wc_read_section);
+ }
+ preempt_enable();
+ }
+}
+
+static void ref_shazptr_wc_delay_section(const int nloops, const int udl, const int ndl)
+{
+ int i;
+
+ for (i = nloops; i >= 0; i--) {
+ preempt_disable();
+ {
+ guard(shazptr)(ref_shazptr_delay_section);
+ /* Trigger wildcard logic */
+ guard(shazptr)(ref_shazptr_wc_delay_section);
+ un_delay(udl, ndl);
+ }
+ preempt_enable();
+ }
+}
+
+static const struct ref_scale_ops shazptr_wildcard_ops = {
+ .init = ref_shazptr_init,
+ .readsection = ref_shazptr_wc_read_section,
+ .delaysection = ref_shazptr_wc_delay_section,
+ .name = "shazptr_wildcard"
+};
+
static void rcu_scale_one_reader(void)
{
if (readdelay <= 0)
@@ -1235,7 +1273,7 @@ ref_scale_init(void)
&refcnt_ops, &rwlock_ops, &rwsem_ops, &lock_ops, &lock_irq_ops,
&acqrel_ops, &sched_clock_ops, &clock_ops, &jiffies_ops,
&typesafe_ref_ops, &typesafe_lock_ops, &typesafe_seqlock_ops,
- &shazptr_ops,
+ &shazptr_ops, &shazptr_wildcard_ops,
};
if (!torture_init_begin(scale_type, verbose))
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
` (2 preceding siblings ...)
2025-06-25 3:10 ` [PATCH 3/8] shazptr: Add refscale test for wildcard Boqun Feng
@ 2025-06-25 3:10 ` Boqun Feng
2025-06-25 11:40 ` Peter Zijlstra
` (2 more replies)
2025-06-25 3:10 ` [PATCH 5/8] shazptr: Allow skip self scan in synchronize_shaptr() Boqun Feng
` (5 subsequent siblings)
9 siblings, 3 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 3:10 UTC (permalink / raw)
To: linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
For a general purpose hazard pointers implemenation, always busy waiting
is not an option. It may benefit some special workload, but overall it
hurts the system performance when more and more users begin to call
synchronize_shazptr(). Therefore avoid busy waiting for hazard pointer
slots changes by using a scan kthread, and each synchronize_shazptr()
queues themselves if a quick scan shows they are blocked by some slots.
A simple optimization is done inside the scan: each
synchronize_shazptr() tracks which CPUs (or CPU groups if nr_cpu_ids >
BITS_PER_LONG) are blocking it and the scan function updates this
information for each synchronize_shazptr() (via shazptr_wait)
individually. In this way, synchronize_shazptr() doesn't need to wait
until a scan result showing all slots are not blocking (as long as the
scan has observed each slot has changed into non-block state once).
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
kernel/locking/shazptr.c | 277 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 276 insertions(+), 1 deletion(-)
diff --git a/kernel/locking/shazptr.c b/kernel/locking/shazptr.c
index 991fd1a05cfd..a8559cb559f8 100644
--- a/kernel/locking/shazptr.c
+++ b/kernel/locking/shazptr.c
@@ -7,18 +7,243 @@
* Author: Boqun Feng <boqun.feng@gmail.com>
*/
+#define pr_fmt(fmt) "shazptr: " fmt
+
#include <linux/atomic.h>
#include <linux/cpumask.h>
+#include <linux/completion.h>
+#include <linux/kthread.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
#include <linux/shazptr.h>
+#include <linux/slab.h>
+#include <linux/sort.h>
DEFINE_PER_CPU_SHARED_ALIGNED(void *, shazptr_slots);
EXPORT_PER_CPU_SYMBOL_GPL(shazptr_slots);
-void synchronize_shazptr(void *ptr)
+/* Wait structure for synchronize_shazptr(). */
+struct shazptr_wait {
+ struct list_head list;
+ /* Which groups of CPUs are blocking. */
+ unsigned long blocking_grp_mask;
+ void *ptr;
+ struct completion done;
+};
+
+/* Snapshot for hazptr slot. */
+struct shazptr_snapshot {
+ unsigned long ptr;
+ unsigned long grp_mask;
+};
+
+static inline int
+shazptr_snapshot_cmp(const void *a, const void *b)
+{
+ const struct shazptr_snapshot *snap_a = (struct shazptr_snapshot *)a;
+ const struct shazptr_snapshot *snap_b = (struct shazptr_snapshot *)b;
+
+ if (snap_a->ptr > snap_b->ptr)
+ return 1;
+ else if (snap_a->ptr < snap_b->ptr)
+ return -1;
+ else
+ return 0;
+}
+
+/* *In-place* merge @n together based on ->ptr and accumulate the >grp_mask. */
+static int shazptr_snapshot_merge(struct shazptr_snapshot *snaps, int n)
+{
+ int new, i;
+
+ /* Sort first. */
+ sort(snaps, n, sizeof(*snaps), shazptr_snapshot_cmp, NULL);
+
+ new = 0;
+
+ /* Skip NULLs. */
+ for (i = 0; i < n; i++) {
+ if (snaps[i].ptr)
+ break;
+ }
+
+ while (i < n) {
+ /* Start with a new address. */
+ snaps[new] = snaps[i];
+
+ for (; i < n; i++) {
+ /* Merge if the next one has the same address. */
+ if (snaps[new].ptr == snaps[i].ptr) {
+ snaps[new].grp_mask |= snaps[i].grp_mask;
+ } else
+ break;
+ }
+
+ /*
+ * Either the end has been reached or need to start with a new
+ * record.
+ */
+ new++;
+ }
+
+ return new;
+}
+
+/*
+ * Calculate which group is still blocking @ptr, this assumes the @snaps is
+ * already merged.
+ */
+static unsigned long
+shazptr_snapshot_blocking_grp_mask(struct shazptr_snapshot *snaps,
+ int n, void *ptr)
+{
+ unsigned long mask = 0;
+
+ if (!n)
+ return mask;
+ else if (snaps[n-1].ptr == (unsigned long)SHAZPTR_WILDCARD) {
+ /*
+ * Take SHAZPTR_WILDCARD slots, which is ULONG_MAX, into
+ * consideration if any.
+ */
+ mask = snaps[n-1].grp_mask;
+ }
+
+ /* TODO: binary search if n is big. */
+ for (int i = 0; i < n; i++) {
+ if (snaps[i].ptr == (unsigned long)ptr) {
+ mask |= snaps[i].grp_mask;
+ break;
+ }
+ }
+
+ return mask;
+}
+
+/* Scan structure for synchronize_shazptr(). */
+struct shazptr_scan {
+ /* The scan kthread */
+ struct task_struct *thread;
+
+ /* Wait queue for the scan kthread */
+ struct swait_queue_head wq;
+
+ /* Whether the scan kthread has been scheduled to scan */
+ bool scheduled;
+
+ /* The lock protecting ->queued and ->scheduled */
+ struct mutex lock;
+
+ /* List of queued synchronize_shazptr() request. */
+ struct list_head queued;
+
+ int cpu_grp_size;
+
+ /* List of scanning synchronize_shazptr() request. */
+ struct list_head scanning;
+
+ /* Buffer used for hazptr slot scan, nr_cpu_ids slots*/
+ struct shazptr_snapshot* snaps;
+};
+
+static struct shazptr_scan shazptr_scan;
+
+static void shazptr_do_scan(struct shazptr_scan *scan)
+{
+ int cpu;
+ int snaps_len;
+ struct shazptr_wait *curr, *next;
+
+ scoped_guard(mutex, &scan->lock) {
+ /* Move from ->queued to ->scanning. */
+ list_splice_tail_init(&scan->queued, &scan->scanning);
+ }
+
+ memset(scan->snaps, nr_cpu_ids, sizeof(struct shazptr_snapshot));
+
+ for_each_possible_cpu(cpu) {
+ void **slot = per_cpu_ptr(&shazptr_slots, cpu);
+ void *val;
+
+ /* Pair with smp_store_release() in shazptr_clear(). */
+ val = smp_load_acquire(slot);
+
+ scan->snaps[cpu].ptr = (unsigned long)val;
+ scan->snaps[cpu].grp_mask = 1UL << (cpu / scan->cpu_grp_size);
+ }
+
+ snaps_len = shazptr_snapshot_merge(scan->snaps, nr_cpu_ids);
+
+ /* Only one thread can access ->scanning, so can be lockless. */
+ list_for_each_entry_safe(curr, next, &scan->scanning, list) {
+ /* Accumulate the shazptr slot scan result. */
+ curr->blocking_grp_mask &=
+ shazptr_snapshot_blocking_grp_mask(scan->snaps,
+ snaps_len,
+ curr->ptr);
+
+ if (curr->blocking_grp_mask == 0) {
+ /* All shots are observed as not blocking once. */
+ list_del(&curr->list);
+ complete(&curr->done);
+ }
+ }
+}
+
+static int __noreturn shazptr_scan_kthread(void *unused)
+{
+ for (;;) {
+ swait_event_idle_exclusive(shazptr_scan.wq,
+ READ_ONCE(shazptr_scan.scheduled));
+
+ shazptr_do_scan(&shazptr_scan);
+
+ scoped_guard(mutex, &shazptr_scan.lock) {
+ if (list_empty(&shazptr_scan.queued) &&
+ list_empty(&shazptr_scan.scanning))
+ shazptr_scan.scheduled = false;
+ }
+ }
+}
+
+static int __init shazptr_scan_init(void)
+{
+ struct shazptr_scan *scan = &shazptr_scan;
+ struct task_struct *t;
+
+ init_swait_queue_head(&scan->wq);
+ mutex_init(&scan->lock);
+ INIT_LIST_HEAD(&scan->queued);
+ INIT_LIST_HEAD(&scan->scanning);
+ scan->scheduled = false;
+
+ /* Group CPUs into at most BITS_PER_LONG groups. */
+ scan->cpu_grp_size = DIV_ROUND_UP(nr_cpu_ids, BITS_PER_LONG);
+
+ scan->snaps = kcalloc(nr_cpu_ids, sizeof(scan->snaps[0]), GFP_KERNEL);
+
+ if (scan->snaps) {
+ t = kthread_run(shazptr_scan_kthread, NULL, "shazptr_scan");
+ if (!IS_ERR(t)) {
+ smp_store_release(&scan->thread, t);
+ /* Kthread creation succeeds */
+ return 0;
+ } else {
+ kfree(scan->snaps);
+ }
+ }
+
+ pr_info("Failed to create the scan thread, only busy waits\n");
+ return 0;
+}
+core_initcall(shazptr_scan_init);
+
+static void synchronize_shazptr_busywait(void *ptr)
{
int cpu;
smp_mb(); /* Synchronize with the smp_mb() in shazptr_acquire(). */
+
for_each_possible_cpu(cpu) {
void **slot = per_cpu_ptr(&shazptr_slots, cpu);
/* Pair with smp_store_release() in shazptr_clear(). */
@@ -26,4 +251,54 @@ void synchronize_shazptr(void *ptr)
VAL != ptr && VAL != SHAZPTR_WILDCARD);
}
}
+
+static void synchronize_shazptr_normal(void *ptr)
+{
+ int cpu;
+ unsigned long blocking_grp_mask = 0;
+
+ smp_mb(); /* Synchronize with the smp_mb() in shazptr_acquire(). */
+
+ for_each_possible_cpu(cpu) {
+ void **slot = per_cpu_ptr(&shazptr_slots, cpu);
+ void *val;
+
+ /* Pair with smp_store_release() in shazptr_clear(). */
+ val = smp_load_acquire(slot);
+
+ if (val == ptr || val == SHAZPTR_WILDCARD)
+ blocking_grp_mask |= 1UL << (cpu / shazptr_scan.cpu_grp_size);
+ }
+
+ /* Found blocking slots, prepare to wait. */
+ if (blocking_grp_mask) {
+ struct shazptr_scan *scan = &shazptr_scan;
+ struct shazptr_wait wait = {
+ .blocking_grp_mask = blocking_grp_mask,
+ };
+
+ INIT_LIST_HEAD(&wait.list);
+ init_completion(&wait.done);
+
+ scoped_guard(mutex, &scan->lock) {
+ list_add_tail(&wait.list, &scan->queued);
+
+ if (!scan->scheduled) {
+ WRITE_ONCE(scan->scheduled, true);
+ swake_up_one(&shazptr_scan.wq);
+ }
+ }
+
+ wait_for_completion(&wait.done);
+ }
+}
+
+void synchronize_shazptr(void *ptr)
+{
+ /* Busy waiting if the scan kthread has not been created. */
+ if (!smp_load_acquire(&shazptr_scan.thread))
+ synchronize_shazptr_busywait(ptr);
+ else
+ synchronize_shazptr_normal(ptr);
+}
EXPORT_SYMBOL_GPL(synchronize_shazptr);
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 5/8] shazptr: Allow skip self scan in synchronize_shaptr()
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
` (3 preceding siblings ...)
2025-06-25 3:10 ` [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting Boqun Feng
@ 2025-06-25 3:10 ` Boqun Feng
2025-06-25 3:10 ` [PATCH 6/8] rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL Boqun Feng
` (4 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 3:10 UTC (permalink / raw)
To: linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
Add a module parameter for shazptr to allow skip the self scan in
synchronize_shaptr(). This can force every synchronize_shaptr() to use
shazptr scan kthread, and help testing the shazptr scan kthread.
Another reason users may want to set this paramter is to reduce the self
scan CPU cost in synchronize_shaptr().
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
kernel/locking/shazptr.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/shazptr.c b/kernel/locking/shazptr.c
index a8559cb559f8..b3f7e8390eb2 100644
--- a/kernel/locking/shazptr.c
+++ b/kernel/locking/shazptr.c
@@ -14,11 +14,17 @@
#include <linux/completion.h>
#include <linux/kthread.h>
#include <linux/list.h>
+#include <linux/moduleparam.h>
#include <linux/mutex.h>
#include <linux/shazptr.h>
#include <linux/slab.h>
#include <linux/sort.h>
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+#define MODULE_PARAM_PREFIX "shazptr."
+
DEFINE_PER_CPU_SHARED_ALIGNED(void *, shazptr_slots);
EXPORT_PER_CPU_SYMBOL_GPL(shazptr_slots);
@@ -252,6 +258,10 @@ static void synchronize_shazptr_busywait(void *ptr)
}
}
+/* Disabled by default. */
+static int skip_synchronize_self_scan;
+module_param(skip_synchronize_self_scan, int, 0644);
+
static void synchronize_shazptr_normal(void *ptr)
{
int cpu;
@@ -259,15 +269,19 @@ static void synchronize_shazptr_normal(void *ptr)
smp_mb(); /* Synchronize with the smp_mb() in shazptr_acquire(). */
- for_each_possible_cpu(cpu) {
- void **slot = per_cpu_ptr(&shazptr_slots, cpu);
- void *val;
+ if (unlikely(skip_synchronize_self_scan)) {
+ blocking_grp_mask = ~0UL;
+ } else {
+ for_each_possible_cpu(cpu) {
+ void **slot = per_cpu_ptr(&shazptr_slots, cpu);
+ void *val;
- /* Pair with smp_store_release() in shazptr_clear(). */
- val = smp_load_acquire(slot);
+ /* Pair with smp_store_release() in shazptr_clear(). */
+ val = smp_load_acquire(slot);
- if (val == ptr || val == SHAZPTR_WILDCARD)
- blocking_grp_mask |= 1UL << (cpu / shazptr_scan.cpu_grp_size);
+ if (val == ptr || val == SHAZPTR_WILDCARD)
+ blocking_grp_mask |= 1UL << (cpu / shazptr_scan.cpu_grp_size);
+ }
}
/* Found blocking slots, prepare to wait. */
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 6/8] rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
` (4 preceding siblings ...)
2025-06-25 3:10 ` [PATCH 5/8] shazptr: Allow skip self scan in synchronize_shaptr() Boqun Feng
@ 2025-06-25 3:10 ` Boqun Feng
2025-06-25 3:11 ` [PATCH 7/8] rcuscale: Add tests for simple hazard pointers Boqun Feng
` (3 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 3:10 UTC (permalink / raw)
To: linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
For synchronization mechanisms similar to RCU, there could be no "grace
period" concept (e.g. hazard pointers), therefore allow
rcu_scale_ops::get_gp_seq to be a NULL pointer for these cases, and
simply treat started and finished grace period as 0.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
kernel/rcu/rcuscale.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index b521d0455992..45413a73d61e 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -568,8 +568,10 @@ rcu_scale_writer(void *arg)
if (gp_exp) {
b_rcu_gp_test_started =
cur_ops->exp_completed() / 2;
- } else {
+ } else if (cur_ops->get_gp_seq) {
b_rcu_gp_test_started = cur_ops->get_gp_seq();
+ } else {
+ b_rcu_gp_test_started = 0;
}
}
@@ -625,9 +627,11 @@ rcu_scale_writer(void *arg)
if (gp_exp) {
b_rcu_gp_test_finished =
cur_ops->exp_completed() / 2;
- } else {
+ } else if (cur_ops->get_gp_seq) {
b_rcu_gp_test_finished =
cur_ops->get_gp_seq();
+ } else {
+ b_rcu_gp_test_finished = 0;
}
if (shutdown) {
smp_mb(); /* Assign before wake. */
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 7/8] rcuscale: Add tests for simple hazard pointers
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
` (5 preceding siblings ...)
2025-06-25 3:10 ` [PATCH 6/8] rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL Boqun Feng
@ 2025-06-25 3:11 ` Boqun Feng
2025-06-25 3:11 ` [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist Boqun Feng
` (2 subsequent siblings)
9 siblings, 0 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 3:11 UTC (permalink / raw)
To: linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
Add two rcu_scale_ops to include tests from simple hazard pointers
(shazptr). One is with evenly distributed readers, and the other is with
all WILDCARD readers. This could show the best and worst case scenarios
for the synchronization time of simple hazard pointers.
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
kernel/rcu/rcuscale.c | 52 ++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 51 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 45413a73d61e..357431bf802b 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -32,6 +32,7 @@
#include <linux/freezer.h>
#include <linux/cpu.h>
#include <linux/delay.h>
+#include <linux/shazptr.h>
#include <linux/stat.h>
#include <linux/srcu.h>
#include <linux/slab.h>
@@ -429,6 +430,54 @@ static struct rcu_scale_ops tasks_tracing_ops = {
#endif // #else // #ifdef CONFIG_TASKS_TRACE_RCU
+static int shazptr_scale_read_lock(void)
+{
+ long cpu = raw_smp_processor_id();
+
+ /* Use cpu + 1 as the key */
+ guard(shazptr)((void *)(cpu + 1));
+
+ return 0;
+}
+
+static int shazptr_scale_wc_read_lock(void)
+{
+ guard(shazptr)(SHAZPTR_WILDCARD);
+
+ return 0;
+}
+
+
+static void shazptr_scale_read_unlock(int idx)
+{
+ /* Do nothing, it's OK since readers are doing back-to-back lock+unlock*/
+}
+
+static void shazptr_scale_sync(void)
+{
+ long cpu = raw_smp_processor_id();
+
+ synchronize_shazptr((void *)(cpu + 1));
+}
+
+static struct rcu_scale_ops shazptr_ops = {
+ .ptype = RCU_FLAVOR,
+ .readlock = shazptr_scale_read_lock,
+ .readunlock = shazptr_scale_read_unlock,
+ .sync = shazptr_scale_sync,
+ .exp_sync = shazptr_scale_sync,
+ .name = "shazptr"
+};
+
+static struct rcu_scale_ops shazptr_wc_ops = {
+ .ptype = RCU_FLAVOR,
+ .readlock = shazptr_scale_wc_read_lock,
+ .readunlock = shazptr_scale_read_unlock,
+ .sync = shazptr_scale_sync,
+ .exp_sync = shazptr_scale_sync,
+ .name = "shazptr_wildcard"
+};
+
static unsigned long rcuscale_seq_diff(unsigned long new, unsigned long old)
{
if (!cur_ops->gp_diff)
@@ -1090,7 +1139,8 @@ rcu_scale_init(void)
long i;
long j;
static struct rcu_scale_ops *scale_ops[] = {
- &rcu_ops, &srcu_ops, &srcud_ops, TASKS_OPS TASKS_RUDE_OPS TASKS_TRACING_OPS
+ &rcu_ops, &srcu_ops, &srcud_ops, &shazptr_ops, &shazptr_wc_ops,
+ TASKS_OPS TASKS_RUDE_OPS TASKS_TRACING_OPS
};
if (!torture_init_begin(scale_type, verbose))
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 34+ messages in thread
* [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
` (6 preceding siblings ...)
2025-06-25 3:11 ` [PATCH 7/8] rcuscale: Add tests for simple hazard pointers Boqun Feng
@ 2025-06-25 3:11 ` Boqun Feng
2025-06-25 11:59 ` Peter Zijlstra
2025-07-10 14:06 ` Breno Leitao
2025-06-25 12:05 ` [PATCH 0/8] Introduce simple hazard pointers for lockdep Christoph Hellwig
2025-06-25 12:25 ` Mathieu Desnoyers
9 siblings, 2 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 3:11 UTC (permalink / raw)
To: linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
Erik Lundgren and Breno Leitao reported [1] a case where
lockdep_unregister_key() can be called from time critical code pathes
where rntl_lock() may be held. And the synchronize_rcu() in it can slow
down operations such as using tc to replace a qdisc in a network device.
In fact the synchronize_rcu() in lockdep_unregister_key() is to wait for
all is_dynamic_key() callers to finish so that removing a key from the
key hashlist, and we can use shazptr to protect the hashlist as well.
Compared to the proposed solution which replaces synchronize_rcu() with
synchronize_rcu_expedited(), using shazptr here can achieve the
same/better synchronization time without the need to send IPI. Hence use
shazptr here.
Reported-by: Erik Lundgren <elundgren@meta.com>
Reported-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/all/20250321-lockdep-v1-1-78b732d195fb@debian.org/ [1]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
---
kernel/locking/lockdep.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index dd2bbf73718b..5c205dd425f8 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -58,6 +58,7 @@
#include <linux/context_tracking.h>
#include <linux/console.h>
#include <linux/kasan.h>
+#include <linux/shazptr.h>
#include <asm/sections.h>
@@ -1267,14 +1268,18 @@ static bool is_dynamic_key(const struct lock_class_key *key)
hash_head = keyhashentry(key);
- rcu_read_lock();
+ /* Need preemption disable for using shazptr. */
+ guard(preempt)();
+
+ /* Protect the list search with shazptr. */
+ guard(shazptr)(hash_head);
+
hlist_for_each_entry_rcu(k, hash_head, hash_entry) {
if (k == key) {
found = true;
break;
}
}
- rcu_read_unlock();
return found;
}
@@ -6620,7 +6625,7 @@ void lockdep_unregister_key(struct lock_class_key *key)
call_rcu(&delayed_free.rcu_head, free_zapped_rcu);
/* Wait until is_dynamic_key() has finished accessing k->hash_entry. */
- synchronize_rcu();
+ synchronize_shazptr(keyhashentry(key));
}
EXPORT_SYMBOL_GPL(lockdep_unregister_key);
--
2.39.5 (Apple Git-154)
^ permalink raw reply related [flat|nested] 34+ messages in thread
* Re: [PATCH 1/8] Introduce simple hazard pointers
2025-06-25 3:10 ` [PATCH 1/8] Introduce simple hazard pointers Boqun Feng
@ 2025-06-25 10:00 ` Peter Zijlstra
2025-06-25 14:25 ` Mathieu Desnoyers
2025-06-25 15:52 ` Waiman Long
2 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2025-06-25 10:00 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Ingo Molnar, Will Deacon, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Tue, Jun 24, 2025 at 08:10:54PM -0700, Boqun Feng wrote:
> As its name suggests, simple hazard pointers (shazptr) is a
> simplification of hazard pointers [1]: it has only one hazard pointer
> slot per-CPU and is targeted for simple use cases where the read-side
> already has preemption disabled. It's a trade-off between full features
> of a normal hazard pointer implementation (multiple slots, dynamic slot
> allocation, etc.) and the simple use scenario.
>
> Since there's only one slot per-CPU, so shazptr read-side critical
> section nesting is a problem that needs to be resolved, because at very
> least, interrupts and NMI can introduce nested shazptr read-side
> critical sections. A SHAZPTR_WILDCARD is introduced to resolve this:
> SHAZPTR_WILDCARD is a special address value that blocks *all* shazptr
> waiters. In an interrupt-causing shazptr read-side critical section
> nesting case (i.e. an interrupt happens while the per-CPU hazard pointer
> slot being used and tries to acquire a hazard pointer itself), the inner
> critical section will switch the value of the hazard pointer slot into
> SHAZPTR_WILDCARD, and let the outer critical section eventually zero the
> slot. The SHAZPTR_WILDCARD still provide the correct protection because
> it blocks all the waiters.
Don't we typically name such a thing a tombstone?
> It's true that once the wildcard mechanism is activated, shazptr
> mechanism may be downgrade to something similar to RCU (and probably
> with a worse implementation), which generally has longer wait time and
> larger memory footprint compared to a typical hazard pointer
> implementation. However, that can only happen with a lot of users using
> hazard pointers, and then it's reasonable to introduce the
> fully-featured hazard pointer implementation [2] and switch users to it.
>
> Note that shazptr_protect() may be added later, the current potential
> usage doesn't require it, and a shazptr_acquire(), which installs the
> protected value to hazard pointer slot and proves the smp_mb(), is
> enough for now.
>
> [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
> lock-free objects," in IEEE Transactions on Parallel and
> Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
>
> Link: https://lore.kernel.org/lkml/20240917143402.930114-1-boqun.feng@gmail.com/ [2]
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> include/linux/shazptr.h | 73 ++++++++++++++++++++++++++++++++++++++++
> kernel/locking/Makefile | 2 +-
> kernel/locking/shazptr.c | 29 ++++++++++++++++
> 3 files changed, 103 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/shazptr.h
> create mode 100644 kernel/locking/shazptr.c
>
> diff --git a/include/linux/shazptr.h b/include/linux/shazptr.h
> new file mode 100644
> index 000000000000..287cd04b4be9
> --- /dev/null
> +++ b/include/linux/shazptr.h
> @@ -0,0 +1,73 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Simple hazard pointers
> + *
> + * Copyright (c) 2025, Microsoft Corporation.
> + *
> + * Author: Boqun Feng <boqun.feng@gmail.com>
> + *
> + * A simple variant of hazard pointers, the users must ensure the preemption
> + * is already disabled when calling a shazptr_acquire() to protect an address.
> + * If one shazptr_acquire() is called after another shazptr_acquire() has been
> + * called without the corresponding shazptr_clear() has been called, the later
> + * shazptr_acquire() must be cleared first.
> + *
> + * The most suitable usage is when only one address need to be protected in a
> + * preemption disabled critical section.
It might be useful to have some example code included here to illustrate
how this is supposed to be used etc.
> + */
> +
> +#ifndef _LINUX_SHAZPTR_H
> +#define _LINUX_SHAZPTR_H
> +
> +#include <linux/cleanup.h>
> +#include <linux/percpu.h>
> +
> +/* Make ULONG_MAX the wildcard value */
> +#define SHAZPTR_WILDCARD ((void *)(ULONG_MAX))
Right, I typically write that like: ((void *)-1L) or ((void *)~0UL)
> +
> +DECLARE_PER_CPU_SHARED_ALIGNED(void *, shazptr_slots);
> +
> +/* Represent a held hazard pointer slot */
> +struct shazptr_guard {
> + void **slot;
> + bool use_wildcard;
> +};
Natural alignment ensures the LSB of that pointer is 0, which is enough
space to stick that bool in, no?
> +
> +/*
> + * Acquire a hazptr slot and begin the hazard pointer critical section.
> + *
> + * Must be called with preemption disabled, and preemption must remain disabled
> + * until shazptr_clear().
> + */
> +static inline struct shazptr_guard shazptr_acquire(void *ptr)
> +{
> + struct shazptr_guard guard = {
> + /* Preemption is disabled. */
> + .slot = this_cpu_ptr(&shazptr_slots),
What you're trying to say with that comment is that: this_cpu_ptr(),
will complain if preemption is not already disabled, and as such this
verifies the assumption?
You can also add:
lockdep_assert_preemption_disabled();
at the start of this function and then all these comments can go in the
bin, no?
> + .use_wildcard = false,
> + };
> +
> + if (likely(!READ_ONCE(*guard.slot))) {
> + WRITE_ONCE(*guard.slot, ptr);
> + } else {
> + guard.use_wildcard = true;
> + WRITE_ONCE(*guard.slot, SHAZPTR_WILDCARD);
> + }
> +
> + smp_mb(); /* Synchronize with smp_mb() at synchronize_shazptr(). */
> +
> + return guard;
> +}
> +
> +static inline void shazptr_clear(struct shazptr_guard guard)
> +{
> + /* Only clear the slot when the outermost guard is released */
> + if (likely(!guard.use_wildcard))
> + smp_store_release(guard.slot, NULL); /* Pair with ACQUIRE at synchronize_shazptr() */
> +}
> +
> +void synchronize_shazptr(void *ptr);
> +
> +DEFINE_CLASS(shazptr, struct shazptr_guard, shazptr_clear(_T),
> + shazptr_acquire(ptr), void *ptr);
> +#endif
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index a114949eeed5..1517076c98ec 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -3,7 +3,7 @@
> # and is generally not a function of system call inputs.
> KCOV_INSTRUMENT := n
>
> -obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
> +obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o shazptr.o
>
> # Avoid recursion lockdep -> sanitizer -> ... -> lockdep & improve performance.
> KASAN_SANITIZE_lockdep.o := n
> diff --git a/kernel/locking/shazptr.c b/kernel/locking/shazptr.c
> new file mode 100644
> index 000000000000..991fd1a05cfd
> --- /dev/null
> +++ b/kernel/locking/shazptr.c
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Simple hazard pointers
> + *
> + * Copyright (c) 2025, Microsoft Corporation.
> + *
> + * Author: Boqun Feng <boqun.feng@gmail.com>
> + */
> +
> +#include <linux/atomic.h>
> +#include <linux/cpumask.h>
> +#include <linux/shazptr.h>
> +
> +DEFINE_PER_CPU_SHARED_ALIGNED(void *, shazptr_slots);
> +EXPORT_PER_CPU_SYMBOL_GPL(shazptr_slots);
> +
> +void synchronize_shazptr(void *ptr)
> +{
> + int cpu;
lockdep_assert_preemption_enabled();
> +
> + smp_mb(); /* Synchronize with the smp_mb() in shazptr_acquire(). */
> + for_each_possible_cpu(cpu) {
> + void **slot = per_cpu_ptr(&shazptr_slots, cpu);
> + /* Pair with smp_store_release() in shazptr_clear(). */
> + smp_cond_load_acquire(slot,
> + VAL != ptr && VAL != SHAZPTR_WILDCARD);
> + }
> +}
> +EXPORT_SYMBOL_GPL(synchronize_shazptr);
> --
> 2.39.5 (Apple Git-154)
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 2/8] shazptr: Add refscale test
2025-06-25 3:10 ` [PATCH 2/8] shazptr: Add refscale test Boqun Feng
@ 2025-06-25 10:02 ` Peter Zijlstra
0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2025-06-25 10:02 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Ingo Molnar, Will Deacon, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Tue, Jun 24, 2025 at 08:10:55PM -0700, Boqun Feng wrote:
> Add the refscale test for shazptr to measure the reader side
> performance.
>
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> kernel/rcu/refscale.c | 39 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> diff --git a/kernel/rcu/refscale.c b/kernel/rcu/refscale.c
> index f11a7c2af778..154520e4ee4c 100644
> --- a/kernel/rcu/refscale.c
> +++ b/kernel/rcu/refscale.c
> @@ -29,6 +29,7 @@
> #include <linux/reboot.h>
> #include <linux/sched.h>
> #include <linux/seq_buf.h>
> +#include <linux/shazptr.h>
> #include <linux/spinlock.h>
> #include <linux/smp.h>
> #include <linux/stat.h>
> @@ -890,6 +891,43 @@ static const struct ref_scale_ops typesafe_seqlock_ops = {
> .name = "typesafe_seqlock"
> };
>
> +static void ref_shazptr_read_section(const int nloops)
> +{
> + int i;
> +
> + for (i = nloops; i >= 0; i--) {
> + preempt_disable();
> + { guard(shazptr)(ref_shazptr_read_section); }
> + preempt_enable();
scoped_guard (preempt)
guard(shazptr)(ref_shazptr_read_section);
> + }
> +}
> +
> +static void ref_shazptr_delay_section(const int nloops, const int udl, const int ndl)
> +{
> + int i;
> +
> + for (i = nloops; i >= 0; i--) {
> + preempt_disable();
> + {
> + guard(shazptr)(ref_shazptr_delay_section);
> + un_delay(udl, ndl);
> + }
> + preempt_enable();
scoped_guard (preempt) {
guard(shazptr)(ref_shazptr_delay_section);
un_delay(udl, ndl);
}
> + }
> +}
> +
> +static bool ref_shazptr_init(void)
> +{
> + return true;
> +}
> +
> +static const struct ref_scale_ops shazptr_ops = {
> + .init = ref_shazptr_init,
> + .readsection = ref_shazptr_read_section,
> + .delaysection = ref_shazptr_delay_section,
> + .name = "shazptr"
> +};
> +
> static void rcu_scale_one_reader(void)
> {
> if (readdelay <= 0)
> @@ -1197,6 +1235,7 @@ ref_scale_init(void)
> &refcnt_ops, &rwlock_ops, &rwsem_ops, &lock_ops, &lock_irq_ops,
> &acqrel_ops, &sched_clock_ops, &clock_ops, &jiffies_ops,
> &typesafe_ref_ops, &typesafe_lock_ops, &typesafe_seqlock_ops,
> + &shazptr_ops,
> };
>
> if (!torture_init_begin(scale_type, verbose))
> --
> 2.39.5 (Apple Git-154)
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 3/8] shazptr: Add refscale test for wildcard
2025-06-25 3:10 ` [PATCH 3/8] shazptr: Add refscale test for wildcard Boqun Feng
@ 2025-06-25 10:03 ` Peter Zijlstra
0 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2025-06-25 10:03 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Ingo Molnar, Will Deacon, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Tue, Jun 24, 2025 at 08:10:56PM -0700, Boqun Feng wrote:
> +static void ref_shazptr_wc_read_section(const int nloops)
> +{
> + int i;
> +
> + for (i = nloops; i >= 0; i--) {
> + preempt_disable();
> + {
> + guard(shazptr)(ref_shazptr_read_section);
> + /* Trigger wildcard logic */
> + guard(shazptr)(ref_shazptr_wc_read_section);
> + }
> + preempt_enable();
> + }
> +}
> +
> +static void ref_shazptr_wc_delay_section(const int nloops, const int udl, const int ndl)
> +{
> + int i;
> +
> + for (i = nloops; i >= 0; i--) {
> + preempt_disable();
> + {
> + guard(shazptr)(ref_shazptr_delay_section);
> + /* Trigger wildcard logic */
> + guard(shazptr)(ref_shazptr_wc_delay_section);
> + un_delay(udl, ndl);
> + }
> + preempt_enable();
> + }
> +}
Same as the last one, scoped_guard (preempt) can replace the preempt
things and the scope.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting
2025-06-25 3:10 ` [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting Boqun Feng
@ 2025-06-25 11:40 ` Peter Zijlstra
2025-06-25 11:56 ` Peter Zijlstra
2025-06-25 13:56 ` Frederic Weisbecker
2 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2025-06-25 11:40 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Ingo Molnar, Will Deacon, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Tue, Jun 24, 2025 at 08:10:57PM -0700, Boqun Feng wrote:
> +/* Scan structure for synchronize_shazptr(). */
> +struct shazptr_scan {
> + /* The scan kthread */
> + struct task_struct *thread;
> +
> + /* Wait queue for the scan kthread */
> + struct swait_queue_head wq;
> +
> + /* Whether the scan kthread has been scheduled to scan */
> + bool scheduled;
> +
> + /* The lock protecting ->queued and ->scheduled */
> + struct mutex lock;
> +
> + /* List of queued synchronize_shazptr() request. */
> + struct list_head queued;
> +
> + int cpu_grp_size;
> +
> + /* List of scanning synchronize_shazptr() request. */
> + struct list_head scanning;
> +
> + /* Buffer used for hazptr slot scan, nr_cpu_ids slots*/
> + struct shazptr_snapshot* snaps;
> +};
I find this style very hard to read, also the order of things is weird.
struct shazptr_scan {
struct task_struct *thread;
struct swait_queue_head wq;
struct list_head scanning;
struct mutex lock;
struct list_head queued; /* __guarded_by(lock) */
bool scheduled; /* __guarded_by(lock) */
struct shazptr_snapshot snaps[0] __counted_by(nr_cpu_ids);
};
(the __guarded_by() thing will come with Thread-Safety support that
Google is still cooking in clang)
> +static struct shazptr_scan shazptr_scan;
And then make this a pointer, and allocate the whole thing as a single
data structure with however much snaps data you need.
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting
2025-06-25 3:10 ` [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting Boqun Feng
2025-06-25 11:40 ` Peter Zijlstra
@ 2025-06-25 11:56 ` Peter Zijlstra
2025-06-25 13:56 ` Frederic Weisbecker
2 siblings, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2025-06-25 11:56 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Ingo Molnar, Will Deacon, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
Response is a bit weird because non-linear editing..
On Tue, Jun 24, 2025 at 08:10:57PM -0700, Boqun Feng wrote:
> + /* Whether the scan kthread has been scheduled to scan */
> + bool scheduled;
> +static int __noreturn shazptr_scan_kthread(void *unused)
> +{
> + for (;;) {
> + swait_event_idle_exclusive(shazptr_scan.wq,
> + READ_ONCE(shazptr_scan.scheduled));
This seems weird; why use a whole wait-queue, in exclusive mode no less,
for something that is one known thread.
Also, I think this thing needs to be FREEZABLE, otherwise suspend might
have issues.
Why not just write it like:
for (;;) {
set_current_state(TASK_IDLE | TASK_FREEZABLE);
if (!list_empty(&scan->queue))
break;
schedule();
}
__set_current_state(TASK_RUNNABLE);
for (;;) {
scoped_guard (mutex, scan->lock) {
if (list_empty(scan->queued) &&
list_empty(scan->scanning))
break;
}
shazptr_do_scan(scan);
}
> + shazptr_do_scan(&shazptr_scan);
> +
> + scoped_guard(mutex, &shazptr_scan.lock) {
> + if (list_empty(&shazptr_scan.queued) &&
> + list_empty(&shazptr_scan.scanning))
> + shazptr_scan.scheduled = false;
This condition, why can't we directly use this condition instead of
scheduled?
> + }
> + }
> +}
> +static void synchronize_shazptr_normal(void *ptr)
> +{
> +
> + /* Found blocking slots, prepare to wait. */
> + if (blocking_grp_mask) {
> + struct shazptr_scan *scan = &shazptr_scan;
> + struct shazptr_wait wait = {
> + .blocking_grp_mask = blocking_grp_mask,
> + };
> +
> + INIT_LIST_HEAD(&wait.list);
> + init_completion(&wait.done);
> +
> + scoped_guard(mutex, &scan->lock) {
> + list_add_tail(&wait.list, &scan->queued);
> +
> + if (!scan->scheduled) {
> + WRITE_ONCE(scan->scheduled, true);
> + swake_up_one(&shazptr_scan.wq);
> + }
Or perhaps; just write this like:
bool was_empty = list_empty(&scan->queued);
list_add_tail(&wait.list, &scan->queued);
if (was_empty)
wake_up_process(scan->thread);
> + }
> +
> + wait_for_completion(&wait.done);
> + }
> +}
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist
2025-06-25 3:11 ` [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist Boqun Feng
@ 2025-06-25 11:59 ` Peter Zijlstra
2025-06-25 14:18 ` Boqun Feng
2025-07-10 14:06 ` Breno Leitao
1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2025-06-25 11:59 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Ingo Molnar, Will Deacon, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Tue, Jun 24, 2025 at 08:11:01PM -0700, Boqun Feng wrote:
> + /* Need preemption disable for using shazptr. */
> + guard(preempt)();
> +
> + /* Protect the list search with shazptr. */
> + guard(shazptr)(hash_head);
OK, this is the end of the series, and so far every single user is doing
both a preempt and a shazptr guard. Why can't we simplify this and have
the shazptr guard imply preempt-disable?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/8] Introduce simple hazard pointers for lockdep
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
` (7 preceding siblings ...)
2025-06-25 3:11 ` [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist Boqun Feng
@ 2025-06-25 12:05 ` Christoph Hellwig
2025-06-25 14:08 ` Boqun Feng
2025-06-25 12:25 ` Mathieu Desnoyers
9 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2025-06-25 12:05 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Tue, Jun 24, 2025 at 08:10:53PM -0700, Boqun Feng wrote:
> Hi,
>
> This is the official first version of simple hazard pointers following
> the RFC:
Can you please put an explanation of what hazard pointers are
prominently into this cover letter?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/8] Introduce simple hazard pointers for lockdep
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
` (8 preceding siblings ...)
2025-06-25 12:05 ` [PATCH 0/8] Introduce simple hazard pointers for lockdep Christoph Hellwig
@ 2025-06-25 12:25 ` Mathieu Desnoyers
2025-06-25 13:21 ` Boqun Feng
9 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2025-06-25 12:25 UTC (permalink / raw)
To: Boqun Feng, linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Breno Leitao, aeh, netdev, edumazet, jhs, kernel-team,
Erik Lundgren
On 2025-06-24 23:10, Boqun Feng wrote:
> Hi,
>
> This is the official first version of simple hazard pointers following
> the RFC:
>
> https://lore.kernel.org/lkml/20250414060055.341516-1-boqun.feng@gmail.com/
>
> I rebase it onto v6.16-rc3 and hope to get more feedback this time.
>
> Thanks a lot for Breno Leitao to try the RFC out and share the numbers.
>
> I did an extra comparison this time, between the shazptr solution and
> the synchronize_rcu_expedited() solution. In my test, during a 100 times
> "tc qdisc replace" run:
>
> * IPI rate with the shazptr solution: ~14 per second per core.
> * IPI rate with synchronize_rcu_expedited(): ~140 per second per core.
>
> (IPI results were from the 'CAL' line in /proc/interrupt)
>
> This shows that while both solutions have the similar speedup, shazptr
> solution avoids the introduce of high IPI rate compared to
> synchronize_rcu_expedited().
>
> Feedback is welcome and please let know if there is any concern or
> suggestion. Thanks!
Hi Boqun,
What is unclear to me is what is the delta wrt:
https://lore.kernel.org/lkml/20241008135034.1982519-4-mathieu.desnoyers@efficios.com/
and whether this helper against compiler optimizations would still be needed here:
https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/
Thanks,
Mathieu
>
> Regards,
> Boqun
>
> --------------------------------------
> Please find the old performance below:
>
> On my system (a 96-cpu VMs), the results of:
>
> time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
>
> are (with lockdep enabled):
>
> (without the patchset)
> real 0m1.039s
> user 0m0.001s
> sys 0m0.069s
>
> (with the patchset)
> real 0m0.053s
> user 0m0.000s
> sys 0m0.051s
>
> i.e. almost 20x speed-up.
>
> Other comparisons between RCU and shazptr, the rcuscale results (using
> default configuration from
> tools/testing/selftests/rcutorture/bin/kvm.sh):
>
> RCU:
>
> Average grace-period duration: 7470.02 microseconds
> Minimum grace-period duration: 3981.6
> 50th percentile grace-period duration: 6002.73
> 90th percentile grace-period duration: 7008.93
> 99th percentile grace-period duration: 10015
> Maximum grace-period duration: 142228
>
> shazptr:
>
> Average grace-period duration: 0.845825 microseconds
> Minimum grace-period duration: 0.199
> 50th percentile grace-period duration: 0.585
> 90th percentile grace-period duration: 1.656
> 99th percentile grace-period duration: 3.872
> Maximum grace-period duration: 3049.05
>
> shazptr (skip_synchronize_self_scan=1, i.e. always let scan kthread to
> wakeup):
>
> Average grace-period duration: 467.861 microseconds
> Minimum grace-period duration: 92.913
> 50th percentile grace-period duration: 440.691
> 90th percentile grace-period duration: 460.623
> 99th percentile grace-period duration: 650.068
> Maximum grace-period duration: 5775.46
>
> shazptr_wildcard (i.e. readers always use SHAZPTR_WILDCARD):
>
> Average grace-period duration: 599.569 microseconds
> Minimum grace-period duration: 1.432
> 50th percentile grace-period duration: 582.631
> 90th percentile grace-period duration: 781.704
> 99th percentile grace-period duration: 1160.26
> Maximum grace-period duration: 6727.53
>
> shazptr_wildcard (skip_synchronize_self_scan=1):
>
> Average grace-period duration: 460.466 microseconds
> Minimum grace-period duration: 303.546
> 50th percentile grace-period duration: 424.334
> 90th percentile grace-period duration: 482.637
> 99th percentile grace-period duration: 600.214
> Maximum grace-period duration: 4126.94
>
> Boqun Feng (8):
> Introduce simple hazard pointers
> shazptr: Add refscale test
> shazptr: Add refscale test for wildcard
> shazptr: Avoid synchronize_shaptr() busy waiting
> shazptr: Allow skip self scan in synchronize_shaptr()
> rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL
> rcuscale: Add tests for simple hazard pointers
> locking/lockdep: Use shazptr to protect the key hashlist
>
> include/linux/shazptr.h | 73 +++++++++
> kernel/locking/Makefile | 2 +-
> kernel/locking/lockdep.c | 11 +-
> kernel/locking/shazptr.c | 318 +++++++++++++++++++++++++++++++++++++++
> kernel/rcu/rcuscale.c | 60 +++++++-
> kernel/rcu/refscale.c | 77 ++++++++++
> 6 files changed, 534 insertions(+), 7 deletions(-)
> create mode 100644 include/linux/shazptr.h
> create mode 100644 kernel/locking/shazptr.c
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/8] Introduce simple hazard pointers for lockdep
2025-06-25 12:25 ` Mathieu Desnoyers
@ 2025-06-25 13:21 ` Boqun Feng
0 siblings, 0 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 13:21 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Breno Leitao, aeh, netdev, edumazet, jhs, kernel-team,
Erik Lundgren
On Wed, Jun 25, 2025 at 08:25:52AM -0400, Mathieu Desnoyers wrote:
> On 2025-06-24 23:10, Boqun Feng wrote:
> > Hi,
> >
> > This is the official first version of simple hazard pointers following
> > the RFC:
> >
> > https://lore.kernel.org/lkml/20250414060055.341516-1-boqun.feng@gmail.com/
> >
> > I rebase it onto v6.16-rc3 and hope to get more feedback this time.
> >
> > Thanks a lot for Breno Leitao to try the RFC out and share the numbers.
> >
> > I did an extra comparison this time, between the shazptr solution and
> > the synchronize_rcu_expedited() solution. In my test, during a 100 times
> > "tc qdisc replace" run:
> >
> > * IPI rate with the shazptr solution: ~14 per second per core.
> > * IPI rate with synchronize_rcu_expedited(): ~140 per second per core.
> >
> > (IPI results were from the 'CAL' line in /proc/interrupt)
> >
> > This shows that while both solutions have the similar speedup, shazptr
> > solution avoids the introduce of high IPI rate compared to
> > synchronize_rcu_expedited().
> >
> > Feedback is welcome and please let know if there is any concern or
> > suggestion. Thanks!
>
> Hi Boqun,
>
> What is unclear to me is what is the delta wrt:
>
> https://lore.kernel.org/lkml/20241008135034.1982519-4-mathieu.desnoyers@efficios.com/
>
shazptr is more close to the general hazptr I proposed earlier:
https://lore.kernel.org/lkml/20240917143402.930114-1-boqun.feng@gmail.com/
, it's aimed as a global facility therefore no hazptr_domain is needed,
plus it supports non-busy waiting synchronize_shazptr() at the
beginning.
> and whether this helper against compiler optimizations would still be needed here:
>
> https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/
>
For the current user, no, but eventually we are going to need it.
Regards,
Boqun
> Thanks,
>
> Mathieu
>
> >
> > Regards,
> > Boqun
> >
> > --------------------------------------
> > Please find the old performance below:
> >
> > On my system (a 96-cpu VMs), the results of:
> >
> > time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> >
> > are (with lockdep enabled):
> >
> > (without the patchset)
> > real 0m1.039s
> > user 0m0.001s
> > sys 0m0.069s
> >
> > (with the patchset)
> > real 0m0.053s
> > user 0m0.000s
> > sys 0m0.051s
> >
> > i.e. almost 20x speed-up.
> >
> > Other comparisons between RCU and shazptr, the rcuscale results (using
> > default configuration from
> > tools/testing/selftests/rcutorture/bin/kvm.sh):
> >
> > RCU:
> >
> > Average grace-period duration: 7470.02 microseconds
> > Minimum grace-period duration: 3981.6
> > 50th percentile grace-period duration: 6002.73
> > 90th percentile grace-period duration: 7008.93
> > 99th percentile grace-period duration: 10015
> > Maximum grace-period duration: 142228
> >
> > shazptr:
> >
> > Average grace-period duration: 0.845825 microseconds
> > Minimum grace-period duration: 0.199
> > 50th percentile grace-period duration: 0.585
> > 90th percentile grace-period duration: 1.656
> > 99th percentile grace-period duration: 3.872
> > Maximum grace-period duration: 3049.05
> >
> > shazptr (skip_synchronize_self_scan=1, i.e. always let scan kthread to
> > wakeup):
> >
> > Average grace-period duration: 467.861 microseconds
> > Minimum grace-period duration: 92.913
> > 50th percentile grace-period duration: 440.691
> > 90th percentile grace-period duration: 460.623
> > 99th percentile grace-period duration: 650.068
> > Maximum grace-period duration: 5775.46
> >
> > shazptr_wildcard (i.e. readers always use SHAZPTR_WILDCARD):
> >
> > Average grace-period duration: 599.569 microseconds
> > Minimum grace-period duration: 1.432
> > 50th percentile grace-period duration: 582.631
> > 90th percentile grace-period duration: 781.704
> > 99th percentile grace-period duration: 1160.26
> > Maximum grace-period duration: 6727.53
> >
> > shazptr_wildcard (skip_synchronize_self_scan=1):
> >
> > Average grace-period duration: 460.466 microseconds
> > Minimum grace-period duration: 303.546
> > 50th percentile grace-period duration: 424.334
> > 90th percentile grace-period duration: 482.637
> > 99th percentile grace-period duration: 600.214
> > Maximum grace-period duration: 4126.94
> >
> > Boqun Feng (8):
> > Introduce simple hazard pointers
> > shazptr: Add refscale test
> > shazptr: Add refscale test for wildcard
> > shazptr: Avoid synchronize_shaptr() busy waiting
> > shazptr: Allow skip self scan in synchronize_shaptr()
> > rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL
> > rcuscale: Add tests for simple hazard pointers
> > locking/lockdep: Use shazptr to protect the key hashlist
> >
> > include/linux/shazptr.h | 73 +++++++++
> > kernel/locking/Makefile | 2 +-
> > kernel/locking/lockdep.c | 11 +-
> > kernel/locking/shazptr.c | 318 +++++++++++++++++++++++++++++++++++++++
> > kernel/rcu/rcuscale.c | 60 +++++++-
> > kernel/rcu/refscale.c | 77 ++++++++++
> > 6 files changed, 534 insertions(+), 7 deletions(-)
> > create mode 100644 include/linux/shazptr.h
> > create mode 100644 kernel/locking/shazptr.c
> >
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting
2025-06-25 3:10 ` [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting Boqun Feng
2025-06-25 11:40 ` Peter Zijlstra
2025-06-25 11:56 ` Peter Zijlstra
@ 2025-06-25 13:56 ` Frederic Weisbecker
2025-06-25 15:24 ` Boqun Feng
2 siblings, 1 reply; 34+ messages in thread
From: Frederic Weisbecker @ 2025-06-25 13:56 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Neeraj Upadhyay, Joel Fernandes, Uladzislau Rezki, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, Zqiang, Breno Leitao, aeh,
netdev, edumazet, jhs, kernel-team, Erik Lundgren
Le Tue, Jun 24, 2025 at 08:10:57PM -0700, Boqun Feng a écrit :
> +static void synchronize_shazptr_normal(void *ptr)
> +{
> + int cpu;
> + unsigned long blocking_grp_mask = 0;
> +
> + smp_mb(); /* Synchronize with the smp_mb() in shazptr_acquire(). */
> +
> + for_each_possible_cpu(cpu) {
> + void **slot = per_cpu_ptr(&shazptr_slots, cpu);
> + void *val;
> +
> + /* Pair with smp_store_release() in shazptr_clear(). */
> + val = smp_load_acquire(slot);
> +
> + if (val == ptr || val == SHAZPTR_WILDCARD)
> + blocking_grp_mask |= 1UL << (cpu / shazptr_scan.cpu_grp_size);
> + }
> +
> + /* Found blocking slots, prepare to wait. */
> + if (blocking_grp_mask) {
synchronize_rcu() here would be enough since all users have preemption disabled.
But I guess this defeats the performance purpose? (If so this might need a
comment somewhere).
I guess blocking_grp_mask is to avoid allocating a cpumask (again for
performance purpose? So I guess synchronize_shazptr_normal() has some perf
expectations?)
One possibility is to have the ptr contained in:
struct hazptr {
void *ptr;
struct cpumask scan_mask
};
And then the caller could simply scan itself those remaining CPUs without
relying on the kthread.
But I'm sure there are good reasons for now doing that :-)
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/8] Introduce simple hazard pointers for lockdep
2025-06-25 12:05 ` [PATCH 0/8] Introduce simple hazard pointers for lockdep Christoph Hellwig
@ 2025-06-25 14:08 ` Boqun Feng
2025-06-26 10:16 ` Christoph Hellwig
0 siblings, 1 reply; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 14:08 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Wed, Jun 25, 2025 at 05:05:11AM -0700, Christoph Hellwig wrote:
> On Tue, Jun 24, 2025 at 08:10:53PM -0700, Boqun Feng wrote:
> > Hi,
> >
> > This is the official first version of simple hazard pointers following
> > the RFC:
>
> Can you please put an explanation of what hazard pointers are
> prominently into this cover letter?
>
Sure, I will put one for the future version, here is the gist:
Hazard pointers provide the similar synchronzation behavior as RCU:
readers are cheap, updaters need to wait for existing readers to go
before they can free the objects.
The difference between hazard pointers and RCU is that instead of
waiting for a grace period, which all the readers have to exit the RCU
read-side critical sections, the updaters of hazard pointers only need
to wait for the readers that are accessing the objects they are about to
free. For example, if we have 2 readers accessing different objects and
1 updater is freeing one of them:
using RCU:
Reader 1 Reader 2 Updater
======== ======== =======
rcu_read_lock();
r = rcu_dereference(a);
rcu_read_lock();
r = rcu_dereference(b);
synchronize_rcu();
rcu_read_unlock();
rcu_read_unlock();
<synchronize_rcu() returns>
free(a);
The updater will need to wait for reader 2 to finish before it can
free 'a', however when using hazard pointers:
Reader 1 Reader 2 Updater
======== ======== =======
g = shazptr_acquire(a);
g = shazptr_acqurie(b);
synchronize_shazptr(a);
shazptr_clear(g);
<synchronize_shazptr(a) returns>
free(a);
shazptr_clear(g); // <- updater doesn't
// need to wait for
// this.
The updater's wait can finish immediately if no one is accessing 'a', in
other words it doesn't need to wait for reader 2.
This means for a particular workload, hazard pointers may have smaller
memory footprint and less updater wait time compared to RCU, while still
have the similar performance on the reader side.
That being said, it does come with some cost, the readers would need to
provide their own hazard pointer slots (allocating memory) in general
cases. And in the simple hazard pointer implementation in this series,
although readers don't need to provide their own hazard pointer slots,
they need to disable the preemption to use the hazard pointer, and the
performance would downgrade (to a naive SRCU implementation probably) if
they want to protect multiple objects in one read-side critical section.
Regards,
Boqun
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist
2025-06-25 11:59 ` Peter Zijlstra
@ 2025-06-25 14:18 ` Boqun Feng
0 siblings, 0 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 14:18 UTC (permalink / raw)
To: Peter Zijlstra
Cc: linux-kernel, rcu, lkmm, Ingo Molnar, Will Deacon, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Wed, Jun 25, 2025 at 01:59:29PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 24, 2025 at 08:11:01PM -0700, Boqun Feng wrote:
>
> > + /* Need preemption disable for using shazptr. */
> > + guard(preempt)();
> > +
> > + /* Protect the list search with shazptr. */
> > + guard(shazptr)(hash_head);
>
> OK, this is the end of the series, and so far every single user is doing
> both a preempt and a shazptr guard. Why can't we simplify this and have
> the shazptr guard imply preempt-disable?
You're right. The background story is that in the beginning, the hazard
pointer protection was placed at the callsites of is_dynamic_key(): one
in register_lock_class() and one in lockdep_init_map_type(), and in
register_lock_class() I could use the fact that it's called with irq
disabled to save the preempt-disable.
So given the current users, it makes sense to fold the preempt disabling
into shazptr guard.
Regards,
Boqun
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/8] Introduce simple hazard pointers
2025-06-25 3:10 ` [PATCH 1/8] Introduce simple hazard pointers Boqun Feng
2025-06-25 10:00 ` Peter Zijlstra
@ 2025-06-25 14:25 ` Mathieu Desnoyers
2025-06-25 15:05 ` Boqun Feng
2025-06-25 15:52 ` Waiman Long
2 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2025-06-25 14:25 UTC (permalink / raw)
To: Boqun Feng, linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Breno Leitao, aeh, netdev, edumazet, jhs, kernel-team,
Erik Lundgren
On 2025-06-24 23:10, Boqun Feng wrote:
[...]
> +
> +static inline void shazptr_clear(struct shazptr_guard guard)
> +{
> + /* Only clear the slot when the outermost guard is released */
> + if (likely(!guard.use_wildcard))
> + smp_store_release(guard.slot, NULL); /* Pair with ACQUIRE at synchronize_shazptr() */
How is the wildcard ever cleared ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/8] Introduce simple hazard pointers
2025-06-25 14:25 ` Mathieu Desnoyers
@ 2025-06-25 15:05 ` Boqun Feng
0 siblings, 0 replies; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 15:05 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Breno Leitao, aeh, netdev, edumazet, jhs, kernel-team,
Erik Lundgren
On Wed, Jun 25, 2025 at 10:25:23AM -0400, Mathieu Desnoyers wrote:
> On 2025-06-24 23:10, Boqun Feng wrote:
> [...]
> > +
> > +static inline void shazptr_clear(struct shazptr_guard guard)
> > +{
> > + /* Only clear the slot when the outermost guard is released */
> > + if (likely(!guard.use_wildcard))
> > + smp_store_release(guard.slot, NULL); /* Pair with ACQUIRE at synchronize_shazptr() */
>
> How is the wildcard ever cleared ?
>
The outermost shazptr_guard will have .use_wildcard being false, so it
will clear the wildcard. E.g.
g1 = shazptr_acqure(a); // g1->use_wildcard is false
// this cpu's hazptr slot is 'a'.
g2 = shazptr_acqure(b); // g2->use_wildcard is true
// this cpu's hazptr slot becomes
// WILDCARD.
shazptr_clear(g2); // do nothing
shazptr_clear(g1); // clear this cpu's hazptr slot to NULL.
Regards,
Boqun
> Thanks,
>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting
2025-06-25 13:56 ` Frederic Weisbecker
@ 2025-06-25 15:24 ` Boqun Feng
2025-06-26 13:45 ` Frederic Weisbecker
0 siblings, 1 reply; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 15:24 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Neeraj Upadhyay, Joel Fernandes, Uladzislau Rezki, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, Zqiang, Breno Leitao, aeh,
netdev, edumazet, jhs, kernel-team, Erik Lundgren
On Wed, Jun 25, 2025 at 03:56:05PM +0200, Frederic Weisbecker wrote:
> Le Tue, Jun 24, 2025 at 08:10:57PM -0700, Boqun Feng a écrit :
> > +static void synchronize_shazptr_normal(void *ptr)
> > +{
> > + int cpu;
> > + unsigned long blocking_grp_mask = 0;
> > +
> > + smp_mb(); /* Synchronize with the smp_mb() in shazptr_acquire(). */
> > +
> > + for_each_possible_cpu(cpu) {
> > + void **slot = per_cpu_ptr(&shazptr_slots, cpu);
> > + void *val;
> > +
> > + /* Pair with smp_store_release() in shazptr_clear(). */
> > + val = smp_load_acquire(slot);
> > +
> > + if (val == ptr || val == SHAZPTR_WILDCARD)
> > + blocking_grp_mask |= 1UL << (cpu / shazptr_scan.cpu_grp_size);
> > + }
> > +
> > + /* Found blocking slots, prepare to wait. */
> > + if (blocking_grp_mask) {
>
> synchronize_rcu() here would be enough since all users have preemption disabled.
> But I guess this defeats the performance purpose? (If so this might need a
> comment somewhere).
>
synchronize_shazptr_normal() cannot wait for a whole grace period,
because the point of hazard pointers is to avoid waiting for unrelated
readers.
> I guess blocking_grp_mask is to avoid allocating a cpumask (again for
> performance purpose? So I guess synchronize_shazptr_normal() has some perf
If we are talking about {k,v}malloc allocation:
synchronize_shazptr_normal() would mostly be used in cleanup/free path
similar to synchronize_rcu(), therefor I would like to avoid "allocating
memory to free memory".
> expectations?)
>
> One possibility is to have the ptr contained in:
>
> struct hazptr {
> void *ptr;
> struct cpumask scan_mask
> };
>
You mean updaters passing a `struct hazptr *` into
synchronize_shazptr_normal()? That may be a good idea, if multiple
updaters can share the same `struct hazptr *`, we can add that later,
but...
> And then the caller could simply scan itself those remaining CPUs without
> relying on the kthread.
.. this is a bad idea, sure, we can always burn some CPU time to scan,
but local optimization doesn't mean global optimization, if in the
future, we have a lots of synchronize_shazptr_normal()s happening at
the same time, the self busy-waiting scan would become problematic.
Regards,
Boqun
>
> But I'm sure there are good reasons for now doing that :-)
>
> Thanks.
>
> --
> Frederic Weisbecker
> SUSE Labs
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/8] Introduce simple hazard pointers
2025-06-25 3:10 ` [PATCH 1/8] Introduce simple hazard pointers Boqun Feng
2025-06-25 10:00 ` Peter Zijlstra
2025-06-25 14:25 ` Mathieu Desnoyers
@ 2025-06-25 15:52 ` Waiman Long
2025-06-25 16:09 ` Boqun Feng
2 siblings, 1 reply; 34+ messages in thread
From: Waiman Long @ 2025-06-25 15:52 UTC (permalink / raw)
To: Boqun Feng, linux-kernel, rcu, lkmm
Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Davidlohr Bueso,
Paul E. McKenney, Josh Triplett, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Uladzislau Rezki, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, Zqiang, Breno Leitao, aeh,
netdev, edumazet, jhs, kernel-team, Erik Lundgren
On 6/24/25 11:10 PM, Boqun Feng wrote:
> As its name suggests, simple hazard pointers (shazptr) is a
> simplification of hazard pointers [1]: it has only one hazard pointer
> slot per-CPU and is targeted for simple use cases where the read-side
> already has preemption disabled. It's a trade-off between full features
> of a normal hazard pointer implementation (multiple slots, dynamic slot
> allocation, etc.) and the simple use scenario.
>
> Since there's only one slot per-CPU, so shazptr read-side critical
> section nesting is a problem that needs to be resolved, because at very
> least, interrupts and NMI can introduce nested shazptr read-side
> critical sections. A SHAZPTR_WILDCARD is introduced to resolve this:
> SHAZPTR_WILDCARD is a special address value that blocks *all* shazptr
> waiters. In an interrupt-causing shazptr read-side critical section
> nesting case (i.e. an interrupt happens while the per-CPU hazard pointer
> slot being used and tries to acquire a hazard pointer itself), the inner
> critical section will switch the value of the hazard pointer slot into
> SHAZPTR_WILDCARD, and let the outer critical section eventually zero the
> slot. The SHAZPTR_WILDCARD still provide the correct protection because
> it blocks all the waiters.
>
> It's true that once the wildcard mechanism is activated, shazptr
> mechanism may be downgrade to something similar to RCU (and probably
> with a worse implementation), which generally has longer wait time and
> larger memory footprint compared to a typical hazard pointer
> implementation. However, that can only happen with a lot of users using
> hazard pointers, and then it's reasonable to introduce the
> fully-featured hazard pointer implementation [2] and switch users to it.
>
> Note that shazptr_protect() may be added later, the current potential
> usage doesn't require it, and a shazptr_acquire(), which installs the
> protected value to hazard pointer slot and proves the smp_mb(), is
> enough for now.
>
> [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
> lock-free objects," in IEEE Transactions on Parallel and
> Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
>
> Link: https://lore.kernel.org/lkml/20240917143402.930114-1-boqun.feng@gmail.com/ [2]
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
> ---
> include/linux/shazptr.h | 73 ++++++++++++++++++++++++++++++++++++++++
> kernel/locking/Makefile | 2 +-
> kernel/locking/shazptr.c | 29 ++++++++++++++++
> 3 files changed, 103 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/shazptr.h
> create mode 100644 kernel/locking/shazptr.c
>
> diff --git a/include/linux/shazptr.h b/include/linux/shazptr.h
> new file mode 100644
> index 000000000000..287cd04b4be9
> --- /dev/null
> +++ b/include/linux/shazptr.h
> @@ -0,0 +1,73 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Simple hazard pointers
> + *
> + * Copyright (c) 2025, Microsoft Corporation.
> + *
> + * Author: Boqun Feng <boqun.feng@gmail.com>
> + *
> + * A simple variant of hazard pointers, the users must ensure the preemption
> + * is already disabled when calling a shazptr_acquire() to protect an address.
> + * If one shazptr_acquire() is called after another shazptr_acquire() has been
> + * called without the corresponding shazptr_clear() has been called, the later
> + * shazptr_acquire() must be cleared first.
> + *
> + * The most suitable usage is when only one address need to be protected in a
> + * preemption disabled critical section.
> + */
> +
> +#ifndef _LINUX_SHAZPTR_H
> +#define _LINUX_SHAZPTR_H
> +
> +#include <linux/cleanup.h>
> +#include <linux/percpu.h>
> +
> +/* Make ULONG_MAX the wildcard value */
> +#define SHAZPTR_WILDCARD ((void *)(ULONG_MAX))
> +
> +DECLARE_PER_CPU_SHARED_ALIGNED(void *, shazptr_slots);
> +
> +/* Represent a held hazard pointer slot */
> +struct shazptr_guard {
> + void **slot;
> + bool use_wildcard;
> +};
> +
> +/*
> + * Acquire a hazptr slot and begin the hazard pointer critical section.
> + *
> + * Must be called with preemption disabled, and preemption must remain disabled
> + * until shazptr_clear().
> + */
> +static inline struct shazptr_guard shazptr_acquire(void *ptr)
> +{
> + struct shazptr_guard guard = {
> + /* Preemption is disabled. */
> + .slot = this_cpu_ptr(&shazptr_slots),
> + .use_wildcard = false,
> + };
> +
> + if (likely(!READ_ONCE(*guard.slot))) {
> + WRITE_ONCE(*guard.slot, ptr);
> + } else {
> + guard.use_wildcard = true;
> + WRITE_ONCE(*guard.slot, SHAZPTR_WILDCARD);
> + }
Is it correct to assume that shazptr cannot be used in a mixed context
environment on the same CPU like a task context and an interrupt context
trying to acquire it simultaneously because the current check isn't
atomic with respect to that?
> +
> + smp_mb(); /* Synchronize with smp_mb() at synchronize_shazptr(). */
> +
> + return guard;
> +}
> +
> +static inline void shazptr_clear(struct shazptr_guard guard)
> +{
> + /* Only clear the slot when the outermost guard is released */
> + if (likely(!guard.use_wildcard))
> + smp_store_release(guard.slot, NULL); /* Pair with ACQUIRE at synchronize_shazptr() */
> +}
Is it better to name it shazptr_release() to be conformant with our
current locking convention?
Cheers,
Longman
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/8] Introduce simple hazard pointers
2025-06-25 15:52 ` Waiman Long
@ 2025-06-25 16:09 ` Boqun Feng
2025-06-25 17:47 ` Waiman Long
0 siblings, 1 reply; 34+ messages in thread
From: Boqun Feng @ 2025-06-25 16:09 UTC (permalink / raw)
To: Waiman Long
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Wed, Jun 25, 2025 at 11:52:04AM -0400, Waiman Long wrote:
[...]
> > +/*
> > + * Acquire a hazptr slot and begin the hazard pointer critical section.
> > + *
> > + * Must be called with preemption disabled, and preemption must remain disabled
> > + * until shazptr_clear().
> > + */
> > +static inline struct shazptr_guard shazptr_acquire(void *ptr)
> > +{
> > + struct shazptr_guard guard = {
> > + /* Preemption is disabled. */
> > + .slot = this_cpu_ptr(&shazptr_slots),
> > + .use_wildcard = false,
> > + };
> > +
> > + if (likely(!READ_ONCE(*guard.slot))) {
> > + WRITE_ONCE(*guard.slot, ptr);
> > + } else {
> > + guard.use_wildcard = true;
> > + WRITE_ONCE(*guard.slot, SHAZPTR_WILDCARD);
> > + }
> Is it correct to assume that shazptr cannot be used in a mixed context
> environment on the same CPU like a task context and an interrupt context
> trying to acquire it simultaneously because the current check isn't atomic
> with respect to that?
I think the current implementation actually support mixed context usage,
let see (assuming we start in a task context):
if (likely(!READ_ONCE(*guard.slot))) {
if an interrupt happens here, it's fine because the slot is still empty,
as long as the interrupt will eventually clear the slot.
WRITE_ONCE(*guard.slot, ptr);
if an interrupt happens here, it's fine because the interrupt would
notice that the slot is already occupied, hence the interrupt will use a
wildcard, and because it uses a wild, it won't clear the slot after it
returns. However the task context's shazptr_clear() will eventually
clear the slot because its guard's .use_wildcard is false.
} else {
if an interrupt happens here, it's fine because of the same: interrupt
will use wildcard, and it will not clear the slot, and some
shazptr_clear() in the task context will eventually clear it.
guard.use_wildcard = true;
WRITE_ONCE(*guard.slot, SHAZPTR_WILDCARD);
if an interrupt happens here, it's fine because of the same.
}
It's similar to why rcu_read_lock() can be just a non-atomic inc.
> > +
> > + smp_mb(); /* Synchronize with smp_mb() at synchronize_shazptr(). */
> > +
> > + return guard;
> > +}
> > +
> > +static inline void shazptr_clear(struct shazptr_guard guard)
> > +{
> > + /* Only clear the slot when the outermost guard is released */
> > + if (likely(!guard.use_wildcard))
> > + smp_store_release(guard.slot, NULL); /* Pair with ACQUIRE at synchronize_shazptr() */
> > +}
>
> Is it better to name it shazptr_release() to be conformant with our current
> locking convention?
>
Maybe, but I will need to think about slot reusing between
shazptr_acquire() and shazptr_release(), in the general hazptr API,
you can hazptr_alloc() a slot, use it and hazptr_clear() and then
use it again, eventually hazptr_free(). I would like to keep both hazptr
APIs consistent as well. Thanks!
Regards,
Boqun
> Cheers,
> Longman
>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 1/8] Introduce simple hazard pointers
2025-06-25 16:09 ` Boqun Feng
@ 2025-06-25 17:47 ` Waiman Long
0 siblings, 0 replies; 34+ messages in thread
From: Waiman Long @ 2025-06-25 17:47 UTC (permalink / raw)
To: Boqun Feng, Waiman Long
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On 6/25/25 12:09 PM, Boqun Feng wrote:
> On Wed, Jun 25, 2025 at 11:52:04AM -0400, Waiman Long wrote:
> [...]
>>> +/*
>>> + * Acquire a hazptr slot and begin the hazard pointer critical section.
>>> + *
>>> + * Must be called with preemption disabled, and preemption must remain disabled
>>> + * until shazptr_clear().
>>> + */
>>> +static inline struct shazptr_guard shazptr_acquire(void *ptr)
>>> +{
>>> + struct shazptr_guard guard = {
>>> + /* Preemption is disabled. */
>>> + .slot = this_cpu_ptr(&shazptr_slots),
>>> + .use_wildcard = false,
>>> + };
>>> +
>>> + if (likely(!READ_ONCE(*guard.slot))) {
>>> + WRITE_ONCE(*guard.slot, ptr);
>>> + } else {
>>> + guard.use_wildcard = true;
>>> + WRITE_ONCE(*guard.slot, SHAZPTR_WILDCARD);
>>> + }
>> Is it correct to assume that shazptr cannot be used in a mixed context
>> environment on the same CPU like a task context and an interrupt context
>> trying to acquire it simultaneously because the current check isn't atomic
>> with respect to that?
> I think the current implementation actually support mixed context usage,
> let see (assuming we start in a task context):
>
> if (likely(!READ_ONCE(*guard.slot))) {
>
> if an interrupt happens here, it's fine because the slot is still empty,
> as long as the interrupt will eventually clear the slot.
>
> WRITE_ONCE(*guard.slot, ptr);
>
> if an interrupt happens here, it's fine because the interrupt would
> notice that the slot is already occupied, hence the interrupt will use a
> wildcard, and because it uses a wild, it won't clear the slot after it
> returns. However the task context's shazptr_clear() will eventually
> clear the slot because its guard's .use_wildcard is false.
>
> } else {
>
> if an interrupt happens here, it's fine because of the same: interrupt
> will use wildcard, and it will not clear the slot, and some
> shazptr_clear() in the task context will eventually clear it.
>
> guard.use_wildcard = true;
> WRITE_ONCE(*guard.slot, SHAZPTR_WILDCARD);
>
> if an interrupt happens here, it's fine because of the same.
>
> }
>
>
> It's similar to why rcu_read_lock() can be just a non-atomic inc.
You are right.
>
>>> +
>>> + smp_mb(); /* Synchronize with smp_mb() at synchronize_shazptr(). */
>>> +
>>> + return guard;
>>> +}
>>> +
>>> +static inline void shazptr_clear(struct shazptr_guard guard)
>>> +{
>>> + /* Only clear the slot when the outermost guard is released */
>>> + if (likely(!guard.use_wildcard))
>>> + smp_store_release(guard.slot, NULL); /* Pair with ACQUIRE at synchronize_shazptr() */
>>> +}
>> Is it better to name it shazptr_release() to be conformant with our current
>> locking convention?
>>
> Maybe, but I will need to think about slot reusing between
> shazptr_acquire() and shazptr_release(), in the general hazptr API,
> you can hazptr_alloc() a slot, use it and hazptr_clear() and then
> use it again, eventually hazptr_free(). I would like to keep both hazptr
> APIs consistent as well. Thanks!
Thanks for the explanation. Maybe we can reuse the general hazptr API
names (alloc/clear/free) even though shazptr_free() will be a no-op for
now. Just that the current acquire/clear naming looks odd to me.
Thanks,
Longman
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/8] Introduce simple hazard pointers for lockdep
2025-06-25 14:08 ` Boqun Feng
@ 2025-06-26 10:16 ` Christoph Hellwig
2025-06-26 13:45 ` Mathieu Desnoyers
2025-06-26 15:47 ` Boqun Feng
0 siblings, 2 replies; 34+ messages in thread
From: Christoph Hellwig @ 2025-06-26 10:16 UTC (permalink / raw)
To: Boqun Feng
Cc: Christoph Hellwig, linux-kernel, rcu, lkmm, Peter Zijlstra,
Ingo Molnar, Will Deacon, Waiman Long, Davidlohr Bueso,
Paul E. McKenney, Josh Triplett, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Uladzislau Rezki, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, Zqiang, Breno Leitao, aeh,
netdev, edumazet, jhs, kernel-team, Erik Lundgren
On Wed, Jun 25, 2025 at 07:08:57AM -0700, Boqun Feng wrote:
> Sure, I will put one for the future version, here is the gist:
Thanks a lot!
> The updater's wait can finish immediately if no one is accessing 'a', in
> other words it doesn't need to wait for reader 2.
So basically it is the RCU concept, but limited to protecting exactly
one pointer update per critical section with no ability for the read
to e.g. acquire a refcount on the objected pointed to by that pointer?
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/8] Introduce simple hazard pointers for lockdep
2025-06-26 10:16 ` Christoph Hellwig
@ 2025-06-26 13:45 ` Mathieu Desnoyers
2025-06-26 15:47 ` Boqun Feng
1 sibling, 0 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2025-06-26 13:45 UTC (permalink / raw)
To: Christoph Hellwig, Boqun Feng
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Breno Leitao, aeh, netdev, edumazet, jhs, kernel-team,
Erik Lundgren
On 2025-06-26 06:16, Christoph Hellwig wrote:
> On Wed, Jun 25, 2025 at 07:08:57AM -0700, Boqun Feng wrote:
>> Sure, I will put one for the future version, here is the gist:
>
> Thanks a lot!
>
>> The updater's wait can finish immediately if no one is accessing 'a', in
>> other words it doesn't need to wait for reader 2.
>
> So basically it is the RCU concept, but limited to protecting exactly
> one pointer update per critical section with no ability for the read
> to e.g. acquire a refcount on the objected pointed to by that pointer?
FWIW, hazard pointers can be chained with other existence guarantee
mechanisms. I've done prototypes that use hazard pointers chained
with a reference counter in the object to implement something similar to
smart pointers. Let me know if you are interested in the details.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting
2025-06-25 15:24 ` Boqun Feng
@ 2025-06-26 13:45 ` Frederic Weisbecker
0 siblings, 0 replies; 34+ messages in thread
From: Frederic Weisbecker @ 2025-06-26 13:45 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Neeraj Upadhyay, Joel Fernandes, Uladzislau Rezki, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, Zqiang, Breno Leitao, aeh,
netdev, edumazet, jhs, kernel-team, Erik Lundgren
Le Wed, Jun 25, 2025 at 08:24:53AM -0700, Boqun Feng a écrit :
> On Wed, Jun 25, 2025 at 03:56:05PM +0200, Frederic Weisbecker wrote:
> > Le Tue, Jun 24, 2025 at 08:10:57PM -0700, Boqun Feng a écrit :
> > > +static void synchronize_shazptr_normal(void *ptr)
> > > +{
> > > + int cpu;
> > > + unsigned long blocking_grp_mask = 0;
> > > +
> > > + smp_mb(); /* Synchronize with the smp_mb() in shazptr_acquire(). */
> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + void **slot = per_cpu_ptr(&shazptr_slots, cpu);
> > > + void *val;
> > > +
> > > + /* Pair with smp_store_release() in shazptr_clear(). */
> > > + val = smp_load_acquire(slot);
> > > +
> > > + if (val == ptr || val == SHAZPTR_WILDCARD)
> > > + blocking_grp_mask |= 1UL << (cpu / shazptr_scan.cpu_grp_size);
> > > + }
> > > +
> > > + /* Found blocking slots, prepare to wait. */
> > > + if (blocking_grp_mask) {
> >
> > synchronize_rcu() here would be enough since all users have preemption disabled.
> > But I guess this defeats the performance purpose? (If so this might need a
> > comment somewhere).
> >
>
> synchronize_shazptr_normal() cannot wait for a whole grace period,
> because the point of hazard pointers is to avoid waiting for unrelated
> readers.
Fair enough!
>
> > I guess blocking_grp_mask is to avoid allocating a cpumask (again for
> > performance purpose? So I guess synchronize_shazptr_normal() has some perf
>
> If we are talking about {k,v}malloc allocation:
> synchronize_shazptr_normal() would mostly be used in cleanup/free path
> similar to synchronize_rcu(), therefor I would like to avoid "allocating
> memory to free memory".
Good point!
>
> > expectations?)
> >
> > One possibility is to have the ptr contained in:
> >
> > struct hazptr {
> > void *ptr;
> > struct cpumask scan_mask
> > };
> >
>
> You mean updaters passing a `struct hazptr *` into
> synchronize_shazptr_normal()? That may be a good idea, if multiple
> updaters can share the same `struct hazptr *`, we can add that later,
> but...
>
> > And then the caller could simply scan itself those remaining CPUs without
> > relying on the kthread.
>
> .. this is a bad idea, sure, we can always burn some CPU time to scan,
> but local optimization doesn't mean global optimization, if in the
> future, we have a lots of synchronize_shazptr_normal()s happening at
> the same time, the self busy-waiting scan would become problematic.
Ok.
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/8] Introduce simple hazard pointers for lockdep
2025-06-26 10:16 ` Christoph Hellwig
2025-06-26 13:45 ` Mathieu Desnoyers
@ 2025-06-26 15:47 ` Boqun Feng
2025-06-27 2:56 ` Paul E. McKenney
1 sibling, 1 reply; 34+ messages in thread
From: Boqun Feng @ 2025-06-26 15:47 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, Breno Leitao, aeh, netdev, edumazet, jhs,
kernel-team, Erik Lundgren
On Thu, Jun 26, 2025 at 03:16:49AM -0700, Christoph Hellwig wrote:
> On Wed, Jun 25, 2025 at 07:08:57AM -0700, Boqun Feng wrote:
> > Sure, I will put one for the future version, here is the gist:
>
> Thanks a lot!
>
> > The updater's wait can finish immediately if no one is accessing 'a', in
> > other words it doesn't need to wait for reader 2.
>
> So basically it is the RCU concept, but limited to protecting exactly
> one pointer update per critical section with no ability for the read
> to e.g. acquire a refcount on the objected pointed to by that pointer?
For the current simple hazard pointer, yes. But simple hazard pointers
is easily to extend so support reading:
{ gp is a global pointer }
Reader Updater
====== =======
g = shazptr_acquire(p):
WRITE_ONCE(*this_cpu_ptr(slot), gp);
smp_mb();
if (READ_ONCE(gp) == *this_cpu_ptr(slot)) {
// still being protected.
<can read gp here>
to_free = READ_ONCE(gp);
WRITE_ONCE(gp, new);
synchronize_shazptr(to_free):
smp_mb();
// wait on the slot of reader
// CPU being 0.
READ_ONCE(per_cpu(reader, slot));
}
shazptr_clear(g):
WRITE_ONCE(*this_cpu_ptr(slot), NULL); // unblock synchronize_shazptr()
Usually the shazptr_acqurie() + "pointer comparison"* is called
shazptr_try_protect().
I will add a document about this in the next version along with other
bits of hazard pointers.
[*]: The pointer comparison is more complicated topic, but Mathieu has
figured out how to do it correctly:
https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/
Regards,
Boqun
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 0/8] Introduce simple hazard pointers for lockdep
2025-06-26 15:47 ` Boqun Feng
@ 2025-06-27 2:56 ` Paul E. McKenney
0 siblings, 0 replies; 34+ messages in thread
From: Paul E. McKenney @ 2025-06-27 2:56 UTC (permalink / raw)
To: Boqun Feng
Cc: Christoph Hellwig, linux-kernel, rcu, lkmm, Peter Zijlstra,
Ingo Molnar, Will Deacon, Waiman Long, Davidlohr Bueso,
Josh Triplett, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Uladzislau Rezki, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, Zqiang, Breno Leitao, aeh,
netdev, edumazet, jhs, kernel-team, Erik Lundgren
On Thu, Jun 26, 2025 at 08:47:25AM -0700, Boqun Feng wrote:
> On Thu, Jun 26, 2025 at 03:16:49AM -0700, Christoph Hellwig wrote:
> > On Wed, Jun 25, 2025 at 07:08:57AM -0700, Boqun Feng wrote:
> > > Sure, I will put one for the future version, here is the gist:
> >
> > Thanks a lot!
> >
> > > The updater's wait can finish immediately if no one is accessing 'a', in
> > > other words it doesn't need to wait for reader 2.
> >
> > So basically it is the RCU concept, but limited to protecting exactly
> > one pointer update per critical section with no ability for the read
> > to e.g. acquire a refcount on the objected pointed to by that pointer?
>
> For the current simple hazard pointer, yes. But simple hazard pointers
> is easily to extend so support reading:
>
> { gp is a global pointer }
>
> Reader Updater
> ====== =======
> g = shazptr_acquire(p):
> WRITE_ONCE(*this_cpu_ptr(slot), gp);
> smp_mb();
>
> if (READ_ONCE(gp) == *this_cpu_ptr(slot)) {
> // still being protected.
> <can read gp here>
> to_free = READ_ONCE(gp);
> WRITE_ONCE(gp, new);
> synchronize_shazptr(to_free):
> smp_mb();
> // wait on the slot of reader
> // CPU being 0.
> READ_ONCE(per_cpu(reader, slot));
> }
>
> shazptr_clear(g):
> WRITE_ONCE(*this_cpu_ptr(slot), NULL); // unblock synchronize_shazptr()
>
>
> Usually the shazptr_acqurie() + "pointer comparison"* is called
> shazptr_try_protect().
>
> I will add a document about this in the next version along with other
> bits of hazard pointers.
>
> [*]: The pointer comparison is more complicated topic, but Mathieu has
> figured out how to do it correctly:
>
> https://lore.kernel.org/lkml/20241008135034.1982519-2-mathieu.desnoyers@efficios.com/
It might be helpful to add that, at a high level, hazard pointers
are a scalable replacement for reference counting. At a similarly
high level, RCU is a scalable replacement for reader-writer locking.
At lower levels, there is considerable overlap in applicability, so that
you can use RCU to replace many reference-counting use cases and hazard
pointers to replace many reader-writer-locking use cases..
Plus, as both Mathieu and Boqun pointed out, both RCU and hazard pointers
can be combined with other synchronization mechanisms, including each
other.
Thanx, Paul
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist
2025-06-25 3:11 ` [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist Boqun Feng
2025-06-25 11:59 ` Peter Zijlstra
@ 2025-07-10 14:06 ` Breno Leitao
2025-07-11 2:31 ` Boqun Feng
1 sibling, 1 reply; 34+ messages in thread
From: Breno Leitao @ 2025-07-10 14:06 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, aeh, netdev, edumazet, jhs, kernel-team,
Erik Lundgren
Hello Boqun,
On Tue, Jun 24, 2025 at 08:11:01PM -0700, Boqun Feng wrote:
> Erik Lundgren and Breno Leitao reported [1] a case where
> lockdep_unregister_key() can be called from time critical code pathes
> where rntl_lock() may be held. And the synchronize_rcu() in it can slow
> down operations such as using tc to replace a qdisc in a network device.
>
> In fact the synchronize_rcu() in lockdep_unregister_key() is to wait for
> all is_dynamic_key() callers to finish so that removing a key from the
> key hashlist, and we can use shazptr to protect the hashlist as well.
>
> Compared to the proposed solution which replaces synchronize_rcu() with
> synchronize_rcu_expedited(), using shazptr here can achieve the
> same/better synchronization time without the need to send IPI. Hence use
> shazptr here.
>
> Reported-by: Erik Lundgren <elundgren@meta.com>
> Reported-by: Breno Leitao <leitao@debian.org>
> Link: https://lore.kernel.org/all/20250321-lockdep-v1-1-78b732d195fb@debian.org/ [1]
> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
First of all, thanks for working to fix the origianl issue. I've been
able to test this in my host, and the gain is impressive.
Before:
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq
real 0m13.195s
user 0m0.001s
sys 0m2.746s
With your patch:
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq
real 0m0.135s
user 0m0.002s
sys 0m0.116s
# time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
real 0m0.127s
user 0m0.001s
sys 0m0.112s
Please add the following to the series:
Tested-by: Breno Leitao <leitao@debian.org>
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist
2025-07-10 14:06 ` Breno Leitao
@ 2025-07-11 2:31 ` Boqun Feng
0 siblings, 0 replies; 34+ messages in thread
From: Boqun Feng @ 2025-07-11 2:31 UTC (permalink / raw)
To: Breno Leitao
Cc: linux-kernel, rcu, lkmm, Peter Zijlstra, Ingo Molnar, Will Deacon,
Waiman Long, Davidlohr Bueso, Paul E. McKenney, Josh Triplett,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Uladzislau Rezki, Steven Rostedt, Mathieu Desnoyers,
Lai Jiangshan, Zqiang, aeh, netdev, edumazet, jhs, kernel-team,
Erik Lundgren
On Thu, Jul 10, 2025 at 07:06:09AM -0700, Breno Leitao wrote:
> Hello Boqun,
>
> On Tue, Jun 24, 2025 at 08:11:01PM -0700, Boqun Feng wrote:
> > Erik Lundgren and Breno Leitao reported [1] a case where
> > lockdep_unregister_key() can be called from time critical code pathes
> > where rntl_lock() may be held. And the synchronize_rcu() in it can slow
> > down operations such as using tc to replace a qdisc in a network device.
> >
> > In fact the synchronize_rcu() in lockdep_unregister_key() is to wait for
> > all is_dynamic_key() callers to finish so that removing a key from the
> > key hashlist, and we can use shazptr to protect the hashlist as well.
> >
> > Compared to the proposed solution which replaces synchronize_rcu() with
> > synchronize_rcu_expedited(), using shazptr here can achieve the
> > same/better synchronization time without the need to send IPI. Hence use
> > shazptr here.
> >
> > Reported-by: Erik Lundgren <elundgren@meta.com>
> > Reported-by: Breno Leitao <leitao@debian.org>
> > Link: https://lore.kernel.org/all/20250321-lockdep-v1-1-78b732d195fb@debian.org/ [1]
> > Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
>
> First of all, thanks for working to fix the origianl issue. I've been
> able to test this in my host, and the gain is impressive.
>
> Before:
>
> # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq
> real 0m13.195s
> user 0m0.001s
> sys 0m2.746s
>
> With your patch:
>
> # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1234: mq
> real 0m0.135s
> user 0m0.002s
> sys 0m0.116s
>
> # time /usr/sbin/tc qdisc replace dev eth0 root handle 0x1: mq
> real 0m0.127s
> user 0m0.001s
> sys 0m0.112s
>
> Please add the following to the series:
>
> Tested-by: Breno Leitao <leitao@debian.org>
Thanks! Will apply ;-)
Regards,
Boqun
^ permalink raw reply [flat|nested] 34+ messages in thread
end of thread, other threads:[~2025-07-11 2:31 UTC | newest]
Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-25 3:10 [PATCH 0/8] Introduce simple hazard pointers for lockdep Boqun Feng
2025-06-25 3:10 ` [PATCH 1/8] Introduce simple hazard pointers Boqun Feng
2025-06-25 10:00 ` Peter Zijlstra
2025-06-25 14:25 ` Mathieu Desnoyers
2025-06-25 15:05 ` Boqun Feng
2025-06-25 15:52 ` Waiman Long
2025-06-25 16:09 ` Boqun Feng
2025-06-25 17:47 ` Waiman Long
2025-06-25 3:10 ` [PATCH 2/8] shazptr: Add refscale test Boqun Feng
2025-06-25 10:02 ` Peter Zijlstra
2025-06-25 3:10 ` [PATCH 3/8] shazptr: Add refscale test for wildcard Boqun Feng
2025-06-25 10:03 ` Peter Zijlstra
2025-06-25 3:10 ` [PATCH 4/8] shazptr: Avoid synchronize_shaptr() busy waiting Boqun Feng
2025-06-25 11:40 ` Peter Zijlstra
2025-06-25 11:56 ` Peter Zijlstra
2025-06-25 13:56 ` Frederic Weisbecker
2025-06-25 15:24 ` Boqun Feng
2025-06-26 13:45 ` Frederic Weisbecker
2025-06-25 3:10 ` [PATCH 5/8] shazptr: Allow skip self scan in synchronize_shaptr() Boqun Feng
2025-06-25 3:10 ` [PATCH 6/8] rcuscale: Allow rcu_scale_ops::get_gp_seq to be NULL Boqun Feng
2025-06-25 3:11 ` [PATCH 7/8] rcuscale: Add tests for simple hazard pointers Boqun Feng
2025-06-25 3:11 ` [PATCH 8/8] locking/lockdep: Use shazptr to protect the key hashlist Boqun Feng
2025-06-25 11:59 ` Peter Zijlstra
2025-06-25 14:18 ` Boqun Feng
2025-07-10 14:06 ` Breno Leitao
2025-07-11 2:31 ` Boqun Feng
2025-06-25 12:05 ` [PATCH 0/8] Introduce simple hazard pointers for lockdep Christoph Hellwig
2025-06-25 14:08 ` Boqun Feng
2025-06-26 10:16 ` Christoph Hellwig
2025-06-26 13:45 ` Mathieu Desnoyers
2025-06-26 15:47 ` Boqun Feng
2025-06-27 2:56 ` Paul E. McKenney
2025-06-25 12:25 ` Mathieu Desnoyers
2025-06-25 13:21 ` Boqun Feng
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).