* [RFC PATCH v2 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency
2024-10-04 18:27 [RFC PATCH v2 0/4] sched+mm: Track lazy active mm existence with hazard pointers Mathieu Desnoyers
@ 2024-10-04 18:27 ` Mathieu Desnoyers
2024-10-04 18:27 ` [RFC PATCH v2 2/4] Documentation: RCU: Refer to ptr_eq() Mathieu Desnoyers
` (2 subsequent siblings)
3 siblings, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04 18:27 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Frederic Weisbecker, Joel Fernandes, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm, Gary Guo, Nikita Popov, llvm
Compiler CSE and SSA GVN optimizations can cause the address dependency
of addresses returned by rcu_dereference to be lost when comparing those
pointers with either constants or previously loaded pointers.
Introduce ptr_eq() to compare two addresses while preserving the address
dependencies for later use of the address. It should be used when
comparing an address returned by rcu_dereference().
This is needed to prevent the compiler CSE and SSA GVN optimizations
from using @a (or @b) in places where the source refers to @b (or @a)
based on the fact that after the comparison, the two are known to be
equal, which does not preserve address dependencies and allows the
following misordering speculations:
- If @b is a constant, the compiler can issue the loads which depend
on @a before loading @a.
- If @b is a register populated by a prior load, weakly-ordered
CPUs can speculate loads which depend on @a before loading @a.
The same logic applies with @a and @b swapped.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Boqun Feng <boqun.feng@gmail.com>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Acked-by: "Paul E. McKenney" <paulmck@kernel.org>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
Cc: Nikita Popov <github@npopov.com>
Cc: llvm@lists.linux.dev
---
Changes since v0:
- Include feedback from Alan Stern.
---
include/linux/compiler.h | 63 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 2df665fa2964..75a378ae7af1 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -186,6 +186,69 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val,
__asm__ ("" : "=r" (var) : "0" (var))
#endif
+/*
+ * Compare two addresses while preserving the address dependencies for
+ * later use of the address. It should be used when comparing an address
+ * returned by rcu_dereference().
+ *
+ * This is needed to prevent the compiler CSE and SSA GVN optimizations
+ * from using @a (or @b) in places where the source refers to @b (or @a)
+ * based on the fact that after the comparison, the two are known to be
+ * equal, which does not preserve address dependencies and allows the
+ * following misordering speculations:
+ *
+ * - If @b is a constant, the compiler can issue the loads which depend
+ * on @a before loading @a.
+ * - If @b is a register populated by a prior load, weakly-ordered
+ * CPUs can speculate loads which depend on @a before loading @a.
+ *
+ * The same logic applies with @a and @b swapped.
+ *
+ * Return value: true if pointers are equal, false otherwise.
+ *
+ * The compiler barrier() is ineffective at fixing this issue. It does
+ * not prevent the compiler CSE from losing the address dependency:
+ *
+ * int fct_2_volatile_barriers(void)
+ * {
+ * int *a, *b;
+ *
+ * do {
+ * a = READ_ONCE(p);
+ * asm volatile ("" : : : "memory");
+ * b = READ_ONCE(p);
+ * } while (a != b);
+ * asm volatile ("" : : : "memory"); <-- barrier()
+ * return *b;
+ * }
+ *
+ * With gcc 14.2 (arm64):
+ *
+ * fct_2_volatile_barriers:
+ * adrp x0, .LANCHOR0
+ * add x0, x0, :lo12:.LANCHOR0
+ * .L2:
+ * ldr x1, [x0] <-- x1 populated by first load.
+ * ldr x2, [x0]
+ * cmp x1, x2
+ * bne .L2
+ * ldr w0, [x1] <-- x1 is used for access which should depend on b.
+ * ret
+ *
+ * On weakly-ordered architectures, this lets CPU speculation use the
+ * result from the first load to speculate "ldr w0, [x1]" before
+ * "ldr x2, [x0]".
+ * Based on the RCU documentation, the control dependency does not
+ * prevent the CPU from speculating loads.
+ */
+static __always_inline
+int ptr_eq(const volatile void *a, const volatile void *b)
+{
+ OPTIMIZER_HIDE_VAR(a);
+ OPTIMIZER_HIDE_VAR(b);
+ return a == b;
+}
+
#define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__)
/**
--
2.39.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* [RFC PATCH v2 2/4] Documentation: RCU: Refer to ptr_eq()
2024-10-04 18:27 [RFC PATCH v2 0/4] sched+mm: Track lazy active mm existence with hazard pointers Mathieu Desnoyers
2024-10-04 18:27 ` [RFC PATCH v2 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency Mathieu Desnoyers
@ 2024-10-04 18:27 ` Mathieu Desnoyers
2024-10-04 21:15 ` Joel Fernandes
2024-10-06 19:52 ` David Laight
2024-10-04 18:27 ` [RFC PATCH v2 3/4] hp: Implement Hazard Pointers Mathieu Desnoyers
2024-10-04 18:27 ` [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence Mathieu Desnoyers
3 siblings, 2 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04 18:27 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Frederic Weisbecker, Joel Fernandes, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm, Gary Guo, Nikita Popov, llvm
Refer to ptr_eq() in the rcu_dereference() documentation.
ptr_eq() is a mechanism that preserves address dependencies when
comparing pointers, and should be favored when comparing a pointer
obtained from rcu_dereference() against another pointer.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Gary Guo <gary@garyguo.net>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
Cc: Nikita Popov <github@npopov.com>
Cc: llvm@lists.linux.dev
---
Changes since v0:
- Include feedback from Alan Stern.
Changes since v1:
- Include feedback from Paul E. McKenney.
---
Documentation/RCU/rcu_dereference.rst | 38 +++++++++++++++++++++++----
1 file changed, 33 insertions(+), 5 deletions(-)
diff --git a/Documentation/RCU/rcu_dereference.rst b/Documentation/RCU/rcu_dereference.rst
index 2524dcdadde2..de6175bf430f 100644
--- a/Documentation/RCU/rcu_dereference.rst
+++ b/Documentation/RCU/rcu_dereference.rst
@@ -104,11 +104,12 @@ readers working properly:
after such branches, but can speculate loads, which can again
result in misordering bugs.
-- Be very careful about comparing pointers obtained from
- rcu_dereference() against non-NULL values. As Linus Torvalds
- explained, if the two pointers are equal, the compiler could
- substitute the pointer you are comparing against for the pointer
- obtained from rcu_dereference(). For example::
+- Use operations that preserve address dependencies (such as
+ "ptr_eq()") to compare pointers obtained from rcu_dereference()
+ against non-NULL pointers. As Linus Torvalds explained, if the
+ two pointers are equal, the compiler could substitute the
+ pointer you are comparing against for the pointer obtained from
+ rcu_dereference(). For example::
p = rcu_dereference(gp);
if (p == &default_struct)
@@ -125,6 +126,29 @@ readers working properly:
On ARM and Power hardware, the load from "default_struct.a"
can now be speculated, such that it might happen before the
rcu_dereference(). This could result in bugs due to misordering.
+ Performing the comparison with "ptr_eq()" ensures the compiler
+ does not perform such transformation.
+
+ If the comparison is against another pointer, the compiler is
+ allowed to use either pointer for the following accesses, which
+ loses the address dependency and allows weakly-ordered
+ architectures such as ARM and PowerPC to speculate the
+ address-dependent load before rcu_dereference(). For example::
+
+ p1 = READ_ONCE(gp);
+ p2 = rcu_dereference(gp);
+ if (p1 == p2) /* BUGGY!!! */
+ do_default(p2->a);
+
+ The compiler can use p1->a rather than p2->a, destroying the
+ address dependency. Performing the comparison with "ptr_eq()"
+ ensures the compiler preserves the address dependencies.
+ Corrected code::
+
+ p1 = READ_ONCE(gp);
+ p2 = rcu_dereference(gp);
+ if (ptr_eq(p1, p2))
+ do_default(p2->a);
However, comparisons are OK in the following cases:
@@ -204,6 +228,10 @@ readers working properly:
comparison will provide exactly the information that the
compiler needs to deduce the value of the pointer.
+ When in doubt, use operations that preserve address dependencies
+ (such as "ptr_eq()") to compare pointers obtained from
+ rcu_dereference() against non-NULL pointers.
+
- Disable any value-speculation optimizations that your compiler
might provide, especially if you are making use of feedback-based
optimizations that take data collected from prior runs. Such
--
2.39.2
^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v2 2/4] Documentation: RCU: Refer to ptr_eq()
2024-10-04 18:27 ` [RFC PATCH v2 2/4] Documentation: RCU: Refer to ptr_eq() Mathieu Desnoyers
@ 2024-10-04 21:15 ` Joel Fernandes
2024-10-06 19:52 ` David Laight
1 sibling, 0 replies; 26+ messages in thread
From: Joel Fernandes @ 2024-10-04 21:15 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm, Gary Guo,
Nikita Popov, llvm
On Fri, Oct 4, 2024 at 2:29 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> Refer to ptr_eq() in the rcu_dereference() documentation.
>
> ptr_eq() is a mechanism that preserves address dependencies when
> comparing pointers, and should be favored when comparing a pointer
> obtained from rcu_dereference() against another pointer.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Acked-by: Alan Stern <stern@rowland.harvard.edu>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Cc: John Stultz <jstultz@google.com>
> Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Joel Fernandes <joel@joelfernandes.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
thanks,
- Joel
^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [RFC PATCH v2 2/4] Documentation: RCU: Refer to ptr_eq()
2024-10-04 18:27 ` [RFC PATCH v2 2/4] Documentation: RCU: Refer to ptr_eq() Mathieu Desnoyers
2024-10-04 21:15 ` Joel Fernandes
@ 2024-10-06 19:52 ` David Laight
2024-10-06 20:39 ` Paul E. McKenney
2024-10-07 11:01 ` Sebastian Andrzej Siewior
1 sibling, 2 replies; 26+ messages in thread
From: David Laight @ 2024-10-06 19:52 UTC (permalink / raw)
To: 'Mathieu Desnoyers', Boqun Feng
Cc: linux-kernel@vger.kernel.org, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Frederic Weisbecker, Joel Fernandes, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael@gmail.com, Mateusz Guzik,
Jonas Oberhauser, rcu@vger.kernel.org, linux-mm@kvack.org,
lkmm@lists.linux.dev, Gary Guo, Nikita Popov,
llvm@lists.linux.dev
From: Mathieu Desnoyers
> Sent: 04 October 2024 19:28
>
> Refer to ptr_eq() in the rcu_dereference() documentation.
>
> ptr_eq() is a mechanism that preserves address dependencies when
> comparing pointers, and should be favored when comparing a pointer
> obtained from rcu_dereference() against another pointer.
Why does this ever really matter for rcu?
The check just ensure that any speculative load uses
a specific one of the pointers when they are different.
This can only matter if you care about the side effects
of the speculative load.
But rcu is all about (things like) lockless list following.
So you need to wait until it is impossible for another
execution context to have a reference to some memory
before actually completely invalidating it (ie kfree()).
And that 50 line comment is pointless.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v2 2/4] Documentation: RCU: Refer to ptr_eq()
2024-10-06 19:52 ` David Laight
@ 2024-10-06 20:39 ` Paul E. McKenney
2024-10-07 11:01 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 26+ messages in thread
From: Paul E. McKenney @ 2024-10-06 20:39 UTC (permalink / raw)
To: David Laight
Cc: 'Mathieu Desnoyers', Boqun Feng,
linux-kernel@vger.kernel.org, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael@gmail.com,
Mateusz Guzik, Jonas Oberhauser, rcu@vger.kernel.org,
linux-mm@kvack.org, lkmm@lists.linux.dev, Gary Guo, Nikita Popov,
llvm@lists.linux.dev
On Sun, Oct 06, 2024 at 07:52:49PM +0000, David Laight wrote:
> From: Mathieu Desnoyers
> > Sent: 04 October 2024 19:28
> >
> > Refer to ptr_eq() in the rcu_dereference() documentation.
> >
> > ptr_eq() is a mechanism that preserves address dependencies when
> > comparing pointers, and should be favored when comparing a pointer
> > obtained from rcu_dereference() against another pointer.
>
> Why does this ever really matter for rcu?
>
> The check just ensure that any speculative load uses
> a specific one of the pointers when they are different.
> This can only matter if you care about the side effects
> of the speculative load.
It can matter when there are static variables used during OOM. The code
must check the pointer against the addresses of the static variables
to work out how to free them, which can enable the compiler to do
specialization optimizations that destroy the address dependencies.
Which is OK if those static variables are compile-time initialized or
some such. But not if they get new values each time they go into the
list, as this can result in the reader seeing pre-initialization garbage.
An admittedly rare but very real use case.
Thanx, Paul
> But rcu is all about (things like) lockless list following.
> So you need to wait until it is impossible for another
> execution context to have a reference to some memory
> before actually completely invalidating it (ie kfree()).
>
> And that 50 line comment is pointless.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: RE: [RFC PATCH v2 2/4] Documentation: RCU: Refer to ptr_eq()
2024-10-06 19:52 ` David Laight
2024-10-06 20:39 ` Paul E. McKenney
@ 2024-10-07 11:01 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 26+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-10-07 11:01 UTC (permalink / raw)
To: David Laight
Cc: 'Mathieu Desnoyers', Boqun Feng,
linux-kernel@vger.kernel.org, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Paul E. McKenney, Will Deacon, Alan Stern,
John Stultz, Neeraj Upadhyay, Frederic Weisbecker, Joel Fernandes,
Josh Triplett, Uladzislau Rezki, Steven Rostedt, Lai Jiangshan,
Zqiang, Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael@gmail.com, Mateusz Guzik,
Jonas Oberhauser, rcu@vger.kernel.org, linux-mm@kvack.org,
lkmm@lists.linux.dev, Gary Guo, Nikita Popov,
llvm@lists.linux.dev
On 2024-10-06 19:52:49 [+0000], David Laight wrote:
> From: Mathieu Desnoyers
> > Sent: 04 October 2024 19:28
> >
> > Refer to ptr_eq() in the rcu_dereference() documentation.
> >
> > ptr_eq() is a mechanism that preserves address dependencies when
> > comparing pointers, and should be favored when comparing a pointer
> > obtained from rcu_dereference() against another pointer.
>
> Why does this ever really matter for rcu?
>
> The check just ensure that any speculative load uses
> a specific one of the pointers when they are different.
> This can only matter if you care about the side effects
> of the speculative load.
>
> But rcu is all about (things like) lockless list following.
> So you need to wait until it is impossible for another
> execution context to have a reference to some memory
> before actually completely invalidating it (ie kfree()).
Not always.
Non-RCU could would have locking with a barrier to ensure a reload.
RCU would not have the barrier. Assuming the pointer, points to a
struct, the compiler could load an element from the first pointer and
keeping it after it ensured the pointer are equal. However based on the
logic, the content could have changed and the compiler would not load
the new value but keep the previous one.
> David
Sebastian
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-04 18:27 [RFC PATCH v2 0/4] sched+mm: Track lazy active mm existence with hazard pointers Mathieu Desnoyers
2024-10-04 18:27 ` [RFC PATCH v2 1/4] compiler.h: Introduce ptr_eq() to preserve address dependency Mathieu Desnoyers
2024-10-04 18:27 ` [RFC PATCH v2 2/4] Documentation: RCU: Refer to ptr_eq() Mathieu Desnoyers
@ 2024-10-04 18:27 ` Mathieu Desnoyers
2024-10-04 21:25 ` Joel Fernandes
` (2 more replies)
2024-10-04 18:27 ` [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence Mathieu Desnoyers
3 siblings, 3 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04 18:27 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Frederic Weisbecker, Joel Fernandes, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm
This API provides existence guarantees of objects through Hazard
Pointers (HP). This minimalist implementation is specific to use
with preemption disabled, but can be extended further as needed.
Each HP domain defines a fixed number of hazard pointer slots (nr_cpus)
across the entire system.
Its main benefit over RCU is that it allows fast reclaim of
HP-protected pointers without needing to wait for a grace period.
It also allows the hazard pointer scan to call a user-defined callback
to retire a hazard pointer slot immediately if needed. This callback
may, for instance, issue an IPI to the relevant CPU.
There are a few possible use-cases for this in the Linux kernel:
- Improve performance of mm_count by replacing lazy active mm by HP.
- Guarantee object existence on pointer dereference to use refcount:
- replace locking used for that purpose in some drivers,
- replace RCU + inc_not_zero pattern,
- rtmutex: Improve situations where locks need to be taken in
reverse dependency chain order by guaranteeing existence of
first and second locks in traversal order, allowing them to be
locked in the correct order (which is reverse from traversal
order) rather than try-lock+retry on nested lock.
References:
[1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
lock-free objects," in IEEE Transactions on Parallel and
Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
Link: https://lore.kernel.org/lkml/j3scdl5iymjlxavomgc6u5ndg3svhab6ga23dr36o4f5mt333w@7xslvq6b6hmv/
Link: https://lpc.events/event/18/contributions/1731/
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
---
Changes since v0:
- Remove slot variable from hp_dereference_allocate().
---
include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
kernel/Makefile | 2 +-
kernel/hp.c | 46 +++++++++++++
3 files changed, 205 insertions(+), 1 deletion(-)
create mode 100644 include/linux/hp.h
create mode 100644 kernel/hp.c
diff --git a/include/linux/hp.h b/include/linux/hp.h
new file mode 100644
index 000000000000..e85fc4365ea2
--- /dev/null
+++ b/include/linux/hp.h
@@ -0,0 +1,158 @@
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+#ifndef _LINUX_HP_H
+#define _LINUX_HP_H
+
+/*
+ * HP: Hazard Pointers
+ *
+ * This API provides existence guarantees of objects through hazard
+ * pointers.
+ *
+ * It uses a fixed number of hazard pointer slots (nr_cpus) across the
+ * entire system for each HP domain.
+ *
+ * Its main benefit over RCU is that it allows fast reclaim of
+ * HP-protected pointers without needing to wait for a grace period.
+ *
+ * It also allows the hazard pointer scan to call a user-defined callback
+ * to retire a hazard pointer slot immediately if needed. This callback
+ * may, for instance, issue an IPI to the relevant CPU.
+ *
+ * References:
+ *
+ * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
+ * lock-free objects," in IEEE Transactions on Parallel and
+ * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
+ */
+
+#include <linux/rcupdate.h>
+
+/*
+ * Hazard pointer slot.
+ */
+struct hp_slot {
+ void *addr;
+};
+
+/*
+ * Hazard pointer context, returned by hp_use().
+ */
+struct hp_ctx {
+ struct hp_slot *slot;
+ void *addr;
+};
+
+/*
+ * hp_scan: Scan hazard pointer domain for @addr.
+ *
+ * Scan hazard pointer domain for @addr.
+ * If @retire_cb is NULL, wait to observe that each slot contains a value
+ * that differs from @addr.
+ * If @retire_cb is non-NULL, invoke @callback for each slot containing
+ * @addr.
+ */
+void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
+ void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr));
+
+/* Get the hazard pointer context address (may be NULL). */
+static inline
+void *hp_ctx_addr(struct hp_ctx ctx)
+{
+ return ctx.addr;
+}
+
+/*
+ * hp_allocate: Allocate a hazard pointer.
+ *
+ * Allocate a hazard pointer slot for @addr. The object existence should
+ * be guaranteed by the caller. Expects to be called from preempt
+ * disable context.
+ *
+ * Returns a hazard pointer context.
+ */
+static inline
+struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr)
+{
+ struct hp_slot *slot;
+ struct hp_ctx ctx;
+
+ if (!addr)
+ goto fail;
+ slot = this_cpu_ptr(percpu_slots);
+ /*
+ * A single hazard pointer slot per CPU is available currently.
+ * Other hazard pointer domains can eventually have a different
+ * configuration.
+ */
+ if (READ_ONCE(slot->addr))
+ goto fail;
+ WRITE_ONCE(slot->addr, addr); /* Store B */
+ ctx.slot = slot;
+ ctx.addr = addr;
+ return ctx;
+
+fail:
+ ctx.slot = NULL;
+ ctx.addr = NULL;
+ return ctx;
+}
+
+/*
+ * hp_dereference_allocate: Dereference and allocate a hazard pointer.
+ *
+ * Returns a hazard pointer context. Expects to be called from preempt
+ * disable context.
+ */
+static inline
+struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p)
+{
+ void *addr, *addr2;
+ struct hp_ctx ctx;
+
+ addr = READ_ONCE(*addr_p);
+retry:
+ ctx = hp_allocate(percpu_slots, addr);
+ if (!hp_ctx_addr(ctx))
+ goto fail;
+ /* Memory ordering: Store B before Load A. */
+ smp_mb();
+ /*
+ * Use RCU dereference without lockdep checks, because
+ * lockdep is not aware of HP guarantees.
+ */
+ addr2 = rcu_access_pointer(*addr_p); /* Load A */
+ /*
+ * If @addr_p content has changed since the first load,
+ * clear the hazard pointer and try again.
+ */
+ if (!ptr_eq(addr2, addr)) {
+ WRITE_ONCE(ctx.slot->addr, NULL);
+ if (!addr2)
+ goto fail;
+ addr = addr2;
+ goto retry;
+ }
+ /*
+ * Use addr2 loaded from rcu_access_pointer() to preserve
+ * address dependency ordering.
+ */
+ ctx.addr = addr2;
+ return ctx;
+
+fail:
+ ctx.slot = NULL;
+ ctx.addr = NULL;
+ return ctx;
+}
+
+/* Retire the hazard pointer in @ctx. */
+static inline
+void hp_retire(const struct hp_ctx ctx)
+{
+ smp_store_release(&ctx.slot->addr, NULL);
+}
+
+#endif /* _LINUX_HP_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 3c13240dfc9f..ec16de96fa80 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
cpu.o exit.o softirq.o resource.o \
sysctl.o capability.o ptrace.o user.o \
signal.o sys.o umh.o workqueue.o pid.o task_work.o \
- extable.o params.o \
+ extable.o params.o hp.o \
kthread.o sys_ni.o nsproxy.o \
notifier.o ksysfs.o cred.o reboot.o \
async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
diff --git a/kernel/hp.c b/kernel/hp.c
new file mode 100644
index 000000000000..b2447bf15300
--- /dev/null
+++ b/kernel/hp.c
@@ -0,0 +1,46 @@
+// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
+//
+// SPDX-License-Identifier: LGPL-2.1-or-later
+
+/*
+ * HP: Hazard Pointers
+ */
+
+#include <linux/hp.h>
+#include <linux/percpu.h>
+
+/*
+ * hp_scan: Scan hazard pointer domain for @addr.
+ *
+ * Scan hazard pointer domain for @addr.
+ * If @retire_cb is non-NULL, invoke @callback for each slot containing
+ * @addr.
+ * Wait to observe that each slot contains a value that differs from
+ * @addr before returning.
+ */
+void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
+ void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
+{
+ int cpu;
+
+ /*
+ * Store A precedes hp_scan(): it unpublishes addr (sets it to
+ * NULL or to a different value), and thus hides it from hazard
+ * pointer readers.
+ */
+
+ if (!addr)
+ return;
+ /* Memory ordering: Store A before Load B. */
+ smp_mb();
+ /* Scan all CPUs slots. */
+ for_each_possible_cpu(cpu) {
+ struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
+
+ if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
+ retire_cb(cpu, slot, addr);
+ /* Busy-wait if node is found. */
+ while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */
+ cpu_relax();
+ }
+}
--
2.39.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-04 18:27 ` [RFC PATCH v2 3/4] hp: Implement Hazard Pointers Mathieu Desnoyers
@ 2024-10-04 21:25 ` Joel Fernandes
2024-10-05 12:05 ` Mathieu Desnoyers
2024-10-05 11:19 ` Frederic Weisbecker
2024-10-05 16:04 ` Peter Zijlstra
2 siblings, 1 reply; 26+ messages in thread
From: Joel Fernandes @ 2024-10-04 21:25 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On Fri, Oct 4, 2024 at 2:29 PM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> This API provides existence guarantees of objects through Hazard
> Pointers (HP). This minimalist implementation is specific to use
> with preemption disabled, but can be extended further as needed.
>
> Each HP domain defines a fixed number of hazard pointer slots (nr_cpus)
> across the entire system.
>
> Its main benefit over RCU is that it allows fast reclaim of
> HP-protected pointers without needing to wait for a grace period.
>
> It also allows the hazard pointer scan to call a user-defined callback
> to retire a hazard pointer slot immediately if needed. This callback
> may, for instance, issue an IPI to the relevant CPU.
>
> There are a few possible use-cases for this in the Linux kernel:
>
> - Improve performance of mm_count by replacing lazy active mm by HP.
> - Guarantee object existence on pointer dereference to use refcount:
> - replace locking used for that purpose in some drivers,
> - replace RCU + inc_not_zero pattern,
> - rtmutex: Improve situations where locks need to be taken in
> reverse dependency chain order by guaranteeing existence of
> first and second locks in traversal order, allowing them to be
> locked in the correct order (which is reverse from traversal
> order) rather than try-lock+retry on nested lock.
>
> References:
>
> [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
> lock-free objects," in IEEE Transactions on Parallel and
> Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
[ ... ]
> ---
> Changes since v0:
> - Remove slot variable from hp_dereference_allocate().
> ---
> include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
> kernel/Makefile | 2 +-
> kernel/hp.c | 46 +++++++++++++
Just a housekeeping comment, ISTR Linus looking down on adding bodies
of C code to header files (like hp_dereference_allocate). I understand
maybe the rationale is that the functions included are inlined. But do
all of them have to be inlined? Such headers also hurt code browsing
capabilities in code browsers like clangd. clangd doesn't understand
header files because it can't independently compile them -- it uses
the compiler to generate and extract the AST for superior code
browsing/completion.
Also have you looked at the benefits of inlining for hp.h?
hp_dereference_allocate() seems large enough that inlining may not
matter much, but I haven't compiled it and looked at the asm myself.
Will continue staring at the code.
thanks,
- Joel
> 3 files changed, 205 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/hp.h
> create mode 100644 kernel/hp.c
>
> diff --git a/include/linux/hp.h b/include/linux/hp.h
> new file mode 100644
> index 000000000000..e85fc4365ea2
> --- /dev/null
> +++ b/include/linux/hp.h
> @@ -0,0 +1,158 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +#ifndef _LINUX_HP_H
> +#define _LINUX_HP_H
> +
> +/*
> + * HP: Hazard Pointers
> + *
> + * This API provides existence guarantees of objects through hazard
> + * pointers.
> + *
> + * It uses a fixed number of hazard pointer slots (nr_cpus) across the
> + * entire system for each HP domain.
> + *
> + * Its main benefit over RCU is that it allows fast reclaim of
> + * HP-protected pointers without needing to wait for a grace period.
> + *
> + * It also allows the hazard pointer scan to call a user-defined callback
> + * to retire a hazard pointer slot immediately if needed. This callback
> + * may, for instance, issue an IPI to the relevant CPU.
> + *
> + * References:
> + *
> + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
> + * lock-free objects," in IEEE Transactions on Parallel and
> + * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
> + */
> +
> +#include <linux/rcupdate.h>
> +
> +/*
> + * Hazard pointer slot.
> + */
> +struct hp_slot {
> + void *addr;
> +};
> +
> +/*
> + * Hazard pointer context, returned by hp_use().
> + */
> +struct hp_ctx {
> + struct hp_slot *slot;
> + void *addr;
> +};
> +
> +/*
> + * hp_scan: Scan hazard pointer domain for @addr.
> + *
> + * Scan hazard pointer domain for @addr.
> + * If @retire_cb is NULL, wait to observe that each slot contains a value
> + * that differs from @addr.
> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
> + * @addr.
> + */
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr));
> +
> +/* Get the hazard pointer context address (may be NULL). */
> +static inline
> +void *hp_ctx_addr(struct hp_ctx ctx)
> +{
> + return ctx.addr;
> +}
> +
> +/*
> + * hp_allocate: Allocate a hazard pointer.
> + *
> + * Allocate a hazard pointer slot for @addr. The object existence should
> + * be guaranteed by the caller. Expects to be called from preempt
> + * disable context.
> + *
> + * Returns a hazard pointer context.
> + */
> +static inline
> +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr)
> +{
> + struct hp_slot *slot;
> + struct hp_ctx ctx;
> +
> + if (!addr)
> + goto fail;
> + slot = this_cpu_ptr(percpu_slots);
> + /*
> + * A single hazard pointer slot per CPU is available currently.
> + * Other hazard pointer domains can eventually have a different
> + * configuration.
> + */
> + if (READ_ONCE(slot->addr))
> + goto fail;
> + WRITE_ONCE(slot->addr, addr); /* Store B */
> + ctx.slot = slot;
> + ctx.addr = addr;
> + return ctx;
> +
> +fail:
> + ctx.slot = NULL;
> + ctx.addr = NULL;
> + return ctx;
> +}
> +
> +/*
> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> + *
> + * Returns a hazard pointer context. Expects to be called from preempt
> + * disable context.
> + */
> +static inline
> +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p)
> +{
> + void *addr, *addr2;
> + struct hp_ctx ctx;
> +
> + addr = READ_ONCE(*addr_p);
> +retry:
> + ctx = hp_allocate(percpu_slots, addr);
> + if (!hp_ctx_addr(ctx))
> + goto fail;
> + /* Memory ordering: Store B before Load A. */
> + smp_mb();
> + /*
> + * Use RCU dereference without lockdep checks, because
> + * lockdep is not aware of HP guarantees.
> + */
> + addr2 = rcu_access_pointer(*addr_p); /* Load A */
> + /*
> + * If @addr_p content has changed since the first load,
> + * clear the hazard pointer and try again.
> + */
> + if (!ptr_eq(addr2, addr)) {
> + WRITE_ONCE(ctx.slot->addr, NULL);
> + if (!addr2)
> + goto fail;
> + addr = addr2;
> + goto retry;
> + }
> + /*
> + * Use addr2 loaded from rcu_access_pointer() to preserve
> + * address dependency ordering.
> + */
> + ctx.addr = addr2;
> + return ctx;
> +
> +fail:
> + ctx.slot = NULL;
> + ctx.addr = NULL;
> + return ctx;
> +}
> +
> +/* Retire the hazard pointer in @ctx. */
> +static inline
> +void hp_retire(const struct hp_ctx ctx)
> +{
> + smp_store_release(&ctx.slot->addr, NULL);
> +}
> +
> +#endif /* _LINUX_HP_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 3c13240dfc9f..ec16de96fa80 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
> cpu.o exit.o softirq.o resource.o \
> sysctl.o capability.o ptrace.o user.o \
> signal.o sys.o umh.o workqueue.o pid.o task_work.o \
> - extable.o params.o \
> + extable.o params.o hp.o \
> kthread.o sys_ni.o nsproxy.o \
> notifier.o ksysfs.o cred.o reboot.o \
> async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
> diff --git a/kernel/hp.c b/kernel/hp.c
> new file mode 100644
> index 000000000000..b2447bf15300
> --- /dev/null
> +++ b/kernel/hp.c
> @@ -0,0 +1,46 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +/*
> + * HP: Hazard Pointers
> + */
> +
> +#include <linux/hp.h>
> +#include <linux/percpu.h>
> +
> +/*
> + * hp_scan: Scan hazard pointer domain for @addr.
> + *
> + * Scan hazard pointer domain for @addr.
> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
> + * @addr.
> + * Wait to observe that each slot contains a value that differs from
> + * @addr before returning.
> + */
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> +{
> + int cpu;
> +
> + /*
> + * Store A precedes hp_scan(): it unpublishes addr (sets it to
> + * NULL or to a different value), and thus hides it from hazard
> + * pointer readers.
> + */
> +
> + if (!addr)
> + return;
> + /* Memory ordering: Store A before Load B. */
> + smp_mb();
> + /* Scan all CPUs slots. */
> + for_each_possible_cpu(cpu) {
> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> +
> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
> + retire_cb(cpu, slot, addr);
> + /* Busy-wait if node is found. */
> + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */
> + cpu_relax();
> + }
> +}
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-04 21:25 ` Joel Fernandes
@ 2024-10-05 12:05 ` Mathieu Desnoyers
0 siblings, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-05 12:05 UTC (permalink / raw)
To: Joel Fernandes
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Frederic Weisbecker, Josh Triplett, Uladzislau Rezki,
Steven Rostedt, Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long,
Mark Rutland, Thomas Gleixner, Vlastimil Babka, maged.michael,
Mateusz Guzik, Jonas Oberhauser, rcu, linux-mm, lkmm
On 2024-10-04 23:25, Joel Fernandes wrote:
> On Fri, Oct 4, 2024 at 2:29 PM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> This API provides existence guarantees of objects through Hazard
>> Pointers (HP). This minimalist implementation is specific to use
>> with preemption disabled, but can be extended further as needed.
>>
>> Each HP domain defines a fixed number of hazard pointer slots (nr_cpus)
>> across the entire system.
>>
>> Its main benefit over RCU is that it allows fast reclaim of
>> HP-protected pointers without needing to wait for a grace period.
>>
>> It also allows the hazard pointer scan to call a user-defined callback
>> to retire a hazard pointer slot immediately if needed. This callback
>> may, for instance, issue an IPI to the relevant CPU.
>>
>> There are a few possible use-cases for this in the Linux kernel:
>>
>> - Improve performance of mm_count by replacing lazy active mm by HP.
>> - Guarantee object existence on pointer dereference to use refcount:
>> - replace locking used for that purpose in some drivers,
>> - replace RCU + inc_not_zero pattern,
>> - rtmutex: Improve situations where locks need to be taken in
>> reverse dependency chain order by guaranteeing existence of
>> first and second locks in traversal order, allowing them to be
>> locked in the correct order (which is reverse from traversal
>> order) rather than try-lock+retry on nested lock.
>>
>> References:
>>
>> [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
>> lock-free objects," in IEEE Transactions on Parallel and
>> Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
> [ ... ]
>> ---
>> Changes since v0:
>> - Remove slot variable from hp_dereference_allocate().
>> ---
>> include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
>> kernel/Makefile | 2 +-
>> kernel/hp.c | 46 +++++++++++++
>
> Just a housekeeping comment, ISTR Linus looking down on adding bodies
> of C code to header files (like hp_dereference_allocate). I understand
> maybe the rationale is that the functions included are inlined. But do
> all of them have to be inlined? Such headers also hurt code browsing
> capabilities in code browsers like clangd. clangd doesn't understand
> header files because it can't independently compile them -- it uses
> the compiler to generate and extract the AST for superior code
> browsing/completion.
>
> Also have you looked at the benefits of inlining for hp.h?
> hp_dereference_allocate() seems large enough that inlining may not
> matter much, but I haven't compiled it and looked at the asm myself.
Here is a comparison in userspace:
* With "hp dereference allocate" inlined:
test_hpref_benchmark (smp_mb) nr_reads 1994298193 nr_writes 22293162 nr_ops 2016591355
test_hpref_benchmark (barrier/membarrier) nr_reads 15208690879 nr_writes 1893785 nr_ops 15210584664
* With "hp dereference allocate" implemented as a function call:
test_hpref_benchmark (smp_mb) nr_reads 1558924716 nr_writes 14261028 nr_ops 1573185744
test_hpref_benchmark (barrier/membarrier) nr_reads 5881131707 nr_writes 2005140 nr_ops 5883136847
So the overhead of the function call when using symmetric memory barriers
between hp allocate/hp scan is a 20% slowdown.
It's worse in the asymmetric barrier/membarrier case, introducing a 61%
slowdown.
Given that the overhead is noticeable, I am tempted to leave the hazard
pointer allocate/retire as inline functions.
About code browsers like clangd, I would recommend improving the tooling
rather than alter the design of the code based on current tooling
limitations.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-04 18:27 ` [RFC PATCH v2 3/4] hp: Implement Hazard Pointers Mathieu Desnoyers
2024-10-04 21:25 ` Joel Fernandes
@ 2024-10-05 11:19 ` Frederic Weisbecker
2024-10-05 11:42 ` Mathieu Desnoyers
2024-10-05 16:04 ` Peter Zijlstra
2 siblings, 1 reply; 26+ messages in thread
From: Frederic Weisbecker @ 2024-10-05 11:19 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
Le Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers a écrit :
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> +{
> + int cpu;
> +
> + /*
> + * Store A precedes hp_scan(): it unpublishes addr (sets it to
> + * NULL or to a different value), and thus hides it from hazard
> + * pointer readers.
> + */
> +
> + if (!addr)
> + return;
> + /* Memory ordering: Store A before Load B. */
> + smp_mb();
> + /* Scan all CPUs slots. */
> + for_each_possible_cpu(cpu) {
> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> +
> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
> + retire_cb(cpu, slot, addr);
> + /* Busy-wait if node is found. */
> + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */
> + cpu_relax();
You agree that having a single possible per-cpu pointer per context and a busy
waiting update side pointer release can't be a general purpose hazard pointer
implementation, right? :-)
Thanks.
> + }
> +}
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-05 11:19 ` Frederic Weisbecker
@ 2024-10-05 11:42 ` Mathieu Desnoyers
0 siblings, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-05 11:42 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On 2024-10-05 13:19, Frederic Weisbecker wrote:
> Le Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers a écrit :
>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>> +{
>> + int cpu;
>> +
>> + /*
>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to
>> + * NULL or to a different value), and thus hides it from hazard
>> + * pointer readers.
>> + */
>> +
>> + if (!addr)
>> + return;
>> + /* Memory ordering: Store A before Load B. */
>> + smp_mb();
>> + /* Scan all CPUs slots. */
>> + for_each_possible_cpu(cpu) {
>> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>> +
>> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
>> + retire_cb(cpu, slot, addr);
>> + /* Busy-wait if node is found. */
>> + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */
>> + cpu_relax();
>
> You agree that having a single possible per-cpu pointer per context and a busy
> waiting update side pointer release can't be a general purpose hazard pointer
> implementation, right? :-)
Of course. This is a minimalist implementation, which can be extended in
various ways, some of which I've implemented as POC in userspace already:
- Increase the number of per-cpu slots available,
- Distinguish between current scan depth target and available
per-cpu slots,
- Fall-back to reference counter when slots are full,
- Allow scanning for a range of addresses (useful for type-safe
memory),
- Allow scanning for a set of hazard pointers (scan batching)
using Bloom filters to probabilistically speed up the comparison
(not implemented yet).
- Implement a queued blocking wait/wakeup when HP scan must wait
(not implemented yet).
- Implement a HP-to-refcount promotion triggered by the HP scan
callback to promote hazard pointers which would be blocked on
to a reference count increment. (not implemented yet)
- Use hazard pointers + refcount to implement smart pointers, which
could be useful for Rust. (not implemented yet)
But my general approach is to wait until the use-cases justify adding
features.
Although if you are curious about any of the points listed above,
just ask and I'll be happy to discuss them in more depth.
Thanks,
Mathieu
>
> Thanks.
>
>> + }
>> +}
>> --
>> 2.39.2
>>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-04 18:27 ` [RFC PATCH v2 3/4] hp: Implement Hazard Pointers Mathieu Desnoyers
2024-10-04 21:25 ` Joel Fernandes
2024-10-05 11:19 ` Frederic Weisbecker
@ 2024-10-05 16:04 ` Peter Zijlstra
2024-10-05 16:07 ` Peter Zijlstra
2024-10-05 18:50 ` Mathieu Desnoyers
2 siblings, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2024-10-05 16:04 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Paul E. McKenney, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
> include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
> kernel/Makefile | 2 +-
> kernel/hp.c | 46 +++++++++++++
> 3 files changed, 205 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/hp.h
> create mode 100644 kernel/hp.c
>
> diff --git a/include/linux/hp.h b/include/linux/hp.h
> new file mode 100644
> index 000000000000..e85fc4365ea2
> --- /dev/null
> +++ b/include/linux/hp.h
> @@ -0,0 +1,158 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +#ifndef _LINUX_HP_H
> +#define _LINUX_HP_H
> +
> +/*
> + * HP: Hazard Pointers
> + *
> + * This API provides existence guarantees of objects through hazard
> + * pointers.
> + *
> + * It uses a fixed number of hazard pointer slots (nr_cpus) across the
> + * entire system for each HP domain.
> + *
> + * Its main benefit over RCU is that it allows fast reclaim of
> + * HP-protected pointers without needing to wait for a grace period.
> + *
> + * It also allows the hazard pointer scan to call a user-defined callback
> + * to retire a hazard pointer slot immediately if needed. This callback
> + * may, for instance, issue an IPI to the relevant CPU.
> + *
> + * References:
> + *
> + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
> + * lock-free objects," in IEEE Transactions on Parallel and
> + * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
> + */
> +
> +#include <linux/rcupdate.h>
> +
> +/*
> + * Hazard pointer slot.
> + */
> +struct hp_slot {
> + void *addr;
> +};
> +
> +/*
> + * Hazard pointer context, returned by hp_use().
> + */
> +struct hp_ctx {
> + struct hp_slot *slot;
> + void *addr;
> +};
> +
> +/*
> + * hp_scan: Scan hazard pointer domain for @addr.
> + *
> + * Scan hazard pointer domain for @addr.
> + * If @retire_cb is NULL, wait to observe that each slot contains a value
> + * that differs from @addr.
> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
> + * @addr.
> + */
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr));
struct hp_domain {
struct hp_slot __percpu *slots
};
might clarify things a wee little.
> +
> +/* Get the hazard pointer context address (may be NULL). */
> +static inline
> +void *hp_ctx_addr(struct hp_ctx ctx)
> +{
> + return ctx.addr;
> +}
From where I'm sitting this seems like superfluous fluff, what's wrong
with ctx.addr ?
> +/*
> + * hp_allocate: Allocate a hazard pointer.
> + *
> + * Allocate a hazard pointer slot for @addr. The object existence should
> + * be guaranteed by the caller. Expects to be called from preempt
> + * disable context.
> + *
> + * Returns a hazard pointer context.
So you made the WTF'o'meter crack, this here function does not allocate
nothing. Naming is bad. At best this is something like
try-set-hazard-pointer or somesuch.
> + */
> +static inline
> +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr)
> +{
> + struct hp_slot *slot;
> + struct hp_ctx ctx;
> +
> + if (!addr)
> + goto fail;
> + slot = this_cpu_ptr(percpu_slots);
> + /*
> + * A single hazard pointer slot per CPU is available currently.
> + * Other hazard pointer domains can eventually have a different
> + * configuration.
> + */
> + if (READ_ONCE(slot->addr))
> + goto fail;
> + WRITE_ONCE(slot->addr, addr); /* Store B */
> + ctx.slot = slot;
> + ctx.addr = addr;
> + return ctx;
> +
> +fail:
> + ctx.slot = NULL;
> + ctx.addr = NULL;
> + return ctx;
> +}
> +
> +/*
> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> + *
> + * Returns a hazard pointer context. Expects to be called from preempt
> + * disable context.
> + */
More terrible naming. Same as above, but additionally, I would expect a
'dereference' to actually dereference the pointer and have a return
value of the dereferenced type.
This function seems to double check and update the hp_ctx thing. I'm not
at all sure yet wtf this is doing -- and the total lack of comments
aren't helping.
> +static inline
> +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p)
> +{
> + void *addr, *addr2;
> + struct hp_ctx ctx;
> +
> + addr = READ_ONCE(*addr_p);
> +retry:
> + ctx = hp_allocate(percpu_slots, addr);
> + if (!hp_ctx_addr(ctx))
> + goto fail;
> + /* Memory ordering: Store B before Load A. */
> + smp_mb();
> + /*
> + * Use RCU dereference without lockdep checks, because
> + * lockdep is not aware of HP guarantees.
> + */
> + addr2 = rcu_access_pointer(*addr_p); /* Load A */
> + /*
> + * If @addr_p content has changed since the first load,
> + * clear the hazard pointer and try again.
> + */
> + if (!ptr_eq(addr2, addr)) {
> + WRITE_ONCE(ctx.slot->addr, NULL);
> + if (!addr2)
> + goto fail;
> + addr = addr2;
> + goto retry;
> + }
> + /*
> + * Use addr2 loaded from rcu_access_pointer() to preserve
> + * address dependency ordering.
> + */
> + ctx.addr = addr2;
> + return ctx;
> +
> +fail:
> + ctx.slot = NULL;
> + ctx.addr = NULL;
> + return ctx;
> +}
> +
> +/* Retire the hazard pointer in @ctx. */
> +static inline
> +void hp_retire(const struct hp_ctx ctx)
> +{
> + smp_store_release(&ctx.slot->addr, NULL);
> +}
> +
> +#endif /* _LINUX_HP_H */
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 3c13240dfc9f..ec16de96fa80 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
> cpu.o exit.o softirq.o resource.o \
> sysctl.o capability.o ptrace.o user.o \
> signal.o sys.o umh.o workqueue.o pid.o task_work.o \
> - extable.o params.o \
> + extable.o params.o hp.o \
> kthread.o sys_ni.o nsproxy.o \
> notifier.o ksysfs.o cred.o reboot.o \
> async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
> diff --git a/kernel/hp.c b/kernel/hp.c
> new file mode 100644
> index 000000000000..b2447bf15300
> --- /dev/null
> +++ b/kernel/hp.c
> @@ -0,0 +1,46 @@
> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> +//
> +// SPDX-License-Identifier: LGPL-2.1-or-later
> +
> +/*
> + * HP: Hazard Pointers
> + */
> +
> +#include <linux/hp.h>
> +#include <linux/percpu.h>
> +
> +/*
> + * hp_scan: Scan hazard pointer domain for @addr.
> + *
> + * Scan hazard pointer domain for @addr.
> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
> + * @addr.
> + * Wait to observe that each slot contains a value that differs from
> + * @addr before returning.
> + */
> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> +{
> + int cpu;
> +
> + /*
> + * Store A precedes hp_scan(): it unpublishes addr (sets it to
> + * NULL or to a different value), and thus hides it from hazard
> + * pointer readers.
> + */
> +
> + if (!addr)
> + return;
> + /* Memory ordering: Store A before Load B. */
> + smp_mb();
> + /* Scan all CPUs slots. */
> + for_each_possible_cpu(cpu) {
> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> +
> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
> + retire_cb(cpu, slot, addr);
Is retirce_cb allowed to cmpxchg the thing?
> + /* Busy-wait if node is found. */
> + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */
> + cpu_relax();
This really should be using smp_cond_load_acquire()
> + }
> +}
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-05 16:04 ` Peter Zijlstra
@ 2024-10-05 16:07 ` Peter Zijlstra
2024-10-05 18:56 ` Mathieu Desnoyers
2024-10-05 18:50 ` Mathieu Desnoyers
1 sibling, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2024-10-05 16:07 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Paul E. McKenney, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
> > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > +{
> > + int cpu;
> > +
> > + /*
> > + * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > + * NULL or to a different value), and thus hides it from hazard
> > + * pointer readers.
> > + */
This should probably assert we're in a preemptible context. Otherwise
people will start using this in non-preemptible context and then we get
to unfuck things later.
> > +
> > + if (!addr)
> > + return;
> > + /* Memory ordering: Store A before Load B. */
> > + smp_mb();
> > + /* Scan all CPUs slots. */
> > + for_each_possible_cpu(cpu) {
> > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> > +
> > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
> > + retire_cb(cpu, slot, addr);
>
> Is retirce_cb allowed to cmpxchg the thing?
>
> > + /* Busy-wait if node is found. */
> > + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */
> > + cpu_relax();
>
> This really should be using smp_cond_load_acquire()
>
> > + }
> > +}
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-05 16:07 ` Peter Zijlstra
@ 2024-10-05 18:56 ` Mathieu Desnoyers
2024-10-07 10:42 ` Peter Zijlstra
0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-05 18:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Paul E. McKenney, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On 2024-10-05 18:07, Peter Zijlstra wrote:
> On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote:
>> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
>
>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>>> +{
>>> + int cpu;
>>> +
>>> + /*
>>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to
>>> + * NULL or to a different value), and thus hides it from hazard
>>> + * pointer readers.
>>> + */
>
> This should probably assert we're in a preemptible context. Otherwise
> people will start using this in non-preemptible context and then we get
> to unfuck things later.
Something like this ?
+ /* Should only be called from preemptible context. */
+ WARN_ON_ONCE(in_atomic());
>
>>> +
>>> + if (!addr)
>>> + return;
>>> + /* Memory ordering: Store A before Load B. */
>>> + smp_mb();
>>> + /* Scan all CPUs slots. */
>>> + for_each_possible_cpu(cpu) {
>>> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>>> +
>>> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
>>> + retire_cb(cpu, slot, addr);
>>
>> Is retirce_cb allowed to cmpxchg the thing?
Renaming retire_cb to "on_match_cb". Whatever the callback does needs to
be done with knowledge of the slot user (e.g. IPI).
>>
>>> + /* Busy-wait if node is found. */
>>> + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */
>>> + cpu_relax();
>>
>> This really should be using smp_cond_load_acquire()
Done,
Thanks,
Mathieu
>>
>>> + }
>>> +}
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-05 18:56 ` Mathieu Desnoyers
@ 2024-10-07 10:42 ` Peter Zijlstra
2024-10-07 13:22 ` Mathieu Desnoyers
0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2024-10-07 10:42 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Paul E. McKenney, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On Sat, Oct 05, 2024 at 02:56:26PM -0400, Mathieu Desnoyers wrote:
> On 2024-10-05 18:07, Peter Zijlstra wrote:
> > On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
> >
> > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > > > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > > > +{
> > > > + int cpu;
> > > > +
> > > > + /*
> > > > + * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > > > + * NULL or to a different value), and thus hides it from hazard
> > > > + * pointer readers.
> > > > + */
> >
> > This should probably assert we're in a preemptible context. Otherwise
> > people will start using this in non-preemptible context and then we get
> > to unfuck things later.
>
> Something like this ?
>
> + /* Should only be called from preemptible context. */
> + WARN_ON_ONCE(in_atomic());
lockdep_assert_preemption_enabled();
that also checks local IRQ state IIRC.
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-07 10:42 ` Peter Zijlstra
@ 2024-10-07 13:22 ` Mathieu Desnoyers
0 siblings, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-07 13:22 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Paul E. McKenney, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On 2024-10-07 12:42, Peter Zijlstra wrote:
> On Sat, Oct 05, 2024 at 02:56:26PM -0400, Mathieu Desnoyers wrote:
>> On 2024-10-05 18:07, Peter Zijlstra wrote:
>>> On Sat, Oct 05, 2024 at 06:04:44PM +0200, Peter Zijlstra wrote:
>>>> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
>>>
>>>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>>>>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>>>>> +{
>>>>> + int cpu;
>>>>> +
>>>>> + /*
>>>>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to
>>>>> + * NULL or to a different value), and thus hides it from hazard
>>>>> + * pointer readers.
>>>>> + */
>>>
>>> This should probably assert we're in a preemptible context. Otherwise
>>> people will start using this in non-preemptible context and then we get
>>> to unfuck things later.
>>
>> Something like this ?
>>
>> + /* Should only be called from preemptible context. */
>> + WARN_ON_ONCE(in_atomic());
>
> lockdep_assert_preemption_enabled();
>
> that also checks local IRQ state IIRC.
I'll use this instead, thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-05 16:04 ` Peter Zijlstra
2024-10-05 16:07 ` Peter Zijlstra
@ 2024-10-05 18:50 ` Mathieu Desnoyers
2024-10-07 10:40 ` Peter Zijlstra
1 sibling, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-05 18:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Paul E. McKenney, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On 2024-10-05 18:04, Peter Zijlstra wrote:
> On Fri, Oct 04, 2024 at 02:27:33PM -0400, Mathieu Desnoyers wrote:
>> include/linux/hp.h | 158 +++++++++++++++++++++++++++++++++++++++++++++
>> kernel/Makefile | 2 +-
>> kernel/hp.c | 46 +++++++++++++
>> 3 files changed, 205 insertions(+), 1 deletion(-)
>> create mode 100644 include/linux/hp.h
>> create mode 100644 kernel/hp.c
>>
>> diff --git a/include/linux/hp.h b/include/linux/hp.h
>> new file mode 100644
>> index 000000000000..e85fc4365ea2
>> --- /dev/null
>> +++ b/include/linux/hp.h
>> @@ -0,0 +1,158 @@
>> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> +//
>> +// SPDX-License-Identifier: LGPL-2.1-or-later
>> +
>> +#ifndef _LINUX_HP_H
>> +#define _LINUX_HP_H
>> +
>> +/*
>> + * HP: Hazard Pointers
>> + *
>> + * This API provides existence guarantees of objects through hazard
>> + * pointers.
>> + *
>> + * It uses a fixed number of hazard pointer slots (nr_cpus) across the
>> + * entire system for each HP domain.
>> + *
>> + * Its main benefit over RCU is that it allows fast reclaim of
>> + * HP-protected pointers without needing to wait for a grace period.
>> + *
>> + * It also allows the hazard pointer scan to call a user-defined callback
>> + * to retire a hazard pointer slot immediately if needed. This callback
>> + * may, for instance, issue an IPI to the relevant CPU.
>> + *
>> + * References:
>> + *
>> + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for
>> + * lock-free objects," in IEEE Transactions on Parallel and
>> + * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004
>> + */
>> +
>> +#include <linux/rcupdate.h>
>> +
>> +/*
>> + * Hazard pointer slot.
>> + */
>> +struct hp_slot {
>> + void *addr;
>> +};
>> +
>> +/*
>> + * Hazard pointer context, returned by hp_use().
>> + */
>> +struct hp_ctx {
>> + struct hp_slot *slot;
>> + void *addr;
>> +};
>> +
>> +/*
>> + * hp_scan: Scan hazard pointer domain for @addr.
>> + *
>> + * Scan hazard pointer domain for @addr.
>> + * If @retire_cb is NULL, wait to observe that each slot contains a value
>> + * that differs from @addr.
>> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
>> + * @addr.
>> + */
>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr));
>
> struct hp_domain {
> struct hp_slot __percpu *slots
> };
>
> might clarify things a wee little.
Good point. This introduces:
#define DECLARE_HP_DOMAIN(domain) \
extern struct hp_domain domain
#define DEFINE_HP_DOMAIN(domain) \
static DEFINE_PER_CPU(struct hp_slot, __ ## domain ## _slots); \
struct hp_domain domain = { \
.percpu_slots = &__## domain ## _slots, \
}
>
>> +
>> +/* Get the hazard pointer context address (may be NULL). */
>> +static inline
>> +void *hp_ctx_addr(struct hp_ctx ctx)
>> +{
>> + return ctx.addr;
>> +}
>
> From where I'm sitting this seems like superfluous fluff, what's wrong
> with ctx.addr ?
I'm OK removing the accessor and just using ctx.addr.
>
>> +/*
>> + * hp_allocate: Allocate a hazard pointer.
>> + *
>> + * Allocate a hazard pointer slot for @addr. The object existence should
>> + * be guaranteed by the caller. Expects to be called from preempt
>> + * disable context.
>> + *
>> + * Returns a hazard pointer context.
>
> So you made the WTF'o'meter crack, this here function does not allocate
> nothing. Naming is bad. At best this is something like
> try-set-hazard-pointer or somesuch.
I went with the naming from the 2004 paper from Maged Michael, but I
agree it could be clearer.
I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
"posting" the intent to use a pointer (as in on a metaphorical billboard),
and removing it when it's done.
>
>> + */
>> +static inline
>> +struct hp_ctx hp_allocate(struct hp_slot __percpu *percpu_slots, void *addr)
>> +{
>> + struct hp_slot *slot;
>> + struct hp_ctx ctx;
>> +
>> + if (!addr)
>> + goto fail;
>> + slot = this_cpu_ptr(percpu_slots);
>> + /*
>> + * A single hazard pointer slot per CPU is available currently.
>> + * Other hazard pointer domains can eventually have a different
>> + * configuration.
>> + */
>> + if (READ_ONCE(slot->addr))
>> + goto fail;
>> + WRITE_ONCE(slot->addr, addr); /* Store B */
>> + ctx.slot = slot;
>> + ctx.addr = addr;
>> + return ctx;
>> +
>> +fail:
>> + ctx.slot = NULL;
>> + ctx.addr = NULL;
>> + return ctx;
>> +}
>> +
>> +/*
>> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
>> + *
>> + * Returns a hazard pointer context. Expects to be called from preempt
>> + * disable context.
>> + */
>
> More terrible naming. Same as above, but additionally, I would expect a
> 'dereference' to actually dereference the pointer and have a return
> value of the dereferenced type.
hp_dereference_try_post() ?
>
> This function seems to double check and update the hp_ctx thing. I'm not
> at all sure yet wtf this is doing -- and the total lack of comments
> aren't helping.
The hp_ctx contains the outputs.
The function loads *addr_p to then try_post it into a HP slot. On success,
it re-reads the *addr_p (with address dependency) and if it still matches,
use that as output address pointer.
I'm planning to remove hp_ctx, and just have:
/*
* hp_try_post: Try to post a hazard pointer.
*
* Post a hazard pointer slot for @addr. The object existence should
* be guaranteed by the caller. Expects to be called from preempt
* disable context.
*
* Returns true if post succeeds, false otherwise.
*/
static inline
bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
[...]
/*
* hp_dereference_try_post: Dereference and try to post a hazard pointer.
*
* Returns a hazard pointer context. Expects to be called from preempt
* disable context.
*/
static inline
void *__hp_dereference_try_post(struct hp_domain *hp_domain,
void * const * addr_p, struct hp_slot **_slot)
[...]
#define hp_dereference_try_post(domain, p, slot_p) \
((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
/* Clear the hazard pointer in @slot. */
static inline
void hp_remove(struct hp_slot *slot)
[...]
>
>> +static inline
>> +struct hp_ctx hp_dereference_allocate(struct hp_slot __percpu *percpu_slots, void * const * addr_p)
>> +{
>> + void *addr, *addr2;
>> + struct hp_ctx ctx;
>> +
>> + addr = READ_ONCE(*addr_p);
>> +retry:
>> + ctx = hp_allocate(percpu_slots, addr);
>> + if (!hp_ctx_addr(ctx))
>> + goto fail;
>> + /* Memory ordering: Store B before Load A. */
>> + smp_mb();
>> + /*
>> + * Use RCU dereference without lockdep checks, because
>> + * lockdep is not aware of HP guarantees.
>> + */
>> + addr2 = rcu_access_pointer(*addr_p); /* Load A */
>> + /*
>> + * If @addr_p content has changed since the first load,
>> + * clear the hazard pointer and try again.
>> + */
>> + if (!ptr_eq(addr2, addr)) {
>> + WRITE_ONCE(ctx.slot->addr, NULL);
>> + if (!addr2)
>> + goto fail;
>> + addr = addr2;
>> + goto retry;
>> + }
>> + /*
>> + * Use addr2 loaded from rcu_access_pointer() to preserve
>> + * address dependency ordering.
>> + */
>> + ctx.addr = addr2;
>> + return ctx;
>> +
>> +fail:
>> + ctx.slot = NULL;
>> + ctx.addr = NULL;
>> + return ctx;
>> +}
>> +
>> +/* Retire the hazard pointer in @ctx. */
>> +static inline
>> +void hp_retire(const struct hp_ctx ctx)
>> +{
>> + smp_store_release(&ctx.slot->addr, NULL);
>> +}
>> +
>> +#endif /* _LINUX_HP_H */
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 3c13240dfc9f..ec16de96fa80 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \
>> cpu.o exit.o softirq.o resource.o \
>> sysctl.o capability.o ptrace.o user.o \
>> signal.o sys.o umh.o workqueue.o pid.o task_work.o \
>> - extable.o params.o \
>> + extable.o params.o hp.o \
>> kthread.o sys_ni.o nsproxy.o \
>> notifier.o ksysfs.o cred.o reboot.o \
>> async.o range.o smpboot.o ucount.o regset.o ksyms_common.o
>> diff --git a/kernel/hp.c b/kernel/hp.c
>> new file mode 100644
>> index 000000000000..b2447bf15300
>> --- /dev/null
>> +++ b/kernel/hp.c
>> @@ -0,0 +1,46 @@
>> +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>> +//
>> +// SPDX-License-Identifier: LGPL-2.1-or-later
>> +
>> +/*
>> + * HP: Hazard Pointers
>> + */
>> +
>> +#include <linux/hp.h>
>> +#include <linux/percpu.h>
>> +
>> +/*
>> + * hp_scan: Scan hazard pointer domain for @addr.
>> + *
>> + * Scan hazard pointer domain for @addr.
>> + * If @retire_cb is non-NULL, invoke @callback for each slot containing
>> + * @addr.
>> + * Wait to observe that each slot contains a value that differs from
>> + * @addr before returning.
>> + */
>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>> +{
>> + int cpu;
>> +
>> + /*
>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to
>> + * NULL or to a different value), and thus hides it from hazard
>> + * pointer readers.
>> + */
>> +
>> + if (!addr)
>> + return;
>> + /* Memory ordering: Store A before Load B. */
>> + smp_mb();
>> + /* Scan all CPUs slots. */
>> + for_each_possible_cpu(cpu) {
>> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>> +
>> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
>> + retire_cb(cpu, slot, addr);
>
> Is retirce_cb allowed to cmpxchg the thing?
It could, but we'd need to make sure the slot is not re-used by another
hp_try_post() before the current user removes its own post. It would
need to synchronize with the current HP user (e.g. with IPI).
I've actually renamed retire_cb to "on_match_cb".
>
>> + /* Busy-wait if node is found. */
>> + while ((smp_load_acquire(&slot->addr)) == addr) /* Load B */
>> + cpu_relax();
>
> This really should be using smp_cond_load_acquire()
Good point,
Thanks,
Mathieu
>
>> + }
>> +}
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-05 18:50 ` Mathieu Desnoyers
@ 2024-10-07 10:40 ` Peter Zijlstra
2024-10-07 14:50 ` Mathieu Desnoyers
0 siblings, 1 reply; 26+ messages in thread
From: Peter Zijlstra @ 2024-10-07 10:40 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Paul E. McKenney, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
> On 2024-10-05 18:04, Peter Zijlstra wrote:
> > > +/*
> > > + * hp_allocate: Allocate a hazard pointer.
> > > + *
> > > + * Allocate a hazard pointer slot for @addr. The object existence should
> > > + * be guaranteed by the caller. Expects to be called from preempt
> > > + * disable context.
> > > + *
> > > + * Returns a hazard pointer context.
> >
> > So you made the WTF'o'meter crack, this here function does not allocate
> > nothing. Naming is bad. At best this is something like
> > try-set-hazard-pointer or somesuch.
>
> I went with the naming from the 2004 paper from Maged Michael, but I
> agree it could be clearer.
>
> I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
> "posting" the intent to use a pointer (as in on a metaphorical billboard),
> and removing it when it's done.
For RCU we've taken to using the word: 'publish', no?
> > > +/*
> > > + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> > > + *
> > > + * Returns a hazard pointer context. Expects to be called from preempt
> > > + * disable context.
> > > + */
> >
> > More terrible naming. Same as above, but additionally, I would expect a
> > 'dereference' to actually dereference the pointer and have a return
> > value of the dereferenced type.
>
> hp_dereference_try_post() ?
>
> >
> > This function seems to double check and update the hp_ctx thing. I'm not
> > at all sure yet wtf this is doing -- and the total lack of comments
> > aren't helping.
>
> The hp_ctx contains the outputs.
>
> The function loads *addr_p to then try_post it into a HP slot. On success,
> it re-reads the *addr_p (with address dependency) and if it still matches,
> use that as output address pointer.
>
> I'm planning to remove hp_ctx, and just have:
>
> /*
> * hp_try_post: Try to post a hazard pointer.
> *
> * Post a hazard pointer slot for @addr. The object existence should
> * be guaranteed by the caller. Expects to be called from preempt
> * disable context.
> *
> * Returns true if post succeeds, false otherwise.
> */
> static inline
> bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
> [...]
>
> /*
> * hp_dereference_try_post: Dereference and try to post a hazard pointer.
> *
> * Returns a hazard pointer context. Expects to be called from preempt
> * disable context.
> */
> static inline
> void *__hp_dereference_try_post(struct hp_domain *hp_domain,
> void * const * addr_p, struct hp_slot **_slot)
> [...]
>
> #define hp_dereference_try_post(domain, p, slot_p) \
> ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
This will compile, but do the wrong thing when p is a regular pointer, no?
>
> /* Clear the hazard pointer in @slot. */
> static inline
> void hp_remove(struct hp_slot *slot)
> [...]
Differently weird, but better I suppose :-)
> > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > > +{
> > > + int cpu;
> > > +
> > > + /*
> > > + * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > > + * NULL or to a different value), and thus hides it from hazard
> > > + * pointer readers.
> > > + */
> > > +
> > > + if (!addr)
> > > + return;
> > > + /* Memory ordering: Store A before Load B. */
> > > + smp_mb();
> > > + /* Scan all CPUs slots. */
> > > + for_each_possible_cpu(cpu) {
> > > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> > > +
> > > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
> > > + retire_cb(cpu, slot, addr);
> >
> > Is retirce_cb allowed to cmpxchg the thing?
>
> It could, but we'd need to make sure the slot is not re-used by another
> hp_try_post() before the current user removes its own post. It would
> need to synchronize with the current HP user (e.g. with IPI).
>
> I've actually renamed retire_cb to "on_match_cb".
Hmm, I think I see. Would it make sense to pass the expected addr to
hp_remove() and double check we don't NULL out something unexpected? --
maybe just for a DEBUG option.
I'm always seeing the NOHZ_FULL guys hating on this :-)
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-07 10:40 ` Peter Zijlstra
@ 2024-10-07 14:50 ` Mathieu Desnoyers
2024-10-07 18:18 ` Paul E. McKenney
0 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-07 14:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Boqun Feng, linux-kernel, Linus Torvalds, Andrew Morton,
Nicholas Piggin, Michael Ellerman, Greg Kroah-Hartman,
Sebastian Andrzej Siewior, Paul E. McKenney, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On 2024-10-07 12:40, Peter Zijlstra wrote:
> On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
>> On 2024-10-05 18:04, Peter Zijlstra wrote:
>
>
>>>> +/*
>>>> + * hp_allocate: Allocate a hazard pointer.
>>>> + *
>>>> + * Allocate a hazard pointer slot for @addr. The object existence should
>>>> + * be guaranteed by the caller. Expects to be called from preempt
>>>> + * disable context.
>>>> + *
>>>> + * Returns a hazard pointer context.
>>>
>>> So you made the WTF'o'meter crack, this here function does not allocate
>>> nothing. Naming is bad. At best this is something like
>>> try-set-hazard-pointer or somesuch.
>>
>> I went with the naming from the 2004 paper from Maged Michael, but I
>> agree it could be clearer.
>>
>> I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
>> "posting" the intent to use a pointer (as in on a metaphorical billboard),
>> and removing it when it's done.
>
> For RCU we've taken to using the word: 'publish', no?
I'm so glad you suggest this, because it turns out that from all
the possible words you could choose from, 'publish' is probably the
most actively confusing. I'll explain.
Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers
through a simple example:
[ Note: I've renamed the HP dereference try_post to HP load try_post
based on further discussion below. ]
*** RCU ***
* Dereference RCU-protected pointer:
rcu_read_lock(); // [ Begin read transaction ]
l_p = rcu_dereference(p); // [ Load p: @addr or NULL ]
if (l_p)
[ use *l_p ...]
rcu_read_unlock(); // [ End read transaction ]
* Publish @addr: addr = kmalloc();
init(addr);
rcu_assign_pointer(p, addr);
* Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
synchronize_rcu(); // Wait for all pre-existing
// read transactions to complete.
kfree(addr);
*** Hazard Pointers ***
* Load and post a HP-protected pointer:
l_p = hp_load_try_post(domain, &p, &slot);
if (l_p) {
[ use *l_p ...]
hp_remove(&slot, l_p);
}
* Publish @addr: addr = kmalloc();
init(addr);
rcu_assign_pointer(p, addr);
* Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
hp_scan(domain, addr, NULL);
kfree(addr);
Both HP and RCU have publication guarantees, which can in fact be
implemented in the same way (e.g. rcu_assign_pointer paired with
something that respects address dependencies ordering). A stronger
implementation of this would be pairing a store-release with a
load-acquire: it works, but it would add needless overhead on
weakly-ordered CPUs.
How the two mechanisms differ is in how they track when it is
safe to reclaim @addr. RCU tracks reader "transactions" begin/end,
and makes sure that all pre-existing transactions are gone before
synchronize_rcu() is allowed to complete. HP does this by tracking
"posted" pointer slots with a HP domain. As long as hp_scan observes
that HP readers are showing interest in @addr, it will wait.
One notable difference between RCU and HP is that HP knows exactly
which pointer is blocking progress, and from which CPU (at least
with my per-CPU HP domain implementation). Therefore, it is possible
for HP to issue an IPI and make sure the HP user either completes its
use of the pointer quickly, or stops using it right away (e.g. making
the active mm use idle mm instead).
One strength of RCU is that it can track use of a whole set of RCU
pointers just by tracking reader transaction begin/end, but this is
also one of its weaknesses: a long reader transaction can postpone
completion of grace period for a long time and increase the memory
footprint. In comparison, HP can immediately complete as soon as the
pointer it is scanning for is gone. Even better, it can send an IPI
to the belate CPU and abort use of the pointer using a callback.
>
>
>>>> +/*
>>>> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
>>>> + *
>>>> + * Returns a hazard pointer context. Expects to be called from preempt
>>>> + * disable context.
>>>> + */
>>>
>>> More terrible naming. Same as above, but additionally, I would expect a
>>> 'dereference' to actually dereference the pointer and have a return
>>> value of the dereferenced type.
>>
>> hp_dereference_try_post() ?
>>
>>>
>>> This function seems to double check and update the hp_ctx thing. I'm not
>>> at all sure yet wtf this is doing -- and the total lack of comments
>>> aren't helping.
>>
>> The hp_ctx contains the outputs.
>>
>> The function loads *addr_p to then try_post it into a HP slot. On success,
>> it re-reads the *addr_p (with address dependency) and if it still matches,
>> use that as output address pointer.
>>
>> I'm planning to remove hp_ctx, and just have:
>>
>> /*
>> * hp_try_post: Try to post a hazard pointer.
>> *
>> * Post a hazard pointer slot for @addr. The object existence should
>> * be guaranteed by the caller. Expects to be called from preempt
>> * disable context.
>> *
>> * Returns true if post succeeds, false otherwise.
>> */
>> static inline
>> bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
>> [...]
>>
>> /*
>> * hp_dereference_try_post: Dereference and try to post a hazard pointer.
>> *
>> * Returns a hazard pointer context. Expects to be called from preempt
>> * disable context.
>> */
>> static inline
>> void *__hp_dereference_try_post(struct hp_domain *hp_domain,
>> void * const * addr_p, struct hp_slot **_slot)
>> [...]
>>
>> #define hp_dereference_try_post(domain, p, slot_p) \
>> ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
>
> This will compile, but do the wrong thing when p is a regular pointer, no?
Right, at least in some cases the compiler may not complain, and people used to
rcu_dereference() will expect that "p" is the pointer to load rather than the
address of that pointer. This would be unexpected.
I must admit that passing the address holding the pointer to load rather than
the pointer to load itself makes it much less troublesome in terms of macro
layers. But perhaps this is another example where we should wander away from the
beaten path and use a word different from "dereference" here. E.g.:
/*
* Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p))
* to generate a compile error if addr_p is not a pointer to a pointer.
*/
#define hp_load_try_post(domain, addr_p, slot_p) \
((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p))
>
>>
>> /* Clear the hazard pointer in @slot. */
>> static inline
>> void hp_remove(struct hp_slot *slot)
>> [...]
>
> Differently weird, but better I suppose :-)
If you find a better word than "remove" to pair with "post", I'm all in :)
>
>
>>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>>>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>>>> +{
>>>> + int cpu;
>>>> +
>>>> + /*
>>>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to
>>>> + * NULL or to a different value), and thus hides it from hazard
>>>> + * pointer readers.
>>>> + */
>>>> +
>>>> + if (!addr)
>>>> + return;
>>>> + /* Memory ordering: Store A before Load B. */
>>>> + smp_mb();
>>>> + /* Scan all CPUs slots. */
>>>> + for_each_possible_cpu(cpu) {
>>>> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>>>> +
>>>> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
>>>> + retire_cb(cpu, slot, addr);
>>>
>>> Is retirce_cb allowed to cmpxchg the thing?
>>
>> It could, but we'd need to make sure the slot is not re-used by another
>> hp_try_post() before the current user removes its own post. It would
>> need to synchronize with the current HP user (e.g. with IPI).
>>
>> I've actually renamed retire_cb to "on_match_cb".
>
> Hmm, I think I see. Would it make sense to pass the expected addr to
> hp_remove() and double check we don't NULL out something unexpected? --
> maybe just for a DEBUG option.
>
> I'm always seeing the NOHZ_FULL guys hating on this :-)
That's a fair point. Sure, we can do this as an extra safety net. For now I
will just make the check always present, we can always move it to a debug
option later.
And now I notice that hp_remove is also used for CPU hotplug (grep
matches for cpuhp_remove_state()). I wonder if we should go for something
more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-07 14:50 ` Mathieu Desnoyers
@ 2024-10-07 18:18 ` Paul E. McKenney
2024-10-07 19:06 ` Paul E. McKenney
0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2024-10-07 18:18 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Peter Zijlstra, Boqun Feng, linux-kernel, Linus Torvalds,
Andrew Morton, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On Mon, Oct 07, 2024 at 10:50:46AM -0400, Mathieu Desnoyers wrote:
> On 2024-10-07 12:40, Peter Zijlstra wrote:
> > On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
> > > On 2024-10-05 18:04, Peter Zijlstra wrote:
> >
> >
> > > > > +/*
> > > > > + * hp_allocate: Allocate a hazard pointer.
> > > > > + *
> > > > > + * Allocate a hazard pointer slot for @addr. The object existence should
> > > > > + * be guaranteed by the caller. Expects to be called from preempt
> > > > > + * disable context.
> > > > > + *
> > > > > + * Returns a hazard pointer context.
> > > >
> > > > So you made the WTF'o'meter crack, this here function does not allocate
> > > > nothing. Naming is bad. At best this is something like
> > > > try-set-hazard-pointer or somesuch.
> > >
> > > I went with the naming from the 2004 paper from Maged Michael, but I
> > > agree it could be clearer.
> > >
> > > I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
> > > "posting" the intent to use a pointer (as in on a metaphorical billboard),
> > > and removing it when it's done.
> >
> > For RCU we've taken to using the word: 'publish', no?
>
> I'm so glad you suggest this, because it turns out that from all
> the possible words you could choose from, 'publish' is probably the
> most actively confusing. I'll explain.
>
> Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers
> through a simple example:
>
> [ Note: I've renamed the HP dereference try_post to HP load try_post
> based on further discussion below. ]
>
> *** RCU ***
>
> * Dereference RCU-protected pointer:
> rcu_read_lock(); // [ Begin read transaction ]
> l_p = rcu_dereference(p); // [ Load p: @addr or NULL ]
> if (l_p)
> [ use *l_p ...]
> rcu_read_unlock(); // [ End read transaction ]
>
> * Publish @addr: addr = kmalloc();
> init(addr);
> rcu_assign_pointer(p, addr);
>
> * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
> synchronize_rcu(); // Wait for all pre-existing
> // read transactions to complete.
> kfree(addr);
>
>
> *** Hazard Pointers ***
>
> * Load and post a HP-protected pointer:
> l_p = hp_load_try_post(domain, &p, &slot);
> if (l_p) {
> [ use *l_p ...]
> hp_remove(&slot, l_p);
> }
>
> * Publish @addr: addr = kmalloc();
> init(addr);
> rcu_assign_pointer(p, addr);
>
> * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
> hp_scan(domain, addr, NULL);
> kfree(addr);
>
> Both HP and RCU have publication guarantees, which can in fact be
> implemented in the same way (e.g. rcu_assign_pointer paired with
> something that respects address dependencies ordering). A stronger
> implementation of this would be pairing a store-release with a
> load-acquire: it works, but it would add needless overhead on
> weakly-ordered CPUs.
>
> How the two mechanisms differ is in how they track when it is
> safe to reclaim @addr. RCU tracks reader "transactions" begin/end,
> and makes sure that all pre-existing transactions are gone before
> synchronize_rcu() is allowed to complete. HP does this by tracking
> "posted" pointer slots with a HP domain. As long as hp_scan observes
> that HP readers are showing interest in @addr, it will wait.
>
> One notable difference between RCU and HP is that HP knows exactly
> which pointer is blocking progress, and from which CPU (at least
> with my per-CPU HP domain implementation). Therefore, it is possible
> for HP to issue an IPI and make sure the HP user either completes its
> use of the pointer quickly, or stops using it right away (e.g. making
> the active mm use idle mm instead).
>
> One strength of RCU is that it can track use of a whole set of RCU
> pointers just by tracking reader transaction begin/end, but this is
> also one of its weaknesses: a long reader transaction can postpone
> completion of grace period for a long time and increase the memory
> footprint. In comparison, HP can immediately complete as soon as the
> pointer it is scanning for is gone. Even better, it can send an IPI
> to the belate CPU and abort use of the pointer using a callback.
Plus, in contrast to hazard pointers, rcu_dereference() cannot say "no".
This all sounds like arguments *for* use of the term "publish" for
hazard pointers rather than against it. What am I missing here?
Thanx, Paul
> > > > > +/*
> > > > > + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> > > > > + *
> > > > > + * Returns a hazard pointer context. Expects to be called from preempt
> > > > > + * disable context.
> > > > > + */
> > > >
> > > > More terrible naming. Same as above, but additionally, I would expect a
> > > > 'dereference' to actually dereference the pointer and have a return
> > > > value of the dereferenced type.
> > >
> > > hp_dereference_try_post() ?
> > >
> > > >
> > > > This function seems to double check and update the hp_ctx thing. I'm not
> > > > at all sure yet wtf this is doing -- and the total lack of comments
> > > > aren't helping.
> > >
> > > The hp_ctx contains the outputs.
> > >
> > > The function loads *addr_p to then try_post it into a HP slot. On success,
> > > it re-reads the *addr_p (with address dependency) and if it still matches,
> > > use that as output address pointer.
> > >
> > > I'm planning to remove hp_ctx, and just have:
> > >
> > > /*
> > > * hp_try_post: Try to post a hazard pointer.
> > > *
> > > * Post a hazard pointer slot for @addr. The object existence should
> > > * be guaranteed by the caller. Expects to be called from preempt
> > > * disable context.
> > > *
> > > * Returns true if post succeeds, false otherwise.
> > > */
> > > static inline
> > > bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
> > > [...]
> > >
> > > /*
> > > * hp_dereference_try_post: Dereference and try to post a hazard pointer.
> > > *
> > > * Returns a hazard pointer context. Expects to be called from preempt
> > > * disable context.
> > > */
> > > static inline
> > > void *__hp_dereference_try_post(struct hp_domain *hp_domain,
> > > void * const * addr_p, struct hp_slot **_slot)
> > > [...]
> > >
> > > #define hp_dereference_try_post(domain, p, slot_p) \
> > > ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
> >
> > This will compile, but do the wrong thing when p is a regular pointer, no?
>
> Right, at least in some cases the compiler may not complain, and people used to
> rcu_dereference() will expect that "p" is the pointer to load rather than the
> address of that pointer. This would be unexpected.
>
> I must admit that passing the address holding the pointer to load rather than
> the pointer to load itself makes it much less troublesome in terms of macro
> layers. But perhaps this is another example where we should wander away from the
> beaten path and use a word different from "dereference" here. E.g.:
>
> /*
> * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p))
> * to generate a compile error if addr_p is not a pointer to a pointer.
> */
> #define hp_load_try_post(domain, addr_p, slot_p) \
> ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p))
>
> >
> > >
> > > /* Clear the hazard pointer in @slot. */
> > > static inline
> > > void hp_remove(struct hp_slot *slot)
> > > [...]
> >
> > Differently weird, but better I suppose :-)
>
> If you find a better word than "remove" to pair with "post", I'm all in :)
>
> >
> >
> > > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > > > > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > > > > +{
> > > > > + int cpu;
> > > > > +
> > > > > + /*
> > > > > + * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > > > > + * NULL or to a different value), and thus hides it from hazard
> > > > > + * pointer readers.
> > > > > + */
> > > > > +
> > > > > + if (!addr)
> > > > > + return;
> > > > > + /* Memory ordering: Store A before Load B. */
> > > > > + smp_mb();
> > > > > + /* Scan all CPUs slots. */
> > > > > + for_each_possible_cpu(cpu) {
> > > > > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> > > > > +
> > > > > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
> > > > > + retire_cb(cpu, slot, addr);
> > > >
> > > > Is retirce_cb allowed to cmpxchg the thing?
> > >
> > > It could, but we'd need to make sure the slot is not re-used by another
> > > hp_try_post() before the current user removes its own post. It would
> > > need to synchronize with the current HP user (e.g. with IPI).
> > >
> > > I've actually renamed retire_cb to "on_match_cb".
> >
> > Hmm, I think I see. Would it make sense to pass the expected addr to
> > hp_remove() and double check we don't NULL out something unexpected? --
> > maybe just for a DEBUG option.
> >
> > I'm always seeing the NOHZ_FULL guys hating on this :-)
>
> That's a fair point. Sure, we can do this as an extra safety net. For now I
> will just make the check always present, we can always move it to a debug
> option later.
>
> And now I notice that hp_remove is also used for CPU hotplug (grep
> matches for cpuhp_remove_state()). I wonder if we should go for something
> more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ?
>
> Thanks,
>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-07 18:18 ` Paul E. McKenney
@ 2024-10-07 19:06 ` Paul E. McKenney
2024-10-07 19:08 ` Mathieu Desnoyers
0 siblings, 1 reply; 26+ messages in thread
From: Paul E. McKenney @ 2024-10-07 19:06 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: Peter Zijlstra, Boqun Feng, linux-kernel, Linus Torvalds,
Andrew Morton, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On Mon, Oct 07, 2024 at 11:18:14AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 07, 2024 at 10:50:46AM -0400, Mathieu Desnoyers wrote:
> > On 2024-10-07 12:40, Peter Zijlstra wrote:
> > > On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
> > > > On 2024-10-05 18:04, Peter Zijlstra wrote:
> > >
> > >
> > > > > > +/*
> > > > > > + * hp_allocate: Allocate a hazard pointer.
> > > > > > + *
> > > > > > + * Allocate a hazard pointer slot for @addr. The object existence should
> > > > > > + * be guaranteed by the caller. Expects to be called from preempt
> > > > > > + * disable context.
> > > > > > + *
> > > > > > + * Returns a hazard pointer context.
> > > > >
> > > > > So you made the WTF'o'meter crack, this here function does not allocate
> > > > > nothing. Naming is bad. At best this is something like
> > > > > try-set-hazard-pointer or somesuch.
> > > >
> > > > I went with the naming from the 2004 paper from Maged Michael, but I
> > > > agree it could be clearer.
> > > >
> > > > I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
> > > > "posting" the intent to use a pointer (as in on a metaphorical billboard),
> > > > and removing it when it's done.
> > >
> > > For RCU we've taken to using the word: 'publish', no?
> >
> > I'm so glad you suggest this, because it turns out that from all
> > the possible words you could choose from, 'publish' is probably the
> > most actively confusing. I'll explain.
> >
> > Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers
> > through a simple example:
> >
> > [ Note: I've renamed the HP dereference try_post to HP load try_post
> > based on further discussion below. ]
> >
> > *** RCU ***
> >
> > * Dereference RCU-protected pointer:
> > rcu_read_lock(); // [ Begin read transaction ]
> > l_p = rcu_dereference(p); // [ Load p: @addr or NULL ]
> > if (l_p)
> > [ use *l_p ...]
> > rcu_read_unlock(); // [ End read transaction ]
> >
> > * Publish @addr: addr = kmalloc();
> > init(addr);
> > rcu_assign_pointer(p, addr);
> >
> > * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
> > synchronize_rcu(); // Wait for all pre-existing
> > // read transactions to complete.
> > kfree(addr);
> >
> >
> > *** Hazard Pointers ***
> >
> > * Load and post a HP-protected pointer:
> > l_p = hp_load_try_post(domain, &p, &slot);
> > if (l_p) {
> > [ use *l_p ...]
> > hp_remove(&slot, l_p);
> > }
> >
> > * Publish @addr: addr = kmalloc();
> > init(addr);
> > rcu_assign_pointer(p, addr);
> >
> > * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
> > hp_scan(domain, addr, NULL);
> > kfree(addr);
> >
> > Both HP and RCU have publication guarantees, which can in fact be
> > implemented in the same way (e.g. rcu_assign_pointer paired with
> > something that respects address dependencies ordering). A stronger
> > implementation of this would be pairing a store-release with a
> > load-acquire: it works, but it would add needless overhead on
> > weakly-ordered CPUs.
> >
> > How the two mechanisms differ is in how they track when it is
> > safe to reclaim @addr. RCU tracks reader "transactions" begin/end,
> > and makes sure that all pre-existing transactions are gone before
> > synchronize_rcu() is allowed to complete. HP does this by tracking
> > "posted" pointer slots with a HP domain. As long as hp_scan observes
> > that HP readers are showing interest in @addr, it will wait.
> >
> > One notable difference between RCU and HP is that HP knows exactly
> > which pointer is blocking progress, and from which CPU (at least
> > with my per-CPU HP domain implementation). Therefore, it is possible
> > for HP to issue an IPI and make sure the HP user either completes its
> > use of the pointer quickly, or stops using it right away (e.g. making
> > the active mm use idle mm instead).
> >
> > One strength of RCU is that it can track use of a whole set of RCU
> > pointers just by tracking reader transaction begin/end, but this is
> > also one of its weaknesses: a long reader transaction can postpone
> > completion of grace period for a long time and increase the memory
> > footprint. In comparison, HP can immediately complete as soon as the
> > pointer it is scanning for is gone. Even better, it can send an IPI
> > to the belate CPU and abort use of the pointer using a callback.
>
> Plus, in contrast to hazard pointers, rcu_dereference() cannot say "no".
>
> This all sounds like arguments *for* use of the term "publish" for
> hazard pointers rather than against it. What am I missing here?
OK, one thing that I was missing is that this was not about the
counterpart to rcu_assign_pointer(), for which I believe "publish" makes
a lot of sense, but rather about the setting of a hazard pointer. Here,
"protect" is the traditional term of power, which has served users well
for some years.
Thanx, Paul
> > > > > > +/*
> > > > > > + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
> > > > > > + *
> > > > > > + * Returns a hazard pointer context. Expects to be called from preempt
> > > > > > + * disable context.
> > > > > > + */
> > > > >
> > > > > More terrible naming. Same as above, but additionally, I would expect a
> > > > > 'dereference' to actually dereference the pointer and have a return
> > > > > value of the dereferenced type.
> > > >
> > > > hp_dereference_try_post() ?
> > > >
> > > > >
> > > > > This function seems to double check and update the hp_ctx thing. I'm not
> > > > > at all sure yet wtf this is doing -- and the total lack of comments
> > > > > aren't helping.
> > > >
> > > > The hp_ctx contains the outputs.
> > > >
> > > > The function loads *addr_p to then try_post it into a HP slot. On success,
> > > > it re-reads the *addr_p (with address dependency) and if it still matches,
> > > > use that as output address pointer.
> > > >
> > > > I'm planning to remove hp_ctx, and just have:
> > > >
> > > > /*
> > > > * hp_try_post: Try to post a hazard pointer.
> > > > *
> > > > * Post a hazard pointer slot for @addr. The object existence should
> > > > * be guaranteed by the caller. Expects to be called from preempt
> > > > * disable context.
> > > > *
> > > > * Returns true if post succeeds, false otherwise.
> > > > */
> > > > static inline
> > > > bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
> > > > [...]
> > > >
> > > > /*
> > > > * hp_dereference_try_post: Dereference and try to post a hazard pointer.
> > > > *
> > > > * Returns a hazard pointer context. Expects to be called from preempt
> > > > * disable context.
> > > > */
> > > > static inline
> > > > void *__hp_dereference_try_post(struct hp_domain *hp_domain,
> > > > void * const * addr_p, struct hp_slot **_slot)
> > > > [...]
> > > >
> > > > #define hp_dereference_try_post(domain, p, slot_p) \
> > > > ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
> > >
> > > This will compile, but do the wrong thing when p is a regular pointer, no?
> >
> > Right, at least in some cases the compiler may not complain, and people used to
> > rcu_dereference() will expect that "p" is the pointer to load rather than the
> > address of that pointer. This would be unexpected.
> >
> > I must admit that passing the address holding the pointer to load rather than
> > the pointer to load itself makes it much less troublesome in terms of macro
> > layers. But perhaps this is another example where we should wander away from the
> > beaten path and use a word different from "dereference" here. E.g.:
> >
> > /*
> > * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p))
> > * to generate a compile error if addr_p is not a pointer to a pointer.
> > */
> > #define hp_load_try_post(domain, addr_p, slot_p) \
> > ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p))
> >
> > >
> > > >
> > > > /* Clear the hazard pointer in @slot. */
> > > > static inline
> > > > void hp_remove(struct hp_slot *slot)
> > > > [...]
> > >
> > > Differently weird, but better I suppose :-)
> >
> > If you find a better word than "remove" to pair with "post", I'm all in :)
> >
> > >
> > >
> > > > > > +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
> > > > > > + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
> > > > > > +{
> > > > > > + int cpu;
> > > > > > +
> > > > > > + /*
> > > > > > + * Store A precedes hp_scan(): it unpublishes addr (sets it to
> > > > > > + * NULL or to a different value), and thus hides it from hazard
> > > > > > + * pointer readers.
> > > > > > + */
> > > > > > +
> > > > > > + if (!addr)
> > > > > > + return;
> > > > > > + /* Memory ordering: Store A before Load B. */
> > > > > > + smp_mb();
> > > > > > + /* Scan all CPUs slots. */
> > > > > > + for_each_possible_cpu(cpu) {
> > > > > > + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
> > > > > > +
> > > > > > + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
> > > > > > + retire_cb(cpu, slot, addr);
> > > > >
> > > > > Is retirce_cb allowed to cmpxchg the thing?
> > > >
> > > > It could, but we'd need to make sure the slot is not re-used by another
> > > > hp_try_post() before the current user removes its own post. It would
> > > > need to synchronize with the current HP user (e.g. with IPI).
> > > >
> > > > I've actually renamed retire_cb to "on_match_cb".
> > >
> > > Hmm, I think I see. Would it make sense to pass the expected addr to
> > > hp_remove() and double check we don't NULL out something unexpected? --
> > > maybe just for a DEBUG option.
> > >
> > > I'm always seeing the NOHZ_FULL guys hating on this :-)
> >
> > That's a fair point. Sure, we can do this as an extra safety net. For now I
> > will just make the check always present, we can always move it to a debug
> > option later.
> >
> > And now I notice that hp_remove is also used for CPU hotplug (grep
> > matches for cpuhp_remove_state()). I wonder if we should go for something
> > more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ?
> >
> > Thanks,
> >
> > Mathieu
> >
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > https://www.efficios.com
> >
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 3/4] hp: Implement Hazard Pointers
2024-10-07 19:06 ` Paul E. McKenney
@ 2024-10-07 19:08 ` Mathieu Desnoyers
0 siblings, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-07 19:08 UTC (permalink / raw)
To: paulmck
Cc: Peter Zijlstra, Boqun Feng, linux-kernel, Linus Torvalds,
Andrew Morton, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Will Deacon,
Alan Stern, John Stultz, Neeraj Upadhyay, Frederic Weisbecker,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Steven Rostedt,
Lai Jiangshan, Zqiang, Ingo Molnar, Waiman Long, Mark Rutland,
Thomas Gleixner, Vlastimil Babka, maged.michael, Mateusz Guzik,
Jonas Oberhauser, rcu, linux-mm, lkmm
On 2024-10-07 21:06, Paul E. McKenney wrote:
> On Mon, Oct 07, 2024 at 11:18:14AM -0700, Paul E. McKenney wrote:
>> On Mon, Oct 07, 2024 at 10:50:46AM -0400, Mathieu Desnoyers wrote:
>>> On 2024-10-07 12:40, Peter Zijlstra wrote:
>>>> On Sat, Oct 05, 2024 at 02:50:17PM -0400, Mathieu Desnoyers wrote:
>>>>> On 2024-10-05 18:04, Peter Zijlstra wrote:
>>>>
>>>>
>>>>>>> +/*
>>>>>>> + * hp_allocate: Allocate a hazard pointer.
>>>>>>> + *
>>>>>>> + * Allocate a hazard pointer slot for @addr. The object existence should
>>>>>>> + * be guaranteed by the caller. Expects to be called from preempt
>>>>>>> + * disable context.
>>>>>>> + *
>>>>>>> + * Returns a hazard pointer context.
>>>>>>
>>>>>> So you made the WTF'o'meter crack, this here function does not allocate
>>>>>> nothing. Naming is bad. At best this is something like
>>>>>> try-set-hazard-pointer or somesuch.
>>>>>
>>>>> I went with the naming from the 2004 paper from Maged Michael, but I
>>>>> agree it could be clearer.
>>>>>
>>>>> I'm tempted to go for "hp_try_post()" and "hp_remove()", basically
>>>>> "posting" the intent to use a pointer (as in on a metaphorical billboard),
>>>>> and removing it when it's done.
>>>>
>>>> For RCU we've taken to using the word: 'publish', no?
>>>
>>> I'm so glad you suggest this, because it turns out that from all
>>> the possible words you could choose from, 'publish' is probably the
>>> most actively confusing. I'll explain.
>>>
>>> Let me first do a 10'000 feet comparison of RCU vs Hazard Pointers
>>> through a simple example:
>>>
>>> [ Note: I've renamed the HP dereference try_post to HP load try_post
>>> based on further discussion below. ]
>>>
>>> *** RCU ***
>>>
>>> * Dereference RCU-protected pointer:
>>> rcu_read_lock(); // [ Begin read transaction ]
>>> l_p = rcu_dereference(p); // [ Load p: @addr or NULL ]
>>> if (l_p)
>>> [ use *l_p ...]
>>> rcu_read_unlock(); // [ End read transaction ]
>>>
>>> * Publish @addr: addr = kmalloc();
>>> init(addr);
>>> rcu_assign_pointer(p, addr);
>>>
>>> * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
>>> synchronize_rcu(); // Wait for all pre-existing
>>> // read transactions to complete.
>>> kfree(addr);
>>>
>>>
>>> *** Hazard Pointers ***
>>>
>>> * Load and post a HP-protected pointer:
>>> l_p = hp_load_try_post(domain, &p, &slot);
>>> if (l_p) {
>>> [ use *l_p ...]
>>> hp_remove(&slot, l_p);
>>> }
>>>
>>> * Publish @addr: addr = kmalloc();
>>> init(addr);
>>> rcu_assign_pointer(p, addr);
>>>
>>> * Reclaim @addr: rcu_assign_pointer(p, NULL); // [ Unpublish @addr ]
>>> hp_scan(domain, addr, NULL);
>>> kfree(addr);
>>>
>>> Both HP and RCU have publication guarantees, which can in fact be
>>> implemented in the same way (e.g. rcu_assign_pointer paired with
>>> something that respects address dependencies ordering). A stronger
>>> implementation of this would be pairing a store-release with a
>>> load-acquire: it works, but it would add needless overhead on
>>> weakly-ordered CPUs.
>>>
>>> How the two mechanisms differ is in how they track when it is
>>> safe to reclaim @addr. RCU tracks reader "transactions" begin/end,
>>> and makes sure that all pre-existing transactions are gone before
>>> synchronize_rcu() is allowed to complete. HP does this by tracking
>>> "posted" pointer slots with a HP domain. As long as hp_scan observes
>>> that HP readers are showing interest in @addr, it will wait.
>>>
>>> One notable difference between RCU and HP is that HP knows exactly
>>> which pointer is blocking progress, and from which CPU (at least
>>> with my per-CPU HP domain implementation). Therefore, it is possible
>>> for HP to issue an IPI and make sure the HP user either completes its
>>> use of the pointer quickly, or stops using it right away (e.g. making
>>> the active mm use idle mm instead).
>>>
>>> One strength of RCU is that it can track use of a whole set of RCU
>>> pointers just by tracking reader transaction begin/end, but this is
>>> also one of its weaknesses: a long reader transaction can postpone
>>> completion of grace period for a long time and increase the memory
>>> footprint. In comparison, HP can immediately complete as soon as the
>>> pointer it is scanning for is gone. Even better, it can send an IPI
>>> to the belate CPU and abort use of the pointer using a callback.
>>
>> Plus, in contrast to hazard pointers, rcu_dereference() cannot say "no".
>>
>> This all sounds like arguments *for* use of the term "publish" for
>> hazard pointers rather than against it. What am I missing here?
>
> OK, one thing that I was missing is that this was not about the
> counterpart to rcu_assign_pointer(), for which I believe "publish" makes
> a lot of sense, but rather about the setting of a hazard pointer. Here,
> "protect" is the traditional term of power, which has served users well
> for some years.
After some reading of the C++ specification wording used for hazard
pointers, I am indeed tempted to go for "try_protect()" and
"retire()" to minimize confusion.
Thanks,
Mathieu
>
> Thanx, Paul
>
>>>>>>> +/*
>>>>>>> + * hp_dereference_allocate: Dereference and allocate a hazard pointer.
>>>>>>> + *
>>>>>>> + * Returns a hazard pointer context. Expects to be called from preempt
>>>>>>> + * disable context.
>>>>>>> + */
>>>>>>
>>>>>> More terrible naming. Same as above, but additionally, I would expect a
>>>>>> 'dereference' to actually dereference the pointer and have a return
>>>>>> value of the dereferenced type.
>>>>>
>>>>> hp_dereference_try_post() ?
>>>>>
>>>>>>
>>>>>> This function seems to double check and update the hp_ctx thing. I'm not
>>>>>> at all sure yet wtf this is doing -- and the total lack of comments
>>>>>> aren't helping.
>>>>>
>>>>> The hp_ctx contains the outputs.
>>>>>
>>>>> The function loads *addr_p to then try_post it into a HP slot. On success,
>>>>> it re-reads the *addr_p (with address dependency) and if it still matches,
>>>>> use that as output address pointer.
>>>>>
>>>>> I'm planning to remove hp_ctx, and just have:
>>>>>
>>>>> /*
>>>>> * hp_try_post: Try to post a hazard pointer.
>>>>> *
>>>>> * Post a hazard pointer slot for @addr. The object existence should
>>>>> * be guaranteed by the caller. Expects to be called from preempt
>>>>> * disable context.
>>>>> *
>>>>> * Returns true if post succeeds, false otherwise.
>>>>> */
>>>>> static inline
>>>>> bool hp_try_post(struct hp_domain *hp_domain, void *addr, struct hp_slot **_slot)
>>>>> [...]
>>>>>
>>>>> /*
>>>>> * hp_dereference_try_post: Dereference and try to post a hazard pointer.
>>>>> *
>>>>> * Returns a hazard pointer context. Expects to be called from preempt
>>>>> * disable context.
>>>>> */
>>>>> static inline
>>>>> void *__hp_dereference_try_post(struct hp_domain *hp_domain,
>>>>> void * const * addr_p, struct hp_slot **_slot)
>>>>> [...]
>>>>>
>>>>> #define hp_dereference_try_post(domain, p, slot_p) \
>>>>> ((__typeof__(*(p))) __hp_dereference_try_post(domain, (void * const *) p, slot_p))
>>>>
>>>> This will compile, but do the wrong thing when p is a regular pointer, no?
>>>
>>> Right, at least in some cases the compiler may not complain, and people used to
>>> rcu_dereference() will expect that "p" is the pointer to load rather than the
>>> address of that pointer. This would be unexpected.
>>>
>>> I must admit that passing the address holding the pointer to load rather than
>>> the pointer to load itself makes it much less troublesome in terms of macro
>>> layers. But perhaps this is another example where we should wander away from the
>>> beaten path and use a word different from "dereference" here. E.g.:
>>>
>>> /*
>>> * Use a comma expression within typeof: __typeof__((void)**(addr_p), *(addr_p))
>>> * to generate a compile error if addr_p is not a pointer to a pointer.
>>> */
>>> #define hp_load_try_post(domain, addr_p, slot_p) \
>>> ((__typeof__((void)**(addr_p), *(addr_p))) __hp_load_try_post(domain, (void * const *) (addr_p), slot_p))
>>>
>>>>
>>>>>
>>>>> /* Clear the hazard pointer in @slot. */
>>>>> static inline
>>>>> void hp_remove(struct hp_slot *slot)
>>>>> [...]
>>>>
>>>> Differently weird, but better I suppose :-)
>>>
>>> If you find a better word than "remove" to pair with "post", I'm all in :)
>>>
>>>>
>>>>
>>>>>>> +void hp_scan(struct hp_slot __percpu *percpu_slots, void *addr,
>>>>>>> + void (*retire_cb)(int cpu, struct hp_slot *slot, void *addr))
>>>>>>> +{
>>>>>>> + int cpu;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * Store A precedes hp_scan(): it unpublishes addr (sets it to
>>>>>>> + * NULL or to a different value), and thus hides it from hazard
>>>>>>> + * pointer readers.
>>>>>>> + */
>>>>>>> +
>>>>>>> + if (!addr)
>>>>>>> + return;
>>>>>>> + /* Memory ordering: Store A before Load B. */
>>>>>>> + smp_mb();
>>>>>>> + /* Scan all CPUs slots. */
>>>>>>> + for_each_possible_cpu(cpu) {
>>>>>>> + struct hp_slot *slot = per_cpu_ptr(percpu_slots, cpu);
>>>>>>> +
>>>>>>> + if (retire_cb && smp_load_acquire(&slot->addr) == addr) /* Load B */
>>>>>>> + retire_cb(cpu, slot, addr);
>>>>>>
>>>>>> Is retirce_cb allowed to cmpxchg the thing?
>>>>>
>>>>> It could, but we'd need to make sure the slot is not re-used by another
>>>>> hp_try_post() before the current user removes its own post. It would
>>>>> need to synchronize with the current HP user (e.g. with IPI).
>>>>>
>>>>> I've actually renamed retire_cb to "on_match_cb".
>>>>
>>>> Hmm, I think I see. Would it make sense to pass the expected addr to
>>>> hp_remove() and double check we don't NULL out something unexpected? --
>>>> maybe just for a DEBUG option.
>>>>
>>>> I'm always seeing the NOHZ_FULL guys hating on this :-)
>>>
>>> That's a fair point. Sure, we can do this as an extra safety net. For now I
>>> will just make the check always present, we can always move it to a debug
>>> option later.
>>>
>>> And now I notice that hp_remove is also used for CPU hotplug (grep
>>> matches for cpuhp_remove_state()). I wonder if we should go for something
>>> more grep-friendly than "hp_", e.g. "hazptr_" and rename hp.h to hazptr.h ?
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>>>
>>> --
>>> Mathieu Desnoyers
>>> EfficiOS Inc.
>>> https://www.efficios.com
>>>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence
2024-10-04 18:27 [RFC PATCH v2 0/4] sched+mm: Track lazy active mm existence with hazard pointers Mathieu Desnoyers
` (2 preceding siblings ...)
2024-10-04 18:27 ` [RFC PATCH v2 3/4] hp: Implement Hazard Pointers Mathieu Desnoyers
@ 2024-10-04 18:27 ` Mathieu Desnoyers
2024-10-07 14:50 ` kernel test robot
3 siblings, 1 reply; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-04 18:27 UTC (permalink / raw)
To: Boqun Feng
Cc: linux-kernel, Mathieu Desnoyers, Linus Torvalds, Andrew Morton,
Peter Zijlstra, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Alan Stern, John Stultz, Neeraj Upadhyay,
Frederic Weisbecker, Joel Fernandes, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, maged.michael, Mateusz Guzik, Jonas Oberhauser,
rcu, linux-mm, lkmm
Replace lazy active mm existence tracking with hazard pointers. This
removes the following implementations and their associated config
options:
- MMU_LAZY_TLB_REFCOUNT
- MMU_LAZY_TLB_SHOOTDOWN
- This removes the call_rcu delayed mm drop for RT.
It leverages the fact that each CPU only ever have at most one single
lazy active mm. This makes it a very good fit for a hazard pointer
domain implemented with one hazard pointer slot per CPU.
* Benchmarks:
will-it-scale context_switch1_threads
nr threads (-t) speedup
1 -0.2%
2 +0.4%
3 +0.2%
6 +0.6%
12 +0.8%
24 +3%
48 +12%
96 +21%
192 +28%
384 +4%
768 -0.6%
Methodology: Each test is the average of 20 iterations. Use median
result of 3 test runs.
Test hardware:
CPU(s): 384
On-line CPU(s) list: 0-383
Vendor ID: AuthenticAMD
Model name: AMD EPYC 9654 96-Core Processor
CPU family: 25
Model: 17
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
Stepping: 1
Frequency boost: enabled
CPU(s) scaling MHz: 100%
CPU max MHz: 3709.0000
CPU min MHz: 400.0000
BogoMIPS: 4799.75
Memory: 768 GB ram.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: John Stultz <jstultz@google.com>
Cc: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Zqiang <qiang.zhang1211@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: maged.michael@gmail.com
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Jonas Oberhauser <jonas.oberhauser@huaweicloud.com>
Cc: rcu@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: lkmm@lists.linux.dev
---
Documentation/mm/active_mm.rst | 9 ++--
arch/Kconfig | 32 -------------
arch/powerpc/Kconfig | 1 -
arch/powerpc/mm/book3s64/radix_tlb.c | 23 +--------
include/linux/mm_types.h | 3 --
include/linux/sched/mm.h | 71 +++++++++++-----------------
kernel/exit.c | 4 +-
kernel/fork.c | 47 +++++-------------
kernel/sched/sched.h | 8 +---
lib/Kconfig.debug | 10 ----
10 files changed, 49 insertions(+), 159 deletions(-)
diff --git a/Documentation/mm/active_mm.rst b/Documentation/mm/active_mm.rst
index d096fc091e23..c225cac49c30 100644
--- a/Documentation/mm/active_mm.rst
+++ b/Documentation/mm/active_mm.rst
@@ -2,11 +2,10 @@
Active MM
=========
-Note, the mm_count refcount may no longer include the "lazy" users
-(running tasks with ->active_mm == mm && ->mm == NULL) on kernels
-with CONFIG_MMU_LAZY_TLB_REFCOUNT=n. Taking and releasing these lazy
-references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
-helpers, which abstract this config option.
+Note, the mm_count refcount no longer include the "lazy" users (running
+tasks with ->active_mm == mm && ->mm == NULL) Taking and releasing these
+lazy references must be done with mmgrab_lazy_tlb() and mmdrop_lazy_tlb()
+helpers, which are implemented with hazard pointers.
::
diff --git a/arch/Kconfig b/arch/Kconfig
index 975dd22a2dbd..d4261935f8dc 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -475,38 +475,6 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
irqs disabled over activate_mm. Architectures that do IPI based TLB
shootdowns should enable this.
-# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
-# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
-# to/from kernel threads when the same mm is running on a lot of CPUs (a large
-# multi-threaded application), by reducing contention on the mm refcount.
-#
-# This can be disabled if the architecture ensures no CPUs are using an mm as a
-# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
-# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
-# final exit(2) TLB flush, for example.
-#
-# To implement this, an arch *must*:
-# Ensure the _lazy_tlb variants of mmgrab/mmdrop are used when manipulating
-# the lazy tlb reference of a kthread's ->active_mm (non-arch code has been
-# converted already).
-config MMU_LAZY_TLB_REFCOUNT
- def_bool y
- depends on !MMU_LAZY_TLB_SHOOTDOWN
-
-# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
-# mm as a lazy tlb beyond its last reference count, by shooting down these
-# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
-# be using the mm as a lazy tlb, so that they may switch themselves to using
-# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
-# may be using mm as a lazy tlb mm.
-#
-# To implement this, an arch *must*:
-# - At the time of the final mmdrop of the mm, ensure mm_cpumask(mm) contains
-# at least all possible CPUs in which the mm is lazy.
-# - It must meet the requirements for MMU_LAZY_TLB_REFCOUNT=n (see above).
-config MMU_LAZY_TLB_SHOOTDOWN
- bool
-
config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d7b09b064a8a..b1e25e75baab 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -291,7 +291,6 @@ config PPC
select MMU_GATHER_PAGE_SIZE
select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_MERGE_VMAS
- select MMU_LAZY_TLB_SHOOTDOWN if PPC_BOOK3S_64
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE if PPC64 || NOT_COHERENT_CACHE
select NEED_PER_CPU_EMBED_FIRST_CHUNK if PPC64
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c
index 9e1f6558d026..ff0d4f28cf52 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -1197,28 +1197,7 @@ void radix__tlb_flush(struct mmu_gather *tlb)
* See the comment for radix in arch_exit_mmap().
*/
if (tlb->fullmm) {
- if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
- /*
- * Shootdown based lazy tlb mm refcounting means we
- * have to IPI everyone in the mm_cpumask anyway soon
- * when the mm goes away, so might as well do it as
- * part of the final flush now.
- *
- * If lazy shootdown was improved to reduce IPIs (e.g.,
- * by batching), then it may end up being better to use
- * tlbies here instead.
- */
- preempt_disable();
-
- smp_mb(); /* see radix__flush_tlb_mm */
- exit_flush_lazy_tlbs(mm);
- __flush_all_mm(mm, true);
-
- preempt_enable();
- } else {
- __flush_all_mm(mm, true);
- }
-
+ __flush_all_mm(mm, true);
} else if ( (psize = radix_get_mmu_psize(page_size)) == -1) {
if (!tlb->freed_tables)
radix__flush_tlb_mm(mm);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..db5f13554485 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -975,9 +975,6 @@ struct mm_struct {
atomic_t tlb_flush_batched;
#endif
struct uprobes_state uprobes_state;
-#ifdef CONFIG_PREEMPT_RT
- struct rcu_head delayed_drop;
-#endif
#ifdef CONFIG_HUGETLB_PAGE
atomic_long_t hugetlb_usage;
#endif
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 91546493c43d..0fecd1a3311d 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -9,6 +9,10 @@
#include <linux/gfp.h>
#include <linux/sync_core.h>
#include <linux/sched/coredump.h>
+#include <linux/hp.h>
+
+/* Sched lazy mm hazard pointer domain. */
+DECLARE_PER_CPU(struct hp_slot, hp_domain_sched_lazy_mm);
/*
* Routines for handling mm_structs
@@ -55,61 +59,42 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
}
-#ifdef CONFIG_PREEMPT_RT
-/*
- * RCU callback for delayed mm drop. Not strictly RCU, but call_rcu() is
- * by far the least expensive way to do that.
- */
-static inline void __mmdrop_delayed(struct rcu_head *rhp)
+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
{
- struct mm_struct *mm = container_of(rhp, struct mm_struct, delayed_drop);
+ struct hp_ctx ctx;
- __mmdrop(mm);
-}
+ /*
+ * mmgrab_lazy_tlb must provide a full memory barrier, see the
+ * membarrier comment finish_task_switch which relies on this.
+ */
+ smp_mb();
-/*
- * Invoked from finish_task_switch(). Delegates the heavy lifting on RT
- * kernels via RCU.
- */
-static inline void mmdrop_sched(struct mm_struct *mm)
-{
- /* Provides a full memory barrier. See mmdrop() */
- if (atomic_dec_and_test(&mm->mm_count))
- call_rcu(&mm->delayed_drop, __mmdrop_delayed);
-}
-#else
-static inline void mmdrop_sched(struct mm_struct *mm)
-{
- mmdrop(mm);
-}
-#endif
+ /*
+ * The caller guarantees existence of mm. Allocate a hazard
+ * pointer to chain this existence guarantee to a hazard
+ * pointer.
+ */
+ ctx = hp_allocate(&hp_domain_sched_lazy_mm, mm);
-/* Helpers for lazy TLB mm refcounting */
-static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
-{
- if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
- mmgrab(mm);
+ /* There is only a single lazy mm per CPU at any time. */
+ WARN_ON_ONCE(!hp_ctx_addr(ctx));
}
static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
{
- if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
- mmdrop(mm);
- } else {
- /*
- * mmdrop_lazy_tlb must provide a full memory barrier, see the
- * membarrier comment finish_task_switch which relies on this.
- */
- smp_mb();
- }
+ /*
+ * mmdrop_lazy_tlb must provide a full memory barrier, see the
+ * membarrier comment finish_task_switch which relies on this.
+ */
+ smp_mb();
+ WRITE_ONCE(this_cpu_ptr(&hp_domain_sched_lazy_mm)->addr, NULL);
}
static inline void mmdrop_lazy_tlb_sched(struct mm_struct *mm)
{
- if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
- mmdrop_sched(mm);
- else
- smp_mb(); /* see mmdrop_lazy_tlb() above */
+ smp_mb(); /* see mmdrop_lazy_tlb() above */
+ WRITE_ONCE(this_cpu_ptr(&hp_domain_sched_lazy_mm)->addr, NULL);
}
/**
diff --git a/kernel/exit.c b/kernel/exit.c
index 7430852a8571..cb4ace06c0f0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -545,8 +545,6 @@ static void exit_mm(void)
if (!mm)
return;
mmap_read_lock(mm);
- mmgrab_lazy_tlb(mm);
- BUG_ON(mm != current->active_mm);
/* more a memory barrier than a real lock */
task_lock(current);
/*
@@ -561,6 +559,8 @@ static void exit_mm(void)
*/
smp_mb__after_spinlock();
local_irq_disable();
+ mmgrab_lazy_tlb(mm);
+ BUG_ON(mm != current->active_mm);
current->mm = NULL;
membarrier_update_current_mm(NULL);
enter_lazy_tlb(mm, current);
diff --git a/kernel/fork.c b/kernel/fork.c
index cc760491f201..42c652ec39b5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -149,6 +149,9 @@ DEFINE_PER_CPU(unsigned long, process_counts) = 0;
__cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */
+/* Sched lazy mm hazard pointer domain. */
+DEFINE_PER_CPU(struct hp_slot, hp_domain_sched_lazy_mm);
+
#ifdef CONFIG_PROVE_RCU
int lockdep_tasklist_lock_is_held(void)
{
@@ -855,50 +858,24 @@ static void do_shoot_lazy_tlb(void *arg)
WARN_ON_ONCE(current->mm);
current->active_mm = &init_mm;
switch_mm(mm, &init_mm, current);
+ WRITE_ONCE(this_cpu_ptr(&hp_domain_sched_lazy_mm)->addr, NULL);
}
}
-static void cleanup_lazy_tlbs(struct mm_struct *mm)
+static void retire_lazy_mm_hp(int cpu, struct hp_slot *slot, void *addr)
{
- if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
- /*
- * In this case, lazy tlb mms are refounted and would not reach
- * __mmdrop until all CPUs have switched away and mmdrop()ed.
- */
- return;
- }
+ smp_call_function_single(cpu, do_shoot_lazy_tlb, addr, 1);
+ smp_call_function_single(cpu, do_check_lazy_tlb, addr, 1);
+}
+static void cleanup_lazy_tlbs(struct mm_struct *mm)
+{
/*
- * Lazy mm shootdown does not refcount "lazy tlb mm" usage, rather it
- * requires lazy mm users to switch to another mm when the refcount
+ * Require lazy mm users to switch to another mm when the refcount
* drops to zero, before the mm is freed. This requires IPIs here to
* switch kernel threads to init_mm.
- *
- * archs that use IPIs to flush TLBs can piggy-back that lazy tlb mm
- * switch with the final userspace teardown TLB flush which leaves the
- * mm lazy on this CPU but no others, reducing the need for additional
- * IPIs here. There are cases where a final IPI is still required here,
- * such as the final mmdrop being performed on a different CPU than the
- * one exiting, or kernel threads using the mm when userspace exits.
- *
- * IPI overheads have not found to be expensive, but they could be
- * reduced in a number of possible ways, for example (roughly
- * increasing order of complexity):
- * - The last lazy reference created by exit_mm() could instead switch
- * to init_mm, however it's probable this will run on the same CPU
- * immediately afterwards, so this may not reduce IPIs much.
- * - A batch of mms requiring IPIs could be gathered and freed at once.
- * - CPUs store active_mm where it can be remotely checked without a
- * lock, to filter out false-positives in the cpumask.
- * - After mm_users or mm_count reaches zero, switching away from the
- * mm could clear mm_cpumask to reduce some IPIs, perhaps together
- * with some batching or delaying of the final IPIs.
- * - A delayed freeing and RCU-like quiescing sequence based on mm
- * switching to avoid IPIs completely.
*/
- on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
- if (IS_ENABLED(CONFIG_DEBUG_VM_SHOOT_LAZIES))
- on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+ hp_scan(&hp_domain_sched_lazy_mm, mm, retire_lazy_mm_hp);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c36cc680361..d883c2aa3518 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3527,12 +3527,8 @@ static inline void switch_mm_cid(struct rq *rq,
if (!next->mm) { // to kernel
/*
* user -> kernel transition does not guarantee a barrier, but
- * we can use the fact that it performs an atomic operation in
- * mmgrab().
- */
- if (prev->mm) // from user
- smp_mb__after_mmgrab();
- /*
+ * we can use the fact that mmgrab() has a full barrier.
+ *
* kernel -> kernel transition does not change rq->curr->mm
* state. It stays NULL.
*/
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a30c03a66172..1cb9dab361c9 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -803,16 +803,6 @@ config DEBUG_VM
If unsure, say N.
-config DEBUG_VM_SHOOT_LAZIES
- bool "Debug MMU_LAZY_TLB_SHOOTDOWN implementation"
- depends on DEBUG_VM
- depends on MMU_LAZY_TLB_SHOOTDOWN
- help
- Enable additional IPIs that ensure lazy tlb mm references are removed
- before the mm is freed.
-
- If unsure, say N.
-
config DEBUG_VM_MAPLE_TREE
bool "Debug VM maple trees"
depends on DEBUG_VM
--
2.39.2
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence
2024-10-04 18:27 ` [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence Mathieu Desnoyers
@ 2024-10-07 14:50 ` kernel test robot
2024-10-07 15:05 ` Mathieu Desnoyers
0 siblings, 1 reply; 26+ messages in thread
From: kernel test robot @ 2024-10-07 14:50 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: oe-lkp, lkp, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Peter Zijlstra, Boqun Feng, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Joel Fernandes, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, Mateusz Guzik, Jonas Oberhauser, linux-doc,
linux-kernel, linuxppc-dev, linux-mm, Mathieu Desnoyers,
maged.michael, rcu, lkmm, oliver.sang
Hello,
kernel test robot noticed "BUG:using_smp_processor_id()in_preemptible" on:
commit: efef4da3b19cadf4beb45079a05643a77821de79 ("[RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence")
url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/compiler-h-Introduce-ptr_eq-to-preserve-address-dependency/20241005-023027
base: https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git next
patch link: https://lore.kernel.org/all/20241004182734.1761555-5-mathieu.desnoyers@efficios.com/
patch subject: [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence
in testcase: boot
compiler: gcc-12
test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
(please refer to attached dmesg/kmsg for entire log/backtrace)
+--------------------------------------------+------------+------------+
| | 75b478bf10 | efef4da3b1 |
+--------------------------------------------+------------+------------+
| BUG:using_smp_processor_id()in_preemptible | 0 | 12 |
+--------------------------------------------+------------+------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202410072229.18756716-oliver.sang@intel.com
[ 6.336856][ T48] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u10:1/48
[ 6.338102][ T48] caller is debug_smp_processor_id (lib/smp_processor_id.c:61)
[ 6.338809][ T48] CPU: 0 UID: 0 PID: 48 Comm: kworker/u10:1 Not tainted 6.12.0-rc1-00004-gefef4da3b19c #5
[ 6.339929][ T48] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 6.341119][ T48] Call Trace:
[ 6.341504][ T48] dump_stack_lvl (lib/dump_stack.c:123)
[ 6.342057][ T48] dump_stack (lib/dump_stack.c:130)
[ 6.342065][ T48] check_preemption_disabled (arch/x86/include/asm/preempt.h:84 lib/smp_processor_id.c:53)
[ 6.342065][ T48] debug_smp_processor_id (lib/smp_processor_id.c:61)
[ 6.342065][ T48] exec_mmap (include/linux/sched/mm.h:91 fs/exec.c:1017)
[ 6.342065][ T48] ? would_dump (fs/exec.c:1409)
[ 6.342065][ T48] begin_new_exec (fs/exec.c:1280)
[ 6.342065][ T48] ? load_elf_phdrs (fs/binfmt_elf.c:534)
[ 6.342065][ T48] load_elf_binary (fs/binfmt_elf.c:996)
[ 6.342065][ T48] ? get_lock_stats (kernel/locking/lockdep.c:339)
[ 6.342065][ T48] ? search_binary_handler (fs/exec.c:1752)
[ 6.342065][ T48] search_binary_handler (fs/exec.c:1752)
[ 6.342065][ T48] exec_binprm (fs/exec.c:1795)
[ 6.342065][ T48] bprm_execve (fs/exec.c:1846 fs/exec.c:1821)
[ 6.342065][ T48] kernel_execve (fs/exec.c:2012)
[ 6.342065][ T48] call_usermodehelper_exec_async (kernel/umh.c:110)
[ 6.342065][ T48] ? umh_complete (kernel/umh.c:65)
[ 6.342065][ T48] ret_from_fork (arch/x86/kernel/process.c:153)
[ 6.342065][ T48] ? umh_complete (kernel/umh.c:65)
[ 6.342065][ T48] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
[ 6.342065][ T48] entry_INT80_32 (arch/x86/entry/entry_32.S:944)
[ 6.352140][ T1] ppdev: user-space parallel port driver
[ 6.353841][ T1] HSI/SSI char device loaded
[ 6.354238][ T1] e1000: Intel(R) PRO/1000 Network Driver
[ 6.354673][ T1] e1000: Copyright (c) 1999-2006 Intel Corporation.
[ 6.650009][ T1] ACPI: _SB_.LNKC: Enabled at IRQ 11
[ 6.968868][ T1] e1000 0000:00:03.0 eth0: (PCI:33MHz:32-bit) 52:54:00:12:34:56
[ 6.969500][ T1] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
[ 6.970506][ T49] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u10:1/49
[ 6.971191][ T49] caller is debug_smp_processor_id (lib/smp_processor_id.c:61)
[ 6.971650][ T49] CPU: 0 UID: 0 PID: 49 Comm: kworker/u10:1 Not tainted 6.12.0-rc1-00004-gefef4da3b19c #5
[ 6.972365][ T49] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 6.973115][ T49] Call Trace:
[ 6.973356][ T49] dump_stack_lvl (lib/dump_stack.c:123)
[ 6.973686][ T49] dump_stack (lib/dump_stack.c:130)
[ 6.973983][ T49] check_preemption_disabled (arch/x86/include/asm/preempt.h:84 lib/smp_processor_id.c:53)
[ 6.974328][ T49] debug_smp_processor_id (lib/smp_processor_id.c:61)
[ 6.974328][ T49] exec_mmap (include/linux/sched/mm.h:91 fs/exec.c:1017)
[ 6.974328][ T49] ? would_dump (fs/exec.c:1409)
[ 6.974328][ T49] begin_new_exec (fs/exec.c:1280)
[ 6.974328][ T49] ? load_elf_phdrs (fs/binfmt_elf.c:534)
[ 6.974328][ T49] load_elf_binary (fs/binfmt_elf.c:996)
[ 6.974328][ T49] ? get_lock_stats (kernel/locking/lockdep.c:339)
[ 6.974328][ T49] ? search_binary_handler (fs/exec.c:1752)
[ 6.974328][ T49] search_binary_handler (fs/exec.c:1752)
[ 6.974328][ T49] exec_binprm (fs/exec.c:1795)
[ 6.974328][ T49] bprm_execve (fs/exec.c:1846 fs/exec.c:1821)
[ 6.974328][ T49] kernel_execve (fs/exec.c:2012)
[ 6.974328][ T49] call_usermodehelper_exec_async (kernel/umh.c:110)
[ 6.974328][ T49] ? umh_complete (kernel/umh.c:65)
[ 6.974328][ T49] ret_from_fork (arch/x86/kernel/process.c:153)
[ 6.974328][ T49] ? umh_complete (kernel/umh.c:65)
[ 6.974328][ T49] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
[ 6.974328][ T49] entry_INT80_32 (arch/x86/entry/entry_32.S:944)
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20241007/202410072229.18756716-oliver.sang@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence
2024-10-07 14:50 ` kernel test robot
@ 2024-10-07 15:05 ` Mathieu Desnoyers
0 siblings, 0 replies; 26+ messages in thread
From: Mathieu Desnoyers @ 2024-10-07 15:05 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, Nicholas Piggin, Michael Ellerman,
Greg Kroah-Hartman, Sebastian Andrzej Siewior, Paul E. McKenney,
Will Deacon, Peter Zijlstra, Boqun Feng, Alan Stern, John Stultz,
Neeraj Upadhyay, Linus Torvalds, Andrew Morton,
Frederic Weisbecker, Joel Fernandes, Josh Triplett,
Uladzislau Rezki, Steven Rostedt, Lai Jiangshan, Zqiang,
Ingo Molnar, Waiman Long, Mark Rutland, Thomas Gleixner,
Vlastimil Babka, Mateusz Guzik, Jonas Oberhauser, linux-doc,
linux-kernel, linuxppc-dev, linux-mm, maged.michael, rcu, lkmm
On 2024-10-07 16:50, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed "BUG:using_smp_processor_id()in_preemptible" on:
>
> commit: efef4da3b19cadf4beb45079a05643a77821de79 ("[RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence")
> url: https://github.com/intel-lab-lkp/linux/commits/Mathieu-Desnoyers/compiler-h-Introduce-ptr_eq-to-preserve-address-dependency/20241005-023027
> base: https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git next
> patch link: https://lore.kernel.org/all/20241004182734.1761555-5-mathieu.desnoyers@efficios.com/
> patch subject: [RFC PATCH v2 4/4] sched+mm: Use hazard pointers to track lazy active mm existence
>
This should do the trick:
- WRITE_ONCE(this_cpu_ptr(hp_domain_sched_lazy_mm.percpu_slots)->addr, NULL);
+ this_cpu_write(hp_domain_sched_lazy_mm.percpu_slots->addr, NULL);
I'll update the patch for the next round.
Thanks,
Mathieu
> in testcase: boot
>
> compiler: gcc-12
> test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
> +--------------------------------------------+------------+------------+
> | | 75b478bf10 | efef4da3b1 |
> +--------------------------------------------+------------+------------+
> | BUG:using_smp_processor_id()in_preemptible | 0 | 12 |
> +--------------------------------------------+------------+------------+
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202410072229.18756716-oliver.sang@intel.com
>
>
> [ 6.336856][ T48] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u10:1/48
> [ 6.338102][ T48] caller is debug_smp_processor_id (lib/smp_processor_id.c:61)
> [ 6.338809][ T48] CPU: 0 UID: 0 PID: 48 Comm: kworker/u10:1 Not tainted 6.12.0-rc1-00004-gefef4da3b19c #5
> [ 6.339929][ T48] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 6.341119][ T48] Call Trace:
> [ 6.341504][ T48] dump_stack_lvl (lib/dump_stack.c:123)
> [ 6.342057][ T48] dump_stack (lib/dump_stack.c:130)
> [ 6.342065][ T48] check_preemption_disabled (arch/x86/include/asm/preempt.h:84 lib/smp_processor_id.c:53)
> [ 6.342065][ T48] debug_smp_processor_id (lib/smp_processor_id.c:61)
> [ 6.342065][ T48] exec_mmap (include/linux/sched/mm.h:91 fs/exec.c:1017)
> [ 6.342065][ T48] ? would_dump (fs/exec.c:1409)
> [ 6.342065][ T48] begin_new_exec (fs/exec.c:1280)
> [ 6.342065][ T48] ? load_elf_phdrs (fs/binfmt_elf.c:534)
> [ 6.342065][ T48] load_elf_binary (fs/binfmt_elf.c:996)
> [ 6.342065][ T48] ? get_lock_stats (kernel/locking/lockdep.c:339)
> [ 6.342065][ T48] ? search_binary_handler (fs/exec.c:1752)
> [ 6.342065][ T48] search_binary_handler (fs/exec.c:1752)
> [ 6.342065][ T48] exec_binprm (fs/exec.c:1795)
> [ 6.342065][ T48] bprm_execve (fs/exec.c:1846 fs/exec.c:1821)
> [ 6.342065][ T48] kernel_execve (fs/exec.c:2012)
> [ 6.342065][ T48] call_usermodehelper_exec_async (kernel/umh.c:110)
> [ 6.342065][ T48] ? umh_complete (kernel/umh.c:65)
> [ 6.342065][ T48] ret_from_fork (arch/x86/kernel/process.c:153)
> [ 6.342065][ T48] ? umh_complete (kernel/umh.c:65)
> [ 6.342065][ T48] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
> [ 6.342065][ T48] entry_INT80_32 (arch/x86/entry/entry_32.S:944)
> [ 6.352140][ T1] ppdev: user-space parallel port driver
> [ 6.353841][ T1] HSI/SSI char device loaded
> [ 6.354238][ T1] e1000: Intel(R) PRO/1000 Network Driver
> [ 6.354673][ T1] e1000: Copyright (c) 1999-2006 Intel Corporation.
> [ 6.650009][ T1] ACPI: _SB_.LNKC: Enabled at IRQ 11
> [ 6.968868][ T1] e1000 0000:00:03.0 eth0: (PCI:33MHz:32-bit) 52:54:00:12:34:56
> [ 6.969500][ T1] e1000 0000:00:03.0 eth0: Intel(R) PRO/1000 Network Connection
> [ 6.970506][ T49] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/u10:1/49
> [ 6.971191][ T49] caller is debug_smp_processor_id (lib/smp_processor_id.c:61)
> [ 6.971650][ T49] CPU: 0 UID: 0 PID: 49 Comm: kworker/u10:1 Not tainted 6.12.0-rc1-00004-gefef4da3b19c #5
> [ 6.972365][ T49] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [ 6.973115][ T49] Call Trace:
> [ 6.973356][ T49] dump_stack_lvl (lib/dump_stack.c:123)
> [ 6.973686][ T49] dump_stack (lib/dump_stack.c:130)
> [ 6.973983][ T49] check_preemption_disabled (arch/x86/include/asm/preempt.h:84 lib/smp_processor_id.c:53)
> [ 6.974328][ T49] debug_smp_processor_id (lib/smp_processor_id.c:61)
> [ 6.974328][ T49] exec_mmap (include/linux/sched/mm.h:91 fs/exec.c:1017)
> [ 6.974328][ T49] ? would_dump (fs/exec.c:1409)
> [ 6.974328][ T49] begin_new_exec (fs/exec.c:1280)
> [ 6.974328][ T49] ? load_elf_phdrs (fs/binfmt_elf.c:534)
> [ 6.974328][ T49] load_elf_binary (fs/binfmt_elf.c:996)
> [ 6.974328][ T49] ? get_lock_stats (kernel/locking/lockdep.c:339)
> [ 6.974328][ T49] ? search_binary_handler (fs/exec.c:1752)
> [ 6.974328][ T49] search_binary_handler (fs/exec.c:1752)
> [ 6.974328][ T49] exec_binprm (fs/exec.c:1795)
> [ 6.974328][ T49] bprm_execve (fs/exec.c:1846 fs/exec.c:1821)
> [ 6.974328][ T49] kernel_execve (fs/exec.c:2012)
> [ 6.974328][ T49] call_usermodehelper_exec_async (kernel/umh.c:110)
> [ 6.974328][ T49] ? umh_complete (kernel/umh.c:65)
> [ 6.974328][ T49] ret_from_fork (arch/x86/kernel/process.c:153)
> [ 6.974328][ T49] ? umh_complete (kernel/umh.c:65)
> [ 6.974328][ T49] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
> [ 6.974328][ T49] entry_INT80_32 (arch/x86/entry/entry_32.S:944)
>
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20241007/202410072229.18756716-oliver.sang@intel.com
>
>
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 26+ messages in thread