Handling large numbers of hazard pointers

public inbox for rcu@vger.kernel.org
 help / color / mirror / Atom feed

* Handling large numbers of hazard pointers
@ 2025-12-13 22:31 Paul E. McKenney
  2025-12-14  2:38 ` Mathieu Desnoyers
  0 siblings, 1 reply; 12+ messages in thread
From: Paul E. McKenney @ 2025-12-13 22:31 UTC (permalink / raw)
  To: rcu; +Cc: boqun.feng, mark.rutland, mathieu.desnoyers, roman.gushchin

Hello!

I didn't do a good job of answering the "what about large numbers of
hazard pointers" at Boqun's and my hazard-pointers talk at Linux Plumbers
Conference yesterday, so please allow me to at least start on the path
towards fixing that problem.

Also, there were a couple of people participating whose email addresses
I don't know, so please feel free to CC them.

The trick is that in real workloads to date, although there might be
millions of hazard pointers, there will typically only be a few active
per CPU at a given time.  This of course suggests a per-CPU data structure
tracking the active ones.  Allocating a hazard pointer grabs an unused one
from this array, or, if all entries are in use, takes memory provided by
the caller and links it into an overflow list.  Either way, it returns a
pointer to the hazard pointer that is now visible to updaters.  When done,
the caller calls a function that marks the array-entry as unused or
removes the element from the list, as the case may be.  Because hazard
pointers can migrate among CPUs, full synchronization is required when
operating on the array and the overflow list.

And either way, the caller is responsible for allocating and freeing the
backup hazard-pointer structure that will be used in case of overflow.
And also either way, the updater need only deal with hazard pointers
that are currently in use.

Cue Mark telling me that his use case can have millions of active hazard
pointers?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-13 22:31 Handling large numbers of hazard pointers Paul E. McKenney
@ 2025-12-14  2:38 ` Mathieu Desnoyers
  2025-12-14  6:37   ` Joel Fernandes
  0 siblings, 1 reply; 12+ messages in thread
From: Mathieu Desnoyers @ 2025-12-14  2:38 UTC (permalink / raw)
  To: paulmck, rcu; +Cc: boqun.feng, mark.rutland, roman.gushchin

On 2025-12-13 17:31, Paul E. McKenney wrote:
> Hello!
> 
> I didn't do a good job of answering the "what about large numbers of
> hazard pointers" at Boqun's and my hazard-pointers talk at Linux Plumbers
> Conference yesterday, so please allow me to at least start on the path
> towards fixing that problem.
> 
> Also, there were a couple of people participating whose email addresses
> I don't know, so please feel free to CC them.
> 
> The trick is that in real workloads to date, although there might be
> millions of hazard pointers, there will typically only be a few active
> per CPU at a given time.  This of course suggests a per-CPU data structure
> tracking the active ones.  Allocating a hazard pointer grabs an unused one
> from this array, or, if all entries are in use, takes memory provided by
> the caller and links it into an overflow list.  Either way, it returns a
> pointer to the hazard pointer that is now visible to updaters.  When done,
> the caller calls a function that marks the array-entry as unused or
> removes the element from the list, as the case may be.  Because hazard
> pointers can migrate among CPUs, full synchronization is required when
> operating on the array and the overflow list.
> 
> And either way, the caller is responsible for allocating and freeing the
> backup hazard-pointer structure that will be used in case of overflow.
> And also either way, the updater need only deal with hazard pointers
> that are currently in use.
OK, so let me state how I see the fundamental problem you are trying
to address, and detail a possible solution.

* Problem Statement

Assuming we go for an array of hazard pointer slots per CPU to cover
the fast path (common case), we still need to handle the
overflow scenario, where more hazard pointers are accessed
concurrently for a given CPU than the array size, either due to
preemption, nested interrupts, or simply nested calls.

* Possible Solution

Requiring the HP caller to allocate backup space is clearly something
that would cover all scenarios. My worry is that tracking this backup
space allocation may be cumbersome for the user, especially if this
requires heap allocation.

Where the backup space can be allocated will likely depend on how long
the HP will be accessed. My working hypothesis here (let me know if
I'm wrong) is that a most of those HP users will complete their access
within the same stack frame where the HP was acquired. This is the
primary use-case I would like to make sure is convenient.

For that use-case the users can simply allocate room on their
stack frame for the backup HP slot. The requirement here is that they
clear the HP slot before the end of the current stack frame.
If there is enough room in the per-CPU array, they use that, else
they add the backup slot from their stack into the backup slot
list. When they are done, if they used a backup slot, they need
to remove it from the list.

There could still be room for more advanced use-cases where the
backup slots are allocated on the heap, but I suspect that it would
make the API trickier to use and should be reserved for use-cases
that really require it.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-14  2:38 ` Mathieu Desnoyers
@ 2025-12-14  6:37   ` Joel Fernandes
  2025-12-14  7:14     ` Joel Fernandes
  0 siblings, 1 reply; 12+ messages in thread
From: Joel Fernandes @ 2025-12-14  6:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com,
	mark.rutland@arm.com, roman.gushchin@linux.dev



> On Dec 14, 2025, at 11:38 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2025-12-13 17:31, Paul E. McKenney wrote:
>> Hello!
>> I didn't do a good job of answering the "what about large numbers of
>> hazard pointers" at Boqun's and my hazard-pointers talk at Linux Plumbers
>> Conference yesterday, so please allow me to at least start on the path
>> towards fixing that problem.
>> Also, there were a couple of people participating whose email addresses
>> I don't know, so please feel free to CC them.
>> The trick is that in real workloads to date, although there might be
>> millions of hazard pointers, there will typically only be a few active
>> per CPU at a given time.  This of course suggests a per-CPU data structure
>> tracking the active ones.  Allocating a hazard pointer grabs an unused one
>> from this array, or, if all entries are in use, takes memory provided by
>> the caller and links it into an overflow list.  Either way, it returns a
>> pointer to the hazard pointer that is now visible to updaters.  When done,
>> the caller calls a function that marks the array-entry as unused or
>> removes the element from the list, as the case may be.  Because hazard
>> pointers can migrate among CPUs, full synchronization is required when
>> operating on the array and the overflow list.
>> And either way, the caller is responsible for allocating and freeing the
>> backup hazard-pointer structure that will be used in case of overflow.
>> And also either way, the updater need only deal with hazard pointers
>> that are currently in use.
> OK, so let me state how I see the fundamental problem you are trying
> to address, and detail a possible solution.
> 
> * Problem Statement
> 
> Assuming we go for an array of hazard pointer slots per CPU to cover
> the fast path (common case), we still need to handle the
> overflow scenario, where more hazard pointers are accessed
> concurrently for a given CPU than the array size, either due to
> preemption, nested interrupts, or simply nested calls.
> 
> * Possible Solution
> 
> Requiring the HP caller to allocate backup space is clearly something
> that would cover all scenarios. My worry is that tracking this backup
> space allocation may be cumbersome for the user, especially if this
> requires heap allocation.
> 
> Where the backup space can be allocated will likely depend on how long
> the HP will be accessed. My working hypothesis here (let me know if
> I'm wrong) is that a most of those HP users will complete their access
> within the same stack frame where the HP was acquired. This is the
> primary use-case I would like to make sure is convenient.
> 
> For that use-case the users can simply allocate room on their
> stack frame for the backup HP slot. The requirement here is that they
> clear the HP slot before the end of the current stack frame.
> If there is enough room in the per-CPU array, they use that, else
> they add the backup slot from their stack into the backup slot
> list. When they are done, if they used a backup slot, they need
> to remove it from the list.
> 
> There could still be room for more advanced use-cases where the
> backup slots are allocated on the heap, but I suspect that it would
> make the API trickier to use and should be reserved for use-cases
> that really require it.
> 
> Thoughts ?

This sounds fine to me. 

We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side.

Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes.

I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded?

Thoughts?

Thanks.





> 
> Thanks,
> 
> Mathieu
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-14  6:37   ` Joel Fernandes
@ 2025-12-14  7:14     ` Joel Fernandes
  2025-12-14 13:43       ` Mathieu Desnoyers
  0 siblings, 1 reply; 12+ messages in thread
From: Joel Fernandes @ 2025-12-14  7:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com,
	mark.rutland@arm.com, roman.gushchin@linux.dev



> On Dec 14, 2025, at 3:37 PM, Joel Fernandes <joelagnelf@nvidia.com> wrote:
> 
> 
> 
>>> On Dec 14, 2025, at 11:38 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>> 
>>> On 2025-12-13 17:31, Paul E. McKenney wrote:
>>> Hello!
>>> I didn't do a good job of answering the "what about large numbers of
>>> hazard pointers" at Boqun's and my hazard-pointers talk at Linux Plumbers
>>> Conference yesterday, so please allow me to at least start on the path
>>> towards fixing that problem.
>>> Also, there were a couple of people participating whose email addresses
>>> I don't know, so please feel free to CC them.
>>> The trick is that in real workloads to date, although there might be
>>> millions of hazard pointers, there will typically only be a few active
>>> per CPU at a given time.  This of course suggests a per-CPU data structure
>>> tracking the active ones.  Allocating a hazard pointer grabs an unused one
>>> from this array, or, if all entries are in use, takes memory provided by
>>> the caller and links it into an overflow list.  Either way, it returns a
>>> pointer to the hazard pointer that is now visible to updaters.  When done,
>>> the caller calls a function that marks the array-entry as unused or
>>> removes the element from the list, as the case may be.  Because hazard
>>> pointers can migrate among CPUs, full synchronization is required when
>>> operating on the array and the overflow list.
>>> And either way, the caller is responsible for allocating and freeing the
>>> backup hazard-pointer structure that will be used in case of overflow.
>>> And also either way, the updater need only deal with hazard pointers
>>> that are currently in use.
>> OK, so let me state how I see the fundamental problem you are trying
>> to address, and detail a possible solution.
>> 
>> * Problem Statement
>> 
>> Assuming we go for an array of hazard pointer slots per CPU to cover
>> the fast path (common case), we still need to handle the
>> overflow scenario, where more hazard pointers are accessed
>> concurrently for a given CPU than the array size, either due to
>> preemption, nested interrupts, or simply nested calls.
>> 
>> * Possible Solution
>> 
>> Requiring the HP caller to allocate backup space is clearly something
>> that would cover all scenarios. My worry is that tracking this backup
>> space allocation may be cumbersome for the user, especially if this
>> requires heap allocation.
>> 
>> Where the backup space can be allocated will likely depend on how long
>> the HP will be accessed. My working hypothesis here (let me know if
>> I'm wrong) is that a most of those HP users will complete their access
>> within the same stack frame where the HP was acquired. This is the
>> primary use-case I would like to make sure is convenient.
>> 
>> For that use-case the users can simply allocate room on their
>> stack frame for the backup HP slot. The requirement here is that they
>> clear the HP slot before the end of the current stack frame.
>> If there is enough room in the per-CPU array, they use that, else
>> they add the backup slot from their stack into the backup slot
>> list. When they are done, if they used a backup slot, they need
>> to remove it from the list.
>> 
>> There could still be room for more advanced use-cases where the
>> backup slots are allocated on the heap, but I suspect that it would
>> make the API trickier to use and should be reserved for use-cases
>> that really require it.
>> 
>> Thoughts ?
> 
> This sounds fine to me.
> 
> We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side.
> 
> Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes.
> 
> I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded?
> 
> Thoughts?
> 
> Thanks.

By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned).

thanks.





> 
> 
> 
> 
> 
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-14  7:14     ` Joel Fernandes
@ 2025-12-14 13:43       ` Mathieu Desnoyers
  2025-12-14 20:49         ` Boqun Feng
  2025-12-14 22:36         ` Joel Fernandes
  0 siblings, 2 replies; 12+ messages in thread
From: Mathieu Desnoyers @ 2025-12-14 13:43 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com,
	mark.rutland@arm.com, roman.gushchin@linux.dev

On 2025-12-14 02:14, Joel Fernandes wrote:
> 
[...]
>> We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side.
>>
>> Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes.

With the preallocation approach, how can we guarantee that we
preallocate enough, and if we cannot provide that guarantee,
how should readers behave when they reach that limit ?

>>
>> I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded?

So the "simple" implementation presented by Boqun at LPC2025 was limited
to preemption disabled context, but in general there is really no need
to require hazptr users to disable preemption for the entire duration of
their hazptr protected critical section. This means that tasks holding a
hazptr can be preempted or blocked, which increases significantly the
number of concurrently active hazptr we need to preallocate.

So I'd rather remove the preempt disable constraint across the entire
hazptr crtitical section if it's not too much trouble. And it appears
that backup slots preallocation is the only fundamental reason that
would motivate this constraint so far.

>... 
> By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned).
I think the solution I have in mind takes care of the negative aspects
of preallocation requirement. Here is how it would be used (see
[1] for the current prototype). The "backup slot" is within the
struct hazptr_ctx, so from a user perspective, it's under the hood.

The user just has to provide a context variable on its stack.

(untested example below)

void *ptr;

void fct(void)
{
	struct hazptr_ctx ctx;
	void *addr;

	addr = hazptr_load_try_protect(&ctx, &ptr);
	if (!addr)
		return;
	[critical section]
	hazptr_release(&ctx, addr);
}

Thanks,

Mathieu

[1] https://github.com/compudj/linux-dev/commit/5e33fc13f950929e630a60ef44249e7b0978e80f


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-14 13:43       ` Mathieu Desnoyers
@ 2025-12-14 20:49         ` Boqun Feng
  2025-12-15 19:34           ` Mathieu Desnoyers
  2025-12-14 22:36         ` Joel Fernandes
  1 sibling, 1 reply; 12+ messages in thread
From: Boqun Feng @ 2025-12-14 20:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Joel Fernandes, paulmck@kernel.org, rcu@vger.kernel.org,
	mark.rutland@arm.com, roman.gushchin@linux.dev

On Sun, Dec 14, 2025 at 08:43:18AM -0500, Mathieu Desnoyers wrote:
[...]
> > ... By the way I wanted to emphasize my PoV, that requiring storage
> > provided by the user seems to negate one of the big benefits of hazard

TBH, if users don't want to do the extra allocation, they probably want
to just use RCU ;-) and even we can hide the allocation behind the
API, the performance penalty is still there when the "allocation"
happens. For example, in Mathieu's proposal below, when this cpu's slot
is used (which should be normal if we allow preemptible hazptr readers)
every hazptr_try_protect() will become a mutex acquisition, that means
we are back to locking in common cases. I in general don't think
allocating or filling in the backup slot while _try_protect() is a
good idea, unless we can avoid the cost in common cases.

> > pointers I think. Unless there is a solid use case for it, we should
> > probably not require the user to provide separate storage IMHO (or make
> > allocation internal to the api as I mentioned).
> I think the solution I have in mind takes care of the negative aspects
> of preallocation requirement. Here is how it would be used (see
> [1] for the current prototype). The "backup slot" is within the

A few issues in the prototype:

* hazptr_load_try_protect_percpu() needs to either be called in
  preemption disabling context or disable preemption itself.

* In __hazptr_load_try_protect(), seems to me that we shouldn't return
  after hazptr_chain_backup_slot() being called, but rather continue,
  otherwise we would miss the updater changes.

* As I mentioned above, since we allow preemption during hazptr readers,
  then when we have multiple users, it's likely that the per-CPU slot is
  used at most of the time, and other try_protect()s would always go
  into the slow path, which downgrade us to normal locking.

Regards,
Boqun

> struct hazptr_ctx, so from a user perspective, it's under the hood.
> 
> The user just has to provide a context variable on its stack.
> 
> (untested example below)
> 
> void *ptr;
> 
> void fct(void)
> {
> 	struct hazptr_ctx ctx;
> 	void *addr;
> 
> 	addr = hazptr_load_try_protect(&ctx, &ptr);
> 	if (!addr)
> 		return;
> 	[critical section]
> 	hazptr_release(&ctx, addr);
> }
> 
> Thanks,
> 
> Mathieu
> 
> [1] https://github.com/compudj/linux-dev/commit/5e33fc13f950929e630a60ef44249e7b0978e80f
> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-14 20:49         ` Boqun Feng
@ 2025-12-15 19:34           ` Mathieu Desnoyers
  2025-12-15 21:11             ` Mathieu Desnoyers
  0 siblings, 1 reply; 12+ messages in thread
From: Mathieu Desnoyers @ 2025-12-15 19:34 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Joel Fernandes, paulmck@kernel.org, rcu@vger.kernel.org,
	mark.rutland@arm.com, roman.gushchin@linux.dev

On 2025-12-14 15:49, Boqun Feng wrote:
[...]
> A few issues in the prototype:
> 

See [1] for the updated commit.

> * hazptr_load_try_protect_percpu() needs to either be called in
>    preemption disabling context or disable preemption itself.

Good point, fixed. Added a "guard(preempt)()" within the scope of a
for loop so preemption is disabled for each attempt, but not across
attempts.

> 
> * In __hazptr_load_try_protect(), seems to me that we shouldn't return
>    after hazptr_chain_backup_slot() being called, but rather continue,
>    otherwise we would miss the updater changes.

Good point, fixed.

> 
> * As I mentioned above, since we allow preemption during hazptr readers,
>    then when we have multiple users, it's likely that the per-CPU slot is
>    used at most of the time, and other try_protect()s would always go
>    into the slow path, which downgrade us to normal locking.

Fair point. I've modified my implementation to have 8 slots per cpu
(rather than one). It fits within a single cache line. That should
allow a few preempted tasks to keep hold of a hazard pointer without
requiring use of the backup slot.

As for the overflow list, I've done the following changes:

- It now has per-cpu overflow lists, so the fallback relies on per-cpu
   locking rather than a global lock,
- Protect list add/del with raw spinlock rather than mutex.
- Introduce a generation counter (64-bit integer) with each overflow
   list, allowing piecewise iteration of the list by synchronize. The
   generation counter is read when beginning list iteration. Each
   time the raw spinlock is released and taken again to continue list
   iteration, the generation counter is checked, and retry entire list
   traversal if it does not match the counter value at beginning of
   traversal.

Strictly speaking, the main downside of the generation counter approach
is that a steady stream of list modifications could cause a synchronize
to retry endlessly. I open to ideas on how to improve this.

Feedback is welcome!

[1] https://github.com/compudj/linux-dev/commit/730616cd2989b710b585144ac10fc34b4fd641ea

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-15 19:34           ` Mathieu Desnoyers
@ 2025-12-15 21:11             ` Mathieu Desnoyers
  0 siblings, 0 replies; 12+ messages in thread
From: Mathieu Desnoyers @ 2025-12-15 21:11 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Joel Fernandes, paulmck@kernel.org, rcu@vger.kernel.org,
	mark.rutland@arm.com, roman.gushchin@linux.dev

On 2025-12-15 14:34, Mathieu Desnoyers wrote:
[...]
> Feedback is welcome!

Note: here is an updated commit which renames the
"hazptr_load_try_protect" to "hazptr_acquire". The
change of name makes sense now that the memory for
the backup slot is already allocated on the caller
stack. The hazard pointer acquire operation cannot fail
anymore.

The loaded pointer value change scenario is handled with a
retry.

https://github.com/compudj/linux-dev/commit/cd9e8721a2409d7ab651c9a191f35a48fcbc4e9b

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-14 13:43       ` Mathieu Desnoyers
  2025-12-14 20:49         ` Boqun Feng
@ 2025-12-14 22:36         ` Joel Fernandes
  2025-12-14 23:26           ` Joel Fernandes
  1 sibling, 1 reply; 12+ messages in thread
From: Joel Fernandes @ 2025-12-14 22:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Joel Fernandes, paulmck, rcu, boqun.feng, mark.rutland,
	roman.gushchin



> On Dec 14, 2025, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2025-12-14 02:14, Joel Fernandes wrote:
> [...]
>>> We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side.
>>> 
>>> Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes.
> 
> With the preallocation approach, how can we guarantee that we
> preallocate enough, and if we cannot provide that guarantee,
> how should readers behave when they reach that limit ?

Sure, see below.

>>> 
>>> I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded?
> 
> So the "simple" implementation presented by Boqun at LPC2025 was limited
> to preemption disabled context, but in general there is really no need
> to require hazptr users to disable preemption for the entire duration of
> their hazptr protected critical section. This means that tasks holding a
> hazptr can be preempted or blocked, which increases significantly the
> number of concurrently active hazptr we need to preallocate.
> 
> So I'd rather remove the preempt disable constraint across the entire
> hazptr crtitical section if it's not too much trouble. And it appears
> that backup slots preallocation is the only fundamental reason that
> would motivate this constraint so far.

We can remove the preempt disable constraint and still not require stack allocation by simply embedding overflow nodes and any other context, in task_struct right? This is similiar to how preemptible-rcu tracks readers.

>> ... By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned).
> I think the solution I have in mind takes care of the negative aspects
> of preallocation requirement. Here is how it would be used (see
> [1] for the current prototype). The "backup slot" is within the
> struct hazptr_ctx, so from a user perspective, it's under the hood.
> 
> The user just has to provide a context variable on its stack.
> 
> (untested example below)
> 
> void *ptr;
> 
> void fct(void)
> {
>    struct hazptr_ctx ctx;
>    void *addr;
> 
>    addr = hazptr_load_try_protect(&ctx, &ptr);
>    if (!addr)
>        return;
>    [critical section]
>    hazptr_release(&ctx, addr);

Correct me if I am wrong but I am not seeing why stack ctx is required if ctx and any space required can be placed in the task struct (and hidden from the user). For interrupts, we can just use per cpu storage split by context like qspinlock does, I think (unless we are going with the wild card proposals).

Thoughts?

Thanks.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-14 22:36         ` Joel Fernandes
@ 2025-12-14 23:26           ` Joel Fernandes
  2025-12-15 19:38             ` Mathieu Desnoyers
  0 siblings, 1 reply; 12+ messages in thread
From: Joel Fernandes @ 2025-12-14 23:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Mathieu Desnoyers, paulmck@kernel.org, rcu@vger.kernel.org,
	boqun.feng@gmail.com, mark.rutland@arm.com,
	roman.gushchin@linux.dev



> On Dec 15, 2025, at 7:36 AM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> 
> 
>> On Dec 14, 2025, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>> 
>> On 2025-12-14 02:14, Joel Fernandes wrote:
>> [...]
>>>> We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side.
>>>> 
>>>> Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes.
>> 
>> With the preallocation approach, how can we guarantee that we
>> preallocate enough, and if we cannot provide that guarantee,
>> how should readers behave when they reach that limit ?
> 
> Sure, see below.
> 
>>>> 
>>>> I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded?
>> 
>> So the "simple" implementation presented by Boqun at LPC2025 was limited
>> to preemption disabled context, but in general there is really no need
>> to require hazptr users to disable preemption for the entire duration of
>> their hazptr protected critical section. This means that tasks holding a
>> hazptr can be preempted or blocked, which increases significantly the
>> number of concurrently active hazptr we need to preallocate.
>> 
>> So I'd rather remove the preempt disable constraint across the entire
>> hazptr crtitical section if it's not too much trouble. And it appears
>> that backup slots preallocation is the only fundamental reason that
>> would motivate this constraint so far.
> 
> We can remove the preempt disable constraint and still not require stack allocation by simply embedding overflow nodes and any other context, in task_struct right? This is similiar to how preemptible-rcu tracks readers.
> 
>>> ... By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned).
>> I think the solution I have in mind takes care of the negative aspects
>> of preallocation requirement. Here is how it would be used (see
>> [1] for the current prototype). The "backup slot" is within the
>> struct hazptr_ctx, so from a user perspective, it's under the hood.
>> 
>> The user just has to provide a context variable on its stack.
>> 
>> (untested example below)
>> 
>> void *ptr;
>> 
>> void fct(void)
>> {
>>   struct hazptr_ctx ctx;
>>   void *addr;
>> 
>>   addr = hazptr_load_try_protect(&ctx, &ptr);
>>   if (!addr)
>>       return;
>>   [critical section]
>>   hazptr_release(&ctx, addr);
> 
> Correct me if I am wrong but I am not seeing why stack ctx is required if ctx and any space required can be placed in the task struct (and hidden from the user). For interrupts, we can just use per cpu storage split by context like qspinlock does, I think (unless we are going with the wild card proposals).

Oh, I missed the chances of different threads requiring protection of different types. Maybe it will work if hazard pointer sections cannot be nested. Or we can wildcard them or something on preemption.

I am going through all the proposals again.

 - Joel



> 
> Thoughts?
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-14 23:26           ` Joel Fernandes
@ 2025-12-15 19:38             ` Mathieu Desnoyers
  2025-12-17  7:54               ` Joel Fernandes
  0 siblings, 1 reply; 12+ messages in thread
From: Mathieu Desnoyers @ 2025-12-15 19:38 UTC (permalink / raw)
  To: Joel Fernandes, Joel Fernandes
  Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com,
	mark.rutland@arm.com, roman.gushchin@linux.dev

On 2025-12-14 18:26, Joel Fernandes wrote:
> 
> 
>> On Dec 15, 2025, at 7:36 AM, Joel Fernandes <joel@joelfernandes.org> wrote:
>>
>> 
>>
>>> On Dec 14, 2025, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>
>>> On 2025-12-14 02:14, Joel Fernandes wrote:
>>> [...]
>>>>> We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side.
>>>>>
>>>>> Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes.
>>>
>>> With the preallocation approach, how can we guarantee that we
>>> preallocate enough, and if we cannot provide that guarantee,
>>> how should readers behave when they reach that limit ?
>>
>> Sure, see below.
>>
>>>>>
>>>>> I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded?
>>>
>>> So the "simple" implementation presented by Boqun at LPC2025 was limited
>>> to preemption disabled context, but in general there is really no need
>>> to require hazptr users to disable preemption for the entire duration of
>>> their hazptr protected critical section. This means that tasks holding a
>>> hazptr can be preempted or blocked, which increases significantly the
>>> number of concurrently active hazptr we need to preallocate.
>>>
>>> So I'd rather remove the preempt disable constraint across the entire
>>> hazptr crtitical section if it's not too much trouble. And it appears
>>> that backup slots preallocation is the only fundamental reason that
>>> would motivate this constraint so far.
>>
>> We can remove the preempt disable constraint and still not require stack allocation by simply embedding overflow nodes and any other context, in task_struct right? This is similiar to how preemptible-rcu tracks readers.

This misses the nested hazard pointer use-case, e.g. one function keeps
a hazard pointer active while calling another function which itself
uses a hazard pointer.

Keeping the backup slot on the caller stack frame takes care of nesting,
which gets tricky if the backup slot is in the task struct.

>>
>>>> ... By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned).
>>> I think the solution I have in mind takes care of the negative aspects
>>> of preallocation requirement. Here is how it would be used (see
>>> [1] for the current prototype). The "backup slot" is within the
>>> struct hazptr_ctx, so from a user perspective, it's under the hood.
>>>
>>> The user just has to provide a context variable on its stack.
>>>
>>> (untested example below)
>>>
>>> void *ptr;
>>>
>>> void fct(void)
>>> {
>>>    struct hazptr_ctx ctx;
>>>    void *addr;
>>>
>>>    addr = hazptr_load_try_protect(&ctx, &ptr);
>>>    if (!addr)
>>>        return;
>>>    [critical section]
>>>    hazptr_release(&ctx, addr);
>>
>> Correct me if I am wrong but I am not seeing why stack ctx is required if ctx and any space required can be placed in the task struct (and hidden from the user). For interrupts, we can just use per cpu storage split by context like qspinlock does, I think (unless we are going with the wild card proposals).
> 
> Oh, I missed the chances of different threads requiring protection of different types. Maybe it will work if hazard pointer sections cannot be nested. Or we can wildcard them or something on preemption.

Nesting of hazard pointer users (even within the same thread or irq
stack) is my main concern here.

Thanks for the feedback!

Mathieu

> 
> I am going through all the proposals again.
> 
>   - Joel
> 
> 
> 
>>
>> Thoughts?
>>
>> Thanks.
>>


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Handling large numbers of hazard pointers
  2025-12-15 19:38             ` Mathieu Desnoyers
@ 2025-12-17  7:54               ` Joel Fernandes
  0 siblings, 0 replies; 12+ messages in thread
From: Joel Fernandes @ 2025-12-17  7:54 UTC (permalink / raw)
  To: Mathieu Desnoyers, Joel Fernandes
  Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com,
	mark.rutland@arm.com, roman.gushchin@linux.dev

Hi Mathieu,

On 12/15/2025 2:38 PM, Mathieu Desnoyers wrote:
> On 2025-12-14 18:26, Joel Fernandes wrote:
>>
>>
>>> On Dec 15, 2025, at 7:36 AM, Joel Fernandes <joel@joelfernandes.org> wrote:
>>>
>>> 
>>>
>>>> On Dec 14, 2025, at 10:43 PM, Mathieu Desnoyers
>>>> <mathieu.desnoyers@efficios.com> wrote:
>>>>
>>>> On 2025-12-14 02:14, Joel Fernandes wrote:
>>>> [...]
>>>>>> We have had similar issues with kfree_rcu() and running out of
>>>>>> preallocated memory there meant we just triggered a slow path
>>>>>> (synchronize) but for hazard ptr I am not sure what such a slow path would
>>>>>> be since, since this problem appears to be on the reader side.
>>>>>>
>>>>>> Perhaps we can also pre allocate overflow nodes in the api itself, and tap
>>>>>> into that in case of overflow? The user need not provide their own storage
>>>>>> then for overflow purposes, I think. And perhaps this pre allocated pool
>>>>>> of overflow nodes can also be common to all hazptr users. Ideally it would
>>>>>> be good to not all the api user deal with overflow at all and it
>>>>>> transparently works behind the scenes.
>>>>
>>>> With the preallocation approach, how can we guarantee that we
>>>> preallocate enough, and if we cannot provide that guarantee,
>>>> how should readers behave when they reach that limit ?
>>>
>>> Sure, see below.
>>>
>>>>>>
>>>>>> I think we need not worry too much about cases of preallocated overflow
>>>>>> nodes itself running out because that is no different than reserved memory
>>>>>> needed in atomic context which need a minimum off anyway right? And we
>>>>>> have a bounded number of CPU and bounded number of context, so the number
>>>>>> of *active* nodes required at any given time should also be bounded?
>>>>
>>>> So the "simple" implementation presented by Boqun at LPC2025 was limited
>>>> to preemption disabled context, but in general there is really no need
>>>> to require hazptr users to disable preemption for the entire duration of
>>>> their hazptr protected critical section. This means that tasks holding a
>>>> hazptr can be preempted or blocked, which increases significantly the
>>>> number of concurrently active hazptr we need to preallocate.
>>>>
>>>> So I'd rather remove the preempt disable constraint across the entire
>>>> hazptr crtitical section if it's not too much trouble. And it appears
>>>> that backup slots preallocation is the only fundamental reason that
>>>> would motivate this constraint so far.
>>>
>>> We can remove the preempt disable constraint and still not require stack
>>> allocation by simply embedding overflow nodes and any other context, in
>>> task_struct right? This is similiar to how preemptible-rcu tracks readers.
> 
> This misses the nested hazard pointer use-case, e.g. one function keeps
> a hazard pointer active while calling another function which itself
> uses a hazard pointer.
> 
> Keeping the backup slot on the caller stack frame takes care of nesting,
> which gets tricky if the backup slot is in the task struct.

Could you share any example code for this part? AFAICS, all the multiple
functions have to do is pass the hazard pointer slot pointer between the
functions, which they have to do anyway to know which slot is in use. You have
to do that in all cases, no whatever what kind of memory the slots are in right?

> 
>>>
>>>>> ... By the way I wanted to emphasize my PoV, that requiring storage
>>>>> provided by the user seems to negate one of the big benefits of hazard
>>>>> pointers I think. Unless there is a solid use case for it, we should
>>>>> probably not require the user to provide separate storage IMHO (or make
>>>>> allocation internal to the api as I mentioned).
>>>> I think the solution I have in mind takes care of the negative aspects
>>>> of preallocation requirement. Here is how it would be used (see
>>>> [1] for the current prototype). The "backup slot" is within the
>>>> struct hazptr_ctx, so from a user perspective, it's under the hood.
>>>>
>>>> The user just has to provide a context variable on its stack.
>>>>
>>>> (untested example below)
>>>>
>>>> void *ptr;
>>>>
>>>> void fct(void)
>>>> {
>>>>    struct hazptr_ctx ctx;
>>>>    void *addr;
>>>>
>>>>    addr = hazptr_load_try_protect(&ctx, &ptr);
>>>>    if (!addr)
>>>>        return;
>>>>    [critical section]
>>>>    hazptr_release(&ctx, addr);
>>>
>>> Correct me if I am wrong but I am not seeing why stack ctx is required if ctx
>>> and any space required can be placed in the task struct (and hidden from the
>>> user). For interrupts, we can just use per cpu storage split by context like
>>> qspinlock does, I think (unless we are going with the wild card proposals).
>>
>> Oh, I missed the chances of different threads requiring protection of
>> different types. Maybe it will work if hazard pointer sections cannot be
>> nested. Or we can wildcard them or something on preemption.
> 
> Nesting of hazard pointer users (even within the same thread or irq
> stack) is my main concern here.
Sure, as we chatted briefly on IRC, if we have the slots in task_struct itself,
then that takes care of nesting as well. But I could be missing your usecase.
Either way, I will implement something like that with per-task-struct slots and
let you and everyone know how it goes.

FWIW, I am currently testing out the context-switch benchmark on the following
machine in our lab:

Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz
    CPU family:              6
    Model:                   106
    Thread(s) per core:      2
    Core(s) per socket:      32
    Socket(s):               2

I had no luck reproducing this on a single-socket, so I am trying a dual socket.
I could not get the exact AMD machine that I saw the reported improvements were
collected with.

 - Joel



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-12-17 13:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-13 22:31 Handling large numbers of hazard pointers Paul E. McKenney
2025-12-14  2:38 ` Mathieu Desnoyers
2025-12-14  6:37   ` Joel Fernandes
2025-12-14  7:14     ` Joel Fernandes
2025-12-14 13:43       ` Mathieu Desnoyers
2025-12-14 20:49         ` Boqun Feng
2025-12-15 19:34           ` Mathieu Desnoyers
2025-12-15 21:11             ` Mathieu Desnoyers
2025-12-14 22:36         ` Joel Fernandes
2025-12-14 23:26           ` Joel Fernandes
2025-12-15 19:38             ` Mathieu Desnoyers
2025-12-17  7:54               ` Joel Fernandes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox