* Handling large numbers of hazard pointers @ 2025-12-13 22:31 Paul E. McKenney 2025-12-14 2:38 ` Mathieu Desnoyers 0 siblings, 1 reply; 12+ messages in thread From: Paul E. McKenney @ 2025-12-13 22:31 UTC (permalink / raw) To: rcu; +Cc: boqun.feng, mark.rutland, mathieu.desnoyers, roman.gushchin Hello! I didn't do a good job of answering the "what about large numbers of hazard pointers" at Boqun's and my hazard-pointers talk at Linux Plumbers Conference yesterday, so please allow me to at least start on the path towards fixing that problem. Also, there were a couple of people participating whose email addresses I don't know, so please feel free to CC them. The trick is that in real workloads to date, although there might be millions of hazard pointers, there will typically only be a few active per CPU at a given time. This of course suggests a per-CPU data structure tracking the active ones. Allocating a hazard pointer grabs an unused one from this array, or, if all entries are in use, takes memory provided by the caller and links it into an overflow list. Either way, it returns a pointer to the hazard pointer that is now visible to updaters. When done, the caller calls a function that marks the array-entry as unused or removes the element from the list, as the case may be. Because hazard pointers can migrate among CPUs, full synchronization is required when operating on the array and the overflow list. And either way, the caller is responsible for allocating and freeing the backup hazard-pointer structure that will be used in case of overflow. And also either way, the updater need only deal with hazard pointers that are currently in use. Cue Mark telling me that his use case can have millions of active hazard pointers? ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-13 22:31 Handling large numbers of hazard pointers Paul E. McKenney @ 2025-12-14 2:38 ` Mathieu Desnoyers 2025-12-14 6:37 ` Joel Fernandes 0 siblings, 1 reply; 12+ messages in thread From: Mathieu Desnoyers @ 2025-12-14 2:38 UTC (permalink / raw) To: paulmck, rcu; +Cc: boqun.feng, mark.rutland, roman.gushchin On 2025-12-13 17:31, Paul E. McKenney wrote: > Hello! > > I didn't do a good job of answering the "what about large numbers of > hazard pointers" at Boqun's and my hazard-pointers talk at Linux Plumbers > Conference yesterday, so please allow me to at least start on the path > towards fixing that problem. > > Also, there were a couple of people participating whose email addresses > I don't know, so please feel free to CC them. > > The trick is that in real workloads to date, although there might be > millions of hazard pointers, there will typically only be a few active > per CPU at a given time. This of course suggests a per-CPU data structure > tracking the active ones. Allocating a hazard pointer grabs an unused one > from this array, or, if all entries are in use, takes memory provided by > the caller and links it into an overflow list. Either way, it returns a > pointer to the hazard pointer that is now visible to updaters. When done, > the caller calls a function that marks the array-entry as unused or > removes the element from the list, as the case may be. Because hazard > pointers can migrate among CPUs, full synchronization is required when > operating on the array and the overflow list. > > And either way, the caller is responsible for allocating and freeing the > backup hazard-pointer structure that will be used in case of overflow. > And also either way, the updater need only deal with hazard pointers > that are currently in use. OK, so let me state how I see the fundamental problem you are trying to address, and detail a possible solution. * Problem Statement Assuming we go for an array of hazard pointer slots per CPU to cover the fast path (common case), we still need to handle the overflow scenario, where more hazard pointers are accessed concurrently for a given CPU than the array size, either due to preemption, nested interrupts, or simply nested calls. * Possible Solution Requiring the HP caller to allocate backup space is clearly something that would cover all scenarios. My worry is that tracking this backup space allocation may be cumbersome for the user, especially if this requires heap allocation. Where the backup space can be allocated will likely depend on how long the HP will be accessed. My working hypothesis here (let me know if I'm wrong) is that a most of those HP users will complete their access within the same stack frame where the HP was acquired. This is the primary use-case I would like to make sure is convenient. For that use-case the users can simply allocate room on their stack frame for the backup HP slot. The requirement here is that they clear the HP slot before the end of the current stack frame. If there is enough room in the per-CPU array, they use that, else they add the backup slot from their stack into the backup slot list. When they are done, if they used a backup slot, they need to remove it from the list. There could still be room for more advanced use-cases where the backup slots are allocated on the heap, but I suspect that it would make the API trickier to use and should be reserved for use-cases that really require it. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-14 2:38 ` Mathieu Desnoyers @ 2025-12-14 6:37 ` Joel Fernandes 2025-12-14 7:14 ` Joel Fernandes 0 siblings, 1 reply; 12+ messages in thread From: Joel Fernandes @ 2025-12-14 6:37 UTC (permalink / raw) To: Mathieu Desnoyers Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com, mark.rutland@arm.com, roman.gushchin@linux.dev > On Dec 14, 2025, at 11:38 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2025-12-13 17:31, Paul E. McKenney wrote: >> Hello! >> I didn't do a good job of answering the "what about large numbers of >> hazard pointers" at Boqun's and my hazard-pointers talk at Linux Plumbers >> Conference yesterday, so please allow me to at least start on the path >> towards fixing that problem. >> Also, there were a couple of people participating whose email addresses >> I don't know, so please feel free to CC them. >> The trick is that in real workloads to date, although there might be >> millions of hazard pointers, there will typically only be a few active >> per CPU at a given time. This of course suggests a per-CPU data structure >> tracking the active ones. Allocating a hazard pointer grabs an unused one >> from this array, or, if all entries are in use, takes memory provided by >> the caller and links it into an overflow list. Either way, it returns a >> pointer to the hazard pointer that is now visible to updaters. When done, >> the caller calls a function that marks the array-entry as unused or >> removes the element from the list, as the case may be. Because hazard >> pointers can migrate among CPUs, full synchronization is required when >> operating on the array and the overflow list. >> And either way, the caller is responsible for allocating and freeing the >> backup hazard-pointer structure that will be used in case of overflow. >> And also either way, the updater need only deal with hazard pointers >> that are currently in use. > OK, so let me state how I see the fundamental problem you are trying > to address, and detail a possible solution. > > * Problem Statement > > Assuming we go for an array of hazard pointer slots per CPU to cover > the fast path (common case), we still need to handle the > overflow scenario, where more hazard pointers are accessed > concurrently for a given CPU than the array size, either due to > preemption, nested interrupts, or simply nested calls. > > * Possible Solution > > Requiring the HP caller to allocate backup space is clearly something > that would cover all scenarios. My worry is that tracking this backup > space allocation may be cumbersome for the user, especially if this > requires heap allocation. > > Where the backup space can be allocated will likely depend on how long > the HP will be accessed. My working hypothesis here (let me know if > I'm wrong) is that a most of those HP users will complete their access > within the same stack frame where the HP was acquired. This is the > primary use-case I would like to make sure is convenient. > > For that use-case the users can simply allocate room on their > stack frame for the backup HP slot. The requirement here is that they > clear the HP slot before the end of the current stack frame. > If there is enough room in the per-CPU array, they use that, else > they add the backup slot from their stack into the backup slot > list. When they are done, if they used a backup slot, they need > to remove it from the list. > > There could still be room for more advanced use-cases where the > backup slots are allocated on the heap, but I suspect that it would > make the API trickier to use and should be reserved for use-cases > that really require it. > > Thoughts ? This sounds fine to me. We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side. Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes. I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded? Thoughts? Thanks. > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-14 6:37 ` Joel Fernandes @ 2025-12-14 7:14 ` Joel Fernandes 2025-12-14 13:43 ` Mathieu Desnoyers 0 siblings, 1 reply; 12+ messages in thread From: Joel Fernandes @ 2025-12-14 7:14 UTC (permalink / raw) To: Mathieu Desnoyers Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com, mark.rutland@arm.com, roman.gushchin@linux.dev > On Dec 14, 2025, at 3:37 PM, Joel Fernandes <joelagnelf@nvidia.com> wrote: > > > >>> On Dec 14, 2025, at 11:38 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >>> >>> On 2025-12-13 17:31, Paul E. McKenney wrote: >>> Hello! >>> I didn't do a good job of answering the "what about large numbers of >>> hazard pointers" at Boqun's and my hazard-pointers talk at Linux Plumbers >>> Conference yesterday, so please allow me to at least start on the path >>> towards fixing that problem. >>> Also, there were a couple of people participating whose email addresses >>> I don't know, so please feel free to CC them. >>> The trick is that in real workloads to date, although there might be >>> millions of hazard pointers, there will typically only be a few active >>> per CPU at a given time. This of course suggests a per-CPU data structure >>> tracking the active ones. Allocating a hazard pointer grabs an unused one >>> from this array, or, if all entries are in use, takes memory provided by >>> the caller and links it into an overflow list. Either way, it returns a >>> pointer to the hazard pointer that is now visible to updaters. When done, >>> the caller calls a function that marks the array-entry as unused or >>> removes the element from the list, as the case may be. Because hazard >>> pointers can migrate among CPUs, full synchronization is required when >>> operating on the array and the overflow list. >>> And either way, the caller is responsible for allocating and freeing the >>> backup hazard-pointer structure that will be used in case of overflow. >>> And also either way, the updater need only deal with hazard pointers >>> that are currently in use. >> OK, so let me state how I see the fundamental problem you are trying >> to address, and detail a possible solution. >> >> * Problem Statement >> >> Assuming we go for an array of hazard pointer slots per CPU to cover >> the fast path (common case), we still need to handle the >> overflow scenario, where more hazard pointers are accessed >> concurrently for a given CPU than the array size, either due to >> preemption, nested interrupts, or simply nested calls. >> >> * Possible Solution >> >> Requiring the HP caller to allocate backup space is clearly something >> that would cover all scenarios. My worry is that tracking this backup >> space allocation may be cumbersome for the user, especially if this >> requires heap allocation. >> >> Where the backup space can be allocated will likely depend on how long >> the HP will be accessed. My working hypothesis here (let me know if >> I'm wrong) is that a most of those HP users will complete their access >> within the same stack frame where the HP was acquired. This is the >> primary use-case I would like to make sure is convenient. >> >> For that use-case the users can simply allocate room on their >> stack frame for the backup HP slot. The requirement here is that they >> clear the HP slot before the end of the current stack frame. >> If there is enough room in the per-CPU array, they use that, else >> they add the backup slot from their stack into the backup slot >> list. When they are done, if they used a backup slot, they need >> to remove it from the list. >> >> There could still be room for more advanced use-cases where the >> backup slots are allocated on the heap, but I suspect that it would >> make the API trickier to use and should be reserved for use-cases >> that really require it. >> >> Thoughts ? > > This sounds fine to me. > > We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side. > > Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes. > > I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded? > > Thoughts? > > Thanks. By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned). thanks. > > > > > >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-14 7:14 ` Joel Fernandes @ 2025-12-14 13:43 ` Mathieu Desnoyers 2025-12-14 20:49 ` Boqun Feng 2025-12-14 22:36 ` Joel Fernandes 0 siblings, 2 replies; 12+ messages in thread From: Mathieu Desnoyers @ 2025-12-14 13:43 UTC (permalink / raw) To: Joel Fernandes Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com, mark.rutland@arm.com, roman.gushchin@linux.dev On 2025-12-14 02:14, Joel Fernandes wrote: > [...] >> We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side. >> >> Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes. With the preallocation approach, how can we guarantee that we preallocate enough, and if we cannot provide that guarantee, how should readers behave when they reach that limit ? >> >> I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded? So the "simple" implementation presented by Boqun at LPC2025 was limited to preemption disabled context, but in general there is really no need to require hazptr users to disable preemption for the entire duration of their hazptr protected critical section. This means that tasks holding a hazptr can be preempted or blocked, which increases significantly the number of concurrently active hazptr we need to preallocate. So I'd rather remove the preempt disable constraint across the entire hazptr crtitical section if it's not too much trouble. And it appears that backup slots preallocation is the only fundamental reason that would motivate this constraint so far. >... > By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned). I think the solution I have in mind takes care of the negative aspects of preallocation requirement. Here is how it would be used (see [1] for the current prototype). The "backup slot" is within the struct hazptr_ctx, so from a user perspective, it's under the hood. The user just has to provide a context variable on its stack. (untested example below) void *ptr; void fct(void) { struct hazptr_ctx ctx; void *addr; addr = hazptr_load_try_protect(&ctx, &ptr); if (!addr) return; [critical section] hazptr_release(&ctx, addr); } Thanks, Mathieu [1] https://github.com/compudj/linux-dev/commit/5e33fc13f950929e630a60ef44249e7b0978e80f -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-14 13:43 ` Mathieu Desnoyers @ 2025-12-14 20:49 ` Boqun Feng 2025-12-15 19:34 ` Mathieu Desnoyers 2025-12-14 22:36 ` Joel Fernandes 1 sibling, 1 reply; 12+ messages in thread From: Boqun Feng @ 2025-12-14 20:49 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Joel Fernandes, paulmck@kernel.org, rcu@vger.kernel.org, mark.rutland@arm.com, roman.gushchin@linux.dev On Sun, Dec 14, 2025 at 08:43:18AM -0500, Mathieu Desnoyers wrote: [...] > > ... By the way I wanted to emphasize my PoV, that requiring storage > > provided by the user seems to negate one of the big benefits of hazard TBH, if users don't want to do the extra allocation, they probably want to just use RCU ;-) and even we can hide the allocation behind the API, the performance penalty is still there when the "allocation" happens. For example, in Mathieu's proposal below, when this cpu's slot is used (which should be normal if we allow preemptible hazptr readers) every hazptr_try_protect() will become a mutex acquisition, that means we are back to locking in common cases. I in general don't think allocating or filling in the backup slot while _try_protect() is a good idea, unless we can avoid the cost in common cases. > > pointers I think. Unless there is a solid use case for it, we should > > probably not require the user to provide separate storage IMHO (or make > > allocation internal to the api as I mentioned). > I think the solution I have in mind takes care of the negative aspects > of preallocation requirement. Here is how it would be used (see > [1] for the current prototype). The "backup slot" is within the A few issues in the prototype: * hazptr_load_try_protect_percpu() needs to either be called in preemption disabling context or disable preemption itself. * In __hazptr_load_try_protect(), seems to me that we shouldn't return after hazptr_chain_backup_slot() being called, but rather continue, otherwise we would miss the updater changes. * As I mentioned above, since we allow preemption during hazptr readers, then when we have multiple users, it's likely that the per-CPU slot is used at most of the time, and other try_protect()s would always go into the slow path, which downgrade us to normal locking. Regards, Boqun > struct hazptr_ctx, so from a user perspective, it's under the hood. > > The user just has to provide a context variable on its stack. > > (untested example below) > > void *ptr; > > void fct(void) > { > struct hazptr_ctx ctx; > void *addr; > > addr = hazptr_load_try_protect(&ctx, &ptr); > if (!addr) > return; > [critical section] > hazptr_release(&ctx, addr); > } > > Thanks, > > Mathieu > > [1] https://github.com/compudj/linux-dev/commit/5e33fc13f950929e630a60ef44249e7b0978e80f > > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-14 20:49 ` Boqun Feng @ 2025-12-15 19:34 ` Mathieu Desnoyers 2025-12-15 21:11 ` Mathieu Desnoyers 0 siblings, 1 reply; 12+ messages in thread From: Mathieu Desnoyers @ 2025-12-15 19:34 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, paulmck@kernel.org, rcu@vger.kernel.org, mark.rutland@arm.com, roman.gushchin@linux.dev On 2025-12-14 15:49, Boqun Feng wrote: [...] > A few issues in the prototype: > See [1] for the updated commit. > * hazptr_load_try_protect_percpu() needs to either be called in > preemption disabling context or disable preemption itself. Good point, fixed. Added a "guard(preempt)()" within the scope of a for loop so preemption is disabled for each attempt, but not across attempts. > > * In __hazptr_load_try_protect(), seems to me that we shouldn't return > after hazptr_chain_backup_slot() being called, but rather continue, > otherwise we would miss the updater changes. Good point, fixed. > > * As I mentioned above, since we allow preemption during hazptr readers, > then when we have multiple users, it's likely that the per-CPU slot is > used at most of the time, and other try_protect()s would always go > into the slow path, which downgrade us to normal locking. Fair point. I've modified my implementation to have 8 slots per cpu (rather than one). It fits within a single cache line. That should allow a few preempted tasks to keep hold of a hazard pointer without requiring use of the backup slot. As for the overflow list, I've done the following changes: - It now has per-cpu overflow lists, so the fallback relies on per-cpu locking rather than a global lock, - Protect list add/del with raw spinlock rather than mutex. - Introduce a generation counter (64-bit integer) with each overflow list, allowing piecewise iteration of the list by synchronize. The generation counter is read when beginning list iteration. Each time the raw spinlock is released and taken again to continue list iteration, the generation counter is checked, and retry entire list traversal if it does not match the counter value at beginning of traversal. Strictly speaking, the main downside of the generation counter approach is that a steady stream of list modifications could cause a synchronize to retry endlessly. I open to ideas on how to improve this. Feedback is welcome! [1] https://github.com/compudj/linux-dev/commit/730616cd2989b710b585144ac10fc34b4fd641ea Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-15 19:34 ` Mathieu Desnoyers @ 2025-12-15 21:11 ` Mathieu Desnoyers 0 siblings, 0 replies; 12+ messages in thread From: Mathieu Desnoyers @ 2025-12-15 21:11 UTC (permalink / raw) To: Boqun Feng Cc: Joel Fernandes, paulmck@kernel.org, rcu@vger.kernel.org, mark.rutland@arm.com, roman.gushchin@linux.dev On 2025-12-15 14:34, Mathieu Desnoyers wrote: [...] > Feedback is welcome! Note: here is an updated commit which renames the "hazptr_load_try_protect" to "hazptr_acquire". The change of name makes sense now that the memory for the backup slot is already allocated on the caller stack. The hazard pointer acquire operation cannot fail anymore. The loaded pointer value change scenario is handled with a retry. https://github.com/compudj/linux-dev/commit/cd9e8721a2409d7ab651c9a191f35a48fcbc4e9b Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-14 13:43 ` Mathieu Desnoyers 2025-12-14 20:49 ` Boqun Feng @ 2025-12-14 22:36 ` Joel Fernandes 2025-12-14 23:26 ` Joel Fernandes 1 sibling, 1 reply; 12+ messages in thread From: Joel Fernandes @ 2025-12-14 22:36 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Joel Fernandes, paulmck, rcu, boqun.feng, mark.rutland, roman.gushchin > On Dec 14, 2025, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2025-12-14 02:14, Joel Fernandes wrote: > [...] >>> We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side. >>> >>> Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes. > > With the preallocation approach, how can we guarantee that we > preallocate enough, and if we cannot provide that guarantee, > how should readers behave when they reach that limit ? Sure, see below. >>> >>> I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded? > > So the "simple" implementation presented by Boqun at LPC2025 was limited > to preemption disabled context, but in general there is really no need > to require hazptr users to disable preemption for the entire duration of > their hazptr protected critical section. This means that tasks holding a > hazptr can be preempted or blocked, which increases significantly the > number of concurrently active hazptr we need to preallocate. > > So I'd rather remove the preempt disable constraint across the entire > hazptr crtitical section if it's not too much trouble. And it appears > that backup slots preallocation is the only fundamental reason that > would motivate this constraint so far. We can remove the preempt disable constraint and still not require stack allocation by simply embedding overflow nodes and any other context, in task_struct right? This is similiar to how preemptible-rcu tracks readers. >> ... By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned). > I think the solution I have in mind takes care of the negative aspects > of preallocation requirement. Here is how it would be used (see > [1] for the current prototype). The "backup slot" is within the > struct hazptr_ctx, so from a user perspective, it's under the hood. > > The user just has to provide a context variable on its stack. > > (untested example below) > > void *ptr; > > void fct(void) > { > struct hazptr_ctx ctx; > void *addr; > > addr = hazptr_load_try_protect(&ctx, &ptr); > if (!addr) > return; > [critical section] > hazptr_release(&ctx, addr); Correct me if I am wrong but I am not seeing why stack ctx is required if ctx and any space required can be placed in the task struct (and hidden from the user). For interrupts, we can just use per cpu storage split by context like qspinlock does, I think (unless we are going with the wild card proposals). Thoughts? Thanks. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-14 22:36 ` Joel Fernandes @ 2025-12-14 23:26 ` Joel Fernandes 2025-12-15 19:38 ` Mathieu Desnoyers 0 siblings, 1 reply; 12+ messages in thread From: Joel Fernandes @ 2025-12-14 23:26 UTC (permalink / raw) To: Joel Fernandes Cc: Mathieu Desnoyers, paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com, mark.rutland@arm.com, roman.gushchin@linux.dev > On Dec 15, 2025, at 7:36 AM, Joel Fernandes <joel@joelfernandes.org> wrote: > > > >> On Dec 14, 2025, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >> >> On 2025-12-14 02:14, Joel Fernandes wrote: >> [...] >>>> We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side. >>>> >>>> Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes. >> >> With the preallocation approach, how can we guarantee that we >> preallocate enough, and if we cannot provide that guarantee, >> how should readers behave when they reach that limit ? > > Sure, see below. > >>>> >>>> I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded? >> >> So the "simple" implementation presented by Boqun at LPC2025 was limited >> to preemption disabled context, but in general there is really no need >> to require hazptr users to disable preemption for the entire duration of >> their hazptr protected critical section. This means that tasks holding a >> hazptr can be preempted or blocked, which increases significantly the >> number of concurrently active hazptr we need to preallocate. >> >> So I'd rather remove the preempt disable constraint across the entire >> hazptr crtitical section if it's not too much trouble. And it appears >> that backup slots preallocation is the only fundamental reason that >> would motivate this constraint so far. > > We can remove the preempt disable constraint and still not require stack allocation by simply embedding overflow nodes and any other context, in task_struct right? This is similiar to how preemptible-rcu tracks readers. > >>> ... By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned). >> I think the solution I have in mind takes care of the negative aspects >> of preallocation requirement. Here is how it would be used (see >> [1] for the current prototype). The "backup slot" is within the >> struct hazptr_ctx, so from a user perspective, it's under the hood. >> >> The user just has to provide a context variable on its stack. >> >> (untested example below) >> >> void *ptr; >> >> void fct(void) >> { >> struct hazptr_ctx ctx; >> void *addr; >> >> addr = hazptr_load_try_protect(&ctx, &ptr); >> if (!addr) >> return; >> [critical section] >> hazptr_release(&ctx, addr); > > Correct me if I am wrong but I am not seeing why stack ctx is required if ctx and any space required can be placed in the task struct (and hidden from the user). For interrupts, we can just use per cpu storage split by context like qspinlock does, I think (unless we are going with the wild card proposals). Oh, I missed the chances of different threads requiring protection of different types. Maybe it will work if hazard pointer sections cannot be nested. Or we can wildcard them or something on preemption. I am going through all the proposals again. - Joel > > Thoughts? > > Thanks. > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-14 23:26 ` Joel Fernandes @ 2025-12-15 19:38 ` Mathieu Desnoyers 2025-12-17 7:54 ` Joel Fernandes 0 siblings, 1 reply; 12+ messages in thread From: Mathieu Desnoyers @ 2025-12-15 19:38 UTC (permalink / raw) To: Joel Fernandes, Joel Fernandes Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com, mark.rutland@arm.com, roman.gushchin@linux.dev On 2025-12-14 18:26, Joel Fernandes wrote: > > >> On Dec 15, 2025, at 7:36 AM, Joel Fernandes <joel@joelfernandes.org> wrote: >> >> >> >>> On Dec 14, 2025, at 10:43 PM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: >>> >>> On 2025-12-14 02:14, Joel Fernandes wrote: >>> [...] >>>>> We have had similar issues with kfree_rcu() and running out of preallocated memory there meant we just triggered a slow path (synchronize) but for hazard ptr I am not sure what such a slow path would be since, since this problem appears to be on the reader side. >>>>> >>>>> Perhaps we can also pre allocate overflow nodes in the api itself, and tap into that in case of overflow? The user need not provide their own storage then for overflow purposes, I think. And perhaps this pre allocated pool of overflow nodes can also be common to all hazptr users. Ideally it would be good to not all the api user deal with overflow at all and it transparently works behind the scenes. >>> >>> With the preallocation approach, how can we guarantee that we >>> preallocate enough, and if we cannot provide that guarantee, >>> how should readers behave when they reach that limit ? >> >> Sure, see below. >> >>>>> >>>>> I think we need not worry too much about cases of preallocated overflow nodes itself running out because that is no different than reserved memory needed in atomic context which need a minimum off anyway right? And we have a bounded number of CPU and bounded number of context, so the number of *active* nodes required at any given time should also be bounded? >>> >>> So the "simple" implementation presented by Boqun at LPC2025 was limited >>> to preemption disabled context, but in general there is really no need >>> to require hazptr users to disable preemption for the entire duration of >>> their hazptr protected critical section. This means that tasks holding a >>> hazptr can be preempted or blocked, which increases significantly the >>> number of concurrently active hazptr we need to preallocate. >>> >>> So I'd rather remove the preempt disable constraint across the entire >>> hazptr crtitical section if it's not too much trouble. And it appears >>> that backup slots preallocation is the only fundamental reason that >>> would motivate this constraint so far. >> >> We can remove the preempt disable constraint and still not require stack allocation by simply embedding overflow nodes and any other context, in task_struct right? This is similiar to how preemptible-rcu tracks readers. This misses the nested hazard pointer use-case, e.g. one function keeps a hazard pointer active while calling another function which itself uses a hazard pointer. Keeping the backup slot on the caller stack frame takes care of nesting, which gets tricky if the backup slot is in the task struct. >> >>>> ... By the way I wanted to emphasize my PoV, that requiring storage provided by the user seems to negate one of the big benefits of hazard pointers I think. Unless there is a solid use case for it, we should probably not require the user to provide separate storage IMHO (or make allocation internal to the api as I mentioned). >>> I think the solution I have in mind takes care of the negative aspects >>> of preallocation requirement. Here is how it would be used (see >>> [1] for the current prototype). The "backup slot" is within the >>> struct hazptr_ctx, so from a user perspective, it's under the hood. >>> >>> The user just has to provide a context variable on its stack. >>> >>> (untested example below) >>> >>> void *ptr; >>> >>> void fct(void) >>> { >>> struct hazptr_ctx ctx; >>> void *addr; >>> >>> addr = hazptr_load_try_protect(&ctx, &ptr); >>> if (!addr) >>> return; >>> [critical section] >>> hazptr_release(&ctx, addr); >> >> Correct me if I am wrong but I am not seeing why stack ctx is required if ctx and any space required can be placed in the task struct (and hidden from the user). For interrupts, we can just use per cpu storage split by context like qspinlock does, I think (unless we are going with the wild card proposals). > > Oh, I missed the chances of different threads requiring protection of different types. Maybe it will work if hazard pointer sections cannot be nested. Or we can wildcard them or something on preemption. Nesting of hazard pointer users (even within the same thread or irq stack) is my main concern here. Thanks for the feedback! Mathieu > > I am going through all the proposals again. > > - Joel > > > >> >> Thoughts? >> >> Thanks. >> -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Handling large numbers of hazard pointers 2025-12-15 19:38 ` Mathieu Desnoyers @ 2025-12-17 7:54 ` Joel Fernandes 0 siblings, 0 replies; 12+ messages in thread From: Joel Fernandes @ 2025-12-17 7:54 UTC (permalink / raw) To: Mathieu Desnoyers, Joel Fernandes Cc: paulmck@kernel.org, rcu@vger.kernel.org, boqun.feng@gmail.com, mark.rutland@arm.com, roman.gushchin@linux.dev Hi Mathieu, On 12/15/2025 2:38 PM, Mathieu Desnoyers wrote: > On 2025-12-14 18:26, Joel Fernandes wrote: >> >> >>> On Dec 15, 2025, at 7:36 AM, Joel Fernandes <joel@joelfernandes.org> wrote: >>> >>> >>> >>>> On Dec 14, 2025, at 10:43 PM, Mathieu Desnoyers >>>> <mathieu.desnoyers@efficios.com> wrote: >>>> >>>> On 2025-12-14 02:14, Joel Fernandes wrote: >>>> [...] >>>>>> We have had similar issues with kfree_rcu() and running out of >>>>>> preallocated memory there meant we just triggered a slow path >>>>>> (synchronize) but for hazard ptr I am not sure what such a slow path would >>>>>> be since, since this problem appears to be on the reader side. >>>>>> >>>>>> Perhaps we can also pre allocate overflow nodes in the api itself, and tap >>>>>> into that in case of overflow? The user need not provide their own storage >>>>>> then for overflow purposes, I think. And perhaps this pre allocated pool >>>>>> of overflow nodes can also be common to all hazptr users. Ideally it would >>>>>> be good to not all the api user deal with overflow at all and it >>>>>> transparently works behind the scenes. >>>> >>>> With the preallocation approach, how can we guarantee that we >>>> preallocate enough, and if we cannot provide that guarantee, >>>> how should readers behave when they reach that limit ? >>> >>> Sure, see below. >>> >>>>>> >>>>>> I think we need not worry too much about cases of preallocated overflow >>>>>> nodes itself running out because that is no different than reserved memory >>>>>> needed in atomic context which need a minimum off anyway right? And we >>>>>> have a bounded number of CPU and bounded number of context, so the number >>>>>> of *active* nodes required at any given time should also be bounded? >>>> >>>> So the "simple" implementation presented by Boqun at LPC2025 was limited >>>> to preemption disabled context, but in general there is really no need >>>> to require hazptr users to disable preemption for the entire duration of >>>> their hazptr protected critical section. This means that tasks holding a >>>> hazptr can be preempted or blocked, which increases significantly the >>>> number of concurrently active hazptr we need to preallocate. >>>> >>>> So I'd rather remove the preempt disable constraint across the entire >>>> hazptr crtitical section if it's not too much trouble. And it appears >>>> that backup slots preallocation is the only fundamental reason that >>>> would motivate this constraint so far. >>> >>> We can remove the preempt disable constraint and still not require stack >>> allocation by simply embedding overflow nodes and any other context, in >>> task_struct right? This is similiar to how preemptible-rcu tracks readers. > > This misses the nested hazard pointer use-case, e.g. one function keeps > a hazard pointer active while calling another function which itself > uses a hazard pointer. > > Keeping the backup slot on the caller stack frame takes care of nesting, > which gets tricky if the backup slot is in the task struct. Could you share any example code for this part? AFAICS, all the multiple functions have to do is pass the hazard pointer slot pointer between the functions, which they have to do anyway to know which slot is in use. You have to do that in all cases, no whatever what kind of memory the slots are in right? > >>> >>>>> ... By the way I wanted to emphasize my PoV, that requiring storage >>>>> provided by the user seems to negate one of the big benefits of hazard >>>>> pointers I think. Unless there is a solid use case for it, we should >>>>> probably not require the user to provide separate storage IMHO (or make >>>>> allocation internal to the api as I mentioned). >>>> I think the solution I have in mind takes care of the negative aspects >>>> of preallocation requirement. Here is how it would be used (see >>>> [1] for the current prototype). The "backup slot" is within the >>>> struct hazptr_ctx, so from a user perspective, it's under the hood. >>>> >>>> The user just has to provide a context variable on its stack. >>>> >>>> (untested example below) >>>> >>>> void *ptr; >>>> >>>> void fct(void) >>>> { >>>> struct hazptr_ctx ctx; >>>> void *addr; >>>> >>>> addr = hazptr_load_try_protect(&ctx, &ptr); >>>> if (!addr) >>>> return; >>>> [critical section] >>>> hazptr_release(&ctx, addr); >>> >>> Correct me if I am wrong but I am not seeing why stack ctx is required if ctx >>> and any space required can be placed in the task struct (and hidden from the >>> user). For interrupts, we can just use per cpu storage split by context like >>> qspinlock does, I think (unless we are going with the wild card proposals). >> >> Oh, I missed the chances of different threads requiring protection of >> different types. Maybe it will work if hazard pointer sections cannot be >> nested. Or we can wildcard them or something on preemption. > > Nesting of hazard pointer users (even within the same thread or irq > stack) is my main concern here. Sure, as we chatted briefly on IRC, if we have the slots in task_struct itself, then that takes care of nesting as well. But I could be missing your usecase. Either way, I will implement something like that with per-task-struct slots and let you and everyone know how it goes. FWIW, I am currently testing out the context-switch benchmark on the following machine in our lab: Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz CPU family: 6 Model: 106 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 I had no luck reproducing this on a single-socket, so I am trying a dual socket. I could not get the exact AMD machine that I saw the reported improvements were collected with. - Joel ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-12-17 13:29 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-13 22:31 Handling large numbers of hazard pointers Paul E. McKenney 2025-12-14 2:38 ` Mathieu Desnoyers 2025-12-14 6:37 ` Joel Fernandes 2025-12-14 7:14 ` Joel Fernandes 2025-12-14 13:43 ` Mathieu Desnoyers 2025-12-14 20:49 ` Boqun Feng 2025-12-15 19:34 ` Mathieu Desnoyers 2025-12-15 21:11 ` Mathieu Desnoyers 2025-12-14 22:36 ` Joel Fernandes 2025-12-14 23:26 ` Joel Fernandes 2025-12-15 19:38 ` Mathieu Desnoyers 2025-12-17 7:54 ` Joel Fernandes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox