* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere [not found] ` <c5331cd6-76c8-430d-978e-fcad164e48f6@huawei.com> @ 2026-04-23 5:53 ` Dmitry Vyukov 2026-04-23 10:39 ` Thomas Gleixner ` (2 more replies) 0 siblings, 3 replies; 33+ messages in thread From: Dmitry Vyukov @ 2026-04-23 5:53 UTC (permalink / raw) To: Jinjie Ruan, linux-man Cc: Thomas Gleixner, Mark Rutland, Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, 23 Apr 2026 at 03:48, Jinjie Ruan <ruanjinjie@huawei.com> wrote: > > On 4/23/2026 3:47 AM, Thomas Gleixner wrote: > > On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote: > >> On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote: > >> Conceptually we just need to use syscall_enter_from_user_mode() and > >> irqentry_enter_from_user_mode() appropriately. > > > > Right. I figured that out. > > > >> In practice, I can't use those as-is without introducing the exception > >> masking problems I just fixed up for irqentry_enter_from_kernel_mode(), > >> so I'll need to do some similar refactoring first. > > > > See below. > > > >> I haven't paged everything in yet, so just to cehck, is there anything > >> that would behave incorrectly if current->rseq.event.user_irq were set > >> for syscall entry? IIUC it means we'll effectively do the slow path, and > >> I was wondering if that might be acceptable as a one-line bodge for > >> stable. > > > > It might work, but it's trivial enough to avoid that. See below. That on > > top of 6.19.y makes the selftests pass too. > > This aligns with my thoughts when convert arm64 to generic syscall > entry. Currently, the arm64 entry code does not distinguish between IRQ > and syscall entries. It fails to call rseq_note_user_irq_entry() for IRQ > entries as the generic entry framework does, because arm64 uses > enter_from_user_mode() exclusively instead of > irqentry_enter_from_user_mode(). > > https://lore.kernel.org/all/20260320102620.1336796-10-ruanjinjie@huawei.com/ > > > > > Thanks, > > > > tglx > > --- > > arch/arm64/kernel/entry-common.c | 14 ++++++++++---- > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > --- a/arch/arm64/kernel/entry-common.c > > +++ b/arch/arm64/kernel/entry-common.c > > @@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode( > > irqentry_exit(regs, state); > > } > > > > +static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs) > > +{ > > + enter_from_user_mode(regs); > > + mte_disable_tco_entry(current); > > +} > > + > > /* > > * Handle IRQ/context state management when entering from user mode. > > * Before this function is called it is not safe to call regular kernel code, > > @@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode( > > */ > > static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs) > > { > > - enter_from_user_mode(regs); > > - mte_disable_tco_entry(current); > > + arm64_enter_from_user_mode_syscall(regs); > > + rseq_note_user_irq_entry(); > > } > > > > /* > > @@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_ > > > > static void noinstr el0_svc(struct pt_regs *regs) > > { > > - arm64_enter_from_user_mode(regs); > > + arm64_enter_from_user_mode_syscall(regs); > > cortex_a76_erratum_1463225_svc_handler(); > > fpsimd_syscall_enter(); > > local_daif_restore(DAIF_PROCCTX); > > @@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r > > > > static void noinstr el0_svc_compat(struct pt_regs *regs) > > { > > - arm64_enter_from_user_mode(regs); > > + arm64_enter_from_user_mode_syscall(regs); > > cortex_a76_erratum_1463225_svc_handler(); > > local_daif_restore(DAIF_PROCCTX); > > do_el0_svc_compat(regs); +linux-man This part of the rseq man page needs to be fixed as well I think. The kernel no longer reliably provides clearing of rseq_cs on preemption, right? https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 "and set to NULL by the kernel when it restarts an assembly instruction sequence block, as well as when the kernel detects that it is preempting or delivering a signal outside of the range targeted by the rseq_cs." ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 5:53 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Dmitry Vyukov @ 2026-04-23 10:39 ` Thomas Gleixner 2026-04-23 10:51 ` Mathias Stearn 2026-04-23 12:11 ` Alejandro Colomar 2026-04-23 12:29 ` Mathieu Desnoyers 2 siblings, 1 reply; 33+ messages in thread From: Thomas Gleixner @ 2026-04-23 10:39 UTC (permalink / raw) To: Dmitry Vyukov, Jinjie Ruan, linux-man Cc: Mark Rutland, Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, Apr 23 2026 at 07:53, Dmitry Vyukov wrote: > On Thu, 23 Apr 2026 at 03:48, Jinjie Ruan <ruanjinjie@huawei.com> wrote: > > This part of the rseq man page needs to be fixed as well I think. The > kernel no longer reliably provides clearing of rseq_cs on preemption, > right? > > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 > > "and set to NULL by the kernel when it restarts an assembly > instruction sequence block, > as well as when the kernel detects that it is preempting or delivering > a signal outside of the range targeted by the rseq_cs." The kernel clears rseq_cs reliably when user space was interrupted and: the task was preempted or the return from interrupt delivers a signal If the task invoked a syscall then there is absolutely no reason to do either of this because syscalls from within a critical section are a bug and catched when enabling rseq debugging. The original code did this along with unconditionally updating CPU/MMCID which resulted in ~15% performance regression on a syscall heavy database benchmark once glibc started to register rseq. Thanks, tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 10:39 ` Thomas Gleixner @ 2026-04-23 10:51 ` Mathias Stearn 2026-04-23 12:24 ` David Laight 2026-04-23 19:31 ` Thomas Gleixner 0 siblings, 2 replies; 33+ messages in thread From: Mathias Stearn @ 2026-04-23 10:51 UTC (permalink / raw) To: Thomas Gleixner Cc: Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote: > The kernel clears rseq_cs reliably when user space was interrupted and: > > the task was preempted > or > the return from interrupt delivers a signal > > If the task invoked a syscall then there is absolutely no reason to do > either of this because syscalls from within a critical section are a > bug and catched when enabling rseq debugging. > > The original code did this along with unconditionally updating CPU/MMCID > which resulted in ~15% performance regression on a syscall heavy > database benchmark once glibc started to register rseq. Just to be clear TCMalloc does not need either rseq_cs to be cleared or cpu_id_start to be written to on syscalls because it doesn't do syscalls from critical sections. It will actually benefit (slightly) from not updating cpu_id_start on syscalls. It is specifically in the cases where an rseq would need to be aborted (preemption, signals, migration, and membarrier IPI with the rseq flag) that TCMalloc relies on cpu_id_start being written. It does rely on that write even when not inside the critical section, because it effectively uses that to detect if there were any would-cause-abort events in between two critical sections. But since it leaves the rseq_cs pointer non-null between critical sections, so you dont need to add _any_ overhead for programs that never make use of rseq after registration, or add any overhead to syscalls even for those who do. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 10:51 ` Mathias Stearn @ 2026-04-23 12:24 ` David Laight 2026-04-23 19:31 ` Thomas Gleixner 1 sibling, 0 replies; 33+ messages in thread From: David Laight @ 2026-04-23 12:24 UTC (permalink / raw) To: Mathias Stearn Cc: Thomas Gleixner, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, 23 Apr 2026 12:51:22 +0200 Mathias Stearn <mathias@mongodb.com> wrote: > On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > The kernel clears rseq_cs reliably when user space was interrupted and: > > > > the task was preempted > > or > > the return from interrupt delivers a signal > > > > If the task invoked a syscall then there is absolutely no reason to do > > either of this because syscalls from within a critical section are a > > bug and catched when enabling rseq debugging. > > > > The original code did this along with unconditionally updating CPU/MMCID > > which resulted in ~15% performance regression on a syscall heavy > > database benchmark once glibc started to register rseq. > > Just to be clear TCMalloc does not need either rseq_cs to be cleared > or cpu_id_start to be written to on syscalls because it doesn't do > syscalls from critical sections. It will actually benefit (slightly) > from not updating cpu_id_start on syscalls. > > It is specifically in the cases where an rseq would need to be aborted > (preemption, signals, migration, and membarrier IPI with the rseq > flag) that TCMalloc relies on cpu_id_start being written. It does rely > on that write even when not inside the critical section, because it > effectively uses that to detect if there were any would-cause-abort > events in between two critical sections. But since it leaves the > rseq_cs pointer non-null between critical sections, so you dont need > to add _any_ overhead for programs that never make use of rseq after > registration, or add any overhead to syscalls even for those who do. > That sounds like one long rseq sequence where the 'restart' path detects that some of the operations have already been done. David ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 10:51 ` Mathias Stearn 2026-04-23 12:24 ` David Laight @ 2026-04-23 19:31 ` Thomas Gleixner 2026-04-24 7:56 ` Dmitry Vyukov 1 sibling, 1 reply; 33+ messages in thread From: Thomas Gleixner @ 2026-04-23 19:31 UTC (permalink / raw) To: Mathias Stearn Cc: Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, Apr 23 2026 at 12:51, Mathias Stearn wrote: > On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> The kernel clears rseq_cs reliably when user space was interrupted and: >> >> the task was preempted >> or >> the return from interrupt delivers a signal >> >> If the task invoked a syscall then there is absolutely no reason to do >> either of this because syscalls from within a critical section are a >> bug and catched when enabling rseq debugging. >> >> The original code did this along with unconditionally updating CPU/MMCID >> which resulted in ~15% performance regression on a syscall heavy >> database benchmark once glibc started to register rseq. > > Just to be clear TCMalloc does not need either rseq_cs to be cleared > or cpu_id_start to be written to on syscalls because it doesn't do > syscalls from critical sections. It will actually benefit (slightly) > from not updating cpu_id_start on syscalls. I know that it does not do syscalls from within critical sections, but it relies on cpu_id_start being unconditionally updated in one way or the other. > It is specifically in the cases where an rseq would need to be aborted > (preemption, signals, migration, and membarrier IPI with the rseq > flag) that TCMalloc relies on cpu_id_start being written. It does rely > on that write even when not inside the critical section, because it > effectively uses that to detect if there were any would-cause-abort > events in between two critical sections. But since it leaves the > rseq_cs pointer non-null between critical sections, so you dont need > to add _any_ overhead for programs that never make use of rseq after > registration, or add any overhead to syscalls even for those who do. Well. According to the comment in the tcmalloc code: // Calculation of the address of the current CPU slabs region is needed for // allocation/deallocation fast paths, but is quite expensive. Due to variable // shift and experimental support for "virtual CPUs", the calculation involves // several additional loads and dependent calculations. Pseudo-code for the // address calculation is as follows: // // cpu_offset = TcmallocSlab.virtual_cpu_id_offset_; // cpu = *(&__rseq_abi + virtual_cpu_id_offset_); // slabs_and_shift = TcmallocSlab.slabs_and_shift_; // shift = slabs_and_shift & kShiftMask; // shifted_cpu = cpu << shift; // slabs = slabs_and_shift & kSlabsMask; // slabs += shifted_cpu; // // To remove this calculation from fast paths, we cache the slabs address // for the current CPU in thread local storage. However, when a thread is // rescheduled to another CPU, we somehow need to understand that the cached ^^^^^^^^^^^ // address is not valid anymore. To achieve this, we overlap the top 4 bytes // of the cached address with __rseq_abi.cpu_id_start. When a thread is // rescheduled the kernel overwrites cpu_id_start with the current CPU number, // which gives us the signal that the cached address is not valid anymore. The kernel still as of today (the arm64 bug aside) updates the cpu_id_start and cpu_id fields in rseq when a task is rescheduled to another CPU. So if the code only requires to know when it got rescheduled to another CPU then it still should work, no? But it does not, which makes it clear that it relies on this undocumented behaviour of the kernel to rewrite rseq::cpu_id_start unconditionally. I'm not yet convinced that it relies on it only when interrupted between two subsequent critical sections. We'll see. .... Now we come to the best part of this comment: // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. So any code sequence which ends up in: x = tcmalloc(); dostuff(x) evaluate(rseq::cpu_id_start, rseq::cpu_id) is doomed. This might be acceptable for Google internal usage where they control the full stack and can prevent anyone else to utilize rseq, but in an open ecosystem that's obviously a non-starter. And they definitely forgot to add this to the comment: // Never enable CONFIG_RSEQ_DEBUG in the kernel when you use tcmalloc as // it will expose the blatant ABI abuse and therefore will kill your // application. If your assumption that the rewrite is only required when rseq::rseq_cs is non NULL and user space was interrupted is correct, then the obvious no-brainer would have been to add: __u64 rseq_usr_data; to struct rseq and clear that unconditionally when rseq::rseq_cs is cleared. But that would have been too simple, would work independent of endianess and not in the way of anybody else. But I know that's incompatible with the features first, correctness later and we own the world anyway mindset. Just for giggles I asked Google Gemini about the implications of tmalloc's rseq abuse. The answer is pretty clear: "In short, TCMalloc treats RSEQ as a private optimization rather than a shared system resource, which compromises the stability and extensibility of any application that needs RSEQ for anything other than memory allocation." It's also very clear about the wilful ignorance of the tcmalloc people: "In summary, the developers have known for at least 6 years that the implementation was non-standard and conflicting with other rseq usage. The github issue which requested glibc compatibility was opened in 2022 and has been unresolved since then." Thanks, tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 19:31 ` Thomas Gleixner @ 2026-04-24 7:56 ` Dmitry Vyukov 2026-04-24 8:32 ` Mathias Stearn 0 siblings, 1 reply; 33+ messages in thread From: Dmitry Vyukov @ 2026-04-24 7:56 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, 23 Apr 2026 at 21:31, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Thu, Apr 23 2026 at 12:51, Mathias Stearn wrote: > > On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> The kernel clears rseq_cs reliably when user space was interrupted and: > >> > >> the task was preempted > >> or > >> the return from interrupt delivers a signal > >> > >> If the task invoked a syscall then there is absolutely no reason to do > >> either of this because syscalls from within a critical section are a > >> bug and catched when enabling rseq debugging. > >> > >> The original code did this along with unconditionally updating CPU/MMCID > >> which resulted in ~15% performance regression on a syscall heavy > >> database benchmark once glibc started to register rseq. > > > > Just to be clear TCMalloc does not need either rseq_cs to be cleared > > or cpu_id_start to be written to on syscalls because it doesn't do > > syscalls from critical sections. It will actually benefit (slightly) > > from not updating cpu_id_start on syscalls. > > I know that it does not do syscalls from within critical sections, but > it relies on cpu_id_start being unconditionally updated in one way or > the other. > > > It is specifically in the cases where an rseq would need to be aborted > > (preemption, signals, migration, and membarrier IPI with the rseq > > flag) that TCMalloc relies on cpu_id_start being written. It does rely > > on that write even when not inside the critical section, because it > > effectively uses that to detect if there were any would-cause-abort > > events in between two critical sections. But since it leaves the > > rseq_cs pointer non-null between critical sections, so you dont need > > to add _any_ overhead for programs that never make use of rseq after > > registration, or add any overhead to syscalls even for those who do. > > Well. According to the comment in the tcmalloc code: > > // Calculation of the address of the current CPU slabs region is needed for > // allocation/deallocation fast paths, but is quite expensive. Due to variable > // shift and experimental support for "virtual CPUs", the calculation involves > // several additional loads and dependent calculations. Pseudo-code for the > // address calculation is as follows: > // > // cpu_offset = TcmallocSlab.virtual_cpu_id_offset_; > // cpu = *(&__rseq_abi + virtual_cpu_id_offset_); > // slabs_and_shift = TcmallocSlab.slabs_and_shift_; > // shift = slabs_and_shift & kShiftMask; > // shifted_cpu = cpu << shift; > // slabs = slabs_and_shift & kSlabsMask; > // slabs += shifted_cpu; > // > // To remove this calculation from fast paths, we cache the slabs address > // for the current CPU in thread local storage. However, when a thread is > // rescheduled to another CPU, we somehow need to understand that the cached > > ^^^^^^^^^^^ > > // address is not valid anymore. To achieve this, we overlap the top 4 bytes > // of the cached address with __rseq_abi.cpu_id_start. When a thread is > // rescheduled the kernel overwrites cpu_id_start with the current CPU number, > // which gives us the signal that the cached address is not valid anymore. > > The kernel still as of today (the arm64 bug aside) updates the > cpu_id_start and cpu_id fields in rseq when a task is rescheduled to > another CPU. > > So if the code only requires to know when it got rescheduled to another > CPU then it still should work, no? This was my first thought too: https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/ The only problem is with membarrier (it used to force write to __rseq_abi.cpu_id_start for all threads, but now it does not). Otherwise the caching scheme works. I have a tentative fix for tcmalloc: https://github.com/dvyukov/tcmalloc/commit/58d0eca91503f539b26d20b6f55fb2f6f8bc0c37 The crux is as follows. Tcmalloc needs to make all threads stop using old cached slab pointers. The stopping procedure is now: slab->stopped = true; membarrier(); and all rseq critical sections now check the stopped flag in the cached slab pointer. If it's set, the thread does not proceed to use the slab. > But it does not, which makes it clear that it relies on this > undocumented behaviour of the kernel to rewrite rseq::cpu_id_start > unconditionally. I'm not yet convinced that it relies on it only when > interrupted between two subsequent critical sections. We'll see. > > .... > > Now we come to the best part of this comment: > > // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. > > So any code sequence which ends up in: > > x = tcmalloc(); > dostuff(x) > evaluate(rseq::cpu_id_start, rseq::cpu_id) > > is doomed. This might be acceptable for Google internal usage where they > control the full stack and can prevent anyone else to utilize rseq, but > in an open ecosystem that's obviously a non-starter. > > And they definitely forgot to add this to the comment: > > // Never enable CONFIG_RSEQ_DEBUG in the kernel when you use tcmalloc as > // it will expose the blatant ABI abuse and therefore will kill your > // application. > > If your assumption that the rewrite is only required when rseq::rseq_cs > is non NULL and user space was interrupted is correct, then the obvious > no-brainer would have been to add: > > __u64 rseq_usr_data; > > to struct rseq and clear that unconditionally when rseq::rseq_cs is > cleared. > > But that would have been too simple, would work independent of endianess > and not in the way of anybody else. > > But I know that's incompatible with the features first, correctness > later and we own the world anyway mindset. > > Just for giggles I asked Google Gemini about the implications of > tmalloc's rseq abuse. The answer is pretty clear: > > "In short, TCMalloc treats RSEQ as a private optimization rather than > a shared system resource, which compromises the stability and > extensibility of any application that needs RSEQ for anything other > than memory allocation." > > It's also very clear about the wilful ignorance of the tcmalloc people: > > "In summary, the developers have known for at least 6 years that the > implementation was non-standard and conflicting with other rseq > usage. The github issue which requested glibc compatibility was > opened in 2022 and has been unresolved since then." > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 7:56 ` Dmitry Vyukov @ 2026-04-24 8:32 ` Mathias Stearn 2026-04-24 9:30 ` Dmitry Vyukov 2026-04-24 14:16 ` Thomas Gleixner 0 siblings, 2 replies; 33+ messages in thread From: Mathias Stearn @ 2026-04-24 8:32 UTC (permalink / raw) To: Dmitry Vyukov Cc: Thomas Gleixner, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote: > > So if the code only requires to know when it got rescheduled to another > > CPU then it still should work, no? > > This was my first thought too: > https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/ > The only problem is with membarrier (it used to force write to > __rseq_abi.cpu_id_start for all threads, but now it does not). > Otherwise the caching scheme works. I almost wrote a message last night saying that we didn't need cpu_id_start invalidation on preemption. However, I remembered that the Grow() function[1] does a load outside of a critical section then stores a derived value inside the critical section, guarded only by the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really should be doing a compare against the original value inside the critical section (or just do the whole thing inside), but it doesn't. I haven't reasoned end-to-end through this fully to prove corruption is possible, but I suspect that it is if another thread same-cpu preempts between the loads and the store and updates the header before the original thread resumes and writes its original intended header value. Ditto for signals, which sometimes allocate even though they shouldn't. I was really hoping that we would only need to do the "redundant" cpu_id_start writes would only be needed on membarrier_rseq IPIs where it really is a pay-for-what-you-use functionality, I think existing binaries depend on invalidation on preemption. Luckily that should be cheap enough to be ~free. [1] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L964-L980 [2] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L551-L605 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 8:32 ` Mathias Stearn @ 2026-04-24 9:30 ` Dmitry Vyukov 2026-04-24 14:16 ` Thomas Gleixner 1 sibling, 0 replies; 33+ messages in thread From: Dmitry Vyukov @ 2026-04-24 9:30 UTC (permalink / raw) To: Mathias Stearn Cc: Thomas Gleixner, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Fri, 24 Apr 2026 at 10:32, Mathias Stearn <mathias@mongodb.com> wrote: > > On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote: > > > So if the code only requires to know when it got rescheduled to another > > > CPU then it still should work, no? > > > > This was my first thought too: > > https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/ > > The only problem is with membarrier (it used to force write to > > __rseq_abi.cpu_id_start for all threads, but now it does not). > > Otherwise the caching scheme works. > > I almost wrote a message last night saying that we didn't need > cpu_id_start invalidation on preemption. However, I remembered that > the Grow() function[1] does a load outside of a critical section then > stores a derived value inside the critical section, guarded only by > the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really > should be doing a compare against the original value inside the > critical section (or just do the whole thing inside), but it doesn't. > I haven't reasoned end-to-end through this fully to prove corruption > is possible, but I suspect that it is if another thread same-cpu > preempts between the loads and the store and updates the header before > the original thread resumes and writes its original intended header > value. Ditto for signals, which sometimes allocate even though they > shouldn't. > > I was really hoping that we would only need to do the "redundant" > cpu_id_start writes would only be needed on membarrier_rseq IPIs where > it really is a pay-for-what-you-use functionality, I think existing > binaries depend on invalidation on preemption. Luckily that should be > cheap enough to be ~free. I've prototyped this idea too: https://github.com/dvyukov/linux/commit/1284e3723047cb5afd247f75c53de43efc18db82 > [1] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L964-L980 > [2] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L551-L605 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 8:32 ` Mathias Stearn 2026-04-24 9:30 ` Dmitry Vyukov @ 2026-04-24 14:16 ` Thomas Gleixner 2026-04-24 15:03 ` Peter Zijlstra 1 sibling, 1 reply; 33+ messages in thread From: Thomas Gleixner @ 2026-04-24 14:16 UTC (permalink / raw) To: Mathias Stearn, Dmitry Vyukov Cc: Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Fri, Apr 24 2026 at 10:32, Mathias Stearn wrote: > On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote: >> The only problem is with membarrier (it used to force write to >> __rseq_abi.cpu_id_start for all threads, but now it does not). >> Otherwise the caching scheme works. > > I almost wrote a message last night saying that we didn't need > cpu_id_start invalidation on preemption. However, I remembered that > the Grow() function[1] does a load outside of a critical section then > stores a derived value inside the critical section, guarded only by > the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really > should be doing a compare against the original value inside the > critical section (or just do the whole thing inside), but it doesn't. > I haven't reasoned end-to-end through this fully to prove corruption > is possible, but I suspect that it is if another thread same-cpu > preempts between the loads and the store and updates the header before > the original thread resumes and writes its original intended header > value. Ditto for signals, which sometimes allocate even though they > shouldn't. > > I was really hoping that we would only need to do the "redundant" > cpu_id_start writes would only be needed on membarrier_rseq IPIs where > it really is a pay-for-what-you-use functionality, That's fine and can be solved without adding this sequence overhead into the scheduler hotpath. > I think existing binaries depend on invalidation on > preemption. Luckily that should be cheap enough to be ~free. That's only free when it can be burried in the rseq_cs update, which means the ID update would not happen when rseq_cs is NULL. If those two changes fix it w/o requiring additional tcmalloc changes, I'm happy to hack that up tomorrow. Thanks, tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 14:16 ` Thomas Gleixner @ 2026-04-24 15:03 ` Peter Zijlstra 2026-04-24 19:44 ` Thomas Gleixner 0 siblings, 1 reply; 33+ messages in thread From: Peter Zijlstra @ 2026-04-24 15:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote: > On Fri, Apr 24 2026 at 10:32, Mathias Stearn wrote: > > On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote: > >> The only problem is with membarrier (it used to force write to > >> __rseq_abi.cpu_id_start for all threads, but now it does not). > >> Otherwise the caching scheme works. > > > > I almost wrote a message last night saying that we didn't need > > cpu_id_start invalidation on preemption. However, I remembered that > > the Grow() function[1] does a load outside of a critical section then > > stores a derived value inside the critical section, guarded only by > > the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really > > should be doing a compare against the original value inside the > > critical section (or just do the whole thing inside), but it doesn't. > > I haven't reasoned end-to-end through this fully to prove corruption > > is possible, but I suspect that it is if another thread same-cpu > > preempts between the loads and the store and updates the header before > > the original thread resumes and writes its original intended header > > value. Ditto for signals, which sometimes allocate even though they > > shouldn't. > > > > I was really hoping that we would only need to do the "redundant" > > cpu_id_start writes would only be needed on membarrier_rseq IPIs where > > it really is a pay-for-what-you-use functionality, > > That's fine and can be solved without adding this sequence overhead into > the scheduler hotpath. Something like so? (probably needs help for !GENERIC bits) --- diff --git a/include/asm-generic/thread_info_tif.h b/include/asm-generic/thread_info_tif.h index 528e6fc7efe9..1d786003e42a 100644 --- a/include/asm-generic/thread_info_tif.h +++ b/include/asm-generic/thread_info_tif.h @@ -48,7 +48,10 @@ #define TIF_RSEQ 11 // Run RSEQ fast path #define _TIF_RSEQ BIT(TIF_RSEQ) -#define TIF_HRTIMER_REARM 12 // re-arm the timer +#define TIF_RSEQ_FORCE_RESTART 12 // Reset RSEQ-CS from membarrier +#define _TIF_RSEQ_FORCE_RESTART BIT(TIF_RSEQ_FORCE_RESTART) + +#define TIF_HRTIMER_REARM 13 // re-arm the timer #define _TIF_HRTIMER_REARM BIT(TIF_HRTIMER_REARM) #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */ diff --git a/include/linux/rseq.h b/include/linux/rseq.h index b9d62fc2140d..2cbee6d41198 100644 --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -158,6 +158,8 @@ static inline unsigned int rseq_alloc_align(void) return 1U << get_count_order(offsetof(struct rseq, end)); } +extern void rseq_prepare_membarrier(struct mm_struct *mm); + #else /* CONFIG_RSEQ */ static inline void rseq_handle_slowpath(struct pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { } @@ -167,6 +169,7 @@ static inline void rseq_force_update(void) { } static inline void rseq_virt_userspace_exit(void) { } static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { } static inline void rseq_execve(struct task_struct *t) { } +static inline void rseq_prepare_membarrier(struct mm_struct *mm) { } #endif /* !CONFIG_RSEQ */ #ifdef CONFIG_DEBUG_RSEQ diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h index f11ebd34f8b9..3dfaca776971 100644 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -686,7 +686,12 @@ static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *reg #ifdef CONFIG_HAVE_GENERIC_TIF_BITS static __always_inline bool test_tif_rseq(unsigned long ti_work) { - return ti_work & _TIF_RSEQ; + return ti_work & (_TIF_RSEQ | _TIF_RSEQ_FORCE_RESTART); +} + +static __always_inline void clear_tif_rseq_force_restart(void) +{ + clear_thread_flag(TIF_RSEQ_FORCE_RESTART); } static __always_inline void clear_tif_rseq(void) @@ -696,6 +701,7 @@ static __always_inline void clear_tif_rseq(void) } #else static __always_inline bool test_tif_rseq(unsigned long ti_work) { return true; } +static __always_inline void clear_tif_rseq_force_restart(void) { } static __always_inline void clear_tif_rseq(void) { } #endif @@ -703,6 +709,11 @@ static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work) { if (unlikely(test_tif_rseq(ti_work))) { + if (unlikely(ti_work & _TIF_RSEQ_FORCE_RESTART)) { + current->rseq.event.sched_switch = true; + current->rseq.event.ids_changed = true; + clear_tif_rseq_force_restart(); + } if (unlikely(__rseq_exit_to_user_mode_restart(regs))) { current->rseq.event.slowpath = true; set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); diff --git a/kernel/rseq.c b/kernel/rseq.c index 38d3ef540760..9adc7f63adf5 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -255,6 +255,19 @@ static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs) return false; } +void rseq_prepare_membarrier(struct mm_struct *mm) +{ + struct task_struct *t; + + guard(mutex)(&mm->mm_cid.mutex); + + hlist_for_each_entry(t, &mm->mm_cid.user_list, mm_cid.node) { + if (t == current) + continue; + set_tsk_thread_flag(t, TIF_RSEQ_FORCE_RESTART); + } +} + static void rseq_slowpath_update_usr(struct pt_regs *regs) { /* diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 623445603725..696988bb991b 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -334,6 +334,7 @@ static int membarrier_private_expedited(int flags, int cpu_id) MEMBARRIER_STATE_PRIVATE_EXPEDITED_RSEQ_READY)) return -EPERM; ipi_func = ipi_rseq; + rseq_prepare_membarrier(mm); } else { WARN_ON_ONCE(flags); if (!(atomic_read(&mm->membarrier_state) & ^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 15:03 ` Peter Zijlstra @ 2026-04-24 19:44 ` Thomas Gleixner 2026-04-26 22:04 ` Thomas Gleixner 0 siblings, 1 reply; 33+ messages in thread From: Thomas Gleixner @ 2026-04-24 19:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler On Fri, Apr 24 2026 at 17:03, Peter Zijlstra wrote: > On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote: >> > I was really hoping that we would only need to do the "redundant" >> > cpu_id_start writes would only be needed on membarrier_rseq IPIs where >> > it really is a pay-for-what-you-use functionality, >> >> That's fine and can be solved without adding this sequence overhead into >> the scheduler hotpath. > > Something like so? (probably needs help for !GENERIC bits) Yes and yes :) Let me stare at that !generic tif bits case. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 19:44 ` Thomas Gleixner @ 2026-04-26 22:04 ` Thomas Gleixner 2026-04-27 7:40 ` Florian Weimer ` (3 more replies) 0 siblings, 4 replies; 33+ messages in thread From: Thomas Gleixner @ 2026-04-26 22:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Dmitry Vyukov, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Fri, Apr 24 2026 at 21:44, Thomas Gleixner wrote: > On Fri, Apr 24 2026 at 17:03, Peter Zijlstra wrote: >> On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote: >>> > I was really hoping that we would only need to do the "redundant" >>> > cpu_id_start writes would only be needed on membarrier_rseq IPIs where >>> > it really is a pay-for-what-you-use functionality, >>> >>> That's fine and can be solved without adding this sequence overhead into >>> the scheduler hotpath. >> >> Something like so? (probably needs help for !GENERIC bits) > > Yes and yes :) > > Let me stare at that !generic tif bits case. I stared at it and finally gave up because all of this is in a completely FUBAR'ed state and ends up in a horrible pile of hacks and duct tape with a way larger than zero probability that we chase the nasty corner cases for quite some time just to add more duct tape and hacks. Contrary to that it's rather trivial to cleanly separate the behavioral cases and guarantees without a masssive runtime overhead and without a pile of hard to maintain TCMalloc specific hacks. All required code is already available to support the architectures which do not utilize the generic entry code and therefore can't neither use the optimized mode nor time slice extensions. So instead of letting the compiler optimize that code out for the generic entry code users, we can keep it around and utilize one or the other depending on the requested mode. I managed to get the required run-time conditionals down to a minimum so that they are in the noise when analysing it with perf. The real question is how to differentiate between the legacy and the optimized mode. I have two working variants to achieve that: 1) The fully safe option requires a new flag for RSEQ registration. It obviously requires a glibc update. (Suggested by PeterZ) 2) Determine the requirements of the registering task via the size of the registered RSEQ area. The original implementation, which TCMalloc depends on, registers a 32 byte region (ORIG_RSEG_SIZE). This region has 32 byte alignment requirement. The extension safe newer variant exposes the kernel RSEQ feature size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered rseq region is aligned to the next power of two of the feature size. The kernel currently has a feature size of 33 bytes, which means the alignment requirement is 64 bytes. The TCMalloc RSEQ region is embedded into a cache line aligned data structure starting at offset 32 bytes so that bytes 28-31 and the cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with the top-most bit (63 set) to check whether the kernel has overwritten cpu_id_start with an actual CPU id value, which is guaranteed to not have the top most bit set. As this is part of their performance tuned magic, it's a pretty safe assumption, that TCMalloc won't use a larger RSEQ size, which allows to select optimized mode for registrations with a size greater than 32 bytes. That does not require any changes to glibc and works out of the box. (Suggested by Mathieu) In both cases the legacy non-optimized mode exposes the original behaviour up to the mm_cid field and does not provide support for time slice extensions. Optimized mode restores the performance gains and enables support for time slice extensions. I have no strong preference either way and have working code for both variants. Though obviously avoiding to update the libc world has a charme. If that unexpectedly would turn out to be not sufficient, then disabling that would be a trivial one-liner and as a consequence require to add the flag and update the libc world. Combo patch for the auto-detection based on the registered size below as that allows to immediately test without glibc dependencies. It applies cleanly on Linus tree and 7.0. 6.19 would need some fixups, but I learned today that it's already EOL. In the final version that's three separate patches plus a set of selftest changes which validate legacy behaviour and run the full param test suite in both legacy and optimized mode. Thoughts, preferences? Thanks, tglx --- Documentation/userspace-api/rseq.rst | 77 ++++++++++++++ include/linux/rseq.h | 20 +++ include/linux/rseq_entry.h | 110 ++++++++++----------- include/linux/rseq_types.h | 3 kernel/rseq.c | 183 ++++++++++++++++++++++------------- kernel/sched/membarrier.c | 11 +- 6 files changed, 280 insertions(+), 124 deletions(-) --- --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -9,6 +9,11 @@ void __rseq_handle_slowpath(struct pt_regs *regs); +static __always_inline bool rseq_optimized(struct task_struct *t) +{ + return IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && likely(t->rseq.event.optimized); +} + /* Invoked from resume_user_mode_work() */ static inline void rseq_handle_slowpath(struct pt_regs *regs) { @@ -30,7 +35,7 @@ void __rseq_signal_deliver(int sig, stru */ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(current)) { /* '&' is intentional to spare one conditional branch */ if (current->rseq.event.has_rseq & current->rseq.event.user_irq) __rseq_signal_deliver(ksig->sig, regs); @@ -50,15 +55,21 @@ static __always_inline void rseq_sched_s { struct rseq_event *ev = &t->rseq.event; - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* + * Only apply the user_irq optimization for RSEQ ABI V2 + * registrations. Legacy users like TCMalloc rely on the historical ABI + * V1 behaviour which updates IDs on every context swtich. + */ + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(t)) { /* * Avoid a boat load of conditionals by using simple logic * to determine whether NOTIFY_RESUME needs to be raised. * * It's required when the CPU or MM CID has changed or - * the entry was from user space. + * the entry was from user space. ev->has_rseq does not + * have to be evaluated because optimized implies has_rseq. */ - bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; + bool raise = ev->user_irq | ev->ids_changed; if (raise) { ev->sched_switch = true; @@ -66,6 +77,7 @@ static __always_inline void rseq_sched_s } } else { if (ev->has_rseq) { + t->rseq.event.ids_changed = true; t->rseq.event.sched_switch = true; rseq_raise_notify_resume(t); } --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -111,6 +111,20 @@ static __always_inline void rseq_slice_c t->rseq.slice.state.granted = false; } +/* + * Open coded, so it can be invoked within a user access region. + * + * This clears the user space state of the time slice extensions field only when + * the task has registered the optimized RSEQ_ABI V2. Some legacy registrations, + * e.g. TCMalloc, have conflicting non-ABI fields in struct RSEQ, which would be + * overwritten by an unconditional write. + */ +#define rseq_slice_clear_user(rseq, efault) \ +do { \ + if (rseq_slice_extension_enabled()) \ + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); \ +} while (0) + static __always_inline bool __rseq_grant_slice_extension(bool work_pending) { struct task_struct *curr = current; @@ -230,10 +244,10 @@ static __always_inline bool rseq_slice_e static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; } static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { } static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; } +#define rseq_slice_clear_user(rseq, efault) do { } while (0) #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr); -bool rseq_debug_validate_ids(struct task_struct *t); static __always_inline void rseq_note_user_irq_entry(void) { @@ -353,43 +367,6 @@ bool rseq_debug_update_user_cs(struct ta return false; } -/* - * On debug kernels validate that user space did not mess with it if the - * debug branch is enabled. - */ -bool rseq_debug_validate_ids(struct task_struct *t) -{ - struct rseq __user *rseq = t->rseq.usrptr; - u32 cpu_id, uval, node_id; - - /* - * On the first exit after registering the rseq region CPU ID is - * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! - */ - node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ? - cpu_to_node(t->rseq.ids.cpu_id) : 0; - - scoped_user_read_access(rseq, efault) { - unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); - if (cpu_id != t->rseq.ids.cpu_id) - goto die; - unsafe_get_user(uval, &rseq->cpu_id, efault); - if (uval != cpu_id) - goto die; - unsafe_get_user(uval, &rseq->node_id, efault); - if (uval != node_id) - goto die; - unsafe_get_user(uval, &rseq->mm_cid, efault); - if (uval != t->rseq.ids.mm_cid) - goto die; - } - return true; -die: - t->rseq.event.fatal = true; -efault: - return false; -} - #endif /* RSEQ_BUILD_SLOW_PATH */ /* @@ -504,12 +481,32 @@ bool rseq_set_ids_get_csaddr(struct task { struct rseq __user *rseq = t->rseq.usrptr; - if (static_branch_unlikely(&rseq_debug_enabled)) { - if (!rseq_debug_validate_ids(t)) - return false; - } - scoped_user_rw_access(rseq, efault) { + /* Validate the R/O fields for debug and optimized mode */ + if (static_branch_unlikely(&rseq_debug_enabled) || rseq_optimized(t)) { + u32 cpu_id, uval, node_id; + + /* + * On the first exit after registering the rseq region CPU ID is + * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! + */ + node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ? + cpu_to_node(t->rseq.ids.cpu_id) : 0; + + unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); + if (cpu_id != t->rseq.ids.cpu_id) + goto die; + unsafe_get_user(uval, &rseq->cpu_id, efault); + if (uval != cpu_id) + goto die; + unsafe_get_user(uval, &rseq->node_id, efault); + if (uval != node_id) + goto die; + unsafe_get_user(uval, &rseq->mm_cid, efault); + if (uval != t->rseq.ids.mm_cid) + goto die; + } + unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault); unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault); unsafe_put_user(node_id, &rseq->node_id, efault); @@ -517,11 +514,9 @@ bool rseq_set_ids_get_csaddr(struct task if (csaddr) unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); - /* Open coded, so it's in the same user access region */ - if (rseq_slice_extension_enabled()) { - /* Unconditionally clear it, no point in conditionals */ - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); - } + /* RSEQ ABI V2 only operations */ + if (rseq_optimized(t)) + rseq_slice_clear_user(rseq, efault); } rseq_slice_clear_grant(t); @@ -530,6 +525,9 @@ bool rseq_set_ids_get_csaddr(struct task rseq_stat_inc(rseq_stats.ids); rseq_trace_update(t, ids); return true; + +die: + t->rseq.event.fatal = true; efault: return false; } @@ -612,6 +610,14 @@ static __always_inline bool rseq_exit_us * interrupts disabled */ guard(pagefault)(); + /* + * This optimization is only valid when the task registered for the + * optimized RSEQ_ABI_V2 variant. Some legacy users rely on the original + * RSEQ implementation behaviour which unconditionally updated the IDs. + * rseq_sched_switch_event() ensures that legacy registrations always + * have both sched_switch and ids_changed set, which is compatible with + * the historical TIF_NOTIFY_RESUME behaviour. + */ if (likely(!t->rseq.event.ids_changed)) { struct rseq __user *rseq = t->rseq.usrptr; /* @@ -623,11 +629,9 @@ static __always_inline bool rseq_exit_us scoped_user_rw_access(rseq, efault) { unsafe_get_user(csaddr, &rseq->rseq_cs, efault); - /* Open coded, so it's in the same user access region */ - if (rseq_slice_extension_enabled()) { - /* Unconditionally clear it, no point in conditionals */ - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); - } + /* RSEQ ABI V2 only operations */ + if (rseq_optimized(t)) + rseq_slice_clear_user(rseq, efault); } rseq_slice_clear_grant(t); --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -18,6 +18,7 @@ struct rseq; * @ids_changed: Indicator that IDs need to be updated * @user_irq: True on interrupt entry from user mode * @has_rseq: True if the task has a rseq pointer installed + * @optimized: RSEQ ABI V2 optimized mode * @error: Compound error code for the slow path to analyze * @fatal: User space data corrupted or invalid * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME @@ -41,7 +42,7 @@ struct rseq_event { }; u8 has_rseq; - u8 __pad; + u8 optimized; union { u16 error; struct { --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -258,11 +258,15 @@ static bool rseq_handle_cs(struct task_s static void rseq_slowpath_update_usr(struct pt_regs *regs) { /* - * Preserve rseq state and user_irq state. The generic entry code - * clears user_irq on the way out, the non-generic entry - * architectures are not having user_irq. - */ - const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, }; + * Preserve has_rseq, optimized and user_irq state. The generic entry + * code clears user_irq on the way out, the non-generic entry + * architectures are not setting user_irq. + */ + const struct rseq_event evt_mask = { + .has_rseq = true, + .user_irq = true, + .optimized = true, + }; struct task_struct *t = current; struct rseq_ids ids; u32 node_id; @@ -335,8 +339,9 @@ void __rseq_handle_slowpath(struct pt_re void __rseq_signal_deliver(int sig, struct pt_regs *regs) { rseq_stat_inc(rseq_stats.signal); + /* - * Don't update IDs, they are handled on exit to user if + * Don't update IDs yet, they are handled on exit to user if * necessary. The important thing is to abort a critical section of * the interrupted context as after this point the instruction * pointer in @regs points to the signal handler. @@ -349,6 +354,13 @@ void __rseq_signal_deliver(int sig, stru current->rseq.event.error = 0; force_sigsegv(sig); } + + /* + * In legacy mode, force the update of IDs before returning to user + * space to stay compatible. + */ + if (!rseq_optimized(current)) + rseq_force_update(); } /* @@ -404,66 +416,19 @@ static bool rseq_reset_ids(void) /* The original rseq structure size (including padding) is 32 bytes. */ #define ORIG_RSEQ_SIZE 32 -/* - * sys_rseq - setup restartable sequences for caller thread. - */ -SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) +static long rseq_register(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) { + bool optimized = IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_len > ORIG_RSEQ_SIZE; u32 rseqfl = 0; - if (flags & RSEQ_FLAG_UNREGISTER) { - if (flags & ~RSEQ_FLAG_UNREGISTER) - return -EINVAL; - /* Unregister rseq for current thread. */ - if (current->rseq.usrptr != rseq || !current->rseq.usrptr) - return -EINVAL; - if (rseq_len != current->rseq.len) - return -EINVAL; - if (current->rseq.sig != sig) - return -EPERM; - if (!rseq_reset_ids()) - return -EFAULT; - rseq_reset(current); - return 0; - } - - if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON))) - return -EINVAL; - - if (current->rseq.usrptr) { - /* - * If rseq is already registered, check whether - * the provided address differs from the prior - * one. - */ - if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) - return -EINVAL; - if (current->rseq.sig != sig) - return -EPERM; - /* Already registered. */ - return -EBUSY; - } - - /* - * If there was no rseq previously registered, ensure the provided rseq - * is properly aligned, as communcated to user-space through the ELF - * auxiliary vector AT_RSEQ_ALIGN. If rseq_len is the original rseq - * size, the required alignment is the original struct rseq alignment. - * - * The rseq_len is required to be greater or equal to the original rseq - * size. In order to be valid, rseq_len is either the original rseq size, - * or large enough to contain all supported fields, as communicated to - * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. - */ - if (rseq_len < ORIG_RSEQ_SIZE || - (rseq_len == ORIG_RSEQ_SIZE && !IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE)) || - (rseq_len != ORIG_RSEQ_SIZE && (!IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) || - rseq_len < offsetof(struct rseq, end)))) - return -EINVAL; if (!access_ok(rseq, rseq_len)) return -EFAULT; - if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { + /* + * The optimized check disables time slice extensions for legacy + * registrations. + */ + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION) && optimized) { rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; if (rseq_slice_extension_enabled() && (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)) @@ -485,7 +450,15 @@ SYSCALL_DEFINE4(rseq, struct rseq __user unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); unsafe_put_user(0U, &rseq->node_id, efault); unsafe_put_user(0U, &rseq->mm_cid, efault); - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); + + /* + * All fields past mm_cid are only valid for non-legacy registrations + * which register with rseq_len > ORIG_RSEQ_SIZE. + */ + if (optimized) { + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); + } } /* @@ -501,11 +474,11 @@ SYSCALL_DEFINE4(rseq, struct rseq __user #endif /* - * If rseq was previously inactive, and has just been - * registered, ensure the cpu_id_start and cpu_id fields - * are updated before returning to user-space. + * Ensure the cpu_id_start and cpu_id fields are updated before + * returning to user-space. */ current->rseq.event.has_rseq = true; + current->rseq.event.optimized = optimized; rseq_force_update(); return 0; @@ -513,6 +486,86 @@ SYSCALL_DEFINE4(rseq, struct rseq __user return -EFAULT; } +static long rseq_unregister(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) +{ + if (flags & ~RSEQ_FLAG_UNREGISTER) + return -EINVAL; + if (current->rseq.usrptr != rseq || !current->rseq.usrptr) + return -EINVAL; + if (rseq_len != current->rseq.len) + return -EINVAL; + if (current->rseq.sig != sig) + return -EPERM; + if (!rseq_reset_ids()) + return -EFAULT; + rseq_reset(current); + return 0; +} + +static long rseq_reregister(struct rseq __user * rseq, u32 rseq_len, u32 sig) +{ + /* + * If rseq is already registered, check whether the provided address + * differs from the prior one. + */ + if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) + return -EINVAL; + if (current->rseq.sig != sig) + return -EPERM; + /* Already registered. */ + return -EBUSY; +} + +static bool rseq_length_valid(struct rseq __user *rseq, unsigned int rseq_len) +{ + if (rseq_len < ORIG_RSEQ_SIZE) + return false; + + /* + * Ensure the provided rseq is properly aligned, as communicated to + * user-space through the ELF auxiliary vector AT_RSEQ_ALIGN. If + * rseq_len is the original rseq size, the required alignment is the + * original struct rseq alignment. + * + * The rseq_len is required to be greater or equal than the original + * rseq size. + * + * In order to be valid, rseq_len is either the original rseq size, or + * large enough to contain all supported fields, as communicated to + * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. + */ + if (rseq_len < ORIG_RSEQ_SIZE) + return false; + + if (rseq_len == ORIG_RSEQ_SIZE) + return IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE); + + return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) && + rseq_len >= offsetof(struct rseq, end); +} + +#define RSEQ_FLAGS_SUPPORTED (RSEQ_FLAG_SLICE_EXT_DEFAULT_ON) + +/* + * sys_rseq - Register or unregister restartable sequences for the caller thread. + */ +SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) +{ + if (flags & RSEQ_FLAG_UNREGISTER) + return rseq_unregister(rseq, rseq_len, flags, sig); + + if (unlikely(flags & ~RSEQ_FLAGS_SUPPORTED)) + return -EINVAL; + + if (current->rseq.usrptr) + return rseq_reregister(rseq, rseq_len, sig); + + if (!rseq_length_valid(rseq, rseq_len)) + return -EINVAL; + + return rseq_register(rseq, rseq_len, flags, sig); +} + #ifdef CONFIG_RSEQ_SLICE_EXTENSION struct slice_timer { struct hrtimer timer; @@ -713,6 +766,8 @@ int rseq_slice_extension_prctl(unsigned return -ENOTSUPP; if (!current->rseq.usrptr) return -ENXIO; + if (!current->rseq.event.optimized) + return -ENOTSUPP; /* No change? */ if (enable == !!current->rseq.slice.state.enabled) --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -199,7 +199,16 @@ static void ipi_rseq(void *info) * is negligible. */ smp_mb(); - rseq_sched_switch_event(current); + /* + * Legacy mode requires that IDs are written and the critical section is + * evaluated. Optimized mode handles the critical section and IDs are + * only updated if they change as a consequence of preemption after + * return from this IPI. + */ + if (rseq_optimized(current)) + rseq_sched_switch_event(current); + else + rseq_force_update(); } static void ipi_sync_rq_state(void *info) --- a/Documentation/userspace-api/rseq.rst +++ b/Documentation/userspace-api/rseq.rst @@ -24,6 +24,80 @@ Quick access to CPU number, node ID Allows to implement per CPU data efficiently. Documentation is in code and selftests. :( +Optimized RSEQ V2 +----------------- + +On architectures which utilize the generic entry code and generic TIF bits +the kernel supports runtime optimizations for RSEQ, which also enable +enhanced features like scheduler time slice extensions. + +To enable them a task has to register the RSEQ region with at least the +length advertised by getauxval(AT_RSEQ_FEATURE_SIZE). + +If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel +keeps the legacy low performance mode enabled to fulfil the expectations +existing users regarding the original RSEQ implementation behaviour. + +The following table documents the ABI and behavioral guarantees of the +legacy and the optimized V2 mode. + +.. list-table:: RSEQ modes + :header-rows: 1 + + * - Nr + - What + - Legacy + - Optimized V2 + * - 1 + - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read + only) + - Updated by the kernel unconditionally after each context switch and + before signal delivery + - Updated by the kernel if and only if they change, i.e. if the task + is migrated or mm_cid changes + * - 2 + - The rseq_cs critical section field + - Evaluated and handled unconditionally after each context switch and + before signal delivery + - Evaluated and handled conditionally only when user space was + interrupted. Either after being preempted or before signal delivery + in the interrupted context. + * - 3 + - Read only fields + - No strict enforcement except in debug mode + - Strict enforcement + * - 4 + - membarrier(...RSEQ) + - All running threads of the process are interrupted and the ID fields + are rewritten and eventually active critical sections are aborted + before they return to user space. All threads which are scheduled + out whether voluntary or not are covered by #1/#2 above. + - All running threads of the process are interrupted and eventually + active critical sections are aborted before these threads return to + user space. The ID fields are only updated if changed as a + consequence of the interrupt. All threads which are scheduled out + whether voluntary not are covered by #1/#2 above. + * - 5 + - Time slice extensions + - Not supported + - Supported + +The legacy mode is obviously less performant as it does unconditional +updates and critical section checks even if not strictly required by the +ABI contract. That can't be changed anymore as some users depend on that +observed behavior, which in turn enables them to violate the ABI and +overwrite the cpu_id_start field for their own purposes. This is obviously +discouraged as it renders RSEQ incompatible with the intended usage and +breaks the expectation of other libraries in the same application. + +The ABI compliant optimized mode, which respects the read only fields, does +not require unconditional updates and therefore is way more performant. The +kernel validates the read only fields for compliance. If user space +modifies them, the process is killed. Compliant usage allows multiple +libraries in the same application to benefit from the RSEQ functionality +without disturbing each other. + + Scheduler time slice extensions ------------------------------- @@ -37,7 +111,8 @@ scheduled out inside of the critical sec * Enabled at boot time (default is enabled) - * A rseq userspace pointer has been registered for the thread + * A rseq userspace pointer has been registered for the thread in + optimized V2 mode The thread has to enable the functionality via prctl(2):: ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-26 22:04 ` Thomas Gleixner @ 2026-04-27 7:40 ` Florian Weimer 2026-04-27 11:03 ` Thomas Gleixner 2026-04-27 18:35 ` Mathieu Desnoyers 2026-04-28 6:11 ` Dmitry Vyukov ` (2 subsequent siblings) 3 siblings, 2 replies; 33+ messages in thread From: Florian Weimer @ 2026-04-27 7:40 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds, criu * Thomas Gleixner: > The real question is how to differentiate between the legacy and the > optimized mode. I have two working variants to achieve that: > > 1) The fully safe option requires a new flag for RSEQ > registration. It obviously requires a glibc update. (Suggested by > PeterZ) Without glibc changes, RSEQ would keep working, but with the old, problematic performance, right? If we don't have a notification in the auxiliary vector, we'd have to do two system calls at process start, which isn't ideal, but is probably not a significant issue, either. I haven't verified this, but it looks like introducing the flag breaks CRIU? In dump_thread_rseq, we have this: if (rseqc.flags != 0) { pr_err("something wrong with ptrace(PTRACE_GET_RSEQ_CONFIGURATION, %d) flags = 0x%x\n", tid, rseqc.flags); return -1; } I suppose a workaround could make this behavior flag a prctl flag. CRIU wouldn't dump and restore that until taught about it. If the new behavior is switched on explicitly by the flag, it would be backwards-compatible, except that restoring with unpatched CRIU would lead to a performance loss. > 2) Determine the requirements of the registering task via the size of > the registered RSEQ area. > > The original implementation, which TCMalloc depends on, registers > a 32 byte region (ORIG_RSEG_SIZE). This region has 32 byte > alignment requirement. > > The extension safe newer variant exposes the kernel RSEQ feature > size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment > requirement via getauxval(AT_RSEQ_ALIGN). The alignment > requirement is that the registered rseq region is aligned to the > next power of two of the feature size. The kernel currently has a > feature size of 33 bytes, which means the alignment requirement is > 64 bytes. There are still glibc builds in use that do not use AT_RSEQ_ALIGN, and instead unconditionally reserve a size of 32. In some builds, the RSEQ area is not aligned to a multiple of 64, which makes glibc indistinguishable from tcmalloc. You could look at the location of the thread pointer relative to the RSEQ area at registration to tell them apart, but that is perhaps too nasty. Switching to the new extensible RSEQ allocation code in older glibc builds is not entirely trivial, and I would prefer not doing that. Registering with a new flag is comparatively simple, and we could backport it, except that it might not be compatible with CRIU. Thanks, Florian ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-27 7:40 ` Florian Weimer @ 2026-04-27 11:03 ` Thomas Gleixner 2026-04-27 18:35 ` Mathieu Desnoyers 1 sibling, 0 replies; 33+ messages in thread From: Thomas Gleixner @ 2026-04-27 11:03 UTC (permalink / raw) To: Florian Weimer Cc: Peter Zijlstra, Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds, criu On Mon, Apr 27 2026 at 09:40, Florian Weimer wrote: > * Thomas Gleixner: >> The real question is how to differentiate between the legacy and the >> optimized mode. I have two working variants to achieve that: >> >> 1) The fully safe option requires a new flag for RSEQ >> registration. It obviously requires a glibc update. (Suggested by >> PeterZ) > > Without glibc changes, RSEQ would keep working, but with the old, > problematic performance, right? Correct. > If we don't have a notification in the auxiliary vector, we'd have to do > two system calls at process start, which isn't ideal, but is probably > not a significant issue, either. > > I haven't verified this, but it looks like introducing the flag breaks > CRIU? In dump_thread_rseq, we have this: > > if (rseqc.flags != 0) { > pr_err("something wrong with ptrace(PTRACE_GET_RSEQ_CONFIGURATION, %d) flags = 0x%x\n", tid, > rseqc.flags); > return -1; > } Yeah. That'd need to be fixed or work around. > I suppose a workaround could make this behavior flag a prctl flag. CRIU > wouldn't dump and restore that until taught about it. If the new > behavior is switched on explicitly by the flag, it would be > backwards-compatible, except that restoring with unpatched CRIU would > lead to a performance loss. It's worse. The flag will also enable extended RSEQ features beyond mmcid and requires that the registered rseq size is >= offsetof(struct rseq, end)' >> 2) Determine the requirements of the registering task via the size of >> the registered RSEQ area. >> >> The original implementation, which TCMalloc depends on, registers >> a 32 byte region (ORIG_RSEG_SIZE). This region has 32 byte >> alignment requirement. >> >> The extension safe newer variant exposes the kernel RSEQ feature >> size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment >> requirement via getauxval(AT_RSEQ_ALIGN). The alignment >> requirement is that the registered rseq region is aligned to the >> next power of two of the feature size. The kernel currently has a >> feature size of 33 bytes, which means the alignment requirement is >> 64 bytes. > > There are still glibc builds in use that do not use AT_RSEQ_ALIGN, and > instead unconditionally reserve a size of 32. In some builds, the RSEQ > area is not aligned to a multiple of 64, which makes glibc > indistinguishable from tcmalloc. That's how it is. So with a size of 32 this will fallback to legacy mode and not unlock the extended features independent of the alignment. The alignment requirements are: Size 32: 32 bytes Size >32: 64 bytes > You could look at the location of the thread pointer relative to the > RSEQ area at registration to tell them apart, but that is perhaps too > nasty. *Blink* > Switching to the new extensible RSEQ allocation code in older glibc > builds is not entirely trivial, and I would prefer not doing that. > Registering with a new flag is comparatively simple, and we could > backport it, except that it might not be compatible with CRIU. Neither with CRIU nor with the requirement to support additional features which require the registered rseq memory size to be at least as large as the kernel requires. That's why we have AT_RSEQ_FEATURE_SIZE. Otherwise we'd end up with runtime conditionals for every single feature, which just adds more gunk into the hotpaths and ends up in a ever growing compatibility nightmare. So if a process runs on a newer kernel with let's say 40 bytes rseq size, then it can't be safely migrated with CRIU to a older kernel with 32 bytes rseq size as you don't know whether the process uses some of the extended features in the newer kernel already. But that's not any different from extended syscall features etc. So with the size based detection we end up with the following: Size 32: legacy mode no matter whether that's TCMalloc or glibc. Does not support extended features Size >= kernel size: optimized mode with support for extended features Thanks, tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-27 7:40 ` Florian Weimer 2026-04-27 11:03 ` Thomas Gleixner @ 2026-04-27 18:35 ` Mathieu Desnoyers 2026-04-27 21:06 ` Thomas Gleixner 1 sibling, 1 reply; 33+ messages in thread From: Mathieu Desnoyers @ 2026-04-27 18:35 UTC (permalink / raw) To: Florian Weimer, Thomas Gleixner Cc: Peter Zijlstra, Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds, criu, Michael Jeanson On 2026-04-27 03:40, Florian Weimer wrote: > * Thomas Gleixner: > >> The real question is how to differentiate between the legacy and the >> optimized mode. I have two working variants to achieve that: [...] > > Switching to the new extensible RSEQ allocation code in older glibc > builds is not entirely trivial, and I would prefer not doing that. > Registering with a new flag is comparatively simple, and we could > backport it, except that it might not be compatible with CRIU. A third option would allow the entire range of older libc versions to benefit from rseq optimizations, gating the "v2" behavior on: rseq_len > 32 || (flags & RSEQ_FLAG_V2) As a result: - restore compatibility with existing tcmalloc binaries. - glibc 2.41+ would benefit from optimization without changes. - glibc 2.35-2.40 would be able to easily backport minimal changes [*] to benefit from kernel optimizations (flags & RSEQ_FLAG_V2). Likewise for RHEL glibc 2.34 with backported rseq support. [*] Minimal changes to allow older libc to use the optimized mode involve implementing a new query for getauxval(AT_RSEQ_V2), which would return nonzero when the kernel supports the v2 flag, and when supported pass a new RSEQ_FLAG_V2 flag to rseq on registration. That v2 behavior would: A) Enforce the ABI contract: - RO fields corruption -> kill process, - System call within rseq critical section -> kill process, B) Allow optimization of the rseq field updates (only update relevant fields on migration), This entirely decouples the feature enablement concern (rseq_len) from the strictness/optimization mode (v2). This keeps compatibility with current tcmalloc binaries because tcmalloc always registers a 32 bytes rseq_len without the v2 flag set. tcmalloc already has its own internal fields at fixed offsets from the rseq structure which conflict with extended rseq fields, so limiting the tcmalloc work-around behavior to rseq_len == 32 seem to align well with the tcmalloc project approach towards extensibility and ecosystem inter-compatibility. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-27 18:35 ` Mathieu Desnoyers @ 2026-04-27 21:06 ` Thomas Gleixner 0 siblings, 0 replies; 33+ messages in thread From: Thomas Gleixner @ 2026-04-27 21:06 UTC (permalink / raw) To: Mathieu Desnoyers, Florian Weimer Cc: Peter Zijlstra, Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds, criu, Michael Jeanson On Mon, Apr 27 2026 at 14:35, Mathieu Desnoyers wrote: > On 2026-04-27 03:40, Florian Weimer wrote: >> Switching to the new extensible RSEQ allocation code in older glibc >> builds is not entirely trivial, and I would prefer not doing that. >> Registering with a new flag is comparatively simple, and we could >> backport it, except that it might not be compatible with CRIU. > A third option would allow the entire range of older libc versions to > benefit from rseq optimizations, gating the "v2" behavior on: > > rseq_len > 32 || (flags & RSEQ_FLAG_V2) No. Features beyond mm_cid require optimized mode and a larger rseq area. That's not negotiable. See below. > That v2 behavior would: > > A) Enforce the ABI contract: > > - RO fields corruption -> kill process, My patch does that already and the time slice extension muck does so too from day one. > - System call within rseq critical section -> kill process, No. That's overkill for syscall heavy workloads. Also it's not a functional correctness problem which affects multiple RSEQ users in an application. User space can do even worse things. cs_start call foo // foo uses rseq too .... cs_end Invoking a syscall from within the critical section is stupid, but at least harmless vs. other usage in the same thread as the syscall needs to return before anything else can go and use RSEQ in that thread, no? People who develop RSEQ critical sections can enable debug mode via the sysfs knob if they want to prove that their code is correct. That's a debug aid, not more. > B) Allow optimization of the rseq field updates (only update relevant > fields on migration), That's part of the whole combo. Optimized behaviour and new features. > This entirely decouples the feature enablement concern (rseq_len) from > the strictness/optimization mode (v2). Which causes us to sprinkle more conditionals into the hot paths for individual features instead of simply doing unconditional stores and be done with it. It's bad enough that we have one, we don't need more. User space knows the size the kernel expects and if it insists on using the original size, so be it. Keep it simple. Thanks, tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-26 22:04 ` Thomas Gleixner 2026-04-27 7:40 ` Florian Weimer @ 2026-04-28 6:11 ` Dmitry Vyukov 2026-04-28 8:07 ` Thomas Gleixner 2026-04-28 7:39 ` Peter Zijlstra 2026-04-28 8:03 ` Peter Zijlstra 3 siblings, 1 reply; 33+ messages in thread From: Dmitry Vyukov @ 2026-04-28 6:11 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, Mathias Stearn, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Mon, 27 Apr 2026 at 00:04, Thomas Gleixner <tglx@kernel.org> wrote: > > On Fri, Apr 24 2026 at 21:44, Thomas Gleixner wrote: > > On Fri, Apr 24 2026 at 17:03, Peter Zijlstra wrote: > >> On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote: > >>> > I was really hoping that we would only need to do the "redundant" > >>> > cpu_id_start writes would only be needed on membarrier_rseq IPIs where > >>> > it really is a pay-for-what-you-use functionality, > >>> > >>> That's fine and can be solved without adding this sequence overhead into > >>> the scheduler hotpath. > >> > >> Something like so? (probably needs help for !GENERIC bits) > > > > Yes and yes :) > > > > Let me stare at that !generic tif bits case. > > I stared at it and finally gave up because all of this is in a > completely FUBAR'ed state and ends up in a horrible pile of hacks and > duct tape with a way larger than zero probability that we chase the > nasty corner cases for quite some time just to add more duct tape and > hacks. > > Contrary to that it's rather trivial to cleanly separate the behavioral > cases and guarantees without a masssive runtime overhead and without a > pile of hard to maintain TCMalloc specific hacks. > > All required code is already available to support the architectures > which do not utilize the generic entry code and therefore can't neither > use the optimized mode nor time slice extensions. So instead of letting > the compiler optimize that code out for the generic entry code users, we > can keep it around and utilize one or the other depending on the > requested mode. I managed to get the required run-time conditionals down > to a minimum so that they are in the noise when analysing it with perf. > > The real question is how to differentiate between the legacy and the > optimized mode. I have two working variants to achieve that: > > 1) The fully safe option requires a new flag for RSEQ > registration. It obviously requires a glibc update. (Suggested by > PeterZ) > > 2) Determine the requirements of the registering task via the size of > the registered RSEQ area. > > The original implementation, which TCMalloc depends on, registers > a 32 byte region (ORIG_RSEG_SIZE). This region has 32 byte > alignment requirement. > > The extension safe newer variant exposes the kernel RSEQ feature > size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment > requirement via getauxval(AT_RSEQ_ALIGN). The alignment > requirement is that the registered rseq region is aligned to the > next power of two of the feature size. The kernel currently has a > feature size of 33 bytes, which means the alignment requirement is > 64 bytes. > > The TCMalloc RSEQ region is embedded into a cache line aligned > data structure starting at offset 32 bytes so that bytes 28-31 and > the cpu_id_start field at bytes 32-35 form a 64-bit little endian > pointer with the top-most bit (63 set) to check whether the kernel > has overwritten cpu_id_start with an actual CPU id value, which is > guaranteed to not have the top most bit set. > > As this is part of their performance tuned magic, it's a pretty > safe assumption, that TCMalloc won't use a larger RSEQ size, which > allows to select optimized mode for registrations with a size > greater than 32 bytes. > > That does not require any changes to glibc and works out of the > box. (Suggested by Mathieu) > > In both cases the legacy non-optimized mode exposes the original > behaviour up to the mm_cid field and does not provide support for time > slice extensions. Optimized mode restores the performance gains and > enables support for time slice extensions. > > I have no strong preference either way and have working code for both > variants. Though obviously avoiding to update the libc world has a > charme. If that unexpectedly would turn out to be not sufficient, then > disabling that would be a trivial one-liner and as a consequence require > to add the flag and update the libc world. > > Combo patch for the auto-detection based on the registered size below as > that allows to immediately test without glibc dependencies. It applies > cleanly on Linus tree and 7.0. 6.19 would need some fixups, but I > learned today that it's already EOL. > > In the final version that's three separate patches plus a set of > selftest changes which validate legacy behaviour and run the full param > test suite in both legacy and optimized mode. > > Thoughts, preferences? I like this! This does not create fires, and allows incremental transition. I've tested the patch with membarrier doing TIF_RSEQ_FORCE_RESTART, and it did not work with unmodified tcmalloc for subtle reasons anyway. Tcmalloc can be made to work with that approach with minor changes, but if the goal is to keep old binaries working, that won't work. I've tested this patch with unmodified tcmalloc tests, and it almost works. I think the recent rseq changes introduced 2 more regressions/bugs. They are caught by tcmalloc tests, but I think they equally affect optimized v2 mode. 1. This seems to be broken after rseq unregistration: "cpu_id_start Optimistic cache of the CPU number on which the registered thread is running. Its value is guaranteed to always be a possible CPU number, even when rseq is not registered." Old kernels used to put cpu_id_start = 0: https://elixir.bootlin.com/linux/v6.7/source/kernel/rseq.c#L119 But now kernel uses this function which puts cpu_id_start = RSEQ_CPU_ID_UNINITIALIZED: https://elixir.bootlin.com/linux/v7.0.1/source/include/linux/rseq_entry.h#L503 2. There are spurious SIGSEGV kills on rseq unregistration (thread exit). Stress tests that unregister rseq and use membarrier sometimes killed with SIGSEGV inside of rseq(UNREGISTER) syscall. I suspect there is some race between ipi_rseq and rseq(UNREGISTER). I saw the same when tested the TIF_RSEQ_FORCE_RESTART patch -- kernel tried to update user-space rseq when the pointer was NULL. rseq_reset uses memset to clear t->rseq: memset(&t->rseq, 0, sizeof(t->rseq)); If that uses byte writes, then it will clear sched_switch/ids_changed, then at this point ipi_rseq still sees has_rseq=1 and sets sched_switch/ids_changed, which cause update on return from rseq(UNREGISTER). Not sure if that's it, or there is something else. This did not happen on older kernels which used rseq_preempt in ipi_rseq: https://elixir.bootlin.com/linux/v6.7/source/kernel/sched/membarrier.c#L197 > Thanks, > > tglx > --- > Documentation/userspace-api/rseq.rst | 77 ++++++++++++++ > include/linux/rseq.h | 20 +++ > include/linux/rseq_entry.h | 110 ++++++++++----------- > include/linux/rseq_types.h | 3 > kernel/rseq.c | 183 ++++++++++++++++++++++------------- > kernel/sched/membarrier.c | 11 +- > 6 files changed, 280 insertions(+), 124 deletions(-) > --- > --- a/include/linux/rseq.h > +++ b/include/linux/rseq.h > @@ -9,6 +9,11 @@ > > void __rseq_handle_slowpath(struct pt_regs *regs); > > +static __always_inline bool rseq_optimized(struct task_struct *t) > +{ > + return IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && likely(t->rseq.event.optimized); > +} > + > /* Invoked from resume_user_mode_work() */ > static inline void rseq_handle_slowpath(struct pt_regs *regs) > { > @@ -30,7 +35,7 @@ void __rseq_signal_deliver(int sig, stru > */ > static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) > { > - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { > + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(current)) { Just rseq_optimized(t), it already includes the other check. > /* '&' is intentional to spare one conditional branch */ > if (current->rseq.event.has_rseq & current->rseq.event.user_irq) > __rseq_signal_deliver(ksig->sig, regs); > @@ -50,15 +55,21 @@ static __always_inline void rseq_sched_s > { > struct rseq_event *ev = &t->rseq.event; > > - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { > + /* > + * Only apply the user_irq optimization for RSEQ ABI V2 > + * registrations. Legacy users like TCMalloc rely on the historical ABI > + * V1 behaviour which updates IDs on every context swtich. > + */ > + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(t)) { Just rseq_optimized(t), it already includes the other check. > /* > * Avoid a boat load of conditionals by using simple logic > * to determine whether NOTIFY_RESUME needs to be raised. > * > * It's required when the CPU or MM CID has changed or > - * the entry was from user space. > + * the entry was from user space. ev->has_rseq does not > + * have to be evaluated because optimized implies has_rseq. > */ > - bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; > + bool raise = ev->user_irq | ev->ids_changed; > > if (raise) { > ev->sched_switch = true; > @@ -66,6 +77,7 @@ static __always_inline void rseq_sched_s > } > } else { > if (ev->has_rseq) { > + t->rseq.event.ids_changed = true; > t->rseq.event.sched_switch = true; > rseq_raise_notify_resume(t); > } > --- a/include/linux/rseq_entry.h > +++ b/include/linux/rseq_entry.h > @@ -111,6 +111,20 @@ static __always_inline void rseq_slice_c > t->rseq.slice.state.granted = false; > } > > +/* > + * Open coded, so it can be invoked within a user access region. > + * > + * This clears the user space state of the time slice extensions field only when > + * the task has registered the optimized RSEQ_ABI V2. Some legacy registrations, > + * e.g. TCMalloc, have conflicting non-ABI fields in struct RSEQ, which would be > + * overwritten by an unconditional write. > + */ > +#define rseq_slice_clear_user(rseq, efault) \ > +do { \ > + if (rseq_slice_extension_enabled()) \ > + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); \ > +} while (0) > + > static __always_inline bool __rseq_grant_slice_extension(bool work_pending) > { > struct task_struct *curr = current; > @@ -230,10 +244,10 @@ static __always_inline bool rseq_slice_e > static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; } > static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { } > static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; } > +#define rseq_slice_clear_user(rseq, efault) do { } while (0) > #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ > > bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr); > -bool rseq_debug_validate_ids(struct task_struct *t); > > static __always_inline void rseq_note_user_irq_entry(void) > { > @@ -353,43 +367,6 @@ bool rseq_debug_update_user_cs(struct ta > return false; > } > > -/* > - * On debug kernels validate that user space did not mess with it if the > - * debug branch is enabled. > - */ > -bool rseq_debug_validate_ids(struct task_struct *t) > -{ > - struct rseq __user *rseq = t->rseq.usrptr; > - u32 cpu_id, uval, node_id; > - > - /* > - * On the first exit after registering the rseq region CPU ID is > - * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! > - */ > - node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ? > - cpu_to_node(t->rseq.ids.cpu_id) : 0; > - > - scoped_user_read_access(rseq, efault) { > - unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); > - if (cpu_id != t->rseq.ids.cpu_id) > - goto die; > - unsafe_get_user(uval, &rseq->cpu_id, efault); > - if (uval != cpu_id) > - goto die; > - unsafe_get_user(uval, &rseq->node_id, efault); > - if (uval != node_id) > - goto die; > - unsafe_get_user(uval, &rseq->mm_cid, efault); > - if (uval != t->rseq.ids.mm_cid) > - goto die; > - } > - return true; > -die: > - t->rseq.event.fatal = true; > -efault: > - return false; > -} > - > #endif /* RSEQ_BUILD_SLOW_PATH */ > > /* > @@ -504,12 +481,32 @@ bool rseq_set_ids_get_csaddr(struct task > { > struct rseq __user *rseq = t->rseq.usrptr; > > - if (static_branch_unlikely(&rseq_debug_enabled)) { > - if (!rseq_debug_validate_ids(t)) > - return false; > - } > - > scoped_user_rw_access(rseq, efault) { > + /* Validate the R/O fields for debug and optimized mode */ > + if (static_branch_unlikely(&rseq_debug_enabled) || rseq_optimized(t)) { > + u32 cpu_id, uval, node_id; > + > + /* > + * On the first exit after registering the rseq region CPU ID is > + * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! > + */ > + node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ? > + cpu_to_node(t->rseq.ids.cpu_id) : 0; > + > + unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); > + if (cpu_id != t->rseq.ids.cpu_id) > + goto die; > + unsafe_get_user(uval, &rseq->cpu_id, efault); > + if (uval != cpu_id) > + goto die; > + unsafe_get_user(uval, &rseq->node_id, efault); > + if (uval != node_id) > + goto die; > + unsafe_get_user(uval, &rseq->mm_cid, efault); > + if (uval != t->rseq.ids.mm_cid) > + goto die; > + } > + > unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault); > unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault); > unsafe_put_user(node_id, &rseq->node_id, efault); > @@ -517,11 +514,9 @@ bool rseq_set_ids_get_csaddr(struct task > if (csaddr) > unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); > > - /* Open coded, so it's in the same user access region */ > - if (rseq_slice_extension_enabled()) { > - /* Unconditionally clear it, no point in conditionals */ > - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); > - } > + /* RSEQ ABI V2 only operations */ > + if (rseq_optimized(t)) > + rseq_slice_clear_user(rseq, efault); > } > > rseq_slice_clear_grant(t); > @@ -530,6 +525,9 @@ bool rseq_set_ids_get_csaddr(struct task > rseq_stat_inc(rseq_stats.ids); > rseq_trace_update(t, ids); > return true; > + > +die: > + t->rseq.event.fatal = true; > efault: > return false; > } > @@ -612,6 +610,14 @@ static __always_inline bool rseq_exit_us > * interrupts disabled > */ > guard(pagefault)(); > + /* > + * This optimization is only valid when the task registered for the > + * optimized RSEQ_ABI_V2 variant. Some legacy users rely on the original > + * RSEQ implementation behaviour which unconditionally updated the IDs. > + * rseq_sched_switch_event() ensures that legacy registrations always > + * have both sched_switch and ids_changed set, which is compatible with > + * the historical TIF_NOTIFY_RESUME behaviour. > + */ > if (likely(!t->rseq.event.ids_changed)) { > struct rseq __user *rseq = t->rseq.usrptr; > /* > @@ -623,11 +629,9 @@ static __always_inline bool rseq_exit_us > scoped_user_rw_access(rseq, efault) { > unsafe_get_user(csaddr, &rseq->rseq_cs, efault); > > - /* Open coded, so it's in the same user access region */ > - if (rseq_slice_extension_enabled()) { > - /* Unconditionally clear it, no point in conditionals */ > - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); > - } > + /* RSEQ ABI V2 only operations */ > + if (rseq_optimized(t)) > + rseq_slice_clear_user(rseq, efault); > } > > rseq_slice_clear_grant(t); > --- a/include/linux/rseq_types.h > +++ b/include/linux/rseq_types.h > @@ -18,6 +18,7 @@ struct rseq; > * @ids_changed: Indicator that IDs need to be updated > * @user_irq: True on interrupt entry from user mode > * @has_rseq: True if the task has a rseq pointer installed > + * @optimized: RSEQ ABI V2 optimized mode > * @error: Compound error code for the slow path to analyze > * @fatal: User space data corrupted or invalid > * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME > @@ -41,7 +42,7 @@ struct rseq_event { > }; > > u8 has_rseq; > - u8 __pad; > + u8 optimized; > union { > u16 error; > struct { > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -258,11 +258,15 @@ static bool rseq_handle_cs(struct task_s > static void rseq_slowpath_update_usr(struct pt_regs *regs) > { > /* > - * Preserve rseq state and user_irq state. The generic entry code > - * clears user_irq on the way out, the non-generic entry > - * architectures are not having user_irq. > - */ > - const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, }; > + * Preserve has_rseq, optimized and user_irq state. The generic entry > + * code clears user_irq on the way out, the non-generic entry > + * architectures are not setting user_irq. > + */ > + const struct rseq_event evt_mask = { > + .has_rseq = true, > + .user_irq = true, > + .optimized = true, > + }; > struct task_struct *t = current; > struct rseq_ids ids; > u32 node_id; > @@ -335,8 +339,9 @@ void __rseq_handle_slowpath(struct pt_re > void __rseq_signal_deliver(int sig, struct pt_regs *regs) > { > rseq_stat_inc(rseq_stats.signal); > + > /* > - * Don't update IDs, they are handled on exit to user if > + * Don't update IDs yet, they are handled on exit to user if > * necessary. The important thing is to abort a critical section of > * the interrupted context as after this point the instruction > * pointer in @regs points to the signal handler. > @@ -349,6 +354,13 @@ void __rseq_signal_deliver(int sig, stru > current->rseq.event.error = 0; > force_sigsegv(sig); > } > + > + /* > + * In legacy mode, force the update of IDs before returning to user > + * space to stay compatible. > + */ > + if (!rseq_optimized(current)) > + rseq_force_update(); > } > > /* > @@ -404,66 +416,19 @@ static bool rseq_reset_ids(void) > /* The original rseq structure size (including padding) is 32 bytes. */ > #define ORIG_RSEQ_SIZE 32 > > -/* > - * sys_rseq - setup restartable sequences for caller thread. > - */ > -SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) > +static long rseq_register(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) > { > + bool optimized = IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_len > ORIG_RSEQ_SIZE; > u32 rseqfl = 0; > > - if (flags & RSEQ_FLAG_UNREGISTER) { > - if (flags & ~RSEQ_FLAG_UNREGISTER) > - return -EINVAL; > - /* Unregister rseq for current thread. */ > - if (current->rseq.usrptr != rseq || !current->rseq.usrptr) > - return -EINVAL; > - if (rseq_len != current->rseq.len) > - return -EINVAL; > - if (current->rseq.sig != sig) > - return -EPERM; > - if (!rseq_reset_ids()) > - return -EFAULT; > - rseq_reset(current); > - return 0; > - } > - > - if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON))) > - return -EINVAL; > - > - if (current->rseq.usrptr) { > - /* > - * If rseq is already registered, check whether > - * the provided address differs from the prior > - * one. > - */ > - if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) > - return -EINVAL; > - if (current->rseq.sig != sig) > - return -EPERM; > - /* Already registered. */ > - return -EBUSY; > - } > - > - /* > - * If there was no rseq previously registered, ensure the provided rseq > - * is properly aligned, as communcated to user-space through the ELF > - * auxiliary vector AT_RSEQ_ALIGN. If rseq_len is the original rseq > - * size, the required alignment is the original struct rseq alignment. > - * > - * The rseq_len is required to be greater or equal to the original rseq > - * size. In order to be valid, rseq_len is either the original rseq size, > - * or large enough to contain all supported fields, as communicated to > - * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. > - */ > - if (rseq_len < ORIG_RSEQ_SIZE || > - (rseq_len == ORIG_RSEQ_SIZE && !IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE)) || > - (rseq_len != ORIG_RSEQ_SIZE && (!IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) || > - rseq_len < offsetof(struct rseq, end)))) > - return -EINVAL; > if (!access_ok(rseq, rseq_len)) > return -EFAULT; > > - if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { > + /* > + * The optimized check disables time slice extensions for legacy > + * registrations. > + */ > + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION) && optimized) { > rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; > if (rseq_slice_extension_enabled() && > (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)) > @@ -485,7 +450,15 @@ SYSCALL_DEFINE4(rseq, struct rseq __user > unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); > unsafe_put_user(0U, &rseq->node_id, efault); > unsafe_put_user(0U, &rseq->mm_cid, efault); > - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); > + > + /* > + * All fields past mm_cid are only valid for non-legacy registrations > + * which register with rseq_len > ORIG_RSEQ_SIZE. > + */ > + if (optimized) { > + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) > + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); > + } > } > > /* > @@ -501,11 +474,11 @@ SYSCALL_DEFINE4(rseq, struct rseq __user > #endif > > /* > - * If rseq was previously inactive, and has just been > - * registered, ensure the cpu_id_start and cpu_id fields > - * are updated before returning to user-space. > + * Ensure the cpu_id_start and cpu_id fields are updated before > + * returning to user-space. > */ > current->rseq.event.has_rseq = true; > + current->rseq.event.optimized = optimized; > rseq_force_update(); > return 0; > > @@ -513,6 +486,86 @@ SYSCALL_DEFINE4(rseq, struct rseq __user > return -EFAULT; > } > > +static long rseq_unregister(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) > +{ > + if (flags & ~RSEQ_FLAG_UNREGISTER) > + return -EINVAL; > + if (current->rseq.usrptr != rseq || !current->rseq.usrptr) > + return -EINVAL; > + if (rseq_len != current->rseq.len) > + return -EINVAL; > + if (current->rseq.sig != sig) > + return -EPERM; > + if (!rseq_reset_ids()) > + return -EFAULT; > + rseq_reset(current); > + return 0; > +} > + > +static long rseq_reregister(struct rseq __user * rseq, u32 rseq_len, u32 sig) > +{ > + /* > + * If rseq is already registered, check whether the provided address > + * differs from the prior one. > + */ > + if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) > + return -EINVAL; > + if (current->rseq.sig != sig) > + return -EPERM; > + /* Already registered. */ > + return -EBUSY; > +} > + > +static bool rseq_length_valid(struct rseq __user *rseq, unsigned int rseq_len) > +{ > + if (rseq_len < ORIG_RSEQ_SIZE) > + return false; > + > + /* > + * Ensure the provided rseq is properly aligned, as communicated to > + * user-space through the ELF auxiliary vector AT_RSEQ_ALIGN. If > + * rseq_len is the original rseq size, the required alignment is the > + * original struct rseq alignment. > + * > + * The rseq_len is required to be greater or equal than the original > + * rseq size. > + * > + * In order to be valid, rseq_len is either the original rseq size, or > + * large enough to contain all supported fields, as communicated to > + * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. > + */ > + if (rseq_len < ORIG_RSEQ_SIZE) > + return false; > + > + if (rseq_len == ORIG_RSEQ_SIZE) > + return IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE); > + > + return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) && > + rseq_len >= offsetof(struct rseq, end); > +} > + > +#define RSEQ_FLAGS_SUPPORTED (RSEQ_FLAG_SLICE_EXT_DEFAULT_ON) > + > +/* > + * sys_rseq - Register or unregister restartable sequences for the caller thread. > + */ > +SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) > +{ > + if (flags & RSEQ_FLAG_UNREGISTER) > + return rseq_unregister(rseq, rseq_len, flags, sig); > + > + if (unlikely(flags & ~RSEQ_FLAGS_SUPPORTED)) > + return -EINVAL; > + > + if (current->rseq.usrptr) > + return rseq_reregister(rseq, rseq_len, sig); > + > + if (!rseq_length_valid(rseq, rseq_len)) > + return -EINVAL; > + > + return rseq_register(rseq, rseq_len, flags, sig); > +} > + > #ifdef CONFIG_RSEQ_SLICE_EXTENSION > struct slice_timer { > struct hrtimer timer; > @@ -713,6 +766,8 @@ int rseq_slice_extension_prctl(unsigned > return -ENOTSUPP; > if (!current->rseq.usrptr) > return -ENXIO; > + if (!current->rseq.event.optimized) > + return -ENOTSUPP; > > /* No change? */ > if (enable == !!current->rseq.slice.state.enabled) > --- a/kernel/sched/membarrier.c > +++ b/kernel/sched/membarrier.c > @@ -199,7 +199,16 @@ static void ipi_rseq(void *info) > * is negligible. > */ > smp_mb(); > - rseq_sched_switch_event(current); > + /* > + * Legacy mode requires that IDs are written and the critical section is > + * evaluated. Optimized mode handles the critical section and IDs are > + * only updated if they change as a consequence of preemption after > + * return from this IPI. > + */ > + if (rseq_optimized(current)) > + rseq_sched_switch_event(current); > + else > + rseq_force_update(); > } > > static void ipi_sync_rq_state(void *info) > --- a/Documentation/userspace-api/rseq.rst > +++ b/Documentation/userspace-api/rseq.rst > @@ -24,6 +24,80 @@ Quick access to CPU number, node ID > Allows to implement per CPU data efficiently. Documentation is in code and > selftests. :( > > +Optimized RSEQ V2 > +----------------- > + > +On architectures which utilize the generic entry code and generic TIF bits > +the kernel supports runtime optimizations for RSEQ, which also enable > +enhanced features like scheduler time slice extensions. > + > +To enable them a task has to register the RSEQ region with at least the > +length advertised by getauxval(AT_RSEQ_FEATURE_SIZE). > + > +If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel > +keeps the legacy low performance mode enabled to fulfil the expectations > +existing users regarding the original RSEQ implementation behaviour. > + > +The following table documents the ABI and behavioral guarantees of the > +legacy and the optimized V2 mode. > + > +.. list-table:: RSEQ modes > + :header-rows: 1 > + > + * - Nr > + - What > + - Legacy > + - Optimized V2 > + * - 1 > + - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read > + only) > + - Updated by the kernel unconditionally after each context switch and > + before signal delivery > + - Updated by the kernel if and only if they change, i.e. if the task > + is migrated or mm_cid changes > + * - 2 > + - The rseq_cs critical section field > + - Evaluated and handled unconditionally after each context switch and > + before signal delivery > + - Evaluated and handled conditionally only when user space was > + interrupted. Either after being preempted or before signal delivery > + in the interrupted context. > + * - 3 > + - Read only fields > + - No strict enforcement except in debug mode > + - Strict enforcement > + * - 4 > + - membarrier(...RSEQ) > + - All running threads of the process are interrupted and the ID fields > + are rewritten and eventually active critical sections are aborted > + before they return to user space. All threads which are scheduled > + out whether voluntary or not are covered by #1/#2 above. > + - All running threads of the process are interrupted and eventually > + active critical sections are aborted before these threads return to > + user space. The ID fields are only updated if changed as a > + consequence of the interrupt. All threads which are scheduled out > + whether voluntary not are covered by #1/#2 above. > + * - 5 > + - Time slice extensions > + - Not supported > + - Supported > + > +The legacy mode is obviously less performant as it does unconditional > +updates and critical section checks even if not strictly required by the > +ABI contract. That can't be changed anymore as some users depend on that > +observed behavior, which in turn enables them to violate the ABI and > +overwrite the cpu_id_start field for their own purposes. This is obviously > +discouraged as it renders RSEQ incompatible with the intended usage and > +breaks the expectation of other libraries in the same application. > + > +The ABI compliant optimized mode, which respects the read only fields, does > +not require unconditional updates and therefore is way more performant. The > +kernel validates the read only fields for compliance. If user space > +modifies them, the process is killed. Compliant usage allows multiple > +libraries in the same application to benefit from the RSEQ functionality > +without disturbing each other. > + > + > Scheduler time slice extensions > ------------------------------- > > @@ -37,7 +111,8 @@ scheduled out inside of the critical sec > > * Enabled at boot time (default is enabled) > > - * A rseq userspace pointer has been registered for the thread > + * A rseq userspace pointer has been registered for the thread in > + optimized V2 mode > > The thread has to enable the functionality via prctl(2):: > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-28 6:11 ` Dmitry Vyukov @ 2026-04-28 8:07 ` Thomas Gleixner 2026-04-28 8:18 ` Thomas Gleixner 0 siblings, 1 reply; 33+ messages in thread From: Thomas Gleixner @ 2026-04-28 8:07 UTC (permalink / raw) To: Dmitry Vyukov Cc: Peter Zijlstra, Mathias Stearn, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Tue, Apr 28 2026 at 08:11, Dmitry Vyukov wrote: > On Mon, 27 Apr 2026 at 00:04, Thomas Gleixner <tglx@kernel.org> wrote: > 1. This seems to be broken after rseq unregistration: > "cpu_id_start Optimistic cache of the CPU number on which the > registered thread is running. Its value is guaranteed to always be a > possible CPU number, even when rseq is not registered." > > Old kernels used to put cpu_id_start = 0: > https://elixir.bootlin.com/linux/v6.7/source/kernel/rseq.c#L119 > > But now kernel uses this function which puts cpu_id_start = > RSEQ_CPU_ID_UNINITIALIZED: > https://elixir.bootlin.com/linux/v7.0.1/source/include/linux/rseq_entry.h#L503 Ooops. Missed that detail. > 2. There are spurious SIGSEGV kills on rseq unregistration (thread exit). > Stress tests that unregister rseq and use membarrier sometimes killed > with SIGSEGV inside of rseq(UNREGISTER) syscall. I suspect there is > some race between ipi_rseq and rseq(UNREGISTER). I saw the same when > tested the > TIF_RSEQ_FORCE_RESTART patch -- kernel tried to update user-space rseq > when the pointer was NULL. > > rseq_reset uses memset to clear t->rseq: > > memset(&t->rseq, 0, sizeof(t->rseq)); > > If that uses byte writes, then it will clear sched_switch/ids_changed, > then at this point ipi_rseq still sees has_rseq=1 and sets > sched_switch/ids_changed, which cause update on return from > rseq(UNREGISTER). > Not sure if that's it, or there is something else. Can you try the updated version below? Thanks, tglx --- --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -9,6 +9,11 @@ void __rseq_handle_slowpath(struct pt_regs *regs); +static __always_inline bool rseq_optimized(struct task_struct *t) +{ + return IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && likely(t->rseq.event.optimized); +} + /* Invoked from resume_user_mode_work() */ static inline void rseq_handle_slowpath(struct pt_regs *regs) { @@ -30,7 +35,7 @@ void __rseq_signal_deliver(int sig, stru */ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(current)) { /* '&' is intentional to spare one conditional branch */ if (current->rseq.event.has_rseq & current->rseq.event.user_irq) __rseq_signal_deliver(ksig->sig, regs); @@ -50,15 +55,21 @@ static __always_inline void rseq_sched_s { struct rseq_event *ev = &t->rseq.event; - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* + * Only apply the user_irq optimization for RSEQ ABI V2 + * registrations. Legacy users like TCMalloc rely on the historical ABI + * V1 behaviour which updates IDs on every context swtich. + */ + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(t)) { /* * Avoid a boat load of conditionals by using simple logic * to determine whether NOTIFY_RESUME needs to be raised. * * It's required when the CPU or MM CID has changed or - * the entry was from user space. + * the entry was from user space. ev->has_rseq does not + * have to be evaluated because optimized implies has_rseq. */ - bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; + bool raise = ev->user_irq | ev->ids_changed; if (raise) { ev->sched_switch = true; @@ -66,6 +77,7 @@ static __always_inline void rseq_sched_s } } else { if (ev->has_rseq) { + t->rseq.event.ids_changed = true; t->rseq.event.sched_switch = true; rseq_raise_notify_resume(t); } @@ -119,6 +131,7 @@ static inline void rseq_virt_userspace_e static inline void rseq_reset(struct task_struct *t) { + guard(irqsave)(); memset(&t->rseq, 0, sizeof(t->rseq)); t->rseq.ids.cpu_id = RSEQ_CPU_ID_UNINITIALIZED; } --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -111,6 +111,20 @@ static __always_inline void rseq_slice_c t->rseq.slice.state.granted = false; } +/* + * Open coded, so it can be invoked within a user access region. + * + * This clears the user space state of the time slice extensions field only when + * the task has registered the optimized RSEQ_ABI V2. Some legacy registrations, + * e.g. TCMalloc, have conflicting non-ABI fields in struct RSEQ, which would be + * overwritten by an unconditional write. + */ +#define rseq_slice_clear_user(rseq, efault) \ +do { \ + if (rseq_slice_extension_enabled()) \ + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); \ +} while (0) + static __always_inline bool __rseq_grant_slice_extension(bool work_pending) { struct task_struct *curr = current; @@ -230,10 +244,10 @@ static __always_inline bool rseq_slice_e static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; } static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { } static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; } +#define rseq_slice_clear_user(rseq, efault) do { } while (0) #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr); -bool rseq_debug_validate_ids(struct task_struct *t); static __always_inline void rseq_note_user_irq_entry(void) { @@ -353,43 +367,6 @@ bool rseq_debug_update_user_cs(struct ta return false; } -/* - * On debug kernels validate that user space did not mess with it if the - * debug branch is enabled. - */ -bool rseq_debug_validate_ids(struct task_struct *t) -{ - struct rseq __user *rseq = t->rseq.usrptr; - u32 cpu_id, uval, node_id; - - /* - * On the first exit after registering the rseq region CPU ID is - * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! - */ - node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ? - cpu_to_node(t->rseq.ids.cpu_id) : 0; - - scoped_user_read_access(rseq, efault) { - unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); - if (cpu_id != t->rseq.ids.cpu_id) - goto die; - unsafe_get_user(uval, &rseq->cpu_id, efault); - if (uval != cpu_id) - goto die; - unsafe_get_user(uval, &rseq->node_id, efault); - if (uval != node_id) - goto die; - unsafe_get_user(uval, &rseq->mm_cid, efault); - if (uval != t->rseq.ids.mm_cid) - goto die; - } - return true; -die: - t->rseq.event.fatal = true; -efault: - return false; -} - #endif /* RSEQ_BUILD_SLOW_PATH */ /* @@ -504,12 +481,32 @@ bool rseq_set_ids_get_csaddr(struct task { struct rseq __user *rseq = t->rseq.usrptr; - if (static_branch_unlikely(&rseq_debug_enabled)) { - if (!rseq_debug_validate_ids(t)) - return false; - } - scoped_user_rw_access(rseq, efault) { + /* Validate the R/O fields for debug and optimized mode */ + if (static_branch_unlikely(&rseq_debug_enabled) || rseq_optimized(t)) { + u32 cpu_id, uval, node_id; + + /* + * On the first exit after registering the rseq region CPU ID is + * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! + */ + node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ? + cpu_to_node(t->rseq.ids.cpu_id) : 0; + + unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); + if (cpu_id != t->rseq.ids.cpu_id) + goto die; + unsafe_get_user(uval, &rseq->cpu_id, efault); + if (uval != cpu_id) + goto die; + unsafe_get_user(uval, &rseq->node_id, efault); + if (uval != node_id) + goto die; + unsafe_get_user(uval, &rseq->mm_cid, efault); + if (uval != t->rseq.ids.mm_cid) + goto die; + } + unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault); unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault); unsafe_put_user(node_id, &rseq->node_id, efault); @@ -517,11 +514,9 @@ bool rseq_set_ids_get_csaddr(struct task if (csaddr) unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); - /* Open coded, so it's in the same user access region */ - if (rseq_slice_extension_enabled()) { - /* Unconditionally clear it, no point in conditionals */ - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); - } + /* RSEQ ABI V2 only operations */ + if (rseq_optimized(t)) + rseq_slice_clear_user(rseq, efault); } rseq_slice_clear_grant(t); @@ -530,6 +525,9 @@ bool rseq_set_ids_get_csaddr(struct task rseq_stat_inc(rseq_stats.ids); rseq_trace_update(t, ids); return true; + +die: + t->rseq.event.fatal = true; efault: return false; } @@ -612,6 +610,14 @@ static __always_inline bool rseq_exit_us * interrupts disabled */ guard(pagefault)(); + /* + * This optimization is only valid when the task registered for the + * optimized RSEQ_ABI_V2 variant. Some legacy users rely on the original + * RSEQ implementation behaviour which unconditionally updated the IDs. + * rseq_sched_switch_event() ensures that legacy registrations always + * have both sched_switch and ids_changed set, which is compatible with + * the historical TIF_NOTIFY_RESUME behaviour. + */ if (likely(!t->rseq.event.ids_changed)) { struct rseq __user *rseq = t->rseq.usrptr; /* @@ -623,11 +629,9 @@ static __always_inline bool rseq_exit_us scoped_user_rw_access(rseq, efault) { unsafe_get_user(csaddr, &rseq->rseq_cs, efault); - /* Open coded, so it's in the same user access region */ - if (rseq_slice_extension_enabled()) { - /* Unconditionally clear it, no point in conditionals */ - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); - } + /* RSEQ ABI V2 only operations */ + if (rseq_optimized(t)) + rseq_slice_clear_user(rseq, efault); } rseq_slice_clear_grant(t); --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -18,6 +18,7 @@ struct rseq; * @ids_changed: Indicator that IDs need to be updated * @user_irq: True on interrupt entry from user mode * @has_rseq: True if the task has a rseq pointer installed + * @optimized: RSEQ ABI V2 optimized mode * @error: Compound error code for the slow path to analyze * @fatal: User space data corrupted or invalid * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME @@ -41,7 +42,7 @@ struct rseq_event { }; u8 has_rseq; - u8 __pad; + u8 optimized; union { u16 error; struct { --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -236,11 +236,6 @@ static int __init rseq_debugfs_init(void } __initcall(rseq_debugfs_init); -static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id) -{ - return rseq_set_ids_get_csaddr(t, ids, node_id, NULL); -} - static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs) { struct rseq __user *urseq = t->rseq.usrptr; @@ -258,11 +253,15 @@ static bool rseq_handle_cs(struct task_s static void rseq_slowpath_update_usr(struct pt_regs *regs) { /* - * Preserve rseq state and user_irq state. The generic entry code - * clears user_irq on the way out, the non-generic entry - * architectures are not having user_irq. - */ - const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, }; + * Preserve has_rseq, optimized and user_irq state. The generic entry + * code clears user_irq on the way out, the non-generic entry + * architectures are not setting user_irq. + */ + const struct rseq_event evt_mask = { + .has_rseq = true, + .user_irq = true, + .optimized = true, + }; struct task_struct *t = current; struct rseq_ids ids; u32 node_id; @@ -335,8 +334,9 @@ void __rseq_handle_slowpath(struct pt_re void __rseq_signal_deliver(int sig, struct pt_regs *regs) { rseq_stat_inc(rseq_stats.signal); + /* - * Don't update IDs, they are handled on exit to user if + * Don't update IDs yet, they are handled on exit to user if * necessary. The important thing is to abort a critical section of * the interrupted context as after this point the instruction * pointer in @regs points to the signal handler. @@ -349,6 +349,13 @@ void __rseq_signal_deliver(int sig, stru current->rseq.event.error = 0; force_sigsegv(sig); } + + /* + * In legacy mode, force the update of IDs before returning to user + * space to stay compatible. + */ + if (!rseq_optimized(current)) + rseq_force_update(); } /* @@ -384,19 +391,22 @@ void rseq_syscall(struct pt_regs *regs) static bool rseq_reset_ids(void) { - struct rseq_ids ids = { - .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, - .mm_cid = 0, - }; + struct rseq __user *rseq = current->rseq.usrptr; /* * If this fails, terminate it because this leaves the kernel in * stupid state as exit to user space will try to fixup the ids * again. */ - if (rseq_set_ids(current, &ids, 0)) - return true; + scoped_user_rw_access(rseq, efault) { + unsafe_put_user(0, &rseq->cpu_id_start, efault); + unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); + unsafe_put_user(0, &rseq->node_id, efault); + unsafe_put_user(0, &rseq->mm_cid, efault); + } + return true; +efault: force_sig(SIGSEGV); return false; } @@ -404,70 +414,24 @@ static bool rseq_reset_ids(void) /* The original rseq structure size (including padding) is 32 bytes. */ #define ORIG_RSEQ_SIZE 32 -/* - * sys_rseq - setup restartable sequences for caller thread. - */ -SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) +static long rseq_register(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) { + bool optimized = IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_len > ORIG_RSEQ_SIZE; u32 rseqfl = 0; - if (flags & RSEQ_FLAG_UNREGISTER) { - if (flags & ~RSEQ_FLAG_UNREGISTER) - return -EINVAL; - /* Unregister rseq for current thread. */ - if (current->rseq.usrptr != rseq || !current->rseq.usrptr) - return -EINVAL; - if (rseq_len != current->rseq.len) - return -EINVAL; - if (current->rseq.sig != sig) - return -EPERM; - if (!rseq_reset_ids()) - return -EFAULT; - rseq_reset(current); - return 0; - } - - if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON))) - return -EINVAL; - - if (current->rseq.usrptr) { - /* - * If rseq is already registered, check whether - * the provided address differs from the prior - * one. - */ - if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) - return -EINVAL; - if (current->rseq.sig != sig) - return -EPERM; - /* Already registered. */ - return -EBUSY; - } - - /* - * If there was no rseq previously registered, ensure the provided rseq - * is properly aligned, as communcated to user-space through the ELF - * auxiliary vector AT_RSEQ_ALIGN. If rseq_len is the original rseq - * size, the required alignment is the original struct rseq alignment. - * - * The rseq_len is required to be greater or equal to the original rseq - * size. In order to be valid, rseq_len is either the original rseq size, - * or large enough to contain all supported fields, as communicated to - * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. - */ - if (rseq_len < ORIG_RSEQ_SIZE || - (rseq_len == ORIG_RSEQ_SIZE && !IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE)) || - (rseq_len != ORIG_RSEQ_SIZE && (!IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) || - rseq_len < offsetof(struct rseq, end)))) - return -EINVAL; if (!access_ok(rseq, rseq_len)) return -EFAULT; - if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { - rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; - if (rseq_slice_extension_enabled() && - (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)) - rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; + /* + * The optimized check disables time slice extensions for legacy + * registrations. + */ + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION) && optimized) { + if (rseq_slice_extension_enabled()) { + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; + if (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON) + rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_ENABLED; + } } scoped_user_write_access(rseq, efault) { @@ -485,7 +449,15 @@ SYSCALL_DEFINE4(rseq, struct rseq __user unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); unsafe_put_user(0U, &rseq->node_id, efault); unsafe_put_user(0U, &rseq->mm_cid, efault); - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); + + /* + * All fields past mm_cid are only valid for non-legacy registrations + * which register with rseq_len > ORIG_RSEQ_SIZE. + */ + if (optimized) { + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); + } } /* @@ -501,11 +473,11 @@ SYSCALL_DEFINE4(rseq, struct rseq __user #endif /* - * If rseq was previously inactive, and has just been - * registered, ensure the cpu_id_start and cpu_id fields - * are updated before returning to user-space. + * Ensure the cpu_id_start and cpu_id fields are updated before + * returning to user-space. */ current->rseq.event.has_rseq = true; + current->rseq.event.optimized = optimized; rseq_force_update(); return 0; @@ -513,6 +485,80 @@ SYSCALL_DEFINE4(rseq, struct rseq __user return -EFAULT; } +static long rseq_unregister(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) +{ + if (flags & ~RSEQ_FLAG_UNREGISTER) + return -EINVAL; + if (current->rseq.usrptr != rseq || !current->rseq.usrptr) + return -EINVAL; + if (rseq_len != current->rseq.len) + return -EINVAL; + if (current->rseq.sig != sig) + return -EPERM; + if (!rseq_reset_ids()) + return -EFAULT; + rseq_reset(current); + return 0; +} + +static long rseq_reregister(struct rseq __user * rseq, u32 rseq_len, u32 sig) +{ + /* + * If rseq is already registered, check whether the provided address + * differs from the prior one. + */ + if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) + return -EINVAL; + if (current->rseq.sig != sig) + return -EPERM; + /* Already registered. */ + return -EBUSY; +} + +static bool rseq_length_valid(struct rseq __user *rseq, unsigned int rseq_len) +{ + /* + * Ensure the provided rseq is properly aligned, as communicated to + * user-space through the ELF auxiliary vector AT_RSEQ_ALIGN. If + * rseq_len is the original rseq size, the required alignment is the + * original struct rseq alignment. + * + * In order to be valid, rseq_len is either the original rseq size, or + * large enough to contain all supported fields, as communicated to + * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. + */ + if (rseq_len < ORIG_RSEQ_SIZE) + return false; + + if (rseq_len == ORIG_RSEQ_SIZE) + return IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE); + + return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) && + rseq_len >= offsetof(struct rseq, end); +} + +#define RSEQ_FLAGS_SUPPORTED (RSEQ_FLAG_SLICE_EXT_DEFAULT_ON) + +/* + * sys_rseq - Register or unregister restartable sequences for the caller thread. + */ +SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) +{ + if (flags & RSEQ_FLAG_UNREGISTER) + return rseq_unregister(rseq, rseq_len, flags, sig); + + if (unlikely(flags & ~RSEQ_FLAGS_SUPPORTED)) + return -EINVAL; + + if (current->rseq.usrptr) + return rseq_reregister(rseq, rseq_len, sig); + + if (!rseq_length_valid(rseq, rseq_len)) + return -EINVAL; + + return rseq_register(rseq, rseq_len, flags, sig); +} + #ifdef CONFIG_RSEQ_SLICE_EXTENSION struct slice_timer { struct hrtimer timer; @@ -713,6 +759,8 @@ int rseq_slice_extension_prctl(unsigned return -ENOTSUPP; if (!current->rseq.usrptr) return -ENXIO; + if (!current->rseq.event.optimized) + return -ENOTSUPP; /* No change? */ if (enable == !!current->rseq.slice.state.enabled) --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -199,7 +199,16 @@ static void ipi_rseq(void *info) * is negligible. */ smp_mb(); - rseq_sched_switch_event(current); + /* + * Legacy mode requires that IDs are written and the critical section is + * evaluated. Optimized mode handles the critical section and IDs are + * only updated if they change as a consequence of preemption after + * return from this IPI. + */ + if (rseq_optimized(current)) + rseq_sched_switch_event(current); + else + rseq_force_update(); } static void ipi_sync_rq_state(void *info) --- a/Documentation/userspace-api/rseq.rst +++ b/Documentation/userspace-api/rseq.rst @@ -24,6 +24,81 @@ Quick access to CPU number, node ID Allows to implement per CPU data efficiently. Documentation is in code and selftests. :( +Optimized RSEQ V2 +----------------- + +On architectures which utilize the generic entry code and generic TIF bits +the kernel supports runtime optimizations for RSEQ, which also enable +enhanced features like scheduler time slice extensions. + +To enable them a task has to register the RSEQ region with at least the +length advertised by getauxval(AT_RSEQ_FEATURE_SIZE). + +If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel +keeps the legacy low performance mode enabled to fulfil the expectations +of existing users regarding the original RSEQ implementation behaviour. + +The following table documents the ABI and behavioral guarantees of the +legacy and the optimized V2 mode. + +.. list-table:: RSEQ modes + :header-rows: 1 + + * - Nr + - What + - Legacy + - Optimized V2 + * - 1 + - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read + only) + - Updated by the kernel unconditionally after each context switch and + before signal delivery + - Updated by the kernel if and only if they change, i.e. if the task + is migrated or mm_cid changes + * - 2 + - The rseq_cs critical section field + - Evaluated and handled unconditionally after each context switch and + before signal delivery + - Evaluated and handled conditionally only when user space was + interrupted. Either after being preempted or before signal delivery + in the interrupted context. + * - 3 + - Read only fields + - No strict enforcement except in debug mode + - Strict enforcement + * - 4 + - membarrier(...RSEQ) + - All running threads of the process are interrupted and the ID fields + are rewritten and eventually active critical sections are aborted + before they return to user space. All threads which are scheduled + out whether voluntary or not are covered by #1/#2 above. + - All running threads of the process are interrupted and eventually + active critical sections are aborted before these threads return to + user space. The ID fields are only updated if changed as a + consequence of the interrupt. All threads which are scheduled out + whether voluntary or not are covered by #1/#2 above. + * - 5 + - Time slice extensions + - Not supported + - Supported + +The legacy mode is obviously less performant as it does unconditional +updates and critical section checks even if not strictly required by the +ABI contract. That can't be changed anymore as some users depend on that +observed behavior, which in turn enables them to violate the ABI and +overwrite the cpu_id_start field for their own purposes. This is obviously +discouraged as it renders RSEQ incompatible with the intended usage and +breaks the expectation of other libraries in the same application. + +The ABI compliant optimized mode, which respects the read only fields, does +not require unconditional updates and therefore is way more performant. The +kernel validates the read only fields for compliance. If user space +modifies them, the process is killed. Compliant usage allows multiple +libraries in the same application to benefit from the RSEQ functionality +without disturbing each other. The ABI compliant optimized mode also +enables extended RSEQ features like time slice extensions. + + Scheduler time slice extensions ------------------------------- @@ -37,7 +112,8 @@ scheduled out inside of the critical sec * Enabled at boot time (default is enabled) - * A rseq userspace pointer has been registered for the thread + * A rseq userspace pointer has been registered for the thread in + optimized V2 mode The thread has to enable the functionality via prctl(2):: ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-28 8:07 ` Thomas Gleixner @ 2026-04-28 8:18 ` Thomas Gleixner 2026-04-28 10:53 ` Dmitry Vyukov 0 siblings, 1 reply; 33+ messages in thread From: Thomas Gleixner @ 2026-04-28 8:18 UTC (permalink / raw) To: Dmitry Vyukov Cc: Peter Zijlstra, Mathias Stearn, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Tue, Apr 28 2026 at 10:07, Thomas Gleixner wrote: >> Not sure if that's it, or there is something else. > > Can you try the updated version below? Is there a pre-compiled version of those tcmalloc tests somewhere? I tried to build it from source, but I really have better things to do than wasting my time on debugging this bazel nonsense. Thanks, tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-28 8:18 ` Thomas Gleixner @ 2026-04-28 10:53 ` Dmitry Vyukov 2026-04-28 13:31 ` Mathias Stearn 0 siblings, 1 reply; 33+ messages in thread From: Dmitry Vyukov @ 2026-04-28 10:53 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, Mathias Stearn, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 1511 bytes --] On Tue, 28 Apr 2026 at 10:18, Thomas Gleixner <tglx@kernel.org> wrote: > > On Tue, Apr 28 2026 at 10:07, Thomas Gleixner wrote: > >> Not sure if that's it, or there is something else. > > > > Can you try the updated version below? > > Is there a pre-compiled version of those tcmalloc tests somewhere? > > I tried to build it from source, but I really have better things to do > than wasting my time on debugging this bazel nonsense. I feel your pain. I've attached an archive with 2 tests that I used. I've managed to build them using: USE_BAZEL_VERSION=8.6.0 bazelisk build --repo_env=CC=clang --repo_env=CXX=clang++ --dynamic_mode=off --fission=no --strip=never --copt=-gmlt //tcmalloc/internal:percpu_tcmalloc_test //tcmalloc/testing:background_test_no_glibc_rseq on tcmalloc commit 8e98046ec5639bffbe70a53770a2699dd355b26d I unfortunately did not manage to build fully static binaries, so these still depend on a relatively recent libc. I had to build a Debian Forky image for qemu to run these binaries... I also used the stress utility to run these because they crash non-deterministically: https://pkg.go.dev/golang.org/x/tools/cmd/stress It allows to run a binary in parallel loop, looking for different failure modes and collecting logs. Hopefully this should work: $ go install golang.org/x/tools/cmd/stress@latest $ GLIBC_TUNABLES=glibc.pthread.rseq=0 $HOME/go/bin/stress ./percpu_tcmalloc_test But otherwise you can try just: $ GLIBC_TUNABLES=glibc.pthread.rseq=0 ./percpu_tcmalloc_test [-- Attachment #2: tcmalloc.tar.gz --] [-- Type: application/gzip, Size: 5373002 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-28 10:53 ` Dmitry Vyukov @ 2026-04-28 13:31 ` Mathias Stearn 2026-04-28 15:46 ` Thomas Gleixner 0 siblings, 1 reply; 33+ messages in thread From: Mathias Stearn @ 2026-04-28 13:31 UTC (permalink / raw) To: Dmitry Vyukov Cc: Thomas Gleixner, Peter Zijlstra, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds [-- Attachment #1: Type: text/plain, Size: 702 bytes --] On Tue, Apr 28, 2026 at 11:54 AM Dmitry Vyukov <dvyukov@google.com> wrote: > > On Tue, 28 Apr 2026 at 10:18, Thomas Gleixner <tglx@kernel.org> wrote: > > > > Is there a pre-compiled version of those tcmalloc tests somewhere? > > I've attached an archive with 2 tests that I used. Here is an additional test. It is the stress test I used to show that it could result in two live allocations getting the same address. It will run for up to a minute or until the first double allocation gets detected (usually within 30ms on 6.19). Arm binary is linked against glibc-2.35, x86 is 2.39. Should have no other runtime deps. GLIBC_TUNABLES=glibc.pthread.rseq=0 ./double_alloc_test.ARCH [-- Attachment #2: double_alloc_test.x86.gz --] [-- Type: application/gzip, Size: 989995 bytes --] [-- Attachment #3: double_alloc_test.arm.gz --] [-- Type: application/gzip, Size: 940006 bytes --] [-- Attachment #4: double_alloc_test.cc --] [-- Type: text/x-c-code, Size: 6012 bytes --] // Stress test to detect double-allocation caused by the Linux 6.19 rseq bug. // // On Linux 6.19, membarrier RSEQ IPI no longer writes cpu_id_start. // This breaks tcmalloc's StopCpu protocol: ShrinkOtherCache/DrainCpu can // read slab objects concurrently with a Pop on the same CPU, giving two // callers the same pointer (silent heap corruption). // // Detection: each allocation is stamped with a per-thread canary. If another // thread receives the same pointer, it overwrites the canary. The original // owner detects this on its next verification pass. #include <sched.h> #include <stdint.h> #include <string.h> #include <unistd.h> #include <atomic> #include <cstdio> #include <thread> #include <vector> #include "absl/time/clock.h" #include "absl/time/time.h" #include "tcmalloc/malloc_extension.h" namespace tcmalloc { namespace { constexpr int kNumThreads = 64; constexpr int kMaxLivePerThread = 800; constexpr absl::Duration kTestDuration = absl::Minutes(1); constexpr size_t kAllocSizes[] = {16, 32, 48, 64, 80, 128, 256}; constexpr int kNumSizes = sizeof(kAllocSizes) / sizeof(kAllocSizes[0]); struct Alloc { void* ptr; size_t size; uint64_t canary; }; static uint64_t MakeCanary(int tid, uint64_t counter) { return (static_cast<uint64_t>(tid + 1) << 48) | (counter & 0xFFFFFFFFFFFFULL); } static int CanaryTid(uint64_t canary) { return static_cast<int>(canary >> 48) - 1; } static void StampAlloc(void* ptr, size_t size, uint64_t canary) { auto* p = static_cast<volatile uint64_t*>(ptr); size_t n = size / sizeof(uint64_t); for (size_t i = 0; i < n; ++i) { p[i] = canary; } } static bool VerifyAlloc(const Alloc& a) { auto* p = static_cast<volatile uint64_t*>(a.ptr); return p[0] == a.canary; } void test() { MallocExtension::SetBackgroundProcessSleepInterval(absl::Milliseconds(1)); std::thread background([] { MallocExtension::ProcessBackgroundActions(); }); std::atomic<bool> stop{false}; std::atomic<int> canary_corruptions{0}; std::atomic<uint64_t> total_allocs{0}; const auto start_time = absl::Now(); std::vector<std::thread> threads; for (int tid = 0; tid < kNumThreads; ++tid) { threads.emplace_back([&, tid] { std::vector<Alloc> live; live.reserve(kMaxLivePerThread + 128); uint64_t counter = 0; uint32_t rng = tid * 2654435761u + 1; while (!stop.load(std::memory_order_relaxed)) { rng = rng * 1103515245 + 12345; size_t alloc_size = kAllocSizes[rng % kNumSizes]; for (int i = 0; i < 64 && static_cast<int>(live.size()) < kMaxLivePerThread; ++i) { void* p = ::operator new(alloc_size); uint64_t canary = MakeCanary(tid, ++counter); StampAlloc(p, alloc_size, canary); live.push_back({p, alloc_size, canary}); total_allocs.fetch_add(1, std::memory_order_relaxed); } for (size_t i = 0; i < live.size(); ++i) { if (!VerifyAlloc(live[i])) { auto* p = static_cast<volatile uint64_t*>(live[i].ptr); uint64_t found = p[0]; int found_tid = CanaryTid(found); int expected_tid = CanaryTid(live[i].canary); int corruptions = canary_corruptions.fetch_add(1, std::memory_order_relaxed) + 1; fprintf(stderr, "*** DOUBLE ALLOCATION DETECTED (#%d) ***\n" " ptr=%p size=%zu\n" " expected canary=0x%016lx (tid=%d)\n" " found canary=0x%016lx (tid=%d)\n", corruptions, live[i].ptr, live[i].size, (unsigned long)live[i].canary, expected_tid, (unsigned long)found, found_tid); live[i].ptr = nullptr; stop.store(true, std::memory_order_relaxed); stop.notify_all(); } } size_t w = 0; for (size_t r = 0; r < live.size(); ++r) { if (live[r].ptr != nullptr) { if (w != r) live[w] = live[r]; ++w; } } live.resize(w); rng = rng * 1103515245 + 12345; int to_free = live.size() / 2; for (int i = 0; i < to_free; ++i) { auto& a = live.back(); if (a.ptr) { if (!VerifyAlloc(a)) { auto* p = static_cast<volatile uint64_t*>(a.ptr); uint64_t found = p[0]; canary_corruptions.fetch_add(1, std::memory_order_relaxed); fprintf(stderr, "*** DOUBLE ALLOCATION DETECTED (at free) ***\n" " ptr=%p expected=0x%016lx found=0x%016lx\n", a.ptr, (unsigned long)a.canary, (unsigned long)found); a.ptr = nullptr; stop.store(true, std::memory_order_relaxed); stop.notify_all(); } else { ::operator delete(a.ptr, a.size); } } live.pop_back(); } } if (canary_corruptions.load(std::memory_order_relaxed) == 0) { for (auto& a : live) { if (a.ptr) ::operator delete(a.ptr, a.size); } } }); } std::thread([&]{ absl::SleepFor(kTestDuration); stop.store(true, std::memory_order_relaxed); stop.notify_all(); }).detach(); stop.wait(false); const auto elapsed = absl::ToDoubleSeconds(absl::Now() - start_time); for (auto& t : threads) t.join(); MallocExtension::SetBackgroundProcessActionsEnabled(false); background.join(); uint64_t ops = total_allocs.load(); int corruptions = canary_corruptions.load(); bool pass = (corruptions == 0); fprintf(stderr, "\n=== Results ===\n" "Time: %fs\n" "Total allocations: %lu\n" "Canary corruptions (double allocations): %d\n %s\n", elapsed, (unsigned long)ops, corruptions, pass ? "PASS" : "FAIL"); _exit(!pass); } } // namespace } // namespace tcmalloc int main() { tcmalloc::test(); return 0; } ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-28 13:31 ` Mathias Stearn @ 2026-04-28 15:46 ` Thomas Gleixner 0 siblings, 0 replies; 33+ messages in thread From: Thomas Gleixner @ 2026-04-28 15:46 UTC (permalink / raw) To: Mathias Stearn, Dmitry Vyukov Cc: Peter Zijlstra, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Tue, Apr 28 2026 at 15:31, Mathias Stearn wrote: > On Tue, Apr 28, 2026 at 11:54 AM Dmitry Vyukov <dvyukov@google.com> wrote: >> >> On Tue, 28 Apr 2026 at 10:18, Thomas Gleixner <tglx@kernel.org> wrote: >> > >> > Is there a pre-compiled version of those tcmalloc tests somewhere? >> >> I've attached an archive with 2 tests that I used. > > Here is an additional test. It is the stress test I used to show that > it could result in two live allocations getting the same address. It > will run for up to a minute or until the first double allocation gets > detected (usually within 30ms on 6.19). Thanks to both of you for providing those binaries. I've run all three binaries now on my latest version in parallel for quite same time and it seems to hold up. Mark just told me privately that these plus the arm64 fix he's working on survive that double allocation test. Let me go and write a cover letter and post the pile. Thanks, tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-26 22:04 ` Thomas Gleixner 2026-04-27 7:40 ` Florian Weimer 2026-04-28 6:11 ` Dmitry Vyukov @ 2026-04-28 7:39 ` Peter Zijlstra 2026-04-28 8:13 ` Peter Zijlstra 2026-04-28 8:51 ` Thomas Gleixner 2026-04-28 8:03 ` Peter Zijlstra 3 siblings, 2 replies; 33+ messages in thread From: Peter Zijlstra @ 2026-04-28 7:39 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Mon, Apr 27, 2026 at 12:04:48AM +0200, Thomas Gleixner wrote: > --- a/include/linux/rseq.h > +++ b/include/linux/rseq.h > @@ -9,6 +9,11 @@ > > void __rseq_handle_slowpath(struct pt_regs *regs); > > +static __always_inline bool rseq_optimized(struct task_struct *t) > +{ > + return IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && likely(t->rseq.event.optimized); > +} > + > /* Invoked from resume_user_mode_work() */ > static inline void rseq_handle_slowpath(struct pt_regs *regs) > { > @@ -30,7 +35,7 @@ void __rseq_signal_deliver(int sig, stru > */ > static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) > { > - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { > + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(current)) { rseq_optimized() already implies GENERIC_IRQ_ENTRY > /* '&' is intentional to spare one conditional branch */ > if (current->rseq.event.has_rseq & current->rseq.event.user_irq) > __rseq_signal_deliver(ksig->sig, regs); > @@ -50,15 +55,21 @@ static __always_inline void rseq_sched_s > { > struct rseq_event *ev = &t->rseq.event; > > - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { > + /* > + * Only apply the user_irq optimization for RSEQ ABI V2 > + * registrations. Legacy users like TCMalloc rely on the historical ABI > + * V1 behaviour which updates IDs on every context swtich. > + */ > + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(t)) { idem. > --- a/include/linux/rseq_types.h > +++ b/include/linux/rseq_types.h > @@ -18,6 +18,7 @@ struct rseq; > * @ids_changed: Indicator that IDs need to be updated > * @user_irq: True on interrupt entry from user mode > * @has_rseq: True if the task has a rseq pointer installed > + * @optimized: RSEQ ABI V2 optimized mode > * @error: Compound error code for the slow path to analyze > * @fatal: User space data corrupted or invalid > * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME > @@ -41,7 +42,7 @@ struct rseq_event { > }; > > u8 has_rseq; > - u8 __pad; > + u8 optimized; > union { > u16 error; > struct { I know you like the 'optimized' name, it is faster etc. However, the description there suggests: has_rseq_v2 not be a bad name. And while I write this, I figured we could have the value of has_rseq be 2, rather than 1, but this might end up generating worse code, dunno, haven't tried yet. > +static bool rseq_length_valid(struct rseq __user *rseq, unsigned int rseq_len) > +{ > + if (rseq_len < ORIG_RSEQ_SIZE) > + return false; > + > + /* > + * Ensure the provided rseq is properly aligned, as communicated to > + * user-space through the ELF auxiliary vector AT_RSEQ_ALIGN. If > + * rseq_len is the original rseq size, the required alignment is the > + * original struct rseq alignment. > + * > + * The rseq_len is required to be greater or equal than the original > + * rseq size. > + * > + * In order to be valid, rseq_len is either the original rseq size, or > + * large enough to contain all supported fields, as communicated to > + * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. > + */ > + if (rseq_len < ORIG_RSEQ_SIZE) > + return false; You just did that check, I doubt it'll have changed since the comment ;-) > + if (rseq_len == ORIG_RSEQ_SIZE) > + return IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE); > + > + return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) && > + rseq_len >= offsetof(struct rseq, end); > +} Given we really only differentiate between ORIG_RSEQ_SIZE (32) and sizeof(struct rseq), perhaps we should also add something like: if (rseq_len != sizeof(struct rseq)) return false; ? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-28 7:39 ` Peter Zijlstra @ 2026-04-28 8:13 ` Peter Zijlstra 2026-04-28 8:51 ` Thomas Gleixner 1 sibling, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2026-04-28 8:13 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Tue, Apr 28, 2026 at 09:39:38AM +0200, Peter Zijlstra wrote: > > + return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) && > > + rseq_len >= offsetof(struct rseq, end); > > +} > > Given we really only differentiate between ORIG_RSEQ_SIZE (32) and > sizeof(struct rseq), perhaps we should also add something like: > > if (rseq_len != sizeof(struct rseq)) > return false; > Wakeup juice, I need more wakeup juice :-) Its there, except written weirdly with that offsetof thing. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-28 7:39 ` Peter Zijlstra 2026-04-28 8:13 ` Peter Zijlstra @ 2026-04-28 8:51 ` Thomas Gleixner 1 sibling, 0 replies; 33+ messages in thread From: Thomas Gleixner @ 2026-04-28 8:51 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Tue, Apr 28 2026 at 09:39, Peter Zijlstra wrote: > On Mon, Apr 27, 2026 at 12:04:48AM +0200, Thomas Gleixner wrote: >> static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) >> { >> - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { >> + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(current)) { > > rseq_optimized() already implies GENERIC_IRQ_ENTRY Indeed. >> + u8 optimized; >> union { >> u16 error; >> struct { > > I know you like the 'optimized' name, it is faster etc. However, the > description there suggests: has_rseq_v2 not be a bad name. > > And while I write this, I figured we could have the value of has_rseq be > 2, rather than 1, but this might end up generating worse code, dunno, > haven't tried yet. Tried that early on and it was worse, but back the approach has changed since then and I just validated that it's actually fine to do so. Fixed. Thanks, tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-26 22:04 ` Thomas Gleixner ` (2 preceding siblings ...) 2026-04-28 7:39 ` Peter Zijlstra @ 2026-04-28 8:03 ` Peter Zijlstra 2026-04-28 8:36 ` Thomas Gleixner 3 siblings, 1 reply; 33+ messages in thread From: Peter Zijlstra @ 2026-04-28 8:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Mon, Apr 27, 2026 at 12:04:48AM +0200, Thomas Gleixner wrote: > +Optimized RSEQ V2 > +----------------- > + > +On architectures which utilize the generic entry code and generic TIF bits > +the kernel supports runtime optimizations for RSEQ, which also enable > +enhanced features like scheduler time slice extensions. > + > +To enable them a task has to register the RSEQ region with at least the > +length advertised by getauxval(AT_RSEQ_FEATURE_SIZE). > + > +If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel > +keeps the legacy low performance mode enabled to fulfil the expectations > +existing users regarding the original RSEQ implementation behaviour. > + > +The following table documents the ABI and behavioral guarantees of the > +legacy and the optimized V2 mode. > + > +.. list-table:: RSEQ modes > + :header-rows: 1 > + > + * - Nr > + - What > + - Legacy > + - Optimized V2 > + * - 1 > + - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read > + only) > + - Updated by the kernel unconditionally after each context switch and > + before signal delivery > + - Updated by the kernel if and only if they change, i.e. if the task > + is migrated or mm_cid changes > + * - 2 > + - The rseq_cs critical section field > + - Evaluated and handled unconditionally after each context switch and > + before signal delivery > + - Evaluated and handled conditionally only when user space was > + interrupted. Either after being preempted or before signal delivery > + in the interrupted context. > + * - 3 > + - Read only fields > + - No strict enforcement except in debug mode > + - Strict enforcement > + * - 4 > + - membarrier(...RSEQ) > + - All running threads of the process are interrupted and the ID fields > + are rewritten and eventually active critical sections are aborted > + before they return to user space. All threads which are scheduled > + out whether voluntary or not are covered by #1/#2 above. > + - All running threads of the process are interrupted and eventually > + active critical sections are aborted before these threads return to > + user space. The ID fields are only updated if changed as a > + consequence of the interrupt. All threads which are scheduled out > + whether voluntary not are covered by #1/#2 above. > + * - 5 > + - Time slice extensions > + - Not supported > + - Supported I'm sure its cute when rendered, but when read as text this is nigh on unreadable. > +The legacy mode is obviously less performant as it does unconditional > +updates and critical section checks even if not strictly required by the > +ABI contract. That can't be changed anymore as some users depend on that > +observed behavior, which in turn enables them to violate the ABI and > +overwrite the cpu_id_start field for their own purposes. This is obviously > +discouraged as it renders RSEQ incompatible with the intended usage and > +breaks the expectation of other libraries in the same application. > + > +The ABI compliant optimized mode, which respects the read only fields, does > +not require unconditional updates and therefore is way more performant. The > +kernel validates the read only fields for compliance. If user space > +modifies them, the process is killed. Compliant usage allows multiple > +libraries in the same application to benefit from the RSEQ functionality > +without disturbing each other. > + ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-28 8:03 ` Peter Zijlstra @ 2026-04-28 8:36 ` Thomas Gleixner 0 siblings, 0 replies; 33+ messages in thread From: Thomas Gleixner @ 2026-04-28 8:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Tue, Apr 28 2026 at 10:03, Peter Zijlstra wrote: > On Mon, Apr 27, 2026 at 12:04:48AM +0200, Thomas Gleixner wrote: >> + * - 5 >> + - Time slice extensions >> + - Not supported >> + - Supported > > I'm sure its cute when rendered, but when read as text this is nigh on > unreadable. I tried the other table variants. They are even worse :) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 5:53 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Dmitry Vyukov 2026-04-23 10:39 ` Thomas Gleixner @ 2026-04-23 12:11 ` Alejandro Colomar 2026-04-23 12:54 ` Mathieu Desnoyers 2026-04-23 12:29 ` Mathieu Desnoyers 2 siblings, 1 reply; 33+ messages in thread From: Alejandro Colomar @ 2026-04-23 12:11 UTC (permalink / raw) To: Dmitry Vyukov Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler, Michael Jeanson [-- Attachment #1: Type: text/plain, Size: 4335 bytes --] Hello Dmitry, On 2026-04-23T07:53:55+0200, Dmitry Vyukov wrote: > On Thu, 23 Apr 2026 at 03:48, Jinjie Ruan <ruanjinjie@huawei.com> wrote: > > > > On 4/23/2026 3:47 AM, Thomas Gleixner wrote: > > > On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote: > > >> On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote: > > >> Conceptually we just need to use syscall_enter_from_user_mode() and > > >> irqentry_enter_from_user_mode() appropriately. > > > > > > Right. I figured that out. > > > > > >> In practice, I can't use those as-is without introducing the exception > > >> masking problems I just fixed up for irqentry_enter_from_kernel_mode(), > > >> so I'll need to do some similar refactoring first. > > > > > > See below. > > > > > >> I haven't paged everything in yet, so just to cehck, is there anything > > >> that would behave incorrectly if current->rseq.event.user_irq were set > > >> for syscall entry? IIUC it means we'll effectively do the slow path, and > > >> I was wondering if that might be acceptable as a one-line bodge for > > >> stable. > > > > > > It might work, but it's trivial enough to avoid that. See below. That on > > > top of 6.19.y makes the selftests pass too. > > > > This aligns with my thoughts when convert arm64 to generic syscall > > entry. Currently, the arm64 entry code does not distinguish between IRQ > > and syscall entries. It fails to call rseq_note_user_irq_entry() for IRQ > > entries as the generic entry framework does, because arm64 uses > > enter_from_user_mode() exclusively instead of > > irqentry_enter_from_user_mode(). > > > > https://lore.kernel.org/all/20260320102620.1336796-10-ruanjinjie@huawei.com/ > > > > > > > > Thanks, > > > > > > tglx > > > --- > > > arch/arm64/kernel/entry-common.c | 14 ++++++++++---- > > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > > > --- a/arch/arm64/kernel/entry-common.c > > > +++ b/arch/arm64/kernel/entry-common.c > > > @@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode( > > > irqentry_exit(regs, state); > > > } > > > > > > +static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs) > > > +{ > > > + enter_from_user_mode(regs); > > > + mte_disable_tco_entry(current); > > > +} > > > + > > > /* > > > * Handle IRQ/context state management when entering from user mode. > > > * Before this function is called it is not safe to call regular kernel code, > > > @@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode( > > > */ > > > static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs) > > > { > > > - enter_from_user_mode(regs); > > > - mte_disable_tco_entry(current); > > > + arm64_enter_from_user_mode_syscall(regs); > > > + rseq_note_user_irq_entry(); > > > } > > > > > > /* > > > @@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_ > > > > > > static void noinstr el0_svc(struct pt_regs *regs) > > > { > > > - arm64_enter_from_user_mode(regs); > > > + arm64_enter_from_user_mode_syscall(regs); > > > cortex_a76_erratum_1463225_svc_handler(); > > > fpsimd_syscall_enter(); > > > local_daif_restore(DAIF_PROCCTX); > > > @@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r > > > > > > static void noinstr el0_svc_compat(struct pt_regs *regs) > > > { > > > - arm64_enter_from_user_mode(regs); > > > + arm64_enter_from_user_mode_syscall(regs); > > > cortex_a76_erratum_1463225_svc_handler(); > > > local_daif_restore(DAIF_PROCCTX); > > > do_el0_svc_compat(regs); > > > +linux-man > > This part of the rseq man page needs to be fixed as well I think. The > kernel no longer reliably provides clearing of rseq_cs on preemption, > right? > > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 +Michael Jeanson That page seems to be maintained separately, as part of the librseq project. Have a lovely day! Alex > > "and set to NULL by the kernel when it restarts an assembly > instruction sequence block, > as well as when the kernel detects that it is preempting or delivering > a signal outside of the range targeted by the rseq_cs." > -- <https://www.alejandro-colomar.es> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 12:11 ` Alejandro Colomar @ 2026-04-23 12:54 ` Mathieu Desnoyers 0 siblings, 0 replies; 33+ messages in thread From: Mathieu Desnoyers @ 2026-04-23 12:54 UTC (permalink / raw) To: Alejandro Colomar, Dmitry Vyukov Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler, Michael Jeanson On 2026-04-23 08:11, Alejandro Colomar wrote: [...] >> >> +linux-man >> >> This part of the rseq man page needs to be fixed as well I think. The >> kernel no longer reliably provides clearing of rseq_cs on preemption, >> right? >> >> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 > > +Michael Jeanson > > That page seems to be maintained separately, as part of the librseq > project. Yes, I maintain the librseq project, thanks Alejandro! Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 5:53 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Dmitry Vyukov 2026-04-23 10:39 ` Thomas Gleixner 2026-04-23 12:11 ` Alejandro Colomar @ 2026-04-23 12:29 ` Mathieu Desnoyers 2026-04-23 12:36 ` Dmitry Vyukov 2 siblings, 1 reply; 33+ messages in thread From: Mathieu Desnoyers @ 2026-04-23 12:29 UTC (permalink / raw) To: Dmitry Vyukov, Jinjie Ruan, linux-man Cc: Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On 2026-04-23 01:53, Dmitry Vyukov wrote: [...] > +linux-man > > This part of the rseq man page needs to be fixed as well I think. The > kernel no longer reliably provides clearing of rseq_cs on preemption, > right? > > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 I'm maintaining this manual page in librseq. > > "and set to NULL by the kernel when it restarts an assembly > instruction sequence block, > as well as when the kernel detects that it is preempting or delivering > a signal outside of the range targeted by the rseq_cs." I think you got two things confused here. 1) There is currently a bug on arm64 where it fails to honor the rseq ABI contract wrt critical section abort. AFAIU there is a fix proposed for this. 2) Thomas relaxed the implementation of cpu_id_start field updates so it only stores to the rseq area when the current cpu actually changes (migration). So AFAIU the statement in the man page is still fine. It's just arm64 that needs fixing. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 12:29 ` Mathieu Desnoyers @ 2026-04-23 12:36 ` Dmitry Vyukov 2026-04-23 12:53 ` Mathieu Desnoyers 0 siblings, 1 reply; 33+ messages in thread From: Dmitry Vyukov @ 2026-04-23 12:36 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, 23 Apr 2026 at 14:29, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2026-04-23 01:53, Dmitry Vyukov wrote: > [...] > > +linux-man > > > > This part of the rseq man page needs to be fixed as well I think. The > > kernel no longer reliably provides clearing of rseq_cs on preemption, > > right? > > > > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 > > I'm maintaining this manual page in librseq. > > > > > "and set to NULL by the kernel when it restarts an assembly > > instruction sequence block, > > as well as when the kernel detects that it is preempting or delivering > > a signal outside of the range targeted by the rseq_cs." > > I think you got two things confused here. > > 1) There is currently a bug on arm64 where it fails to honor the > rseq ABI contract wrt critical section abort. AFAIU there is a > fix proposed for this. > > 2) Thomas relaxed the implementation of cpu_id_start field updates > so it only stores to the rseq area when the current cpu actually > changes (migration). > > So AFAIU the statement in the man page is still fine. It's just arm64 > that needs fixing. My understanding was that due to the ev->user_irq check here: +static __always_inline void rseq_sched_switch_event(struct task_struct *t) ... + bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; + + if (raise) { + ev->sched_switch = true; + rseq_raise_notify_resume(t); + } There won't be any rseq-related processing for threads preempted in syscalls, which means that rseq_cs won't be NULLed for threads preempted inside of syscalls. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 12:36 ` Dmitry Vyukov @ 2026-04-23 12:53 ` Mathieu Desnoyers 2026-04-23 12:58 ` Dmitry Vyukov 0 siblings, 1 reply; 33+ messages in thread From: Mathieu Desnoyers @ 2026-04-23 12:53 UTC (permalink / raw) To: Dmitry Vyukov Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler, Michael Jeanson On 2026-04-23 08:36, Dmitry Vyukov wrote: > On Thu, 23 Apr 2026 at 14:29, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2026-04-23 01:53, Dmitry Vyukov wrote: >> [...] >>> +linux-man >>> >>> This part of the rseq man page needs to be fixed as well I think. The >>> kernel no longer reliably provides clearing of rseq_cs on preemption, >>> right? >>> >>> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 >> >> I'm maintaining this manual page in librseq. >> >>> >>> "and set to NULL by the kernel when it restarts an assembly >>> instruction sequence block, >>> as well as when the kernel detects that it is preempting or delivering >>> a signal outside of the range targeted by the rseq_cs." >> >> I think you got two things confused here. >> >> 1) There is currently a bug on arm64 where it fails to honor the >> rseq ABI contract wrt critical section abort. AFAIU there is a >> fix proposed for this. >> >> 2) Thomas relaxed the implementation of cpu_id_start field updates >> so it only stores to the rseq area when the current cpu actually >> changes (migration). >> >> So AFAIU the statement in the man page is still fine. It's just arm64 >> that needs fixing. > > > My understanding was that due to the ev->user_irq check here: > > +static __always_inline void rseq_sched_switch_event(struct task_struct *t) > ... > + bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; > + > + if (raise) { > + ev->sched_switch = true; > + rseq_raise_notify_resume(t); > + } > > There won't be any rseq-related processing for threads preempted in > syscalls, which means that rseq_cs won't be NULLed for threads > preempted inside of syscalls. Let's see if I understand your concern correctly. Scenario: A thread is within a rseq critical section. It exits the critical section without clearing the rseq_cs pointer, expecting the kernel to lazily clear the rseq_cs pointer eventually when it detects that it's not nested on top of the userspace critical section anymore. It then calls a system call _outside_ of the rseq critical section, but with rseq_cs pointer set. Based on the rseq man page wording, it would then expect the preemption within the system call to guarantee clearing that that pointer. Here is the relevant comment block in the man page: Updated by user-space, which sets the address of the cur‐ rently active rseq_cs at the beginning of assembly instruc‐ tion sequence block, and set to NULL by the kernel when it restarts an assembly instruction sequence block, as well as >>>>>>>>> when the kernel detects that it is preempting or delivering a signal outside of the range targeted by the rseq_cs. >>>>>>>>> ^^^ this The whole point about lazy-clearing of rseq_cs is that it _may_ happen when the kernel preempts or delivers a signal (or at any point really), but it's just an optimization. Updating the manual page with this wording would match the intent: Updated by user-space, which sets the address of the cur‐ rently active rseq_cs at the beginning of assembly instruc‐ tion sequence block, and set to NULL by the kernel when it restarts an assembly instruction sequence block. May be set to NULL by the kernel when it detects that the current instruction pointer is outside of the range targeted by the rseq_cs. Also needs to be set to NULL by user-space before reclaim‐ ing memory that contains the targeted struct rseq_cs. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 12:53 ` Mathieu Desnoyers @ 2026-04-23 12:58 ` Dmitry Vyukov 0 siblings, 0 replies; 33+ messages in thread From: Dmitry Vyukov @ 2026-04-23 12:58 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler, Michael Jeanson On Thu, 23 Apr 2026 at 14:53, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2026-04-23 08:36, Dmitry Vyukov wrote: > > On Thu, 23 Apr 2026 at 14:29, Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> On 2026-04-23 01:53, Dmitry Vyukov wrote: > >> [...] > >>> +linux-man > >>> > >>> This part of the rseq man page needs to be fixed as well I think. The > >>> kernel no longer reliably provides clearing of rseq_cs on preemption, > >>> right? > >>> > >>> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 > >> > >> I'm maintaining this manual page in librseq. > >> > >>> > >>> "and set to NULL by the kernel when it restarts an assembly > >>> instruction sequence block, > >>> as well as when the kernel detects that it is preempting or delivering > >>> a signal outside of the range targeted by the rseq_cs." > >> > >> I think you got two things confused here. > >> > >> 1) There is currently a bug on arm64 where it fails to honor the > >> rseq ABI contract wrt critical section abort. AFAIU there is a > >> fix proposed for this. > >> > >> 2) Thomas relaxed the implementation of cpu_id_start field updates > >> so it only stores to the rseq area when the current cpu actually > >> changes (migration). > >> > >> So AFAIU the statement in the man page is still fine. It's just arm64 > >> that needs fixing. > > > > > > My understanding was that due to the ev->user_irq check here: > > > > +static __always_inline void rseq_sched_switch_event(struct task_struct *t) > > ... > > + bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; > > + > > + if (raise) { > > + ev->sched_switch = true; > > + rseq_raise_notify_resume(t); > > + } > > > > There won't be any rseq-related processing for threads preempted in > > syscalls, which means that rseq_cs won't be NULLed for threads > > preempted inside of syscalls. > > Let's see if I understand your concern correctly. Scenario: > > A thread is within a rseq critical section. It exits the critical > section without clearing the rseq_cs pointer, expecting the kernel > to lazily clear the rseq_cs pointer eventually when it detects that > it's not nested on top of the userspace critical section anymore. > It then calls a system call _outside_ of the rseq critical section, > but with rseq_cs pointer set. Based on the rseq man page wording, > it would then expect the preemption within the system call to guarantee > clearing that that pointer. Yes, this is the scenario I had in mind. > Here is the relevant comment block in the man page: > > Updated by user-space, which sets the address of the cur‐ > rently active rseq_cs at the beginning of assembly instruc‐ > tion sequence block, and set to NULL by the kernel when it > restarts an assembly instruction sequence block, as well as > >>>>>>>>> > when the kernel detects that it is preempting or delivering > a signal outside of the range targeted by the rseq_cs. > >>>>>>>>> > ^^^ this > > The whole point about lazy-clearing of rseq_cs is that it _may_ happen when > the kernel preempts or delivers a signal (or at any point really), but it's > just an optimization. > > Updating the manual page with this wording would match the intent: > > Updated by user-space, which sets the address of the cur‐ > rently active rseq_cs at the beginning of assembly instruc‐ > tion sequence block, and set to NULL by the kernel when it > restarts an assembly instruction sequence block. May be set > to NULL by the kernel when it detects that the current > instruction pointer is outside of the range targeted by > the rseq_cs. > Also needs to be set to NULL by user-space before reclaim‐ > ing memory that contains the targeted struct rseq_cs. > > Thoughts ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2026-04-28 15:46 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com>
[not found] ` <aejCaG6n9s7ak5TO@J2N7QTR9R3.cambridge.arm.com>
[not found] ` <87zf2u28d1.ffs@tglx>
[not found] ` <aekPXvvuKHKlETjm@J2N7QTR9R3.cambridge.arm.com>
[not found] ` <87wlxy22x7.ffs@tglx>
[not found] ` <c5331cd6-76c8-430d-978e-fcad164e48f6@huawei.com>
2026-04-23 5:53 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Dmitry Vyukov
2026-04-23 10:39 ` Thomas Gleixner
2026-04-23 10:51 ` Mathias Stearn
2026-04-23 12:24 ` David Laight
2026-04-23 19:31 ` Thomas Gleixner
2026-04-24 7:56 ` Dmitry Vyukov
2026-04-24 8:32 ` Mathias Stearn
2026-04-24 9:30 ` Dmitry Vyukov
2026-04-24 14:16 ` Thomas Gleixner
2026-04-24 15:03 ` Peter Zijlstra
2026-04-24 19:44 ` Thomas Gleixner
2026-04-26 22:04 ` Thomas Gleixner
2026-04-27 7:40 ` Florian Weimer
2026-04-27 11:03 ` Thomas Gleixner
2026-04-27 18:35 ` Mathieu Desnoyers
2026-04-27 21:06 ` Thomas Gleixner
2026-04-28 6:11 ` Dmitry Vyukov
2026-04-28 8:07 ` Thomas Gleixner
2026-04-28 8:18 ` Thomas Gleixner
2026-04-28 10:53 ` Dmitry Vyukov
2026-04-28 13:31 ` Mathias Stearn
2026-04-28 15:46 ` Thomas Gleixner
2026-04-28 7:39 ` Peter Zijlstra
2026-04-28 8:13 ` Peter Zijlstra
2026-04-28 8:51 ` Thomas Gleixner
2026-04-28 8:03 ` Peter Zijlstra
2026-04-28 8:36 ` Thomas Gleixner
2026-04-23 12:11 ` Alejandro Colomar
2026-04-23 12:54 ` Mathieu Desnoyers
2026-04-23 12:29 ` Mathieu Desnoyers
2026-04-23 12:36 ` Dmitry Vyukov
2026-04-23 12:53 ` Mathieu Desnoyers
2026-04-23 12:58 ` Dmitry Vyukov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox