* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere [not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com> @ 2026-04-22 12:56 ` Peter Zijlstra 2026-04-22 13:13 ` Peter Zijlstra 2026-04-22 13:09 ` Mark Rutland 2026-04-24 16:45 ` [PATCH] arm64/entry: Fix arm64-specific rseq brokenness (was: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64) " Mark Rutland 2 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-04-22 12:56 UTC (permalink / raw) To: Mathias Stearn Cc: Thomas Gleixner, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote: > Additionally, it breaks tcmalloc specifically by failing to overwrite > the cpu_id_start field at points where it was relied on for > correctness. This specific behaviour was documented as being wrong and running with DEBUG_RSEQ would have flagged it. The tcmalloc issue has been contentious for a long time. The tcmalloc folks relied on something that was documented to be wrong. It has been reported to the tcmalloc people many years ago and if you were to run tcmalloc on most any kernel (very much including 6.19) with DEBUG_RSEQ=y, it would have yelled. The tcmalloc people didn't care. There was a proposal for an RSEQ extension for what they need, and they didn't care. All this should be in their bugzilla or whatever. The RSEQ rework improved performance significantly for everyone, and kept all the documented behaviour (+- arm64 bug). Tcmalloc got screwed over because they relied on implementation behaviour that was specifically documented to be broken. And they didn't care. Google was very much aware of this. And hasn't lifted a finger to remedy it. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-22 12:56 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Peter Zijlstra @ 2026-04-22 13:13 ` Peter Zijlstra 2026-04-23 10:38 ` Mathias Stearn [not found] ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com> 0 siblings, 2 replies; 41+ messages in thread From: Peter Zijlstra @ 2026-04-22 13:13 UTC (permalink / raw) To: Mathias Stearn Cc: Thomas Gleixner, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Wed, Apr 22, 2026 at 02:56:47PM +0200, Peter Zijlstra wrote: > On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote: > > > Additionally, it breaks tcmalloc specifically by failing to overwrite > > the cpu_id_start field at points where it was relied on for > > correctness. > > This specific behaviour was documented as being wrong and running with > DEBUG_RSEQ would have flagged it. > > The tcmalloc issue has been contentious for a long time. The tcmalloc > folks relied on something that was documented to be wrong. It has been > reported to the tcmalloc people many years ago and if you were to run > tcmalloc on most any kernel (very much including 6.19) with > DEBUG_RSEQ=y, it would have yelled. > > The tcmalloc people didn't care. There was a proposal for an RSEQ > extension for what they need, and they didn't care. All this should be > in their bugzilla or whatever. > > The RSEQ rework improved performance significantly for everyone, and > kept all the documented behaviour (+- arm64 bug). Tcmalloc got screwed > over because they relied on implementation behaviour that was > specifically documented to be broken. And they didn't care. Google was > very much aware of this. And hasn't lifted a finger to remedy it. Also: https://lore.kernel.org/all/874io5andc.ffs@tglx/ ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-22 13:13 ` Peter Zijlstra @ 2026-04-23 10:38 ` Mathias Stearn [not found] ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com> 1 sibling, 0 replies; 41+ messages in thread From: Mathias Stearn @ 2026-04-23 10:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Thomas Gleixner, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Wed, Apr 22, 2026 at 3:13 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Apr 22, 2026 at 02:56:47PM +0200, Peter Zijlstra wrote: > > On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote: > > > > > Additionally, it breaks tcmalloc specifically by failing to overwrite > > > the cpu_id_start field at points where it was relied on for > > > correctness. > > > > This specific behaviour was documented as being wrong and running with > > DEBUG_RSEQ would have flagged it. > > > > The tcmalloc issue has been contentious for a long time. The tcmalloc > > folks relied on something that was documented to be wrong. It has been > > reported to the tcmalloc people many years ago and if you were to run > > tcmalloc on most any kernel (very much including 6.19) with > > DEBUG_RSEQ=y, it would have yelled. > > > > The tcmalloc people didn't care. There was a proposal for an RSEQ > > extension for what they need, and they didn't care. All this should be > > in their bugzilla or whatever. > > > > The RSEQ rework improved performance significantly for everyone, and > > kept all the documented behaviour (+- arm64 bug). Tcmalloc got screwed > > over because they relied on implementation behaviour that was > > specifically documented to be broken. And they didn't care. Google was > > very much aware of this. And hasn't lifted a finger to remedy it. > > Also: https://lore.kernel.org/all/874io5andc.ffs@tglx/ (Sorry for the resend to folks who got this already - I got an alert that it was rejected by the mailinglists because it contained html so attempting to resend as plain text) I won't claim that tcmalloc _should_ be abusing cpu_id_start as it is. I agree that it seems questionable at best. However, I will strongly disagree with the following comment in that message: > What it not longer does is updating the > CPU number for the preemption case on the same CPU > because that's just a massive waste of CPU cycles. I don't think it will cost _any_ cycles to implement what I proposed. And it especially should have no impact from just enabling rseq on a thread as glibc now does. It should only result in different instructions being executed when the program actually _uses_ rseq by setting the rseq_cs variable to a non-null pointer. I will repeat the proposal with a bit more commentary in case you missed some of the details that make it free: Any time a critical section might be aborted (migration, preemption, signal delivery, and membarrier IPI), the kernel _already_ must check the rseq_cs field to see if the thread is in a critical section [and if it is null because the program isn't using rseq critical sections, no further action is taken]. This is documented as nulling the pointer after (I assume to make later checks cheaper) [if this changed, then it *is* a change in _documented behavior_, not just an implementation detail]. It would be sufficient for tcmalloc's internal usage if every time the kernel nulled out rseq_cs, it also wrote the cpu id to cpu_id_start. [This is one additional store to a cacheline you are already writing to so it should be ~free on modern OoO CPUs and cheap on others. There might be a small cost to loading the current cpu, but since nothing depends on that other than the store, I still expect it to be ~free] To make this more concrete, I am proposing adding unsafe_put_user((u32)task_cpu(t), &t->rseq.usrptr->cpu_id_start, efault); after each place where you currently do unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault); in rseq_update_user_cs. Is that something that you would expect to cause a performance issue? Again, I'm not claiming that it is "good" that this needs to be done. But it does seem like a small price to pay to keep existing binaries working on new kernels. Quoting the first paragraph of https://docs.kernel.org/admin-guide/reporting-regressions.html: > “We don’t cause regressions” is the first rule of Linux kernel development; Linux founder and lead developer Linus Torvalds established it himself and ensures it’s obeyed. I don't see anything on that page that says it doesn't count as a regression if the userspace program "relied on implementation behaviour that was specifically documented to be broken". ^ permalink raw reply [flat|nested] 41+ messages in thread
[parent not found: <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com>]
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere [not found] ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com> @ 2026-04-23 11:48 ` Thomas Gleixner 2026-04-23 12:11 ` Mathias Stearn 0 siblings, 1 reply; 41+ messages in thread From: Thomas Gleixner @ 2026-04-23 11:48 UTC (permalink / raw) To: Mathias Stearn, Peter Zijlstra Cc: Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Thu, Apr 23 2026 at 11:24, Mathias Stearn wrote: > On Wed, Apr 22, 2026 at 3:13 PM Peter Zijlstra <peterz@infradead.org> wrote: > To make this more concrete, I am proposing adding > > unsafe_put_user((u32)task_cpu(t), &t->rseq.usrptr->cpu_id_start, efault); > > after each place where you currently do > > unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault); > > in rseq_update_user_cs. Is that something that you would expect to cause a > performance issue? That would work and not bring the performance issues back, but: 1) Did you validate that adding the reset into rseq_update_user_cs() is actually sufficient? If adding it to rseq_update_user_cs() is not sufficient, then we have a really serious problem. Because we'd need to go back and do it unconditionally, which then makes the 15% performance regression, which happened when glibc enabled rseq, come back instantaneously. And in that case the damage for tcmalloc() is the lesser of two evils. 2) The tcmalloc abuse breaks the documented and guaranteed user space ABI and therefore it makes it impossible for any other library in an application which uses tcmalloc to rely on the documented and guaranteed rseq::cpu_id_start/rseq::cpu_id semantics. Which means, that tcmalloc is holding everybody else hostage. That's just not acceptable. Not even under the no regression rule. 3) The fact that tcmalloc prevents a user from enabling rseq debugging is equally unacceptable as it does not allow me to validate my own rseq magic code in my mongodb client because enabling it will make the DB I want to test against go away. Again tcmalloc holds everybody else hostage for no reason at all. The most amazing part is that tcmalloc uses this to spare two instruction cycles, but nobody noticed in 8 years how much performance the unconditional rseq nonsense in the kernel left on the table. Thanks, tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 11:48 ` Thomas Gleixner @ 2026-04-23 12:11 ` Mathias Stearn 2026-04-23 17:19 ` Thomas Gleixner 0 siblings, 1 reply; 41+ messages in thread From: Mathias Stearn @ 2026-04-23 12:11 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Thu, Apr 23, 2026 at 1:48 PM Thomas Gleixner <tglx@linutronix.de> wrote: > That would work and not bring the performance issues back, but: > > 1) Did you validate that adding the reset into rseq_update_user_cs() is > actually sufficient? Not yet, although I confirmed with the tcmalloc maintainers that they thought it would be sufficient before suggesting it. I'm currently building your patch from upthread to test that out. I can try this after, although I don't think I'll be able to get to that today. I'll try to get a coworker to test it though. > Which means, that tcmalloc is holding everybody else hostage. > That's just not acceptable. Not even under the no regression rule. Agree. I don't love the situation either. Or that we need to advise setting the environment variable to tell glibc not to use rseq. But I also want our users to be able to use existing mongo binaries on new kernels. > 3) The fact that tcmalloc prevents a user from enabling rseq debugging > is equally unacceptable as it does not allow me to validate my own > rseq magic code in my mongodb client because enabling it will make > the DB I want to test against go away. Glad to hear you use mongodb :) > The most amazing part is that tcmalloc uses this to spare two > instruction cycles, but nobody noticed in 8 years how much performance > the unconditional rseq nonsense in the kernel left on the table. I am looking into a change to our copy of tcmalloc to have it stop squatting on cpu_id_start, and will run that through our correctness and performance tests. I can't promise anything (and I certainly can't speak for what Google may choose to do), but I share your expectation that it should be possible with minimal impact. It _is_ more than 2 cycles though, since it extends the load dependency chain by one or two pointer chases and a bit of ALU ops. I'd guesstimate it will likely cost on the order of 5-10 cycles per call to malloc or free. I think we can absorb that, but will need to test. Of course, even if we make that change, it will only apply to _future_ binaries. That's why we prefer a kernel fix so that users will be able to run our existing releases (or any containers that use them) on a modern kernel. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 12:11 ` Mathias Stearn @ 2026-04-23 17:19 ` Thomas Gleixner 2026-04-23 17:38 ` Chris Kennelly 2026-04-23 17:41 ` Linus Torvalds 0 siblings, 2 replies; 41+ messages in thread From: Thomas Gleixner @ 2026-04-23 17:19 UTC (permalink / raw) To: Mathias Stearn Cc: Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler, Linus Torvalds On Thu, Apr 23 2026 at 14:11, Mathias Stearn wrote: Cc+ Linus > Of course, even if we make that change, it will only apply to _future_ > binaries. That's why we prefer a kernel fix so that users will be able > to run our existing releases (or any containers that use them) on a > modern kernel. I understand that and as everyone else I would be happy to do that, but the price everyone pays for proliferating the tcmalloc insanity is not cheap either. So let me recap the whole situation and how we got there: 1) The original RSEQ implementation updates the rseq::cpu_id_start field in user space more or less unconditionally on every exit to user, whether the CPU/MMCID have been changed or not. That went unnoticed for years because nothing used rseq aside of google and tcmalloc. Once glibc registered rseq, this resulted in a up to 15% performance penalty for syscall heavy workloads. 2) The rseq::cpu_id_start field is documented as read only for user space in the ABI contract and guaranteed to be updated by the kernel when a task is migrated to a different CPU. 3) The RO for userspace property has been enforced by RSEQ debugging mode since day one. If such a debug enabled kernel detects user space changing the field it kills the task/application. 4) tcmalloc abused the suboptimal implementation (see #1) and scribbled over rseq::cpu_id_start for their own nefarious purposes. 5) As a consequence of #4 tcmalloc cannot be used on a RSEQ debug enabled kernel. Which means a developer cannot validate his RSEQ code against a debug kernel when tcmalloc is in use on the system as that would crash the tcmalloc dependent applications due to #3. 6) As a consequence of #4 tcmalloc cannot be used together with any other facility/library which wants to utilize the ABI guaranteed properties of rseq::cpu_id_start in the same application. 7) tcmalloc violates the ABI from day one and has since refused to address the problem despite being offered a kernel side rseq extension to solve it many years ago. 8) When addressing the performance issues of RSEQ the unconditional update stopped to exist under the valid assumption that the kernel has only to satisfy the guaranteed ABI properties, especially when they are enforcable by RSEQ debug. As a consequence this exposed the tcmalloc ABI violation because the unconditional pointless overwriting of something which did not change stopped to happen. Due to #4 everyone is in a hard place and up a creek without a paddle. Here are the possible solutions: A) Mathias suggested to force overwrite rseq:cpu_id_start everytime the rseq::rseq_cs field is cleared by the kernel under the not yet validated theoretical assumption that this cures the problem for tcmalloc. If that's sufficient that would be harmless performance wise because the write would be inside the already existing STAC/CLAC section and just add some more noise to the rseq critical section operations. That would allow existing tcmalloc usage to continue, but obviously would neither solve #5 and #6 above nor provide an incentive for tcmalloc to actually fix their crap. B) If that's not sufficient then keeping tcmalloc alive would require to go back to the previous state and let everyone else pay the price in terms of performance overhead. C) Declare that this is not a regression because the ABI guarantee is not violated and the RO property has been enforcable by RSEQ debugging since day one. In my opinion #C is the right thing to do, but I can see a case being made for the lightweight fix Mathias suggested (#A) _if_ and only _if_ that is sufficient. Picking #A would also mean that user space people have to take up the fight against tcmalloc when they want to use the RSEQ guaranteed ABI along with tcmalloc in the same application or use a RSEQ debug kernel to validate their own code. Going back to the full unconditional nightmare (#B) is not an option at all as anybody else has to take the massive performance hit. Oh well... Thanks, tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 17:19 ` Thomas Gleixner @ 2026-04-23 17:38 ` Chris Kennelly 2026-04-23 17:47 ` Mathieu Desnoyers 2026-04-23 19:39 ` Thomas Gleixner 2026-04-23 17:41 ` Linus Torvalds 1 sibling, 2 replies; 41+ messages in thread From: Chris Kennelly @ 2026-04-23 17:38 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler, Linus Torvalds On Thu, Apr 23, 2026 at 1:19 PM Thomas Gleixner <tglx@kernel.org> wrote: > > On Thu, Apr 23 2026 at 14:11, Mathias Stearn wrote: > > Cc+ Linus > > > Of course, even if we make that change, it will only apply to _future_ > > binaries. That's why we prefer a kernel fix so that users will be able > > to run our existing releases (or any containers that use them) on a > > modern kernel. > > I understand that and as everyone else I would be happy to do that, but > the price everyone pays for proliferating the tcmalloc insanity is not > cheap either. > > So let me recap the whole situation and how we got there: > > 1) The original RSEQ implementation updates the rseq::cpu_id_start > field in user space more or less unconditionally on every exit to > user, whether the CPU/MMCID have been changed or not. > > That went unnoticed for years because nothing used rseq aside of > google and tcmalloc. Once glibc registered rseq, this resulted in a > up to 15% performance penalty for syscall heavy workloads. > > 2) The rseq::cpu_id_start field is documented as read only for user > space in the ABI contract and guaranteed to be updated by the > kernel when a task is migrated to a different CPU. > > 3) The RO for userspace property has been enforced by RSEQ debugging > mode since day one. If such a debug enabled kernel detects user > space changing the field it kills the task/application. The optimization in TCMalloc that you're describing has been available since September 2023: https://github.com/google/tcmalloc/commit/aaa4fbf6fcdce1b7f86fcadd659874645c75ddb9 I thought the RSEQ debug checks were added in December 2024: https://github.com/torvalds/linux/commit/7d5265ffcd8b41da5e09066360540d6e0716e9cd, but perhaps I misidentified the ones in question. > > 4) tcmalloc abused the suboptimal implementation (see #1) and > scribbled over rseq::cpu_id_start for their own nefarious purposes. > > 5) As a consequence of #4 tcmalloc cannot be used on a RSEQ debug > enabled kernel. Which means a developer cannot validate his RSEQ > code against a debug kernel when tcmalloc is in use on the system > as that would crash the tcmalloc dependent applications due to #3. > > 6) As a consequence of #4 tcmalloc cannot be used together with any > other facility/library which wants to utilize the ABI guaranteed > properties of rseq::cpu_id_start in the same application. > > 7) tcmalloc violates the ABI from day one and has since refused to > address the problem despite being offered a kernel side rseq > extension to solve it many years ago. I know there was some discussion around a preemption notification scheme, rseq_sched_state; but I thought the discussion moved in favor of the timeslice extension interface that recently landed. Timeslice extension solves some use cases, but I'm not sure it addresses this one. > > 8) When addressing the performance issues of RSEQ the unconditional > update stopped to exist under the valid assumption that the kernel > has only to satisfy the guaranteed ABI properties, especially when > they are enforcable by RSEQ debug. > > As a consequence this exposed the tcmalloc ABI violation because > the unconditional pointless overwriting of something which did not > change stopped to happen. > > Due to #4 everyone is in a hard place and up a creek without a paddle. > > Here are the possible solutions: > > A) Mathias suggested to force overwrite rseq:cpu_id_start everytime > the rseq::rseq_cs field is cleared by the kernel under the not yet > validated theoretical assumption that this cures the problem for > tcmalloc. > > If that's sufficient that would be harmless performance wise > because the write would be inside the already existing STAC/CLAC > section and just add some more noise to the rseq critical section > operations. > > That would allow existing tcmalloc usage to continue, but > obviously would neither solve #5 and #6 above nor provide an > incentive for tcmalloc to actually fix their crap. > > B) If that's not sufficient then keeping tcmalloc alive would require > to go back to the previous state and let everyone else pay the > price in terms of performance overhead. > > C) Declare that this is not a regression because the ABI guarantee is > not violated and the RO property has been enforcable by RSEQ > debugging since day one. > > In my opinion #C is the right thing to do, but I can see a case being > made for the lightweight fix Mathias suggested (#A) _if_ and only _if_ > that is sufficient. Picking #A would also mean that user space people > have to take up the fight against tcmalloc when they want to use the > RSEQ guaranteed ABI along with tcmalloc in the same application or use a > RSEQ debug kernel to validate their own code. > > Going back to the full unconditional nightmare (#B) is not an option at > all as anybody else has to take the massive performance hit. > > Oh well... > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 17:38 ` Chris Kennelly @ 2026-04-23 17:47 ` Mathieu Desnoyers 2026-04-23 19:39 ` Thomas Gleixner 1 sibling, 0 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2026-04-23 17:47 UTC (permalink / raw) To: Chris Kennelly, Thomas Gleixner Cc: Mathias Stearn, Peter Zijlstra, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler, Linus Torvalds On 2026-04-23 13:38, Chris Kennelly wrote: > On Thu, Apr 23, 2026 at 1:19 PM Thomas Gleixner <tglx@kernel.org> wrote: [...] >> >> 3) The RO for userspace property has been enforced by RSEQ debugging >> mode since day one. If such a debug enabled kernel detects user >> space changing the field it kills the task/application. > > The optimization in TCMalloc that you're describing has been available > since September 2023: > https://github.com/google/tcmalloc/commit/aaa4fbf6fcdce1b7f86fcadd659874645c75ddb9 > > I thought the RSEQ debug checks were added in December 2024: > https://github.com/torvalds/linux/commit/7d5265ffcd8b41da5e09066360540d6e0716e9cd, > but perhaps I misidentified the ones in question. You are correct, I added the RSEQ field corruption validation under debug config in Nov. 2024 when I noticed the world of pain we were heading towards with incompatible tcmalloc vs glibc (and general) use due to tcmalloc not respecting the ABI contract. RSEQ has been upstreamed in 2018. So that's not exactly a day one enforcement. The ABI contract was clear about this being an invalid use from day one though. [...] >> 7) tcmalloc violates the ABI from day one and has since refused to >> address the problem despite being offered a kernel side rseq >> extension to solve it many years ago. > > I know there was some discussion around a preemption notification > scheme, rseq_sched_state; but I thought the discussion moved in favor > of the timeslice extension interface that recently landed. Timeslice > extension solves some use cases, but I'm not sure it addresses this > one. I have actively engaged with the tcmalloc developers to understand their needs and figure out a proper solution for the past ~3-4 years, without success. I have done a POC branch extending rseq with a "reset a linked list of userspace areas on preemption" back in 2024 which would have solved tcmalloc's issues cleanly. I never posted it publicly because the tcmalloc devs told me they could not justify spending time even trying this out to their managers. I still have that feature branch gathering dust somewhere. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 17:38 ` Chris Kennelly 2026-04-23 17:47 ` Mathieu Desnoyers @ 2026-04-23 19:39 ` Thomas Gleixner 1 sibling, 0 replies; 41+ messages in thread From: Thomas Gleixner @ 2026-04-23 19:39 UTC (permalink / raw) To: Chris Kennelly Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler, Linus Torvalds On Thu, Apr 23 2026 at 13:38, Chris Kennelly wrote: > On Thu, Apr 23, 2026 at 1:19 PM Thomas Gleixner <tglx@kernel.org> wrote: >> 3) The RO for userspace property has been enforced by RSEQ debugging >> mode since day one. If such a debug enabled kernel detects user >> space changing the field it kills the task/application. > > The optimization in TCMalloc that you're describing has been available > since September 2023: > https://github.com/google/tcmalloc/commit/aaa4fbf6fcdce1b7f86fcadd659874645c75ddb9 And the github issue which requested glibc compatibility was opened in Sept. 2022: https://github.com/google/tcmalloc/issues/144 > I thought the RSEQ debug checks were added in December 2024: > https://github.com/torvalds/linux/commit/7d5265ffcd8b41da5e09066360540d6e0716e9cd, > but perhaps I misidentified the ones in question. I might have misread the git log. But that still does not justify the violation of a documented ABI for the price that nobody else can use it once tcmalloc is in play: x = tcmalloc(); dostuff(x) evaluate(rseq::cpu_id_start, rseq::cpu_id) <- FAIL >> 7) tcmalloc violates the ABI from day one and has since refused to >> address the problem despite being offered a kernel side rseq >> extension to solve it many years ago. > > I know there was some discussion around a preemption notification > scheme, rseq_sched_state; but I thought the discussion moved in favor > of the timeslice extension interface that recently landed. Timeslice > extension solves some use cases, but I'm not sure it addresses this > one. No it does not. That's an orthogonal optimization. Thanks, tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 17:19 ` Thomas Gleixner 2026-04-23 17:38 ` Chris Kennelly @ 2026-04-23 17:41 ` Linus Torvalds 2026-04-23 18:35 ` Mathias Stearn ` (2 more replies) 1 sibling, 3 replies; 41+ messages in thread From: Linus Torvalds @ 2026-04-23 17:41 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Thu, 23 Apr 2026 at 10:19, Thomas Gleixner <tglx@kernel.org> wrote: > > C) Declare that this is not a regression because the ABI guarantee is > not violated and the RO property has been enforcable by RSEQ > debugging since day one. No, if this actually hits real users, that is not an option. If real users never used RSEQ debugging options, those options are simply irrelevant. Regression rules have never been about "it wouldn't have worked in some other configuration". That's like saying "that code would never have worked on another architecture". It may be true, but it's irrelevant for the people whose binaries no longer work. We will have to fix this. This is not some kind of gray area. It clearly violates our regression rules. The only "ABI guarantee" is what people actually see and use, not some debug option that wasn't enabled. And I just checked - it's not enabled in at least the Fedora distro kernels. Presumably other distros don't enable it either. So no actual non-kernel developer would *ever* have hit it, and claiming it is relevant is just garbage. IOW, that debug option was always a complete no-op except for kernel developers. In fact, that debug option is actively *hidden* - you have to enable EXPERT to even see it. Soi it really is not a real option for normal people AT ALL. Christ, even *I* don't enable EXPERT except for build testing. It's literally something that only embedded people doing odd things should do. If that rule was actually an important part of the ABI, it shouldn't have been a debug thing. So: (a) the debug code in question needs to just be removed, since it's now actively detrimental, and means that any kernel developer who *does* enable it can't actually test this case any more. It's checking for something that has been shown to not be true. (b) we need to fix this (revert if it can't be fixed otherwise) I see some patches flying around, but am not clear on whether there was an actual patch that make this work again? Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 17:41 ` Linus Torvalds @ 2026-04-23 18:35 ` Mathias Stearn 2026-04-23 18:53 ` Mark Rutland 2026-04-23 21:03 ` Thomas Gleixner 2 siblings, 0 replies; 41+ messages in thread From: Mathias Stearn @ 2026-04-23 18:35 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Thu, Apr 23, 2026 at 7:48 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > I see some patches flying around, but am not clear on whether there > was an actual patch that make this work again? Thomas's patch from upthread appears in initial testing to address the arm64 preemption breakage. Thanks! I'm currently building with the following patch on top of that and will test it once it is ready. --- diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h index a36b472627de..e26bf249bbd8 100644 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -300,12 +300,15 @@ rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long c /* Invalidate the critical section */ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault); + /* TCMalloc kludge - it relies on cpu_id_start being overwritten */ + unsafe_put_user((u32)task_cpu(t), &t->rseq.usrptr->cpu_id_start, efault); /* Update the instruction pointer */ instruction_pointer_set(regs, (unsigned long)abort_ip); rseq_stat_inc(rseq_stats.fixup); break; clear: unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault); + unsafe_put_user((u32)task_cpu(t), &t->rseq.usrptr->cpu_id_start, efault); rseq_stat_inc(rseq_stats.clear); abort_ip = 0ULL; } ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 17:41 ` Linus Torvalds 2026-04-23 18:35 ` Mathias Stearn @ 2026-04-23 18:53 ` Mark Rutland 2026-04-23 21:03 ` Thomas Gleixner 2 siblings, 0 replies; 41+ messages in thread From: Mark Rutland @ 2026-04-23 18:53 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Jinjie Ruan, Blake Oler On Thu, Apr 23, 2026 at 10:41:02AM -0700, Linus Torvalds wrote: > On Thu, 23 Apr 2026 at 10:19, Thomas Gleixner <tglx@kernel.org> wrote: > I see some patches flying around, but am not clear on whether there > was an actual patch that make this work again? There's not a patch yet. The diffs sent so far were options for fixing the arm64-specific issue (missing aborts on preemption), NOT the generic issue (missing clobbering of cpu_id_start that tcmalloc was depending upon). For the arm64 issue, I think we can have a fix tomorrow (as it's end of day here in the UK). Now that I've pored the entry code and the rseq code, I think a variant of one of Thomas's proposed fixes will work, but I'd like to make the naming/layering crystal clear so that it's harder to break this by accident in future. For the generic issue, hopefully the option Mathias proposed (clearing cpu_id_start when rseq_cs is cleared) is sufficient. I'll work with Mathias and Thomas for that. I've also poked folk to make sure that CI systems run the rseq selftests (which they evidently weren't), so that we catch this sort of thing earlier. Mark. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 17:41 ` Linus Torvalds 2026-04-23 18:35 ` Mathias Stearn 2026-04-23 18:53 ` Mark Rutland @ 2026-04-23 21:03 ` Thomas Gleixner 2026-04-23 21:28 ` Linus Torvalds 2 siblings, 1 reply; 41+ messages in thread From: Thomas Gleixner @ 2026-04-23 21:03 UTC (permalink / raw) To: Linus Torvalds Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Thu, Apr 23 2026 at 10:41, Linus Torvalds wrote: > If that rule was actually an important part of the ABI, it shouldn't > have been a debug thing. It's a debug thing because it's too expensive to be enabled by default. And it's actually valuable for validating RSEQ critical section ABI correctness as they can't be single stepped with a debugger as the break point interruption would immediately canceled. > So: > > (a) the debug code in question needs to just be removed, since it's > now actively detrimental, and means that any kernel developer who > *does* enable it can't actually test this case any more. It's checking > for something that has been shown to not be true. > > (b) we need to fix this (revert if it can't be fixed otherwise) > > I see some patches flying around, but am not clear on whether there > was an actual patch that make this work again? There are two issues: 1) ARM64 On ARM64 RSEQ got broken completely with the partial move to the generic entry code. There are patches flying around which "fix" it and Mark is working on a more complete solution as there are other subtle issues with that aside of the obvious RSEQ wreckage. The latter could have been detected with the existing RSEQ selftests if any CI would actually run them on -next. That's uninteresting and unrelated to the tcmalloc issue. It's just a boring bug which will be fixed in the next couple of days. 2) The tcmalloc problem That's a known problem for at least 6 years. tcmalloc assumes that it "owns" rseq and can do whatever it wants with it. In 2022 the glibc people requested that tcmalloc becomes interoperable with the reasonable expection of glibc to utilize rseq as well: https://github.com/google/tcmalloc/issues/144 Status unresolved. That means that using tcmalloc requires to tell glibc to _NOT_ use rseq and at the same time precludes any other library which wants to use it for the documented purposes. So this code sequence blows up in your face: x = tcmalloc(); dostuff(x) evaluate(rseq::cpu_id_start, rseq::cpu_id) because tcmalloc overwrites rseq::cpu_id_start and thereby breaks the ABI which evaluate() is rightfully depending on. That has absolutely nothing to do with the kernel as there is no kernel interaction between tcmalloc's abuse and the subsequent evaluation of rseq::cpu_id_start. The kernel has no way to fix that problem at all. Now back to your generally correct and agreed on "observed behaviour" rule. Feel free to enforce it, but be aware that you thereby set a precedence that a single abuser can then rightfully own a general shared interface of the kernel forever and force everybody else to give up. The tcmalloc developers actually documented that they own the world: // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. Do you seriously want to proliferate that? Thanks, tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 21:03 ` Thomas Gleixner @ 2026-04-23 21:28 ` Linus Torvalds 2026-04-23 23:08 ` Linus Torvalds 2026-04-27 7:06 ` Florian Weimer 0 siblings, 2 replies; 41+ messages in thread From: Linus Torvalds @ 2026-04-23 21:28 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Thu, 23 Apr 2026 at 14:03, Thomas Gleixner <tglx@kernel.org> wrote: > > Feel free to enforce it, but be aware that you thereby set a > precedence that a single abuser can then rightfully own a general > shared interface of the kernel forever and force everybody else to > give up. That's not a new precedent. That is *literally* the rule we have always had. This is why system calls and ABI's need to have hard rules that they actually check, because if they don't, they are stuck with the semantics that people assume. And no, "documented behavior" is BS. It has absolutely no relevance. All that matters is hard harsh reality. Yes, this has led to issues before. Most new system calls have learnt their lesson, and they check for unused bits in flags etc, and error out on bits that the lernel doesn't really care about being randomly set - so that one day we *can* extend on things and start caring about them. But they do it because we've been burnt so many times before because we haven't checked those bits, and then we were forced to just live with the fact that people passed in random values. > // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. > > Do you seriously want to proliferate that? Absolutely. That's how clever hacks work - they take advantage of things past their design parameters. "If it works, it's not stupid". We don't then turn around and say "you were clever, and we did something stupid, so now we'll hurt you". This is all 100% on the RSEQ kernel code, not on users who took advantage of it. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 21:28 ` Linus Torvalds @ 2026-04-23 23:08 ` Linus Torvalds 2026-04-27 7:06 ` Florian Weimer 1 sibling, 0 replies; 41+ messages in thread From: Linus Torvalds @ 2026-04-23 23:08 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler On Thu, 23 Apr 2026 at 14:28, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > This is all 100% on the RSEQ kernel code, not on users who took advantage of it. Side note: when RSEQ was merged, the *primary* documented use case was literally user space allocators with percpu caches. That's what I was told at the time. Now I think it was jemalloc(), not tcmalloc, but it's not like tcmalloc is some odd minor use-case. We are pretty much talking about the raison d'être of the whole rseq feature, not some odd small corner case. Linus ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 21:28 ` Linus Torvalds 2026-04-23 23:08 ` Linus Torvalds @ 2026-04-27 7:06 ` Florian Weimer 1 sibling, 0 replies; 41+ messages in thread From: Florian Weimer @ 2026-04-27 7:06 UTC (permalink / raw) To: Linus Torvalds Cc: Thomas Gleixner, Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler * Linus Torvalds: >> // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. >> >> Do you seriously want to proliferate that? > > Absolutely. > > That's how clever hacks work - they take advantage of things past > their design parameters. "If it works, it's not stupid". > > We don't then turn around and say "you were clever, and we did > something stupid, so now we'll hurt you". > > This is all 100% on the RSEQ kernel code, not on users who took > advantage of it. RSEQ was intended to be modular, with more than one library using it within a process, without coordination (beyond sticking to the RSEQ protocol). The tcmalloc approach is incompatible with that. Once tcmalloc starts using RSEQ in its peculiar way, nothing else in the process can, and vice versa. This is far from ideal because the particular descheduling notification that tcmalloc uses could be implemented in a much simpler way than full RSEQ, given its non-modular nature. Thanks, Florian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere [not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com> 2026-04-22 12:56 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Peter Zijlstra @ 2026-04-22 13:09 ` Mark Rutland 2026-04-22 17:49 ` Thomas Gleixner 2026-04-24 16:45 ` [PATCH] arm64/entry: Fix arm64-specific rseq brokenness (was: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64) " Mark Rutland 2 siblings, 1 reply; 41+ messages in thread From: Mark Rutland @ 2026-04-22 13:09 UTC (permalink / raw) To: Mathias Stearn Cc: Thomas Gleixner, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Jinjie Ruan, Blake Oler Hi Mathias, On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote: > TL;DR: As of 6.19, rseq no longer provides the documented atomicity > guarantees on arm64 by failing to abort the critical section on same-core > preemption/resumption. Additionally, it breaks tcmalloc specifically by > failing to overwrite the cpu_id_start field at points where it was relied > on for correctness. Thanks for the report, and the test case. As a holding reply, I'm looking into this now from the arm64 side. I'll leave it to Thomas/Peter/Mathieu to comment w.r.t. the issue you raise with cpu_id_start. For some reason, this mail didn't make it to my inbox, and I had to grab it from lore using b4. That might be a problem with my local mail server; I'm just noting that in case others also didn't receive this. Mark. > This is a SEVERE breakage for MongoDB. We received several user reports of > crashes on 6.19. I made a stress test that showed that 6.19 can cause > malloc to return the same pointer twice without it being freed. Because > that can cause arbitrary corruption, our latest releases have all been > patched to refuse to start at all on 6.19+. > > TCMalloc uses rseq in a "creative" way described at > https://github.com/google/tcmalloc/blob/master/docs/rseq.md. In particular, > the "Current CPU Slabs Pointer Caching" section describes an optimization > that relies on an undocumented fact that the kernel was always overwriting > cpu_id_start (even when it wouldn't change) to invalidate a user-space > cache. Since the change to stop writing cpu_id_start seemed to be > intentional as part of a refactoring merged in 2b09f480f0a1, I started > working on a userspace patch to stop relying on that. Unfortunately when > that was complete I ran into a wall that is impossible to work around from > userspace. > > On arm64, the kernel no longer meets the documented guarantee that rseq > critical sections are atomic with respect to preemption. It seems to only > abort the critical section when the thread is migrated to a different core. > The attached test proves it and passes on x86 both before and after 6.19, > and on arm before 6.19, but fails on arm with 6.19. It pins the process to > a single core and then has an rseq critical section that observes a change > made by another thread which is supposed to be impossible. I think this > will break basically any real usage of rseq, other than just reading the > current cpu_id. > > An LLM pointed to these two specific commits in the refactor as causing > this (oldest first): > - 39a167560a61 rseq: Optimize event setting > This assumed that user_irq would be set on preemption but it wasn't on > arm64, so TIF_NOTIFY_RESUME isn't raised on same cpu preemption. > - 566d8015f7ee rseq: Avoid CPU/MM CID updates when no event pending > This broke TCMalloc slab caching trick by not overwriting cpu_id_start on > every return to userspace > > (I have a lot more analysis and suggested fixes from LLMs since I used them > heavily in this testing and analysis, but I won't spam you with the slop > unless requested) > > The arm64 change is a clear breakage and I'm sure it will be > uncontroversial to fix. I can imagine more resistance to reverting to the > old behavior of always overwriting the cpu_id_start field since that seems > to have been an intentional optimization choice. I have reached out to the > TCMalloc maintainers (CC'd) and believe there is a solution that gets the > vast majority of the optimization while still preserving the behavior that > TCMalloc currently relies on[1]. > > Any time a critical section might be aborted (migration, preemption, signal > delivery, and membarrier IPI), the kernel already must (but doesn't on > arm64 at the moment) check the rseq_cs field to see if the thread is in a > critical section, and is documented as nulling the pointer after (I assume > to make later checks cheaper). It would be sufficient for tcmalloc's > internal usage if every time the kernel nulled out rseq_cs, it also wrote > the cpu id to cpu_id_start. That should be essentially free since you are > already writing to the same cache line. It was pointed out that that could > be an issue if another rseq user in the same thread nulled rseq_cs after > its critical section, which would require the kernel to update cpu_id_start > each time it checks rseq_cs, regardless of whether it nulls it. We aren't > aware of any processes that mix tcmalloc with other rseq usages that null > out the field from userspace, but we can't rule them out since it is open > source. Either way, this preserves the property of not updating > cpu_id_start on every syscall return and non-membarrier interrupts, which I > assume is where the majority of the optimization win was from. > > All testing of problematic versions was performed on x86_64 and > aarch64 Ubuntu 24.04.4 with the kernel manually upgraded to > 6.19.8-061908-generic. Source analysis was performed on the v6.19 tag. I > had a few AI agents confirm that nothing in the relevant changes to master > should have solved this, but I have not yet tested there. > > $ cat /proc/version > Linux version 6.19.8-061908-generic (kernel@balboa) > (aarch64-linux-gnu-gcc-15 (Ubuntu 15.2.0-15ubuntu1) 15.2.0, GNU ld (GNU > Binutils for Ubuntu) 2.46) #202603131837 SMP PREEMPT_DYNAMIC Sat Mar 14 > 00:00:07 UTC 2026 > > [1] There is also an exploration of some options to make tcmalloc not rely > on the cpu_id_start overwriting. However we would strongly prefer that > existing binaries continue to work on 6.19 kernels, even if newer binaries > don't need that. At least for a good while. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-22 13:09 ` Mark Rutland @ 2026-04-22 17:49 ` Thomas Gleixner 2026-04-22 18:11 ` Mark Rutland 0 siblings, 1 reply; 41+ messages in thread From: Thomas Gleixner @ 2026-04-22 17:49 UTC (permalink / raw) To: Mark Rutland, Mathias Stearn Cc: Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Jinjie Ruan, Blake Oler On Wed, Apr 22 2026 at 14:09, Mark Rutland wrote: > On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote: >> TL;DR: As of 6.19, rseq no longer provides the documented atomicity >> guarantees on arm64 by failing to abort the critical section on same-core >> preemption/resumption. Additionally, it breaks tcmalloc specifically by >> failing to overwrite the cpu_id_start field at points where it was relied >> on for correctness. > > Thanks for the report, and the test case. > > As a holding reply, I'm looking into this now from the arm64 side. I assume it's the partial conversion to the generic entry code which screws that up. The problem reproduces with rseq selftests nicely. The patch below fixes it as it puts ARM64 back to the non-optimized code for now. Once ARM64 is fully converted it gets all the nice improvements. Thanks, tglx --- diff --git a/include/linux/rseq.h b/include/linux/rseq.h index 2266f4dc77b6..d55476e2a336 100644 --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -30,7 +30,7 @@ void __rseq_signal_deliver(int sig, struct pt_regs *regs); */ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { /* '&' is intentional to spare one conditional branch */ if (current->rseq.event.has_rseq & current->rseq.event.user_irq) __rseq_signal_deliver(ksig->sig, regs); @@ -50,7 +50,7 @@ static __always_inline void rseq_sched_switch_event(struct task_struct *t) { struct rseq_event *ev = &t->rseq.event; - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { /* * Avoid a boat load of conditionals by using simple logic * to determine whether NOTIFY_RESUME needs to be raised. diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h index a36b472627de..8ccd464a108d 100644 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -80,7 +80,7 @@ bool rseq_debug_validate_ids(struct task_struct *t); static __always_inline void rseq_note_user_irq_entry(void) { - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) current->rseq.event.user_irq = true; } @@ -171,8 +171,8 @@ bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, if (unlikely(usig != t->rseq.sig)) goto die; - /* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */ - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* rseq_event.user_irq is only valid if CONFIG_GENERIC_ENTRY=y */ + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { /* If not in interrupt from user context, let it die */ if (unlikely(!t->rseq.event.user_irq)) goto die; @@ -387,7 +387,7 @@ static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *r * allows to skip the critical section when the entry was not from * a user space interrupt, unless debug mode is enabled. */ - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { if (!static_branch_unlikely(&rseq_debug_enabled)) { if (likely(!t->rseq.event.user_irq)) return true; ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-22 17:49 ` Thomas Gleixner @ 2026-04-22 18:11 ` Mark Rutland 2026-04-22 19:47 ` Thomas Gleixner 0 siblings, 1 reply; 41+ messages in thread From: Mark Rutland @ 2026-04-22 18:11 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Jinjie Ruan, Blake Oler On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote: > On Wed, Apr 22 2026 at 14:09, Mark Rutland wrote: > > On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote: > >> TL;DR: As of 6.19, rseq no longer provides the documented atomicity > >> guarantees on arm64 by failing to abort the critical section on same-core > >> preemption/resumption. Additionally, it breaks tcmalloc specifically by > >> failing to overwrite the cpu_id_start field at points where it was relied > >> on for correctness. > > > > Thanks for the report, and the test case. > > > > As a holding reply, I'm looking into this now from the arm64 side. > > I assume it's the partial conversion to the generic entry code which > screws that up. It's slightly more than that, but in a sense, yes. ;) The fix is conceptually simple, but I'll need to do some refactoring. Conceptually we just need to use syscall_enter_from_user_mode() and irqentry_enter_from_user_mode() appropriately. In practice, I can't use those as-is without introducing the exception masking problems I just fixed up for irqentry_enter_from_kernel_mode(), so I'll need to do some similar refactoring first. That and I *think* a couple of of the current checks for CONFIG_GENERIC_ENTRY should be checking CONFIG_GENERIC_IRQ_ENTRY, since all of the relevant bits are in the generic irqentry code rather than the GENERIC_SYSCALL code (and GENERIC_ENTRY is just GENERIC_IRQ_ENTRY + GENERIC_SYSCALL). > The problem reproduces with rseq selftests nicely. Ah; that's both good to know, and worrying that we've never had a report from all the automated testing people are supposedly running. :/ > The patch below fixes it as it puts ARM64 back to the non-optimized code > for now. Once ARM64 is fully converted it gets all the nice improvements. Thanks; I'll give that a test tomorrow. I haven't paged everything in yet, so just to cehck, is there anything that would behave incorrectly if current->rseq.event.user_irq were set for syscall entry? IIUC it means we'll effectively do the slow path, and I was wondering if that might be acceptable as a one-line bodge for stable. As above, I'd like if the actual fix could make this work for GENERIC_IRQ_ENTRY rather than GENERIC_ENTRY, since that way we can make this work as it was supposed to *before* moving to GENERIC_SYSCALL (which has a whole lot more ABI impact to worry about). I think that just needs a small amount of refactoring that arm64 will need regardless. Mark. > > Thanks, > > tglx > --- > diff --git a/include/linux/rseq.h b/include/linux/rseq.h > index 2266f4dc77b6..d55476e2a336 100644 > --- a/include/linux/rseq.h > +++ b/include/linux/rseq.h > @@ -30,7 +30,7 @@ void __rseq_signal_deliver(int sig, struct pt_regs *regs); > */ > static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) > { > - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { > + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { > /* '&' is intentional to spare one conditional branch */ > if (current->rseq.event.has_rseq & current->rseq.event.user_irq) > __rseq_signal_deliver(ksig->sig, regs); > @@ -50,7 +50,7 @@ static __always_inline void rseq_sched_switch_event(struct task_struct *t) > { > struct rseq_event *ev = &t->rseq.event; > > - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { > + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { > /* > * Avoid a boat load of conditionals by using simple logic > * to determine whether NOTIFY_RESUME needs to be raised. > diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h > index a36b472627de..8ccd464a108d 100644 > --- a/include/linux/rseq_entry.h > +++ b/include/linux/rseq_entry.h > @@ -80,7 +80,7 @@ bool rseq_debug_validate_ids(struct task_struct *t); > > static __always_inline void rseq_note_user_irq_entry(void) > { > - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) > + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) > current->rseq.event.user_irq = true; > } > > @@ -171,8 +171,8 @@ bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, > if (unlikely(usig != t->rseq.sig)) > goto die; > > - /* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */ > - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { > + /* rseq_event.user_irq is only valid if CONFIG_GENERIC_ENTRY=y */ > + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { > /* If not in interrupt from user context, let it die */ > if (unlikely(!t->rseq.event.user_irq)) > goto die; > @@ -387,7 +387,7 @@ static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *r > * allows to skip the critical section when the entry was not from > * a user space interrupt, unless debug mode is enabled. > */ > - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { > + if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) { > if (!static_branch_unlikely(&rseq_debug_enabled)) { > if (likely(!t->rseq.event.user_irq)) > return true; ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-22 18:11 ` Mark Rutland @ 2026-04-22 19:47 ` Thomas Gleixner 2026-04-23 1:48 ` Jinjie Ruan 0 siblings, 1 reply; 41+ messages in thread From: Thomas Gleixner @ 2026-04-22 19:47 UTC (permalink / raw) To: Mark Rutland Cc: Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Jinjie Ruan, Blake Oler On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote: > On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote: > Conceptually we just need to use syscall_enter_from_user_mode() and > irqentry_enter_from_user_mode() appropriately. Right. I figured that out. > In practice, I can't use those as-is without introducing the exception > masking problems I just fixed up for irqentry_enter_from_kernel_mode(), > so I'll need to do some similar refactoring first. See below. > I haven't paged everything in yet, so just to cehck, is there anything > that would behave incorrectly if current->rseq.event.user_irq were set > for syscall entry? IIUC it means we'll effectively do the slow path, and > I was wondering if that might be acceptable as a one-line bodge for > stable. It might work, but it's trivial enough to avoid that. See below. That on top of 6.19.y makes the selftests pass too. Thanks, tglx --- arch/arm64/kernel/entry-common.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) --- a/arch/arm64/kernel/entry-common.c +++ b/arch/arm64/kernel/entry-common.c @@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode( irqentry_exit(regs, state); } +static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs) +{ + enter_from_user_mode(regs); + mte_disable_tco_entry(current); +} + /* * Handle IRQ/context state management when entering from user mode. * Before this function is called it is not safe to call regular kernel code, @@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode( */ static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs) { - enter_from_user_mode(regs); - mte_disable_tco_entry(current); + arm64_enter_from_user_mode_syscall(regs); + rseq_note_user_irq_entry(); } /* @@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_ static void noinstr el0_svc(struct pt_regs *regs) { - arm64_enter_from_user_mode(regs); + arm64_enter_from_user_mode_syscall(regs); cortex_a76_erratum_1463225_svc_handler(); fpsimd_syscall_enter(); local_daif_restore(DAIF_PROCCTX); @@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r static void noinstr el0_svc_compat(struct pt_regs *regs) { - arm64_enter_from_user_mode(regs); + arm64_enter_from_user_mode_syscall(regs); cortex_a76_erratum_1463225_svc_handler(); local_daif_restore(DAIF_PROCCTX); do_el0_svc_compat(regs); ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-22 19:47 ` Thomas Gleixner @ 2026-04-23 1:48 ` Jinjie Ruan 2026-04-23 5:53 ` Dmitry Vyukov 0 siblings, 1 reply; 41+ messages in thread From: Jinjie Ruan @ 2026-04-23 1:48 UTC (permalink / raw) To: Thomas Gleixner, Mark Rutland Cc: Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On 4/23/2026 3:47 AM, Thomas Gleixner wrote: > On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote: >> On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote: >> Conceptually we just need to use syscall_enter_from_user_mode() and >> irqentry_enter_from_user_mode() appropriately. > > Right. I figured that out. > >> In practice, I can't use those as-is without introducing the exception >> masking problems I just fixed up for irqentry_enter_from_kernel_mode(), >> so I'll need to do some similar refactoring first. > > See below. > >> I haven't paged everything in yet, so just to cehck, is there anything >> that would behave incorrectly if current->rseq.event.user_irq were set >> for syscall entry? IIUC it means we'll effectively do the slow path, and >> I was wondering if that might be acceptable as a one-line bodge for >> stable. > > It might work, but it's trivial enough to avoid that. See below. That on > top of 6.19.y makes the selftests pass too. This aligns with my thoughts when convert arm64 to generic syscall entry. Currently, the arm64 entry code does not distinguish between IRQ and syscall entries. It fails to call rseq_note_user_irq_entry() for IRQ entries as the generic entry framework does, because arm64 uses enter_from_user_mode() exclusively instead of irqentry_enter_from_user_mode(). https://lore.kernel.org/all/20260320102620.1336796-10-ruanjinjie@huawei.com/ > > Thanks, > > tglx > --- > arch/arm64/kernel/entry-common.c | 14 ++++++++++---- > 1 file changed, 10 insertions(+), 4 deletions(-) > > --- a/arch/arm64/kernel/entry-common.c > +++ b/arch/arm64/kernel/entry-common.c > @@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode( > irqentry_exit(regs, state); > } > > +static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs) > +{ > + enter_from_user_mode(regs); > + mte_disable_tco_entry(current); > +} > + > /* > * Handle IRQ/context state management when entering from user mode. > * Before this function is called it is not safe to call regular kernel code, > @@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode( > */ > static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs) > { > - enter_from_user_mode(regs); > - mte_disable_tco_entry(current); > + arm64_enter_from_user_mode_syscall(regs); > + rseq_note_user_irq_entry(); > } > > /* > @@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_ > > static void noinstr el0_svc(struct pt_regs *regs) > { > - arm64_enter_from_user_mode(regs); > + arm64_enter_from_user_mode_syscall(regs); > cortex_a76_erratum_1463225_svc_handler(); > fpsimd_syscall_enter(); > local_daif_restore(DAIF_PROCCTX); > @@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r > > static void noinstr el0_svc_compat(struct pt_regs *regs) > { > - arm64_enter_from_user_mode(regs); > + arm64_enter_from_user_mode_syscall(regs); > cortex_a76_erratum_1463225_svc_handler(); > local_daif_restore(DAIF_PROCCTX); > do_el0_svc_compat(regs); ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 1:48 ` Jinjie Ruan @ 2026-04-23 5:53 ` Dmitry Vyukov 2026-04-23 10:39 ` Thomas Gleixner ` (2 more replies) 0 siblings, 3 replies; 41+ messages in thread From: Dmitry Vyukov @ 2026-04-23 5:53 UTC (permalink / raw) To: Jinjie Ruan, linux-man Cc: Thomas Gleixner, Mark Rutland, Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, 23 Apr 2026 at 03:48, Jinjie Ruan <ruanjinjie@huawei.com> wrote: > > On 4/23/2026 3:47 AM, Thomas Gleixner wrote: > > On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote: > >> On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote: > >> Conceptually we just need to use syscall_enter_from_user_mode() and > >> irqentry_enter_from_user_mode() appropriately. > > > > Right. I figured that out. > > > >> In practice, I can't use those as-is without introducing the exception > >> masking problems I just fixed up for irqentry_enter_from_kernel_mode(), > >> so I'll need to do some similar refactoring first. > > > > See below. > > > >> I haven't paged everything in yet, so just to cehck, is there anything > >> that would behave incorrectly if current->rseq.event.user_irq were set > >> for syscall entry? IIUC it means we'll effectively do the slow path, and > >> I was wondering if that might be acceptable as a one-line bodge for > >> stable. > > > > It might work, but it's trivial enough to avoid that. See below. That on > > top of 6.19.y makes the selftests pass too. > > This aligns with my thoughts when convert arm64 to generic syscall > entry. Currently, the arm64 entry code does not distinguish between IRQ > and syscall entries. It fails to call rseq_note_user_irq_entry() for IRQ > entries as the generic entry framework does, because arm64 uses > enter_from_user_mode() exclusively instead of > irqentry_enter_from_user_mode(). > > https://lore.kernel.org/all/20260320102620.1336796-10-ruanjinjie@huawei.com/ > > > > > Thanks, > > > > tglx > > --- > > arch/arm64/kernel/entry-common.c | 14 ++++++++++---- > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > --- a/arch/arm64/kernel/entry-common.c > > +++ b/arch/arm64/kernel/entry-common.c > > @@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode( > > irqentry_exit(regs, state); > > } > > > > +static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs) > > +{ > > + enter_from_user_mode(regs); > > + mte_disable_tco_entry(current); > > +} > > + > > /* > > * Handle IRQ/context state management when entering from user mode. > > * Before this function is called it is not safe to call regular kernel code, > > @@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode( > > */ > > static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs) > > { > > - enter_from_user_mode(regs); > > - mte_disable_tco_entry(current); > > + arm64_enter_from_user_mode_syscall(regs); > > + rseq_note_user_irq_entry(); > > } > > > > /* > > @@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_ > > > > static void noinstr el0_svc(struct pt_regs *regs) > > { > > - arm64_enter_from_user_mode(regs); > > + arm64_enter_from_user_mode_syscall(regs); > > cortex_a76_erratum_1463225_svc_handler(); > > fpsimd_syscall_enter(); > > local_daif_restore(DAIF_PROCCTX); > > @@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r > > > > static void noinstr el0_svc_compat(struct pt_regs *regs) > > { > > - arm64_enter_from_user_mode(regs); > > + arm64_enter_from_user_mode_syscall(regs); > > cortex_a76_erratum_1463225_svc_handler(); > > local_daif_restore(DAIF_PROCCTX); > > do_el0_svc_compat(regs); +linux-man This part of the rseq man page needs to be fixed as well I think. The kernel no longer reliably provides clearing of rseq_cs on preemption, right? https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 "and set to NULL by the kernel when it restarts an assembly instruction sequence block, as well as when the kernel detects that it is preempting or delivering a signal outside of the range targeted by the rseq_cs." ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 5:53 ` Dmitry Vyukov @ 2026-04-23 10:39 ` Thomas Gleixner 2026-04-23 10:51 ` Mathias Stearn 2026-04-23 12:11 ` Alejandro Colomar 2026-04-23 12:29 ` Mathieu Desnoyers 2 siblings, 1 reply; 41+ messages in thread From: Thomas Gleixner @ 2026-04-23 10:39 UTC (permalink / raw) To: Dmitry Vyukov, Jinjie Ruan, linux-man Cc: Mark Rutland, Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, Apr 23 2026 at 07:53, Dmitry Vyukov wrote: > On Thu, 23 Apr 2026 at 03:48, Jinjie Ruan <ruanjinjie@huawei.com> wrote: > > This part of the rseq man page needs to be fixed as well I think. The > kernel no longer reliably provides clearing of rseq_cs on preemption, > right? > > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 > > "and set to NULL by the kernel when it restarts an assembly > instruction sequence block, > as well as when the kernel detects that it is preempting or delivering > a signal outside of the range targeted by the rseq_cs." The kernel clears rseq_cs reliably when user space was interrupted and: the task was preempted or the return from interrupt delivers a signal If the task invoked a syscall then there is absolutely no reason to do either of this because syscalls from within a critical section are a bug and catched when enabling rseq debugging. The original code did this along with unconditionally updating CPU/MMCID which resulted in ~15% performance regression on a syscall heavy database benchmark once glibc started to register rseq. Thanks, tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 10:39 ` Thomas Gleixner @ 2026-04-23 10:51 ` Mathias Stearn 2026-04-23 12:24 ` David Laight 2026-04-23 19:31 ` Thomas Gleixner 0 siblings, 2 replies; 41+ messages in thread From: Mathias Stearn @ 2026-04-23 10:51 UTC (permalink / raw) To: Thomas Gleixner Cc: Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote: > The kernel clears rseq_cs reliably when user space was interrupted and: > > the task was preempted > or > the return from interrupt delivers a signal > > If the task invoked a syscall then there is absolutely no reason to do > either of this because syscalls from within a critical section are a > bug and catched when enabling rseq debugging. > > The original code did this along with unconditionally updating CPU/MMCID > which resulted in ~15% performance regression on a syscall heavy > database benchmark once glibc started to register rseq. Just to be clear TCMalloc does not need either rseq_cs to be cleared or cpu_id_start to be written to on syscalls because it doesn't do syscalls from critical sections. It will actually benefit (slightly) from not updating cpu_id_start on syscalls. It is specifically in the cases where an rseq would need to be aborted (preemption, signals, migration, and membarrier IPI with the rseq flag) that TCMalloc relies on cpu_id_start being written. It does rely on that write even when not inside the critical section, because it effectively uses that to detect if there were any would-cause-abort events in between two critical sections. But since it leaves the rseq_cs pointer non-null between critical sections, so you dont need to add _any_ overhead for programs that never make use of rseq after registration, or add any overhead to syscalls even for those who do. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 10:51 ` Mathias Stearn @ 2026-04-23 12:24 ` David Laight 2026-04-23 19:31 ` Thomas Gleixner 1 sibling, 0 replies; 41+ messages in thread From: David Laight @ 2026-04-23 12:24 UTC (permalink / raw) To: Mathias Stearn Cc: Thomas Gleixner, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, 23 Apr 2026 12:51:22 +0200 Mathias Stearn <mathias@mongodb.com> wrote: > On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > The kernel clears rseq_cs reliably when user space was interrupted and: > > > > the task was preempted > > or > > the return from interrupt delivers a signal > > > > If the task invoked a syscall then there is absolutely no reason to do > > either of this because syscalls from within a critical section are a > > bug and catched when enabling rseq debugging. > > > > The original code did this along with unconditionally updating CPU/MMCID > > which resulted in ~15% performance regression on a syscall heavy > > database benchmark once glibc started to register rseq. > > Just to be clear TCMalloc does not need either rseq_cs to be cleared > or cpu_id_start to be written to on syscalls because it doesn't do > syscalls from critical sections. It will actually benefit (slightly) > from not updating cpu_id_start on syscalls. > > It is specifically in the cases where an rseq would need to be aborted > (preemption, signals, migration, and membarrier IPI with the rseq > flag) that TCMalloc relies on cpu_id_start being written. It does rely > on that write even when not inside the critical section, because it > effectively uses that to detect if there were any would-cause-abort > events in between two critical sections. But since it leaves the > rseq_cs pointer non-null between critical sections, so you dont need > to add _any_ overhead for programs that never make use of rseq after > registration, or add any overhead to syscalls even for those who do. > That sounds like one long rseq sequence where the 'restart' path detects that some of the operations have already been done. David ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 10:51 ` Mathias Stearn 2026-04-23 12:24 ` David Laight @ 2026-04-23 19:31 ` Thomas Gleixner 2026-04-24 7:56 ` Dmitry Vyukov 1 sibling, 1 reply; 41+ messages in thread From: Thomas Gleixner @ 2026-04-23 19:31 UTC (permalink / raw) To: Mathias Stearn Cc: Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, Apr 23 2026 at 12:51, Mathias Stearn wrote: > On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> The kernel clears rseq_cs reliably when user space was interrupted and: >> >> the task was preempted >> or >> the return from interrupt delivers a signal >> >> If the task invoked a syscall then there is absolutely no reason to do >> either of this because syscalls from within a critical section are a >> bug and catched when enabling rseq debugging. >> >> The original code did this along with unconditionally updating CPU/MMCID >> which resulted in ~15% performance regression on a syscall heavy >> database benchmark once glibc started to register rseq. > > Just to be clear TCMalloc does not need either rseq_cs to be cleared > or cpu_id_start to be written to on syscalls because it doesn't do > syscalls from critical sections. It will actually benefit (slightly) > from not updating cpu_id_start on syscalls. I know that it does not do syscalls from within critical sections, but it relies on cpu_id_start being unconditionally updated in one way or the other. > It is specifically in the cases where an rseq would need to be aborted > (preemption, signals, migration, and membarrier IPI with the rseq > flag) that TCMalloc relies on cpu_id_start being written. It does rely > on that write even when not inside the critical section, because it > effectively uses that to detect if there were any would-cause-abort > events in between two critical sections. But since it leaves the > rseq_cs pointer non-null between critical sections, so you dont need > to add _any_ overhead for programs that never make use of rseq after > registration, or add any overhead to syscalls even for those who do. Well. According to the comment in the tcmalloc code: // Calculation of the address of the current CPU slabs region is needed for // allocation/deallocation fast paths, but is quite expensive. Due to variable // shift and experimental support for "virtual CPUs", the calculation involves // several additional loads and dependent calculations. Pseudo-code for the // address calculation is as follows: // // cpu_offset = TcmallocSlab.virtual_cpu_id_offset_; // cpu = *(&__rseq_abi + virtual_cpu_id_offset_); // slabs_and_shift = TcmallocSlab.slabs_and_shift_; // shift = slabs_and_shift & kShiftMask; // shifted_cpu = cpu << shift; // slabs = slabs_and_shift & kSlabsMask; // slabs += shifted_cpu; // // To remove this calculation from fast paths, we cache the slabs address // for the current CPU in thread local storage. However, when a thread is // rescheduled to another CPU, we somehow need to understand that the cached ^^^^^^^^^^^ // address is not valid anymore. To achieve this, we overlap the top 4 bytes // of the cached address with __rseq_abi.cpu_id_start. When a thread is // rescheduled the kernel overwrites cpu_id_start with the current CPU number, // which gives us the signal that the cached address is not valid anymore. The kernel still as of today (the arm64 bug aside) updates the cpu_id_start and cpu_id fields in rseq when a task is rescheduled to another CPU. So if the code only requires to know when it got rescheduled to another CPU then it still should work, no? But it does not, which makes it clear that it relies on this undocumented behaviour of the kernel to rewrite rseq::cpu_id_start unconditionally. I'm not yet convinced that it relies on it only when interrupted between two subsequent critical sections. We'll see. .... Now we come to the best part of this comment: // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. So any code sequence which ends up in: x = tcmalloc(); dostuff(x) evaluate(rseq::cpu_id_start, rseq::cpu_id) is doomed. This might be acceptable for Google internal usage where they control the full stack and can prevent anyone else to utilize rseq, but in an open ecosystem that's obviously a non-starter. And they definitely forgot to add this to the comment: // Never enable CONFIG_RSEQ_DEBUG in the kernel when you use tcmalloc as // it will expose the blatant ABI abuse and therefore will kill your // application. If your assumption that the rewrite is only required when rseq::rseq_cs is non NULL and user space was interrupted is correct, then the obvious no-brainer would have been to add: __u64 rseq_usr_data; to struct rseq and clear that unconditionally when rseq::rseq_cs is cleared. But that would have been too simple, would work independent of endianess and not in the way of anybody else. But I know that's incompatible with the features first, correctness later and we own the world anyway mindset. Just for giggles I asked Google Gemini about the implications of tmalloc's rseq abuse. The answer is pretty clear: "In short, TCMalloc treats RSEQ as a private optimization rather than a shared system resource, which compromises the stability and extensibility of any application that needs RSEQ for anything other than memory allocation." It's also very clear about the wilful ignorance of the tcmalloc people: "In summary, the developers have known for at least 6 years that the implementation was non-standard and conflicting with other rseq usage. The github issue which requested glibc compatibility was opened in 2022 and has been unresolved since then." Thanks, tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 19:31 ` Thomas Gleixner @ 2026-04-24 7:56 ` Dmitry Vyukov 2026-04-24 8:32 ` Mathias Stearn 0 siblings, 1 reply; 41+ messages in thread From: Dmitry Vyukov @ 2026-04-24 7:56 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, 23 Apr 2026 at 21:31, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Thu, Apr 23 2026 at 12:51, Mathias Stearn wrote: > > On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote: > >> The kernel clears rseq_cs reliably when user space was interrupted and: > >> > >> the task was preempted > >> or > >> the return from interrupt delivers a signal > >> > >> If the task invoked a syscall then there is absolutely no reason to do > >> either of this because syscalls from within a critical section are a > >> bug and catched when enabling rseq debugging. > >> > >> The original code did this along with unconditionally updating CPU/MMCID > >> which resulted in ~15% performance regression on a syscall heavy > >> database benchmark once glibc started to register rseq. > > > > Just to be clear TCMalloc does not need either rseq_cs to be cleared > > or cpu_id_start to be written to on syscalls because it doesn't do > > syscalls from critical sections. It will actually benefit (slightly) > > from not updating cpu_id_start on syscalls. > > I know that it does not do syscalls from within critical sections, but > it relies on cpu_id_start being unconditionally updated in one way or > the other. > > > It is specifically in the cases where an rseq would need to be aborted > > (preemption, signals, migration, and membarrier IPI with the rseq > > flag) that TCMalloc relies on cpu_id_start being written. It does rely > > on that write even when not inside the critical section, because it > > effectively uses that to detect if there were any would-cause-abort > > events in between two critical sections. But since it leaves the > > rseq_cs pointer non-null between critical sections, so you dont need > > to add _any_ overhead for programs that never make use of rseq after > > registration, or add any overhead to syscalls even for those who do. > > Well. According to the comment in the tcmalloc code: > > // Calculation of the address of the current CPU slabs region is needed for > // allocation/deallocation fast paths, but is quite expensive. Due to variable > // shift and experimental support for "virtual CPUs", the calculation involves > // several additional loads and dependent calculations. Pseudo-code for the > // address calculation is as follows: > // > // cpu_offset = TcmallocSlab.virtual_cpu_id_offset_; > // cpu = *(&__rseq_abi + virtual_cpu_id_offset_); > // slabs_and_shift = TcmallocSlab.slabs_and_shift_; > // shift = slabs_and_shift & kShiftMask; > // shifted_cpu = cpu << shift; > // slabs = slabs_and_shift & kSlabsMask; > // slabs += shifted_cpu; > // > // To remove this calculation from fast paths, we cache the slabs address > // for the current CPU in thread local storage. However, when a thread is > // rescheduled to another CPU, we somehow need to understand that the cached > > ^^^^^^^^^^^ > > // address is not valid anymore. To achieve this, we overlap the top 4 bytes > // of the cached address with __rseq_abi.cpu_id_start. When a thread is > // rescheduled the kernel overwrites cpu_id_start with the current CPU number, > // which gives us the signal that the cached address is not valid anymore. > > The kernel still as of today (the arm64 bug aside) updates the > cpu_id_start and cpu_id fields in rseq when a task is rescheduled to > another CPU. > > So if the code only requires to know when it got rescheduled to another > CPU then it still should work, no? This was my first thought too: https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/ The only problem is with membarrier (it used to force write to __rseq_abi.cpu_id_start for all threads, but now it does not). Otherwise the caching scheme works. I have a tentative fix for tcmalloc: https://github.com/dvyukov/tcmalloc/commit/58d0eca91503f539b26d20b6f55fb2f6f8bc0c37 The crux is as follows. Tcmalloc needs to make all threads stop using old cached slab pointers. The stopping procedure is now: slab->stopped = true; membarrier(); and all rseq critical sections now check the stopped flag in the cached slab pointer. If it's set, the thread does not proceed to use the slab. > But it does not, which makes it clear that it relies on this > undocumented behaviour of the kernel to rewrite rseq::cpu_id_start > unconditionally. I'm not yet convinced that it relies on it only when > interrupted between two subsequent critical sections. We'll see. > > .... > > Now we come to the best part of this comment: > > // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. > > So any code sequence which ends up in: > > x = tcmalloc(); > dostuff(x) > evaluate(rseq::cpu_id_start, rseq::cpu_id) > > is doomed. This might be acceptable for Google internal usage where they > control the full stack and can prevent anyone else to utilize rseq, but > in an open ecosystem that's obviously a non-starter. > > And they definitely forgot to add this to the comment: > > // Never enable CONFIG_RSEQ_DEBUG in the kernel when you use tcmalloc as > // it will expose the blatant ABI abuse and therefore will kill your > // application. > > If your assumption that the rewrite is only required when rseq::rseq_cs > is non NULL and user space was interrupted is correct, then the obvious > no-brainer would have been to add: > > __u64 rseq_usr_data; > > to struct rseq and clear that unconditionally when rseq::rseq_cs is > cleared. > > But that would have been too simple, would work independent of endianess > and not in the way of anybody else. > > But I know that's incompatible with the features first, correctness > later and we own the world anyway mindset. > > Just for giggles I asked Google Gemini about the implications of > tmalloc's rseq abuse. The answer is pretty clear: > > "In short, TCMalloc treats RSEQ as a private optimization rather than > a shared system resource, which compromises the stability and > extensibility of any application that needs RSEQ for anything other > than memory allocation." > > It's also very clear about the wilful ignorance of the tcmalloc people: > > "In summary, the developers have known for at least 6 years that the > implementation was non-standard and conflicting with other rseq > usage. The github issue which requested glibc compatibility was > opened in 2022 and has been unresolved since then." > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 7:56 ` Dmitry Vyukov @ 2026-04-24 8:32 ` Mathias Stearn 2026-04-24 9:30 ` Dmitry Vyukov 2026-04-24 14:16 ` Thomas Gleixner 0 siblings, 2 replies; 41+ messages in thread From: Mathias Stearn @ 2026-04-24 8:32 UTC (permalink / raw) To: Dmitry Vyukov Cc: Thomas Gleixner, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote: > > So if the code only requires to know when it got rescheduled to another > > CPU then it still should work, no? > > This was my first thought too: > https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/ > The only problem is with membarrier (it used to force write to > __rseq_abi.cpu_id_start for all threads, but now it does not). > Otherwise the caching scheme works. I almost wrote a message last night saying that we didn't need cpu_id_start invalidation on preemption. However, I remembered that the Grow() function[1] does a load outside of a critical section then stores a derived value inside the critical section, guarded only by the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really should be doing a compare against the original value inside the critical section (or just do the whole thing inside), but it doesn't. I haven't reasoned end-to-end through this fully to prove corruption is possible, but I suspect that it is if another thread same-cpu preempts between the loads and the store and updates the header before the original thread resumes and writes its original intended header value. Ditto for signals, which sometimes allocate even though they shouldn't. I was really hoping that we would only need to do the "redundant" cpu_id_start writes would only be needed on membarrier_rseq IPIs where it really is a pay-for-what-you-use functionality, I think existing binaries depend on invalidation on preemption. Luckily that should be cheap enough to be ~free. [1] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L964-L980 [2] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L551-L605 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 8:32 ` Mathias Stearn @ 2026-04-24 9:30 ` Dmitry Vyukov 2026-04-24 14:16 ` Thomas Gleixner 1 sibling, 0 replies; 41+ messages in thread From: Dmitry Vyukov @ 2026-04-24 9:30 UTC (permalink / raw) To: Mathias Stearn Cc: Thomas Gleixner, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Fri, 24 Apr 2026 at 10:32, Mathias Stearn <mathias@mongodb.com> wrote: > > On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote: > > > So if the code only requires to know when it got rescheduled to another > > > CPU then it still should work, no? > > > > This was my first thought too: > > https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/ > > The only problem is with membarrier (it used to force write to > > __rseq_abi.cpu_id_start for all threads, but now it does not). > > Otherwise the caching scheme works. > > I almost wrote a message last night saying that we didn't need > cpu_id_start invalidation on preemption. However, I remembered that > the Grow() function[1] does a load outside of a critical section then > stores a derived value inside the critical section, guarded only by > the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really > should be doing a compare against the original value inside the > critical section (or just do the whole thing inside), but it doesn't. > I haven't reasoned end-to-end through this fully to prove corruption > is possible, but I suspect that it is if another thread same-cpu > preempts between the loads and the store and updates the header before > the original thread resumes and writes its original intended header > value. Ditto for signals, which sometimes allocate even though they > shouldn't. > > I was really hoping that we would only need to do the "redundant" > cpu_id_start writes would only be needed on membarrier_rseq IPIs where > it really is a pay-for-what-you-use functionality, I think existing > binaries depend on invalidation on preemption. Luckily that should be > cheap enough to be ~free. I've prototyped this idea too: https://github.com/dvyukov/linux/commit/1284e3723047cb5afd247f75c53de43efc18db82 > [1] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L964-L980 > [2] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L551-L605 ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 8:32 ` Mathias Stearn 2026-04-24 9:30 ` Dmitry Vyukov @ 2026-04-24 14:16 ` Thomas Gleixner 2026-04-24 15:03 ` Peter Zijlstra 1 sibling, 1 reply; 41+ messages in thread From: Thomas Gleixner @ 2026-04-24 14:16 UTC (permalink / raw) To: Mathias Stearn, Dmitry Vyukov Cc: Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Fri, Apr 24 2026 at 10:32, Mathias Stearn wrote: > On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote: >> The only problem is with membarrier (it used to force write to >> __rseq_abi.cpu_id_start for all threads, but now it does not). >> Otherwise the caching scheme works. > > I almost wrote a message last night saying that we didn't need > cpu_id_start invalidation on preemption. However, I remembered that > the Grow() function[1] does a load outside of a critical section then > stores a derived value inside the critical section, guarded only by > the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really > should be doing a compare against the original value inside the > critical section (or just do the whole thing inside), but it doesn't. > I haven't reasoned end-to-end through this fully to prove corruption > is possible, but I suspect that it is if another thread same-cpu > preempts between the loads and the store and updates the header before > the original thread resumes and writes its original intended header > value. Ditto for signals, which sometimes allocate even though they > shouldn't. > > I was really hoping that we would only need to do the "redundant" > cpu_id_start writes would only be needed on membarrier_rseq IPIs where > it really is a pay-for-what-you-use functionality, That's fine and can be solved without adding this sequence overhead into the scheduler hotpath. > I think existing binaries depend on invalidation on > preemption. Luckily that should be cheap enough to be ~free. That's only free when it can be burried in the rseq_cs update, which means the ID update would not happen when rseq_cs is NULL. If those two changes fix it w/o requiring additional tcmalloc changes, I'm happy to hack that up tomorrow. Thanks, tglx ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 14:16 ` Thomas Gleixner @ 2026-04-24 15:03 ` Peter Zijlstra 2026-04-24 19:44 ` Thomas Gleixner 0 siblings, 1 reply; 41+ messages in thread From: Peter Zijlstra @ 2026-04-24 15:03 UTC (permalink / raw) To: Thomas Gleixner Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote: > On Fri, Apr 24 2026 at 10:32, Mathias Stearn wrote: > > On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote: > >> The only problem is with membarrier (it used to force write to > >> __rseq_abi.cpu_id_start for all threads, but now it does not). > >> Otherwise the caching scheme works. > > > > I almost wrote a message last night saying that we didn't need > > cpu_id_start invalidation on preemption. However, I remembered that > > the Grow() function[1] does a load outside of a critical section then > > stores a derived value inside the critical section, guarded only by > > the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really > > should be doing a compare against the original value inside the > > critical section (or just do the whole thing inside), but it doesn't. > > I haven't reasoned end-to-end through this fully to prove corruption > > is possible, but I suspect that it is if another thread same-cpu > > preempts between the loads and the store and updates the header before > > the original thread resumes and writes its original intended header > > value. Ditto for signals, which sometimes allocate even though they > > shouldn't. > > > > I was really hoping that we would only need to do the "redundant" > > cpu_id_start writes would only be needed on membarrier_rseq IPIs where > > it really is a pay-for-what-you-use functionality, > > That's fine and can be solved without adding this sequence overhead into > the scheduler hotpath. Something like so? (probably needs help for !GENERIC bits) --- diff --git a/include/asm-generic/thread_info_tif.h b/include/asm-generic/thread_info_tif.h index 528e6fc7efe9..1d786003e42a 100644 --- a/include/asm-generic/thread_info_tif.h +++ b/include/asm-generic/thread_info_tif.h @@ -48,7 +48,10 @@ #define TIF_RSEQ 11 // Run RSEQ fast path #define _TIF_RSEQ BIT(TIF_RSEQ) -#define TIF_HRTIMER_REARM 12 // re-arm the timer +#define TIF_RSEQ_FORCE_RESTART 12 // Reset RSEQ-CS from membarrier +#define _TIF_RSEQ_FORCE_RESTART BIT(TIF_RSEQ_FORCE_RESTART) + +#define TIF_HRTIMER_REARM 13 // re-arm the timer #define _TIF_HRTIMER_REARM BIT(TIF_HRTIMER_REARM) #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */ diff --git a/include/linux/rseq.h b/include/linux/rseq.h index b9d62fc2140d..2cbee6d41198 100644 --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -158,6 +158,8 @@ static inline unsigned int rseq_alloc_align(void) return 1U << get_count_order(offsetof(struct rseq, end)); } +extern void rseq_prepare_membarrier(struct mm_struct *mm); + #else /* CONFIG_RSEQ */ static inline void rseq_handle_slowpath(struct pt_regs *regs) { } static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { } @@ -167,6 +169,7 @@ static inline void rseq_force_update(void) { } static inline void rseq_virt_userspace_exit(void) { } static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { } static inline void rseq_execve(struct task_struct *t) { } +static inline void rseq_prepare_membarrier(struct mm_struct *mm) { } #endif /* !CONFIG_RSEQ */ #ifdef CONFIG_DEBUG_RSEQ diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h index f11ebd34f8b9..3dfaca776971 100644 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -686,7 +686,12 @@ static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *reg #ifdef CONFIG_HAVE_GENERIC_TIF_BITS static __always_inline bool test_tif_rseq(unsigned long ti_work) { - return ti_work & _TIF_RSEQ; + return ti_work & (_TIF_RSEQ | _TIF_RSEQ_FORCE_RESTART); +} + +static __always_inline void clear_tif_rseq_force_restart(void) +{ + clear_thread_flag(TIF_RSEQ_FORCE_RESTART); } static __always_inline void clear_tif_rseq(void) @@ -696,6 +701,7 @@ static __always_inline void clear_tif_rseq(void) } #else static __always_inline bool test_tif_rseq(unsigned long ti_work) { return true; } +static __always_inline void clear_tif_rseq_force_restart(void) { } static __always_inline void clear_tif_rseq(void) { } #endif @@ -703,6 +709,11 @@ static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work) { if (unlikely(test_tif_rseq(ti_work))) { + if (unlikely(ti_work & _TIF_RSEQ_FORCE_RESTART)) { + current->rseq.event.sched_switch = true; + current->rseq.event.ids_changed = true; + clear_tif_rseq_force_restart(); + } if (unlikely(__rseq_exit_to_user_mode_restart(regs))) { current->rseq.event.slowpath = true; set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); diff --git a/kernel/rseq.c b/kernel/rseq.c index 38d3ef540760..9adc7f63adf5 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -255,6 +255,19 @@ static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs) return false; } +void rseq_prepare_membarrier(struct mm_struct *mm) +{ + struct task_struct *t; + + guard(mutex)(&mm->mm_cid.mutex); + + hlist_for_each_entry(t, &mm->mm_cid.user_list, mm_cid.node) { + if (t == current) + continue; + set_tsk_thread_flag(t, TIF_RSEQ_FORCE_RESTART); + } +} + static void rseq_slowpath_update_usr(struct pt_regs *regs) { /* diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 623445603725..696988bb991b 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -334,6 +334,7 @@ static int membarrier_private_expedited(int flags, int cpu_id) MEMBARRIER_STATE_PRIVATE_EXPEDITED_RSEQ_READY)) return -EPERM; ipi_func = ipi_rseq; + rseq_prepare_membarrier(mm); } else { WARN_ON_ONCE(flags); if (!(atomic_read(&mm->membarrier_state) & ^ permalink raw reply related [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 15:03 ` Peter Zijlstra @ 2026-04-24 19:44 ` Thomas Gleixner 2026-04-26 22:04 ` Thomas Gleixner 0 siblings, 1 reply; 41+ messages in thread From: Thomas Gleixner @ 2026-04-24 19:44 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler On Fri, Apr 24 2026 at 17:03, Peter Zijlstra wrote: > On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote: >> > I was really hoping that we would only need to do the "redundant" >> > cpu_id_start writes would only be needed on membarrier_rseq IPIs where >> > it really is a pay-for-what-you-use functionality, >> >> That's fine and can be solved without adding this sequence overhead into >> the scheduler hotpath. > > Something like so? (probably needs help for !GENERIC bits) Yes and yes :) Let me stare at that !generic tif bits case. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-24 19:44 ` Thomas Gleixner @ 2026-04-26 22:04 ` Thomas Gleixner 2026-04-27 7:40 ` Florian Weimer 0 siblings, 1 reply; 41+ messages in thread From: Thomas Gleixner @ 2026-04-26 22:04 UTC (permalink / raw) To: Peter Zijlstra Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Dmitry Vyukov, Florian Weimer, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds On Fri, Apr 24 2026 at 21:44, Thomas Gleixner wrote: > On Fri, Apr 24 2026 at 17:03, Peter Zijlstra wrote: >> On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote: >>> > I was really hoping that we would only need to do the "redundant" >>> > cpu_id_start writes would only be needed on membarrier_rseq IPIs where >>> > it really is a pay-for-what-you-use functionality, >>> >>> That's fine and can be solved without adding this sequence overhead into >>> the scheduler hotpath. >> >> Something like so? (probably needs help for !GENERIC bits) > > Yes and yes :) > > Let me stare at that !generic tif bits case. I stared at it and finally gave up because all of this is in a completely FUBAR'ed state and ends up in a horrible pile of hacks and duct tape with a way larger than zero probability that we chase the nasty corner cases for quite some time just to add more duct tape and hacks. Contrary to that it's rather trivial to cleanly separate the behavioral cases and guarantees without a masssive runtime overhead and without a pile of hard to maintain TCMalloc specific hacks. All required code is already available to support the architectures which do not utilize the generic entry code and therefore can't neither use the optimized mode nor time slice extensions. So instead of letting the compiler optimize that code out for the generic entry code users, we can keep it around and utilize one or the other depending on the requested mode. I managed to get the required run-time conditionals down to a minimum so that they are in the noise when analysing it with perf. The real question is how to differentiate between the legacy and the optimized mode. I have two working variants to achieve that: 1) The fully safe option requires a new flag for RSEQ registration. It obviously requires a glibc update. (Suggested by PeterZ) 2) Determine the requirements of the registering task via the size of the registered RSEQ area. The original implementation, which TCMalloc depends on, registers a 32 byte region (ORIG_RSEG_SIZE). This region has 32 byte alignment requirement. The extension safe newer variant exposes the kernel RSEQ feature size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered rseq region is aligned to the next power of two of the feature size. The kernel currently has a feature size of 33 bytes, which means the alignment requirement is 64 bytes. The TCMalloc RSEQ region is embedded into a cache line aligned data structure starting at offset 32 bytes so that bytes 28-31 and the cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with the top-most bit (63 set) to check whether the kernel has overwritten cpu_id_start with an actual CPU id value, which is guaranteed to not have the top most bit set. As this is part of their performance tuned magic, it's a pretty safe assumption, that TCMalloc won't use a larger RSEQ size, which allows to select optimized mode for registrations with a size greater than 32 bytes. That does not require any changes to glibc and works out of the box. (Suggested by Mathieu) In both cases the legacy non-optimized mode exposes the original behaviour up to the mm_cid field and does not provide support for time slice extensions. Optimized mode restores the performance gains and enables support for time slice extensions. I have no strong preference either way and have working code for both variants. Though obviously avoiding to update the libc world has a charme. If that unexpectedly would turn out to be not sufficient, then disabling that would be a trivial one-liner and as a consequence require to add the flag and update the libc world. Combo patch for the auto-detection based on the registered size below as that allows to immediately test without glibc dependencies. It applies cleanly on Linus tree and 7.0. 6.19 would need some fixups, but I learned today that it's already EOL. In the final version that's three separate patches plus a set of selftest changes which validate legacy behaviour and run the full param test suite in both legacy and optimized mode. Thoughts, preferences? Thanks, tglx --- Documentation/userspace-api/rseq.rst | 77 ++++++++++++++ include/linux/rseq.h | 20 +++ include/linux/rseq_entry.h | 110 ++++++++++----------- include/linux/rseq_types.h | 3 kernel/rseq.c | 183 ++++++++++++++++++++++------------- kernel/sched/membarrier.c | 11 +- 6 files changed, 280 insertions(+), 124 deletions(-) --- --- a/include/linux/rseq.h +++ b/include/linux/rseq.h @@ -9,6 +9,11 @@ void __rseq_handle_slowpath(struct pt_regs *regs); +static __always_inline bool rseq_optimized(struct task_struct *t) +{ + return IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && likely(t->rseq.event.optimized); +} + /* Invoked from resume_user_mode_work() */ static inline void rseq_handle_slowpath(struct pt_regs *regs) { @@ -30,7 +35,7 @@ void __rseq_signal_deliver(int sig, stru */ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(current)) { /* '&' is intentional to spare one conditional branch */ if (current->rseq.event.has_rseq & current->rseq.event.user_irq) __rseq_signal_deliver(ksig->sig, regs); @@ -50,15 +55,21 @@ static __always_inline void rseq_sched_s { struct rseq_event *ev = &t->rseq.event; - if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) { + /* + * Only apply the user_irq optimization for RSEQ ABI V2 + * registrations. Legacy users like TCMalloc rely on the historical ABI + * V1 behaviour which updates IDs on every context swtich. + */ + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(t)) { /* * Avoid a boat load of conditionals by using simple logic * to determine whether NOTIFY_RESUME needs to be raised. * * It's required when the CPU or MM CID has changed or - * the entry was from user space. + * the entry was from user space. ev->has_rseq does not + * have to be evaluated because optimized implies has_rseq. */ - bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; + bool raise = ev->user_irq | ev->ids_changed; if (raise) { ev->sched_switch = true; @@ -66,6 +77,7 @@ static __always_inline void rseq_sched_s } } else { if (ev->has_rseq) { + t->rseq.event.ids_changed = true; t->rseq.event.sched_switch = true; rseq_raise_notify_resume(t); } --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -111,6 +111,20 @@ static __always_inline void rseq_slice_c t->rseq.slice.state.granted = false; } +/* + * Open coded, so it can be invoked within a user access region. + * + * This clears the user space state of the time slice extensions field only when + * the task has registered the optimized RSEQ_ABI V2. Some legacy registrations, + * e.g. TCMalloc, have conflicting non-ABI fields in struct RSEQ, which would be + * overwritten by an unconditional write. + */ +#define rseq_slice_clear_user(rseq, efault) \ +do { \ + if (rseq_slice_extension_enabled()) \ + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); \ +} while (0) + static __always_inline bool __rseq_grant_slice_extension(bool work_pending) { struct task_struct *curr = current; @@ -230,10 +244,10 @@ static __always_inline bool rseq_slice_e static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; } static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { } static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; } +#define rseq_slice_clear_user(rseq, efault) do { } while (0) #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */ bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr); -bool rseq_debug_validate_ids(struct task_struct *t); static __always_inline void rseq_note_user_irq_entry(void) { @@ -353,43 +367,6 @@ bool rseq_debug_update_user_cs(struct ta return false; } -/* - * On debug kernels validate that user space did not mess with it if the - * debug branch is enabled. - */ -bool rseq_debug_validate_ids(struct task_struct *t) -{ - struct rseq __user *rseq = t->rseq.usrptr; - u32 cpu_id, uval, node_id; - - /* - * On the first exit after registering the rseq region CPU ID is - * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! - */ - node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ? - cpu_to_node(t->rseq.ids.cpu_id) : 0; - - scoped_user_read_access(rseq, efault) { - unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); - if (cpu_id != t->rseq.ids.cpu_id) - goto die; - unsafe_get_user(uval, &rseq->cpu_id, efault); - if (uval != cpu_id) - goto die; - unsafe_get_user(uval, &rseq->node_id, efault); - if (uval != node_id) - goto die; - unsafe_get_user(uval, &rseq->mm_cid, efault); - if (uval != t->rseq.ids.mm_cid) - goto die; - } - return true; -die: - t->rseq.event.fatal = true; -efault: - return false; -} - #endif /* RSEQ_BUILD_SLOW_PATH */ /* @@ -504,12 +481,32 @@ bool rseq_set_ids_get_csaddr(struct task { struct rseq __user *rseq = t->rseq.usrptr; - if (static_branch_unlikely(&rseq_debug_enabled)) { - if (!rseq_debug_validate_ids(t)) - return false; - } - scoped_user_rw_access(rseq, efault) { + /* Validate the R/O fields for debug and optimized mode */ + if (static_branch_unlikely(&rseq_debug_enabled) || rseq_optimized(t)) { + u32 cpu_id, uval, node_id; + + /* + * On the first exit after registering the rseq region CPU ID is + * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0! + */ + node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ? + cpu_to_node(t->rseq.ids.cpu_id) : 0; + + unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault); + if (cpu_id != t->rseq.ids.cpu_id) + goto die; + unsafe_get_user(uval, &rseq->cpu_id, efault); + if (uval != cpu_id) + goto die; + unsafe_get_user(uval, &rseq->node_id, efault); + if (uval != node_id) + goto die; + unsafe_get_user(uval, &rseq->mm_cid, efault); + if (uval != t->rseq.ids.mm_cid) + goto die; + } + unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault); unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault); unsafe_put_user(node_id, &rseq->node_id, efault); @@ -517,11 +514,9 @@ bool rseq_set_ids_get_csaddr(struct task if (csaddr) unsafe_get_user(*csaddr, &rseq->rseq_cs, efault); - /* Open coded, so it's in the same user access region */ - if (rseq_slice_extension_enabled()) { - /* Unconditionally clear it, no point in conditionals */ - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); - } + /* RSEQ ABI V2 only operations */ + if (rseq_optimized(t)) + rseq_slice_clear_user(rseq, efault); } rseq_slice_clear_grant(t); @@ -530,6 +525,9 @@ bool rseq_set_ids_get_csaddr(struct task rseq_stat_inc(rseq_stats.ids); rseq_trace_update(t, ids); return true; + +die: + t->rseq.event.fatal = true; efault: return false; } @@ -612,6 +610,14 @@ static __always_inline bool rseq_exit_us * interrupts disabled */ guard(pagefault)(); + /* + * This optimization is only valid when the task registered for the + * optimized RSEQ_ABI_V2 variant. Some legacy users rely on the original + * RSEQ implementation behaviour which unconditionally updated the IDs. + * rseq_sched_switch_event() ensures that legacy registrations always + * have both sched_switch and ids_changed set, which is compatible with + * the historical TIF_NOTIFY_RESUME behaviour. + */ if (likely(!t->rseq.event.ids_changed)) { struct rseq __user *rseq = t->rseq.usrptr; /* @@ -623,11 +629,9 @@ static __always_inline bool rseq_exit_us scoped_user_rw_access(rseq, efault) { unsafe_get_user(csaddr, &rseq->rseq_cs, efault); - /* Open coded, so it's in the same user access region */ - if (rseq_slice_extension_enabled()) { - /* Unconditionally clear it, no point in conditionals */ - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); - } + /* RSEQ ABI V2 only operations */ + if (rseq_optimized(t)) + rseq_slice_clear_user(rseq, efault); } rseq_slice_clear_grant(t); --- a/include/linux/rseq_types.h +++ b/include/linux/rseq_types.h @@ -18,6 +18,7 @@ struct rseq; * @ids_changed: Indicator that IDs need to be updated * @user_irq: True on interrupt entry from user mode * @has_rseq: True if the task has a rseq pointer installed + * @optimized: RSEQ ABI V2 optimized mode * @error: Compound error code for the slow path to analyze * @fatal: User space data corrupted or invalid * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME @@ -41,7 +42,7 @@ struct rseq_event { }; u8 has_rseq; - u8 __pad; + u8 optimized; union { u16 error; struct { --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -258,11 +258,15 @@ static bool rseq_handle_cs(struct task_s static void rseq_slowpath_update_usr(struct pt_regs *regs) { /* - * Preserve rseq state and user_irq state. The generic entry code - * clears user_irq on the way out, the non-generic entry - * architectures are not having user_irq. - */ - const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, }; + * Preserve has_rseq, optimized and user_irq state. The generic entry + * code clears user_irq on the way out, the non-generic entry + * architectures are not setting user_irq. + */ + const struct rseq_event evt_mask = { + .has_rseq = true, + .user_irq = true, + .optimized = true, + }; struct task_struct *t = current; struct rseq_ids ids; u32 node_id; @@ -335,8 +339,9 @@ void __rseq_handle_slowpath(struct pt_re void __rseq_signal_deliver(int sig, struct pt_regs *regs) { rseq_stat_inc(rseq_stats.signal); + /* - * Don't update IDs, they are handled on exit to user if + * Don't update IDs yet, they are handled on exit to user if * necessary. The important thing is to abort a critical section of * the interrupted context as after this point the instruction * pointer in @regs points to the signal handler. @@ -349,6 +354,13 @@ void __rseq_signal_deliver(int sig, stru current->rseq.event.error = 0; force_sigsegv(sig); } + + /* + * In legacy mode, force the update of IDs before returning to user + * space to stay compatible. + */ + if (!rseq_optimized(current)) + rseq_force_update(); } /* @@ -404,66 +416,19 @@ static bool rseq_reset_ids(void) /* The original rseq structure size (including padding) is 32 bytes. */ #define ORIG_RSEQ_SIZE 32 -/* - * sys_rseq - setup restartable sequences for caller thread. - */ -SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) +static long rseq_register(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) { + bool optimized = IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_len > ORIG_RSEQ_SIZE; u32 rseqfl = 0; - if (flags & RSEQ_FLAG_UNREGISTER) { - if (flags & ~RSEQ_FLAG_UNREGISTER) - return -EINVAL; - /* Unregister rseq for current thread. */ - if (current->rseq.usrptr != rseq || !current->rseq.usrptr) - return -EINVAL; - if (rseq_len != current->rseq.len) - return -EINVAL; - if (current->rseq.sig != sig) - return -EPERM; - if (!rseq_reset_ids()) - return -EFAULT; - rseq_reset(current); - return 0; - } - - if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON))) - return -EINVAL; - - if (current->rseq.usrptr) { - /* - * If rseq is already registered, check whether - * the provided address differs from the prior - * one. - */ - if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) - return -EINVAL; - if (current->rseq.sig != sig) - return -EPERM; - /* Already registered. */ - return -EBUSY; - } - - /* - * If there was no rseq previously registered, ensure the provided rseq - * is properly aligned, as communcated to user-space through the ELF - * auxiliary vector AT_RSEQ_ALIGN. If rseq_len is the original rseq - * size, the required alignment is the original struct rseq alignment. - * - * The rseq_len is required to be greater or equal to the original rseq - * size. In order to be valid, rseq_len is either the original rseq size, - * or large enough to contain all supported fields, as communicated to - * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. - */ - if (rseq_len < ORIG_RSEQ_SIZE || - (rseq_len == ORIG_RSEQ_SIZE && !IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE)) || - (rseq_len != ORIG_RSEQ_SIZE && (!IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) || - rseq_len < offsetof(struct rseq, end)))) - return -EINVAL; if (!access_ok(rseq, rseq_len)) return -EFAULT; - if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) { + /* + * The optimized check disables time slice extensions for legacy + * registrations. + */ + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION) && optimized) { rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE; if (rseq_slice_extension_enabled() && (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)) @@ -485,7 +450,15 @@ SYSCALL_DEFINE4(rseq, struct rseq __user unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault); unsafe_put_user(0U, &rseq->node_id, efault); unsafe_put_user(0U, &rseq->mm_cid, efault); - unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); + + /* + * All fields past mm_cid are only valid for non-legacy registrations + * which register with rseq_len > ORIG_RSEQ_SIZE. + */ + if (optimized) { + if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) + unsafe_put_user(0U, &rseq->slice_ctrl.all, efault); + } } /* @@ -501,11 +474,11 @@ SYSCALL_DEFINE4(rseq, struct rseq __user #endif /* - * If rseq was previously inactive, and has just been - * registered, ensure the cpu_id_start and cpu_id fields - * are updated before returning to user-space. + * Ensure the cpu_id_start and cpu_id fields are updated before + * returning to user-space. */ current->rseq.event.has_rseq = true; + current->rseq.event.optimized = optimized; rseq_force_update(); return 0; @@ -513,6 +486,86 @@ SYSCALL_DEFINE4(rseq, struct rseq __user return -EFAULT; } +static long rseq_unregister(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig) +{ + if (flags & ~RSEQ_FLAG_UNREGISTER) + return -EINVAL; + if (current->rseq.usrptr != rseq || !current->rseq.usrptr) + return -EINVAL; + if (rseq_len != current->rseq.len) + return -EINVAL; + if (current->rseq.sig != sig) + return -EPERM; + if (!rseq_reset_ids()) + return -EFAULT; + rseq_reset(current); + return 0; +} + +static long rseq_reregister(struct rseq __user * rseq, u32 rseq_len, u32 sig) +{ + /* + * If rseq is already registered, check whether the provided address + * differs from the prior one. + */ + if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len) + return -EINVAL; + if (current->rseq.sig != sig) + return -EPERM; + /* Already registered. */ + return -EBUSY; +} + +static bool rseq_length_valid(struct rseq __user *rseq, unsigned int rseq_len) +{ + if (rseq_len < ORIG_RSEQ_SIZE) + return false; + + /* + * Ensure the provided rseq is properly aligned, as communicated to + * user-space through the ELF auxiliary vector AT_RSEQ_ALIGN. If + * rseq_len is the original rseq size, the required alignment is the + * original struct rseq alignment. + * + * The rseq_len is required to be greater or equal than the original + * rseq size. + * + * In order to be valid, rseq_len is either the original rseq size, or + * large enough to contain all supported fields, as communicated to + * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE. + */ + if (rseq_len < ORIG_RSEQ_SIZE) + return false; + + if (rseq_len == ORIG_RSEQ_SIZE) + return IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE); + + return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) && + rseq_len >= offsetof(struct rseq, end); +} + +#define RSEQ_FLAGS_SUPPORTED (RSEQ_FLAG_SLICE_EXT_DEFAULT_ON) + +/* + * sys_rseq - Register or unregister restartable sequences for the caller thread. + */ +SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) +{ + if (flags & RSEQ_FLAG_UNREGISTER) + return rseq_unregister(rseq, rseq_len, flags, sig); + + if (unlikely(flags & ~RSEQ_FLAGS_SUPPORTED)) + return -EINVAL; + + if (current->rseq.usrptr) + return rseq_reregister(rseq, rseq_len, sig); + + if (!rseq_length_valid(rseq, rseq_len)) + return -EINVAL; + + return rseq_register(rseq, rseq_len, flags, sig); +} + #ifdef CONFIG_RSEQ_SLICE_EXTENSION struct slice_timer { struct hrtimer timer; @@ -713,6 +766,8 @@ int rseq_slice_extension_prctl(unsigned return -ENOTSUPP; if (!current->rseq.usrptr) return -ENXIO; + if (!current->rseq.event.optimized) + return -ENOTSUPP; /* No change? */ if (enable == !!current->rseq.slice.state.enabled) --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -199,7 +199,16 @@ static void ipi_rseq(void *info) * is negligible. */ smp_mb(); - rseq_sched_switch_event(current); + /* + * Legacy mode requires that IDs are written and the critical section is + * evaluated. Optimized mode handles the critical section and IDs are + * only updated if they change as a consequence of preemption after + * return from this IPI. + */ + if (rseq_optimized(current)) + rseq_sched_switch_event(current); + else + rseq_force_update(); } static void ipi_sync_rq_state(void *info) --- a/Documentation/userspace-api/rseq.rst +++ b/Documentation/userspace-api/rseq.rst @@ -24,6 +24,80 @@ Quick access to CPU number, node ID Allows to implement per CPU data efficiently. Documentation is in code and selftests. :( +Optimized RSEQ V2 +----------------- + +On architectures which utilize the generic entry code and generic TIF bits +the kernel supports runtime optimizations for RSEQ, which also enable +enhanced features like scheduler time slice extensions. + +To enable them a task has to register the RSEQ region with at least the +length advertised by getauxval(AT_RSEQ_FEATURE_SIZE). + +If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel +keeps the legacy low performance mode enabled to fulfil the expectations +existing users regarding the original RSEQ implementation behaviour. + +The following table documents the ABI and behavioral guarantees of the +legacy and the optimized V2 mode. + +.. list-table:: RSEQ modes + :header-rows: 1 + + * - Nr + - What + - Legacy + - Optimized V2 + * - 1 + - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read + only) + - Updated by the kernel unconditionally after each context switch and + before signal delivery + - Updated by the kernel if and only if they change, i.e. if the task + is migrated or mm_cid changes + * - 2 + - The rseq_cs critical section field + - Evaluated and handled unconditionally after each context switch and + before signal delivery + - Evaluated and handled conditionally only when user space was + interrupted. Either after being preempted or before signal delivery + in the interrupted context. + * - 3 + - Read only fields + - No strict enforcement except in debug mode + - Strict enforcement + * - 4 + - membarrier(...RSEQ) + - All running threads of the process are interrupted and the ID fields + are rewritten and eventually active critical sections are aborted + before they return to user space. All threads which are scheduled + out whether voluntary or not are covered by #1/#2 above. + - All running threads of the process are interrupted and eventually + active critical sections are aborted before these threads return to + user space. The ID fields are only updated if changed as a + consequence of the interrupt. All threads which are scheduled out + whether voluntary not are covered by #1/#2 above. + * - 5 + - Time slice extensions + - Not supported + - Supported + +The legacy mode is obviously less performant as it does unconditional +updates and critical section checks even if not strictly required by the +ABI contract. That can't be changed anymore as some users depend on that +observed behavior, which in turn enables them to violate the ABI and +overwrite the cpu_id_start field for their own purposes. This is obviously +discouraged as it renders RSEQ incompatible with the intended usage and +breaks the expectation of other libraries in the same application. + +The ABI compliant optimized mode, which respects the read only fields, does +not require unconditional updates and therefore is way more performant. The +kernel validates the read only fields for compliance. If user space +modifies them, the process is killed. Compliant usage allows multiple +libraries in the same application to benefit from the RSEQ functionality +without disturbing each other. + + Scheduler time slice extensions ------------------------------- @@ -37,7 +111,8 @@ scheduled out inside of the critical sec * Enabled at boot time (default is enabled) - * A rseq userspace pointer has been registered for the thread + * A rseq userspace pointer has been registered for the thread in + optimized V2 mode The thread has to enable the functionality via prctl(2):: ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-26 22:04 ` Thomas Gleixner @ 2026-04-27 7:40 ` Florian Weimer 0 siblings, 0 replies; 41+ messages in thread From: Florian Weimer @ 2026-04-27 7:40 UTC (permalink / raw) To: Thomas Gleixner Cc: Peter Zijlstra, Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman, Linus Torvalds, criu * Thomas Gleixner: > The real question is how to differentiate between the legacy and the > optimized mode. I have two working variants to achieve that: > > 1) The fully safe option requires a new flag for RSEQ > registration. It obviously requires a glibc update. (Suggested by > PeterZ) Without glibc changes, RSEQ would keep working, but with the old, problematic performance, right? If we don't have a notification in the auxiliary vector, we'd have to do two system calls at process start, which isn't ideal, but is probably not a significant issue, either. I haven't verified this, but it looks like introducing the flag breaks CRIU? In dump_thread_rseq, we have this: if (rseqc.flags != 0) { pr_err("something wrong with ptrace(PTRACE_GET_RSEQ_CONFIGURATION, %d) flags = 0x%x\n", tid, rseqc.flags); return -1; } I suppose a workaround could make this behavior flag a prctl flag. CRIU wouldn't dump and restore that until taught about it. If the new behavior is switched on explicitly by the flag, it would be backwards-compatible, except that restoring with unpatched CRIU would lead to a performance loss. > 2) Determine the requirements of the registering task via the size of > the registered RSEQ area. > > The original implementation, which TCMalloc depends on, registers > a 32 byte region (ORIG_RSEG_SIZE). This region has 32 byte > alignment requirement. > > The extension safe newer variant exposes the kernel RSEQ feature > size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment > requirement via getauxval(AT_RSEQ_ALIGN). The alignment > requirement is that the registered rseq region is aligned to the > next power of two of the feature size. The kernel currently has a > feature size of 33 bytes, which means the alignment requirement is > 64 bytes. There are still glibc builds in use that do not use AT_RSEQ_ALIGN, and instead unconditionally reserve a size of 32. In some builds, the RSEQ area is not aligned to a multiple of 64, which makes glibc indistinguishable from tcmalloc. You could look at the location of the thread pointer relative to the RSEQ area at registration to tell them apart, but that is perhaps too nasty. Switching to the new extensible RSEQ allocation code in older glibc builds is not entirely trivial, and I would prefer not doing that. Registering with a new flag is comparatively simple, and we could backport it, except that it might not be compatible with CRIU. Thanks, Florian ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 5:53 ` Dmitry Vyukov 2026-04-23 10:39 ` Thomas Gleixner @ 2026-04-23 12:11 ` Alejandro Colomar 2026-04-23 12:54 ` Mathieu Desnoyers 2026-04-23 12:29 ` Mathieu Desnoyers 2 siblings, 1 reply; 41+ messages in thread From: Alejandro Colomar @ 2026-04-23 12:11 UTC (permalink / raw) To: Dmitry Vyukov Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler, Michael Jeanson [-- Attachment #1: Type: text/plain, Size: 4335 bytes --] Hello Dmitry, On 2026-04-23T07:53:55+0200, Dmitry Vyukov wrote: > On Thu, 23 Apr 2026 at 03:48, Jinjie Ruan <ruanjinjie@huawei.com> wrote: > > > > On 4/23/2026 3:47 AM, Thomas Gleixner wrote: > > > On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote: > > >> On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote: > > >> Conceptually we just need to use syscall_enter_from_user_mode() and > > >> irqentry_enter_from_user_mode() appropriately. > > > > > > Right. I figured that out. > > > > > >> In practice, I can't use those as-is without introducing the exception > > >> masking problems I just fixed up for irqentry_enter_from_kernel_mode(), > > >> so I'll need to do some similar refactoring first. > > > > > > See below. > > > > > >> I haven't paged everything in yet, so just to cehck, is there anything > > >> that would behave incorrectly if current->rseq.event.user_irq were set > > >> for syscall entry? IIUC it means we'll effectively do the slow path, and > > >> I was wondering if that might be acceptable as a one-line bodge for > > >> stable. > > > > > > It might work, but it's trivial enough to avoid that. See below. That on > > > top of 6.19.y makes the selftests pass too. > > > > This aligns with my thoughts when convert arm64 to generic syscall > > entry. Currently, the arm64 entry code does not distinguish between IRQ > > and syscall entries. It fails to call rseq_note_user_irq_entry() for IRQ > > entries as the generic entry framework does, because arm64 uses > > enter_from_user_mode() exclusively instead of > > irqentry_enter_from_user_mode(). > > > > https://lore.kernel.org/all/20260320102620.1336796-10-ruanjinjie@huawei.com/ > > > > > > > > Thanks, > > > > > > tglx > > > --- > > > arch/arm64/kernel/entry-common.c | 14 ++++++++++---- > > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > > > --- a/arch/arm64/kernel/entry-common.c > > > +++ b/arch/arm64/kernel/entry-common.c > > > @@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode( > > > irqentry_exit(regs, state); > > > } > > > > > > +static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs) > > > +{ > > > + enter_from_user_mode(regs); > > > + mte_disable_tco_entry(current); > > > +} > > > + > > > /* > > > * Handle IRQ/context state management when entering from user mode. > > > * Before this function is called it is not safe to call regular kernel code, > > > @@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode( > > > */ > > > static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs) > > > { > > > - enter_from_user_mode(regs); > > > - mte_disable_tco_entry(current); > > > + arm64_enter_from_user_mode_syscall(regs); > > > + rseq_note_user_irq_entry(); > > > } > > > > > > /* > > > @@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_ > > > > > > static void noinstr el0_svc(struct pt_regs *regs) > > > { > > > - arm64_enter_from_user_mode(regs); > > > + arm64_enter_from_user_mode_syscall(regs); > > > cortex_a76_erratum_1463225_svc_handler(); > > > fpsimd_syscall_enter(); > > > local_daif_restore(DAIF_PROCCTX); > > > @@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r > > > > > > static void noinstr el0_svc_compat(struct pt_regs *regs) > > > { > > > - arm64_enter_from_user_mode(regs); > > > + arm64_enter_from_user_mode_syscall(regs); > > > cortex_a76_erratum_1463225_svc_handler(); > > > local_daif_restore(DAIF_PROCCTX); > > > do_el0_svc_compat(regs); > > > +linux-man > > This part of the rseq man page needs to be fixed as well I think. The > kernel no longer reliably provides clearing of rseq_cs on preemption, > right? > > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 +Michael Jeanson That page seems to be maintained separately, as part of the librseq project. Have a lovely day! Alex > > "and set to NULL by the kernel when it restarts an assembly > instruction sequence block, > as well as when the kernel detects that it is preempting or delivering > a signal outside of the range targeted by the rseq_cs." > -- <https://www.alejandro-colomar.es> [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 12:11 ` Alejandro Colomar @ 2026-04-23 12:54 ` Mathieu Desnoyers 0 siblings, 0 replies; 41+ messages in thread From: Mathieu Desnoyers @ 2026-04-23 12:54 UTC (permalink / raw) To: Alejandro Colomar, Dmitry Vyukov Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler, Michael Jeanson On 2026-04-23 08:11, Alejandro Colomar wrote: [...] >> >> +linux-man >> >> This part of the rseq man page needs to be fixed as well I think. The >> kernel no longer reliably provides clearing of rseq_cs on preemption, >> right? >> >> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 > > +Michael Jeanson > > That page seems to be maintained separately, as part of the librseq > project. Yes, I maintain the librseq project, thanks Alejandro! Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 5:53 ` Dmitry Vyukov 2026-04-23 10:39 ` Thomas Gleixner 2026-04-23 12:11 ` Alejandro Colomar @ 2026-04-23 12:29 ` Mathieu Desnoyers 2026-04-23 12:36 ` Dmitry Vyukov 2 siblings, 1 reply; 41+ messages in thread From: Mathieu Desnoyers @ 2026-04-23 12:29 UTC (permalink / raw) To: Dmitry Vyukov, Jinjie Ruan, linux-man Cc: Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On 2026-04-23 01:53, Dmitry Vyukov wrote: [...] > +linux-man > > This part of the rseq man page needs to be fixed as well I think. The > kernel no longer reliably provides clearing of rseq_cs on preemption, > right? > > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 I'm maintaining this manual page in librseq. > > "and set to NULL by the kernel when it restarts an assembly > instruction sequence block, > as well as when the kernel detects that it is preempting or delivering > a signal outside of the range targeted by the rseq_cs." I think you got two things confused here. 1) There is currently a bug on arm64 where it fails to honor the rseq ABI contract wrt critical section abort. AFAIU there is a fix proposed for this. 2) Thomas relaxed the implementation of cpu_id_start field updates so it only stores to the rseq area when the current cpu actually changes (migration). So AFAIU the statement in the man page is still fine. It's just arm64 that needs fixing. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 12:29 ` Mathieu Desnoyers @ 2026-04-23 12:36 ` Dmitry Vyukov 2026-04-23 12:53 ` Mathieu Desnoyers 0 siblings, 1 reply; 41+ messages in thread From: Dmitry Vyukov @ 2026-04-23 12:36 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler On Thu, 23 Apr 2026 at 14:29, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2026-04-23 01:53, Dmitry Vyukov wrote: > [...] > > +linux-man > > > > This part of the rseq man page needs to be fixed as well I think. The > > kernel no longer reliably provides clearing of rseq_cs on preemption, > > right? > > > > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 > > I'm maintaining this manual page in librseq. > > > > > "and set to NULL by the kernel when it restarts an assembly > > instruction sequence block, > > as well as when the kernel detects that it is preempting or delivering > > a signal outside of the range targeted by the rseq_cs." > > I think you got two things confused here. > > 1) There is currently a bug on arm64 where it fails to honor the > rseq ABI contract wrt critical section abort. AFAIU there is a > fix proposed for this. > > 2) Thomas relaxed the implementation of cpu_id_start field updates > so it only stores to the rseq area when the current cpu actually > changes (migration). > > So AFAIU the statement in the man page is still fine. It's just arm64 > that needs fixing. My understanding was that due to the ev->user_irq check here: +static __always_inline void rseq_sched_switch_event(struct task_struct *t) ... + bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; + + if (raise) { + ev->sched_switch = true; + rseq_raise_notify_resume(t); + } There won't be any rseq-related processing for threads preempted in syscalls, which means that rseq_cs won't be NULLed for threads preempted inside of syscalls. ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 12:36 ` Dmitry Vyukov @ 2026-04-23 12:53 ` Mathieu Desnoyers 2026-04-23 12:58 ` Dmitry Vyukov 0 siblings, 1 reply; 41+ messages in thread From: Mathieu Desnoyers @ 2026-04-23 12:53 UTC (permalink / raw) To: Dmitry Vyukov Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler, Michael Jeanson On 2026-04-23 08:36, Dmitry Vyukov wrote: > On Thu, 23 Apr 2026 at 14:29, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2026-04-23 01:53, Dmitry Vyukov wrote: >> [...] >>> +linux-man >>> >>> This part of the rseq man page needs to be fixed as well I think. The >>> kernel no longer reliably provides clearing of rseq_cs on preemption, >>> right? >>> >>> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 >> >> I'm maintaining this manual page in librseq. >> >>> >>> "and set to NULL by the kernel when it restarts an assembly >>> instruction sequence block, >>> as well as when the kernel detects that it is preempting or delivering >>> a signal outside of the range targeted by the rseq_cs." >> >> I think you got two things confused here. >> >> 1) There is currently a bug on arm64 where it fails to honor the >> rseq ABI contract wrt critical section abort. AFAIU there is a >> fix proposed for this. >> >> 2) Thomas relaxed the implementation of cpu_id_start field updates >> so it only stores to the rseq area when the current cpu actually >> changes (migration). >> >> So AFAIU the statement in the man page is still fine. It's just arm64 >> that needs fixing. > > > My understanding was that due to the ev->user_irq check here: > > +static __always_inline void rseq_sched_switch_event(struct task_struct *t) > ... > + bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; > + > + if (raise) { > + ev->sched_switch = true; > + rseq_raise_notify_resume(t); > + } > > There won't be any rseq-related processing for threads preempted in > syscalls, which means that rseq_cs won't be NULLed for threads > preempted inside of syscalls. Let's see if I understand your concern correctly. Scenario: A thread is within a rseq critical section. It exits the critical section without clearing the rseq_cs pointer, expecting the kernel to lazily clear the rseq_cs pointer eventually when it detects that it's not nested on top of the userspace critical section anymore. It then calls a system call _outside_ of the rseq critical section, but with rseq_cs pointer set. Based on the rseq man page wording, it would then expect the preemption within the system call to guarantee clearing that that pointer. Here is the relevant comment block in the man page: Updated by user-space, which sets the address of the cur‐ rently active rseq_cs at the beginning of assembly instruc‐ tion sequence block, and set to NULL by the kernel when it restarts an assembly instruction sequence block, as well as >>>>>>>>> when the kernel detects that it is preempting or delivering a signal outside of the range targeted by the rseq_cs. >>>>>>>>> ^^^ this The whole point about lazy-clearing of rseq_cs is that it _may_ happen when the kernel preempts or delivers a signal (or at any point really), but it's just an optimization. Updating the manual page with this wording would match the intent: Updated by user-space, which sets the address of the cur‐ rently active rseq_cs at the beginning of assembly instruc‐ tion sequence block, and set to NULL by the kernel when it restarts an assembly instruction sequence block. May be set to NULL by the kernel when it detects that the current instruction pointer is outside of the range targeted by the rseq_cs. Also needs to be set to NULL by user-space before reclaim‐ ing memory that contains the targeted struct rseq_cs. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere 2026-04-23 12:53 ` Mathieu Desnoyers @ 2026-04-23 12:58 ` Dmitry Vyukov 0 siblings, 0 replies; 41+ messages in thread From: Dmitry Vyukov @ 2026-04-23 12:58 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler, Michael Jeanson On Thu, 23 Apr 2026 at 14:53, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2026-04-23 08:36, Dmitry Vyukov wrote: > > On Thu, 23 Apr 2026 at 14:29, Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> On 2026-04-23 01:53, Dmitry Vyukov wrote: > >> [...] > >>> +linux-man > >>> > >>> This part of the rseq man page needs to be fixed as well I think. The > >>> kernel no longer reliably provides clearing of rseq_cs on preemption, > >>> right? > >>> > >>> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241 > >> > >> I'm maintaining this manual page in librseq. > >> > >>> > >>> "and set to NULL by the kernel when it restarts an assembly > >>> instruction sequence block, > >>> as well as when the kernel detects that it is preempting or delivering > >>> a signal outside of the range targeted by the rseq_cs." > >> > >> I think you got two things confused here. > >> > >> 1) There is currently a bug on arm64 where it fails to honor the > >> rseq ABI contract wrt critical section abort. AFAIU there is a > >> fix proposed for this. > >> > >> 2) Thomas relaxed the implementation of cpu_id_start field updates > >> so it only stores to the rseq area when the current cpu actually > >> changes (migration). > >> > >> So AFAIU the statement in the man page is still fine. It's just arm64 > >> that needs fixing. > > > > > > My understanding was that due to the ev->user_irq check here: > > > > +static __always_inline void rseq_sched_switch_event(struct task_struct *t) > > ... > > + bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq; > > + > > + if (raise) { > > + ev->sched_switch = true; > > + rseq_raise_notify_resume(t); > > + } > > > > There won't be any rseq-related processing for threads preempted in > > syscalls, which means that rseq_cs won't be NULLed for threads > > preempted inside of syscalls. > > Let's see if I understand your concern correctly. Scenario: > > A thread is within a rseq critical section. It exits the critical > section without clearing the rseq_cs pointer, expecting the kernel > to lazily clear the rseq_cs pointer eventually when it detects that > it's not nested on top of the userspace critical section anymore. > It then calls a system call _outside_ of the rseq critical section, > but with rseq_cs pointer set. Based on the rseq man page wording, > it would then expect the preemption within the system call to guarantee > clearing that that pointer. Yes, this is the scenario I had in mind. > Here is the relevant comment block in the man page: > > Updated by user-space, which sets the address of the cur‐ > rently active rseq_cs at the beginning of assembly instruc‐ > tion sequence block, and set to NULL by the kernel when it > restarts an assembly instruction sequence block, as well as > >>>>>>>>> > when the kernel detects that it is preempting or delivering > a signal outside of the range targeted by the rseq_cs. > >>>>>>>>> > ^^^ this > > The whole point about lazy-clearing of rseq_cs is that it _may_ happen when > the kernel preempts or delivers a signal (or at any point really), but it's > just an optimization. > > Updating the manual page with this wording would match the intent: > > Updated by user-space, which sets the address of the cur‐ > rently active rseq_cs at the beginning of assembly instruc‐ > tion sequence block, and set to NULL by the kernel when it > restarts an assembly instruction sequence block. May be set > to NULL by the kernel when it detects that the current > instruction pointer is outside of the range targeted by > the rseq_cs. > Also needs to be set to NULL by user-space before reclaim‐ > ing memory that contains the targeted struct rseq_cs. > > Thoughts ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com ^ permalink raw reply [flat|nested] 41+ messages in thread
* [PATCH] arm64/entry: Fix arm64-specific rseq brokenness (was: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64) and tcmalloc everywhere [not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com> 2026-04-22 12:56 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Peter Zijlstra 2026-04-22 13:09 ` Mark Rutland @ 2026-04-24 16:45 ` Mark Rutland 2 siblings, 0 replies; 41+ messages in thread From: Mark Rutland @ 2026-04-24 16:45 UTC (permalink / raw) To: Mathias Stearn, Linus Torvalds, Catalin Marinas, Will Deacon, Thomas Gleixner, Mathieu Desnoyers, Peter Zijlstra Cc: Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel, Ingo Molnar, Jinjie Ruan, Blake Oler Patch for the arm64-specific issue below. This doesn't fix the generic cpu_id_start issue, but it brings arm64 into line with everyone else, and it's the shape we'll need going forwards for other stuff anyway. I've given it light testing with Mathias's reproducer and the kselftests, which all pass. I've also pushed it to my arm64/rseq branch: https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/rseq Mark. ---->8---- From 79b65cbbfa20aa2cb0bc248591fab5459cdc101b Mon Sep 17 00:00:00 2001 From: Mark Rutland <mark.rutland@arm.com> Date: Thu, 23 Apr 2026 16:51:12 +0100 Subject: [PATCH] arm64/entry: Fix arm64-specific rseq brokenness Mathias Stearn reports that since v6.19, there are two big issues affecting rseq: (1) On arm64 specifically, rseq critical sections aren't aborted when they should be. (2) The 'cpu_id_start' field is no longer written by the kernel in all cases it used to be, including some cases where TCMalloc depends on the kernel clobbering the field. This patch fixes issue #1. This patch DOES NOT fix issue #2, which will need to be addressed by other patches. The arm64-specific brokenness is a result of commits: 2fc0e4b4126c ("rseq: Record interrupt from user space") 39a167560a61 ("rseq: Optimize event setting") The first commit failed to add a call to rseq_note_user_irq_entry() on arm64. Thus arm64 never sets rseq_event::user_irq to record that it may be necessary to abort an active rseq critical section upon return to userspace. On its own, this commit had no functional impact as the value of rseq_event::user_irq was not consumed. The second commit relied upon rseq_event::user_irq to determine whether or not to bother to perform rseq work when returning to userspace. As rseq_event::user_irq wasn't set on arm64, this work would be skipped, and consequently an active rseq critical section would not be aborted. Fix this by giving arm64 syscall-specific entry/exit paths, and performing the relevant logic in syscall and non-syscall paths, including calling rseq_note_user_irq_entry() for non-syscall entry. Currently arm64 cannot use syscall_enter_from_user_mode(), syscall_exit_to_user_mode(), and irqentry_exit_to_user_mode(), due to ordering constraints with exception masking, and risk of ABI breakage for syscall tracing/audit/etc. For the moment the entry/exit logic is left as arm64-specific, but mirroring the generic code. I intend to follow up with refactoring/cleanup, as we did for kernel mode entry paths in commit: 041aa7a85390 ("entry: Split preemption from irqentry_exit_to_kernel_mode()") ... which will allow arm64 to use the GENERIC_IRQ_ENTRY functions directly. Fixes: 39a167560a61 ("rseq: Optimize event setting") Reported-by: Mathias Stearn <mathias@mongodb.com> Link: https://lore.kernel.org/regressions/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com/ Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> --- arch/arm64/kernel/entry-common.c | 29 ++++++++++++++++++++++------- include/linux/irq-entry-common.h | 8 -------- include/linux/rseq_entry.h | 19 ------------------- 3 files changed, 22 insertions(+), 34 deletions(-) diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c index cb54335465f66..65ade1f1544f6 100644 --- a/arch/arm64/kernel/entry-common.c +++ b/arch/arm64/kernel/entry-common.c @@ -62,6 +62,12 @@ static void noinstr arm64_exit_to_kernel_mode(struct pt_regs *regs, irqentry_exit_to_kernel_mode_after_preempt(regs, state); } +static __always_inline void arm64_syscall_enter_from_user_mode(struct pt_regs *regs) +{ + enter_from_user_mode(regs); + mte_disable_tco_entry(current); +} + /* * Handle IRQ/context state management when entering from user mode. * Before this function is called it is not safe to call regular kernel code, @@ -70,20 +76,29 @@ static void noinstr arm64_exit_to_kernel_mode(struct pt_regs *regs, static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs) { enter_from_user_mode(regs); + rseq_note_user_irq_entry(); mte_disable_tco_entry(current); sme_enter_from_user_mode(); } +static __always_inline void arm64_syscall_exit_to_user_mode(struct pt_regs *regs) +{ + local_irq_disable(); + syscall_exit_to_user_mode_prepare(regs); + local_daif_mask(); + mte_check_tfsr_exit(); + exit_to_user_mode(); +} + /* * Handle IRQ/context state management when exiting to user mode. * After this function returns it is not safe to call regular kernel code, * instrumentable code, or any code which may trigger an exception. */ - static __always_inline void arm64_exit_to_user_mode(struct pt_regs *regs) { local_irq_disable(); - exit_to_user_mode_prepare_legacy(regs); + irqentry_exit_to_user_mode_prepare(regs); local_daif_mask(); sme_exit_to_user_mode(); mte_check_tfsr_exit(); @@ -92,7 +107,7 @@ static __always_inline void arm64_exit_to_user_mode(struct pt_regs *regs) asmlinkage void noinstr asm_exit_to_user_mode(struct pt_regs *regs) { - arm64_exit_to_user_mode(regs); + arm64_syscall_exit_to_user_mode(regs); } /* @@ -716,12 +731,12 @@ static void noinstr el0_brk64(struct pt_regs *regs, unsigned long esr) static void noinstr el0_svc(struct pt_regs *regs) { - arm64_enter_from_user_mode(regs); + arm64_syscall_enter_from_user_mode(regs); cortex_a76_erratum_1463225_svc_handler(); fpsimd_syscall_enter(); local_daif_restore(DAIF_PROCCTX); do_el0_svc(regs); - arm64_exit_to_user_mode(regs); + arm64_syscall_exit_to_user_mode(regs); fpsimd_syscall_exit(); } @@ -868,11 +883,11 @@ static void noinstr el0_cp15(struct pt_regs *regs, unsigned long esr) static void noinstr el0_svc_compat(struct pt_regs *regs) { - arm64_enter_from_user_mode(regs); + arm64_syscall_enter_from_user_mode(regs); cortex_a76_erratum_1463225_svc_handler(); local_daif_restore(DAIF_PROCCTX); do_el0_svc_compat(regs); - arm64_exit_to_user_mode(regs); + arm64_syscall_exit_to_user_mode(regs); } static void noinstr el0_bkpt32(struct pt_regs *regs, unsigned long esr) diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h index 167fba7dbf043..1fabf0f5ea8e7 100644 --- a/include/linux/irq-entry-common.h +++ b/include/linux/irq-entry-common.h @@ -218,14 +218,6 @@ static __always_inline void __exit_to_user_mode_validate(void) lockdep_sys_exit(); } -/* Temporary workaround to keep ARM64 alive */ -static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs) -{ - __exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK); - rseq_exit_to_user_mode_legacy(); - __exit_to_user_mode_validate(); -} - /** * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required * @regs: Pointer to pt_regs on entry stack diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h index f11ebd34f8b95..a3762410c4ab6 100644 --- a/include/linux/rseq_entry.h +++ b/include/linux/rseq_entry.h @@ -753,24 +753,6 @@ static __always_inline void rseq_irqentry_exit_to_user_mode(void) ev->events = 0; } -/* Required to keep ARM64 working */ -static __always_inline void rseq_exit_to_user_mode_legacy(void) -{ - struct rseq_event *ev = ¤t->rseq.event; - - rseq_stat_inc(rseq_stats.exit); - - if (static_branch_unlikely(&rseq_debug_enabled)) - WARN_ON_ONCE(ev->sched_switch); - - /* - * Ensure that event (especially user_irq) is cleared when the - * interrupt did not result in a schedule and therefore the - * rseq processing did not clear it. - */ - ev->events = 0; -} - void __rseq_debug_syscall_return(struct pt_regs *regs); static __always_inline void rseq_debug_syscall_return(struct pt_regs *regs) @@ -786,7 +768,6 @@ static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned } static inline void rseq_syscall_exit_to_user_mode(void) { } static inline void rseq_irqentry_exit_to_user_mode(void) { } -static inline void rseq_exit_to_user_mode_legacy(void) { } static inline void rseq_debug_syscall_return(struct pt_regs *regs) { } static inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; } #endif /* !CONFIG_RSEQ */ -- 2.30.2 ^ permalink raw reply related [flat|nested] 41+ messages in thread
end of thread, other threads:[~2026-04-27 7:41 UTC | newest]
Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com>
2026-04-22 12:56 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Peter Zijlstra
2026-04-22 13:13 ` Peter Zijlstra
2026-04-23 10:38 ` Mathias Stearn
[not found] ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com>
2026-04-23 11:48 ` Thomas Gleixner
2026-04-23 12:11 ` Mathias Stearn
2026-04-23 17:19 ` Thomas Gleixner
2026-04-23 17:38 ` Chris Kennelly
2026-04-23 17:47 ` Mathieu Desnoyers
2026-04-23 19:39 ` Thomas Gleixner
2026-04-23 17:41 ` Linus Torvalds
2026-04-23 18:35 ` Mathias Stearn
2026-04-23 18:53 ` Mark Rutland
2026-04-23 21:03 ` Thomas Gleixner
2026-04-23 21:28 ` Linus Torvalds
2026-04-23 23:08 ` Linus Torvalds
2026-04-27 7:06 ` Florian Weimer
2026-04-22 13:09 ` Mark Rutland
2026-04-22 17:49 ` Thomas Gleixner
2026-04-22 18:11 ` Mark Rutland
2026-04-22 19:47 ` Thomas Gleixner
2026-04-23 1:48 ` Jinjie Ruan
2026-04-23 5:53 ` Dmitry Vyukov
2026-04-23 10:39 ` Thomas Gleixner
2026-04-23 10:51 ` Mathias Stearn
2026-04-23 12:24 ` David Laight
2026-04-23 19:31 ` Thomas Gleixner
2026-04-24 7:56 ` Dmitry Vyukov
2026-04-24 8:32 ` Mathias Stearn
2026-04-24 9:30 ` Dmitry Vyukov
2026-04-24 14:16 ` Thomas Gleixner
2026-04-24 15:03 ` Peter Zijlstra
2026-04-24 19:44 ` Thomas Gleixner
2026-04-26 22:04 ` Thomas Gleixner
2026-04-27 7:40 ` Florian Weimer
2026-04-23 12:11 ` Alejandro Colomar
2026-04-23 12:54 ` Mathieu Desnoyers
2026-04-23 12:29 ` Mathieu Desnoyers
2026-04-23 12:36 ` Dmitry Vyukov
2026-04-23 12:53 ` Mathieu Desnoyers
2026-04-23 12:58 ` Dmitry Vyukov
2026-04-24 16:45 ` [PATCH] arm64/entry: Fix arm64-specific rseq brokenness (was: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64) " Mark Rutland
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox