Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
       [not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com>
@ 2026-04-22 12:56 ` Peter Zijlstra
  2026-04-22 13:13   ` Peter Zijlstra
  2026-04-22 13:09 ` Mark Rutland
  2026-04-24 16:45 ` [PATCH] arm64/entry: Fix arm64-specific rseq brokenness (was: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64) " Mark Rutland
  2 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-04-22 12:56 UTC (permalink / raw)
  To: Mathias Stearn
  Cc: Thomas Gleixner, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Ingo Molnar,
	Mark Rutland, Jinjie Ruan, Blake Oler

On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote:

> Additionally, it breaks tcmalloc specifically by failing to overwrite
> the cpu_id_start field at points where it was relied on for
> correctness.

This specific behaviour was documented as being wrong and running with
DEBUG_RSEQ would have flagged it.

The tcmalloc issue has been contentious for a long time. The tcmalloc
folks relied on something that was documented to be wrong. It has been
reported to the tcmalloc people many years ago and if you were to run
tcmalloc on most any kernel (very much including 6.19) with
DEBUG_RSEQ=y, it would have yelled.

The tcmalloc people didn't care. There was a proposal for an RSEQ
extension for what they need, and they didn't care. All this should be
in their bugzilla or whatever.

The RSEQ rework improved performance significantly for everyone, and
kept all the documented behaviour (+- arm64 bug). Tcmalloc got screwed
over because they relied on implementation behaviour that was
specifically documented to be broken. And they didn't care. Google was
very much aware of this. And hasn't lifted a finger to remedy it.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-22 12:56 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Peter Zijlstra
@ 2026-04-22 13:13   ` Peter Zijlstra
  2026-04-23 10:38     ` Mathias Stearn
       [not found]     ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com>
  0 siblings, 2 replies; 41+ messages in thread
From: Peter Zijlstra @ 2026-04-22 13:13 UTC (permalink / raw)
  To: Mathias Stearn
  Cc: Thomas Gleixner, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Ingo Molnar,
	Mark Rutland, Jinjie Ruan, Blake Oler

On Wed, Apr 22, 2026 at 02:56:47PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote:
> 
> > Additionally, it breaks tcmalloc specifically by failing to overwrite
> > the cpu_id_start field at points where it was relied on for
> > correctness.
> 
> This specific behaviour was documented as being wrong and running with
> DEBUG_RSEQ would have flagged it.
> 
> The tcmalloc issue has been contentious for a long time. The tcmalloc
> folks relied on something that was documented to be wrong. It has been
> reported to the tcmalloc people many years ago and if you were to run
> tcmalloc on most any kernel (very much including 6.19) with
> DEBUG_RSEQ=y, it would have yelled.
> 
> The tcmalloc people didn't care. There was a proposal for an RSEQ
> extension for what they need, and they didn't care. All this should be
> in their bugzilla or whatever.
> 
> The RSEQ rework improved performance significantly for everyone, and
> kept all the documented behaviour (+- arm64 bug). Tcmalloc got screwed
> over because they relied on implementation behaviour that was
> specifically documented to be broken. And they didn't care. Google was
> very much aware of this. And hasn't lifted a finger to remedy it.

Also: https://lore.kernel.org/all/874io5andc.ffs@tglx/ 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-22 13:13   ` Peter Zijlstra
@ 2026-04-23 10:38     ` Mathias Stearn
       [not found]     ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com>
  1 sibling, 0 replies; 41+ messages in thread
From: Mathias Stearn @ 2026-04-23 10:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Ingo Molnar,
	Mark Rutland, Jinjie Ruan, Blake Oler

On Wed, Apr 22, 2026 at 3:13 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Apr 22, 2026 at 02:56:47PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote:
> >
> > > Additionally, it breaks tcmalloc specifically by failing to overwrite
> > > the cpu_id_start field at points where it was relied on for
> > > correctness.
> >
> > This specific behaviour was documented as being wrong and running with
> > DEBUG_RSEQ would have flagged it.
> >
> > The tcmalloc issue has been contentious for a long time. The tcmalloc
> > folks relied on something that was documented to be wrong. It has been
> > reported to the tcmalloc people many years ago and if you were to run
> > tcmalloc on most any kernel (very much including 6.19) with
> > DEBUG_RSEQ=y, it would have yelled.
> >
> > The tcmalloc people didn't care. There was a proposal for an RSEQ
> > extension for what they need, and they didn't care. All this should be
> > in their bugzilla or whatever.
> >
> > The RSEQ rework improved performance significantly for everyone, and
> > kept all the documented behaviour (+- arm64 bug). Tcmalloc got screwed
> > over because they relied on implementation behaviour that was
> > specifically documented to be broken. And they didn't care. Google was
> > very much aware of this. And hasn't lifted a finger to remedy it.
>
> Also: https://lore.kernel.org/all/874io5andc.ffs@tglx/

(Sorry for the resend to folks who got this already - I got an alert
that it was rejected by the mailinglists because it contained html so
attempting to resend as plain text)

I won't claim that tcmalloc _should_ be abusing cpu_id_start as it is.
I agree that it seems questionable at best. However, I will strongly
disagree with the following comment in that message:

> What it not longer does is updating the
> CPU number for the preemption case on the same CPU
> because that's just a massive waste of CPU cycles.

I don't think it will cost _any_ cycles to implement what I proposed.
And it especially should have no impact from just enabling rseq on a
thread as glibc now does. It should only result in different
instructions being executed when the program actually _uses_ rseq by
setting the rseq_cs variable to a non-null pointer. I will repeat the
proposal with a bit more commentary in case you missed some of the
details that make it free:

Any time a critical section might be aborted (migration, preemption,
signal delivery, and membarrier IPI), the kernel _already_ must check
the rseq_cs field to see if the thread is in a critical section [and
if it is null because the program isn't using rseq critical sections,
no further action is taken]. This is documented as nulling the pointer
after (I assume to make later checks cheaper) [if this changed, then
it *is* a change in _documented behavior_, not just an implementation
detail]. It would be sufficient for tcmalloc's internal usage if every
time the kernel nulled out rseq_cs, it also wrote the cpu id to
cpu_id_start. [This is one additional store to a cacheline you are
already writing to so it should be ~free on modern OoO CPUs and cheap
on others. There might be a small cost to loading the current cpu, but
since nothing depends on that other than the store, I still expect it
to be ~free]

To make this more concrete, I am proposing adding

unsafe_put_user((u32)task_cpu(t), &t->rseq.usrptr->cpu_id_start, efault);

after each place where you currently do

unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);

in rseq_update_user_cs. Is that something that you would expect to
cause a performance issue?

Again, I'm not claiming that it is "good" that this needs to be done.
But it does seem like a small price to pay to keep existing binaries
working on new kernels. Quoting the first paragraph of
https://docs.kernel.org/admin-guide/reporting-regressions.html:

> “We don’t cause regressions” is the first rule of Linux kernel development; Linux founder and lead developer Linus Torvalds established it himself and ensures it’s obeyed.

I don't see anything on that page that says it doesn't count as a
regression if the userspace program "relied on implementation
behaviour that was specifically documented to be broken".

^ permalink raw reply	[flat|nested] 41+ messages in thread

[parent not found: <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com>]

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
       [not found]     ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com>
@ 2026-04-23 11:48       ` Thomas Gleixner
  2026-04-23 12:11         ` Mathias Stearn
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-23 11:48 UTC (permalink / raw)
  To: Mathias Stearn, Peter Zijlstra
  Cc: Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions,
	linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland,
	Jinjie Ruan, Blake Oler

On Thu, Apr 23 2026 at 11:24, Mathias Stearn wrote:
> On Wed, Apr 22, 2026 at 3:13 PM Peter Zijlstra <peterz@infradead.org> wrote:
> To make this more concrete, I am proposing adding
>
> unsafe_put_user((u32)task_cpu(t), &t->rseq.usrptr->cpu_id_start, efault);
>
> after each place where you currently do
>
> unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
>
> in rseq_update_user_cs. Is that something that you would expect to cause a
> performance issue?

That would work and not bring the performance issues back, but:

  1) Did you validate that adding the reset into rseq_update_user_cs() is
     actually sufficient?

     If adding it to rseq_update_user_cs() is not sufficient, then we
     have a really serious problem. Because we'd need to go back and do
     it unconditionally, which then makes the 15% performance
     regression, which happened when glibc enabled rseq, come back
     instantaneously. And in that case the damage for tcmalloc() is the
     lesser of two evils.

  2) The tcmalloc abuse breaks the documented and guaranteed user space
     ABI and therefore it makes it impossible for any other library in
     an application which uses tcmalloc to rely on the documented and
     guaranteed rseq::cpu_id_start/rseq::cpu_id semantics.

     Which means, that tcmalloc is holding everybody else hostage.
     That's just not acceptable. Not even under the no regression rule.

  3) The fact that tcmalloc prevents a user from enabling rseq debugging
     is equally unacceptable as it does not allow me to validate my own
     rseq magic code in my mongodb client because enabling it will make
     the DB I want to test against go away.

     Again tcmalloc holds everybody else hostage for no reason at all.

The most amazing part is that tcmalloc uses this to spare two
instruction cycles, but nobody noticed in 8 years how much performance
the unconditional rseq nonsense in the kernel left on the table.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 11:48       ` Thomas Gleixner
@ 2026-04-23 12:11         ` Mathias Stearn
  2026-04-23 17:19           ` Thomas Gleixner
  0 siblings, 1 reply; 41+ messages in thread
From: Mathias Stearn @ 2026-04-23 12:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Ingo Molnar,
	Mark Rutland, Jinjie Ruan, Blake Oler

On Thu, Apr 23, 2026 at 1:48 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> That would work and not bring the performance issues back, but:
>
>   1) Did you validate that adding the reset into rseq_update_user_cs() is
>      actually sufficient?

Not yet, although I confirmed with the tcmalloc maintainers that they
thought it would be sufficient before suggesting it. I'm currently
building your patch from upthread to test that out. I can try this
after, although I don't think I'll be able to get to that today. I'll
try to get a coworker to test it though.

>      Which means, that tcmalloc is holding everybody else hostage.
>      That's just not acceptable. Not even under the no regression rule.

Agree. I don't love the situation either. Or that we need to advise
setting the environment variable to tell glibc not to use rseq. But I
also want our users to be able to use existing mongo binaries on new
kernels.

>   3) The fact that tcmalloc prevents a user from enabling rseq debugging
>      is equally unacceptable as it does not allow me to validate my own
>      rseq magic code in my mongodb client because enabling it will make
>      the DB I want to test against go away.

Glad to hear you use mongodb :)

> The most amazing part is that tcmalloc uses this to spare two
> instruction cycles, but nobody noticed in 8 years how much performance
> the unconditional rseq nonsense in the kernel left on the table.

I am looking into a change to our copy of tcmalloc to have it stop
squatting on cpu_id_start, and will run that through our correctness
and performance tests. I can't promise anything (and I certainly can't
speak for what Google may choose to do), but I share your expectation
that it should be possible with minimal impact. It _is_ more than 2
cycles though, since it extends the load dependency chain by one or
two pointer chases and a bit of ALU ops. I'd guesstimate it will
likely cost on the order of 5-10 cycles per call to malloc or free. I
think we can absorb that, but will need to test.

Of course, even if we make that change, it will only apply to _future_
binaries. That's why we prefer a kernel fix so that users will be able
to run our existing releases (or any containers that use them) on a
modern kernel.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 12:11         ` Mathias Stearn
@ 2026-04-23 17:19           ` Thomas Gleixner
  2026-04-23 17:38             ` Chris Kennelly
  2026-04-23 17:41             ` Linus Torvalds
  0 siblings, 2 replies; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-23 17:19 UTC (permalink / raw)
  To: Mathias Stearn
  Cc: Peter Zijlstra, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Ingo Molnar,
	Mark Rutland, Jinjie Ruan, Blake Oler, Linus Torvalds

On Thu, Apr 23 2026 at 14:11, Mathias Stearn wrote:

Cc+ Linus

> Of course, even if we make that change, it will only apply to _future_
> binaries. That's why we prefer a kernel fix so that users will be able
> to run our existing releases (or any containers that use them) on a
> modern kernel.

I understand that and as everyone else I would be happy to do that, but
the price everyone pays for proliferating the tcmalloc insanity is not
cheap either.

So let me recap the whole situation and how we got there:

  1) The original RSEQ implementation updates the rseq::cpu_id_start
     field in user space more or less unconditionally on every exit to
     user, whether the CPU/MMCID have been changed or not.

     That went unnoticed for years because nothing used rseq aside of
     google and tcmalloc. Once glibc registered rseq, this resulted in a
     up to 15% performance penalty for syscall heavy workloads.

  2) The rseq::cpu_id_start field is documented as read only for user
     space in the ABI contract and guaranteed to be updated by the
     kernel when a task is migrated to a different CPU.

  3) The RO for userspace property has been enforced by RSEQ debugging
     mode since day one. If such a debug enabled kernel detects user
     space changing the field it kills the task/application.

  4) tcmalloc abused the suboptimal implementation (see #1) and
     scribbled over rseq::cpu_id_start for their own nefarious purposes.

  5) As a consequence of #4 tcmalloc cannot be used on a RSEQ debug
     enabled kernel. Which means a developer cannot validate his RSEQ
     code against a debug kernel when tcmalloc is in use on the system
     as that would crash the tcmalloc dependent applications due to #3.

  6) As a consequence of #4 tcmalloc cannot be used together with any
     other facility/library which wants to utilize the ABI guaranteed
     properties of rseq::cpu_id_start in the same application.

  7) tcmalloc violates the ABI from day one and has since refused to
     address the problem despite being offered a kernel side rseq
     extension to solve it many years ago.

  8) When addressing the performance issues of RSEQ the unconditional
     update stopped to exist under the valid assumption that the kernel
     has only to satisfy the guaranteed ABI properties, especially when
     they are enforcable by RSEQ debug.

     As a consequence this exposed the tcmalloc ABI violation because
     the unconditional pointless overwriting of something which did not
     change stopped to happen.

Due to #4 everyone is in a hard place and up a creek without a paddle.

Here are the possible solutions:

  A) Mathias suggested to force overwrite rseq:cpu_id_start everytime
     the rseq::rseq_cs field is cleared by the kernel under the not yet
     validated theoretical assumption that this cures the problem for
     tcmalloc.

     If that's sufficient that would be harmless performance wise
     because the write would be inside the already existing STAC/CLAC
     section and just add some more noise to the rseq critical section
     operations.

     That would allow existing tcmalloc usage to continue, but
     obviously would neither solve #5 and #6 above nor provide an
     incentive for tcmalloc to actually fix their crap.

  B) If that's not sufficient then keeping tcmalloc alive would require
     to go back to the previous state and let everyone else pay the
     price in terms of performance overhead.

  C) Declare that this is not a regression because the ABI guarantee is
     not violated and the RO property has been enforcable by RSEQ
     debugging since day one.

In my opinion #C is the right thing to do, but I can see a case being
made for the lightweight fix Mathias suggested (#A) _if_ and only _if_
that is sufficient. Picking #A would also mean that user space people
have to take up the fight against tcmalloc when they want to use the
RSEQ guaranteed ABI along with tcmalloc in the same application or use a
RSEQ debug kernel to validate their own code.

Going back to the full unconditional nightmare (#B) is not an option at
all as anybody else has to take the massive performance hit.

Oh well...

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 17:19           ` Thomas Gleixner
@ 2026-04-23 17:38             ` Chris Kennelly
  2026-04-23 17:47               ` Mathieu Desnoyers
  2026-04-23 19:39               ` Thomas Gleixner
  2026-04-23 17:41             ` Linus Torvalds
  1 sibling, 2 replies; 41+ messages in thread
From: Chris Kennelly @ 2026-04-23 17:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney,
	Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel,
	Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler,
	Linus Torvalds

On Thu, Apr 23, 2026 at 1:19 PM Thomas Gleixner <tglx@kernel.org> wrote:
>
> On Thu, Apr 23 2026 at 14:11, Mathias Stearn wrote:
>
> Cc+ Linus
>
> > Of course, even if we make that change, it will only apply to _future_
> > binaries. That's why we prefer a kernel fix so that users will be able
> > to run our existing releases (or any containers that use them) on a
> > modern kernel.
>
> I understand that and as everyone else I would be happy to do that, but
> the price everyone pays for proliferating the tcmalloc insanity is not
> cheap either.
>
> So let me recap the whole situation and how we got there:
>
>   1) The original RSEQ implementation updates the rseq::cpu_id_start
>      field in user space more or less unconditionally on every exit to
>      user, whether the CPU/MMCID have been changed or not.
>
>      That went unnoticed for years because nothing used rseq aside of
>      google and tcmalloc. Once glibc registered rseq, this resulted in a
>      up to 15% performance penalty for syscall heavy workloads.
>
>   2) The rseq::cpu_id_start field is documented as read only for user
>      space in the ABI contract and guaranteed to be updated by the
>      kernel when a task is migrated to a different CPU.
>
>   3) The RO for userspace property has been enforced by RSEQ debugging
>      mode since day one. If such a debug enabled kernel detects user
>      space changing the field it kills the task/application.

The optimization in TCMalloc that you're describing has been available
since September 2023:
https://github.com/google/tcmalloc/commit/aaa4fbf6fcdce1b7f86fcadd659874645c75ddb9

I thought the RSEQ debug checks were added in December 2024:
https://github.com/torvalds/linux/commit/7d5265ffcd8b41da5e09066360540d6e0716e9cd,
but perhaps I misidentified the ones in question.

>
>   4) tcmalloc abused the suboptimal implementation (see #1) and
>      scribbled over rseq::cpu_id_start for their own nefarious purposes.
>
>   5) As a consequence of #4 tcmalloc cannot be used on a RSEQ debug
>      enabled kernel. Which means a developer cannot validate his RSEQ
>      code against a debug kernel when tcmalloc is in use on the system
>      as that would crash the tcmalloc dependent applications due to #3.
>
>   6) As a consequence of #4 tcmalloc cannot be used together with any
>      other facility/library which wants to utilize the ABI guaranteed
>      properties of rseq::cpu_id_start in the same application.
>
>   7) tcmalloc violates the ABI from day one and has since refused to
>      address the problem despite being offered a kernel side rseq
>      extension to solve it many years ago.

I know there was some discussion around a preemption notification
scheme, rseq_sched_state; but I thought the discussion moved in favor
of the timeslice extension interface that recently landed. Timeslice
extension solves some use cases, but I'm not sure it addresses this
one.

>
>   8) When addressing the performance issues of RSEQ the unconditional
>      update stopped to exist under the valid assumption that the kernel
>      has only to satisfy the guaranteed ABI properties, especially when
>      they are enforcable by RSEQ debug.
>
>      As a consequence this exposed the tcmalloc ABI violation because
>      the unconditional pointless overwriting of something which did not
>      change stopped to happen.
>
> Due to #4 everyone is in a hard place and up a creek without a paddle.
>
> Here are the possible solutions:
>
>   A) Mathias suggested to force overwrite rseq:cpu_id_start everytime
>      the rseq::rseq_cs field is cleared by the kernel under the not yet
>      validated theoretical assumption that this cures the problem for
>      tcmalloc.
>
>      If that's sufficient that would be harmless performance wise
>      because the write would be inside the already existing STAC/CLAC
>      section and just add some more noise to the rseq critical section
>      operations.
>
>      That would allow existing tcmalloc usage to continue, but
>      obviously would neither solve #5 and #6 above nor provide an
>      incentive for tcmalloc to actually fix their crap.
>
>   B) If that's not sufficient then keeping tcmalloc alive would require
>      to go back to the previous state and let everyone else pay the
>      price in terms of performance overhead.
>
>   C) Declare that this is not a regression because the ABI guarantee is
>      not violated and the RO property has been enforcable by RSEQ
>      debugging since day one.
>
> In my opinion #C is the right thing to do, but I can see a case being
> made for the lightweight fix Mathias suggested (#A) _if_ and only _if_
> that is sufficient. Picking #A would also mean that user space people
> have to take up the fight against tcmalloc when they want to use the
> RSEQ guaranteed ABI along with tcmalloc in the same application or use a
> RSEQ debug kernel to validate their own code.
>
> Going back to the full unconditional nightmare (#B) is not an option at
> all as anybody else has to take the massive performance hit.
>
> Oh well...
>
> Thanks,
>
>         tglx


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 17:38             ` Chris Kennelly
@ 2026-04-23 17:47               ` Mathieu Desnoyers
  2026-04-23 19:39               ` Thomas Gleixner
  1 sibling, 0 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2026-04-23 17:47 UTC (permalink / raw)
  To: Chris Kennelly, Thomas Gleixner
  Cc: Mathias Stearn, Peter Zijlstra, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Dmitry Vyukov, regressions,
	linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland,
	Jinjie Ruan, Blake Oler, Linus Torvalds

On 2026-04-23 13:38, Chris Kennelly wrote:
> On Thu, Apr 23, 2026 at 1:19 PM Thomas Gleixner <tglx@kernel.org> wrote:

[...]

>>
>>    3) The RO for userspace property has been enforced by RSEQ debugging
>>       mode since day one. If such a debug enabled kernel detects user
>>       space changing the field it kills the task/application.
> 
> The optimization in TCMalloc that you're describing has been available
> since September 2023:
> https://github.com/google/tcmalloc/commit/aaa4fbf6fcdce1b7f86fcadd659874645c75ddb9
> 
> I thought the RSEQ debug checks were added in December 2024:
> https://github.com/torvalds/linux/commit/7d5265ffcd8b41da5e09066360540d6e0716e9cd,
> but perhaps I misidentified the ones in question.

You are correct, I added the RSEQ field corruption validation under
debug config in Nov. 2024 when I noticed the world of pain we were
heading towards with incompatible tcmalloc vs glibc (and general) use
due to tcmalloc not respecting the ABI contract. RSEQ has been
upstreamed in 2018. So that's not exactly a day one enforcement.
The ABI contract was clear about this being an invalid use from
day one though.

[...]

>>    7) tcmalloc violates the ABI from day one and has since refused to
>>       address the problem despite being offered a kernel side rseq
>>       extension to solve it many years ago.
> 
> I know there was some discussion around a preemption notification
> scheme, rseq_sched_state; but I thought the discussion moved in favor
> of the timeslice extension interface that recently landed. Timeslice
> extension solves some use cases, but I'm not sure it addresses this
> one.

I have actively engaged with the tcmalloc developers to
understand their needs and figure out a proper solution for the
past ~3-4 years, without success.

I have done a POC branch extending rseq with a "reset a linked list of
userspace areas on preemption" back in 2024 which would have solved
tcmalloc's issues cleanly. I never posted it publicly because the
tcmalloc devs told me they could not justify spending time even trying
this out to their managers.

I still have that feature branch gathering dust somewhere.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 17:38             ` Chris Kennelly
  2026-04-23 17:47               ` Mathieu Desnoyers
@ 2026-04-23 19:39               ` Thomas Gleixner
  1 sibling, 0 replies; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-23 19:39 UTC (permalink / raw)
  To: Chris Kennelly
  Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney,
	Dmitry Vyukov, regressions, linux-kernel, linux-arm-kernel,
	Ingo Molnar, Mark Rutland, Jinjie Ruan, Blake Oler,
	Linus Torvalds

On Thu, Apr 23 2026 at 13:38, Chris Kennelly wrote:
> On Thu, Apr 23, 2026 at 1:19 PM Thomas Gleixner <tglx@kernel.org> wrote:
>>   3) The RO for userspace property has been enforced by RSEQ debugging
>>      mode since day one. If such a debug enabled kernel detects user
>>      space changing the field it kills the task/application.
>
> The optimization in TCMalloc that you're describing has been available
> since September 2023:
> https://github.com/google/tcmalloc/commit/aaa4fbf6fcdce1b7f86fcadd659874645c75ddb9

And the github issue which requested glibc compatibility was opened in
Sept. 2022:

      https://github.com/google/tcmalloc/issues/144

> I thought the RSEQ debug checks were added in December 2024:
> https://github.com/torvalds/linux/commit/7d5265ffcd8b41da5e09066360540d6e0716e9cd,
> but perhaps I misidentified the ones in question.

I might have misread the git log. But that still does not justify the
violation of a documented ABI for the price that nobody else can use it
once tcmalloc is in play:

   x = tcmalloc();
   dostuff(x)
     evaluate(rseq::cpu_id_start, rseq::cpu_id) <- FAIL

>>   7) tcmalloc violates the ABI from day one and has since refused to
>>      address the problem despite being offered a kernel side rseq
>>      extension to solve it many years ago.
>
> I know there was some discussion around a preemption notification
> scheme, rseq_sched_state; but I thought the discussion moved in favor
> of the timeslice extension interface that recently landed. Timeslice
> extension solves some use cases, but I'm not sure it addresses this
> one.

No it does not. That's an orthogonal optimization.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 17:19           ` Thomas Gleixner
  2026-04-23 17:38             ` Chris Kennelly
@ 2026-04-23 17:41             ` Linus Torvalds
  2026-04-23 18:35               ` Mathias Stearn
                                 ` (2 more replies)
  1 sibling, 3 replies; 41+ messages in thread
From: Linus Torvalds @ 2026-04-23 17:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney,
	Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel,
	linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan,
	Blake Oler

On Thu, 23 Apr 2026 at 10:19, Thomas Gleixner <tglx@kernel.org> wrote:
>
>   C) Declare that this is not a regression because the ABI guarantee is
>      not violated and the RO property has been enforcable by RSEQ
>      debugging since day one.

No, if this actually hits real users, that is not an option. If real
users never used RSEQ debugging options, those options are simply
irrelevant.

Regression rules have never been about "it wouldn't have worked in
some other configuration". That's like saying "that code would never
have worked on another architecture". It may be true, but it's
irrelevant for the people whose binaries no longer work.

We will have to fix this.

This is not some kind of gray area. It clearly violates our regression rules.

The only "ABI guarantee" is what people actually see and use, not some
debug option that wasn't enabled.

And I just checked - it's not enabled in at least the Fedora distro
kernels. Presumably other distros don't enable it either. So no actual
non-kernel developer would *ever* have hit it, and claiming it is
relevant is just garbage.

IOW, that debug option was always a complete no-op except for kernel developers.

In fact, that debug option is actively *hidden* - you have to enable
EXPERT to even see it. Soi it really is not a real option for normal
people AT ALL.

Christ, even *I* don't enable EXPERT except for build testing. It's
literally something that only embedded people doing odd things should
do.

If that rule was actually an important part of the ABI, it shouldn't
have been a debug thing.

So:

 (a) the debug code in question needs to just be removed, since it's
now actively detrimental, and means that any kernel developer who
*does* enable it can't actually test this case any more. It's checking
for something that has been shown to not be true.

 (b) we need to fix this (revert if it can't be fixed otherwise)

I see some patches flying around, but am not clear on whether there
was an actual patch that make this work again?

             Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 17:41             ` Linus Torvalds
@ 2026-04-23 18:35               ` Mathias Stearn
  2026-04-23 18:53               ` Mark Rutland
  2026-04-23 21:03               ` Thomas Gleixner
  2 siblings, 0 replies; 41+ messages in thread
From: Mathias Stearn @ 2026-04-23 18:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Peter Zijlstra, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney,
	Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel,
	linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan,
	Blake Oler

On Thu, Apr 23, 2026 at 7:48 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> I see some patches flying around, but am not clear on whether there
> was an actual patch that make this work again?

Thomas's patch from upthread appears in initial testing to address the
arm64 preemption breakage. Thanks! I'm currently building with the
following patch on top of that and will test it once it is ready.

---

diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index a36b472627de..e26bf249bbd8 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -300,12 +300,15 @@ rseq_update_user_cs(struct task_struct *t,
struct pt_regs *regs, unsigned long c

         /* Invalidate the critical section */
         unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+        /* TCMalloc kludge - it relies on cpu_id_start being overwritten */
+        unsafe_put_user((u32)task_cpu(t),
&t->rseq.usrptr->cpu_id_start, efault);
         /* Update the instruction pointer */
         instruction_pointer_set(regs, (unsigned long)abort_ip);
         rseq_stat_inc(rseq_stats.fixup);
         break;
     clear:
         unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+        unsafe_put_user((u32)task_cpu(t),
&t->rseq.usrptr->cpu_id_start, efault);
         rseq_stat_inc(rseq_stats.clear);
         abort_ip = 0ULL;
     }


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 17:41             ` Linus Torvalds
  2026-04-23 18:35               ` Mathias Stearn
@ 2026-04-23 18:53               ` Mark Rutland
  2026-04-23 21:03               ` Thomas Gleixner
  2 siblings, 0 replies; 41+ messages in thread
From: Mark Rutland @ 2026-04-23 18:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Mathias Stearn, Peter Zijlstra,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions,
	linux-kernel, linux-arm-kernel, Ingo Molnar, Jinjie Ruan,
	Blake Oler

On Thu, Apr 23, 2026 at 10:41:02AM -0700, Linus Torvalds wrote:
> On Thu, 23 Apr 2026 at 10:19, Thomas Gleixner <tglx@kernel.org> wrote:
> I see some patches flying around, but am not clear on whether there
> was an actual patch that make this work again?

There's not a patch yet.

The diffs sent so far were options for fixing the arm64-specific issue
(missing aborts on preemption), NOT the generic issue (missing
clobbering of cpu_id_start that tcmalloc was depending upon).

For the arm64 issue, I think we can have a fix tomorrow (as it's end of
day here in the UK). Now that I've pored the entry code and the rseq
code, I think a variant of one of Thomas's proposed fixes will work, but
I'd like to make the naming/layering crystal clear so that it's harder
to break this by accident in future.

For the generic issue, hopefully the option Mathias proposed (clearing
cpu_id_start when rseq_cs is cleared) is sufficient. I'll work with
Mathias and Thomas for that.

I've also poked folk to make sure that CI systems run the rseq selftests
(which they evidently weren't), so that we catch this sort of thing
earlier.

Mark.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 17:41             ` Linus Torvalds
  2026-04-23 18:35               ` Mathias Stearn
  2026-04-23 18:53               ` Mark Rutland
@ 2026-04-23 21:03               ` Thomas Gleixner
  2026-04-23 21:28                 ` Linus Torvalds
  2 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-23 21:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney,
	Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel,
	linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan,
	Blake Oler

On Thu, Apr 23 2026 at 10:41, Linus Torvalds wrote:
> If that rule was actually an important part of the ABI, it shouldn't
> have been a debug thing.

It's a debug thing because it's too expensive to be enabled by
default. And it's actually valuable for validating RSEQ critical section
ABI correctness as they can't be single stepped with a debugger as the
break point interruption would immediately canceled.

> So:
>
>  (a) the debug code in question needs to just be removed, since it's
> now actively detrimental, and means that any kernel developer who
> *does* enable it can't actually test this case any more. It's checking
> for something that has been shown to not be true.
>
>  (b) we need to fix this (revert if it can't be fixed otherwise)
>
> I see some patches flying around, but am not clear on whether there
> was an actual patch that make this work again?

There are two issues:

  1) ARM64

     On ARM64 RSEQ got broken completely with the partial move to the
     generic entry code. There are patches flying around which "fix" it
     and Mark is working on a more complete solution as there are other
     subtle issues with that aside of the obvious RSEQ wreckage. The
     latter could have been detected with the existing RSEQ selftests if
     any CI would actually run them on -next.

     That's uninteresting and unrelated to the tcmalloc issue. It's just
     a boring bug which will be fixed in the next couple of days.


  2) The tcmalloc problem

     That's a known problem for at least 6 years. tcmalloc assumes that
     it "owns" rseq and can do whatever it wants with it.

     In 2022 the glibc people requested that tcmalloc becomes
     interoperable with the reasonable expection of glibc to utilize
     rseq as well:

          https://github.com/google/tcmalloc/issues/144

     Status unresolved.

     That means that using tcmalloc requires to tell glibc to _NOT_ use
     rseq and at the same time precludes any other library which wants
     to use it for the documented purposes. So this code sequence blows
     up in your face:

        x = tcmalloc();
        dostuff(x)
          evaluate(rseq::cpu_id_start, rseq::cpu_id)

     because tcmalloc overwrites rseq::cpu_id_start and thereby breaks
     the ABI which evaluate() is rightfully depending on.

     That has absolutely nothing to do with the kernel as there is no
     kernel interaction between tcmalloc's abuse and the subsequent
     evaluation of rseq::cpu_id_start. The kernel has no way to fix that
     problem at all.

     Now back to your generally correct and agreed on "observed
     behaviour" rule.

     Feel free to enforce it, but be aware that you thereby set a
     precedence that a single abuser can then rightfully own a general
     shared interface of the kernel forever and force everybody else to
     give up.

     The tcmalloc developers actually documented that they own the
     world:

     // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.

     Do you seriously want to proliferate that?

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 21:03               ` Thomas Gleixner
@ 2026-04-23 21:28                 ` Linus Torvalds
  2026-04-23 23:08                   ` Linus Torvalds
  2026-04-27  7:06                   ` Florian Weimer
  0 siblings, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2026-04-23 21:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney,
	Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel,
	linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan,
	Blake Oler

On Thu, 23 Apr 2026 at 14:03, Thomas Gleixner <tglx@kernel.org> wrote:
>
>      Feel free to enforce it, but be aware that you thereby set a
>      precedence that a single abuser can then rightfully own a general
>      shared interface of the kernel forever and force everybody else to
>      give up.

That's not a new precedent. That is *literally* the rule we have always had.

This is why system calls and ABI's need to have hard rules that they
actually check, because if they don't, they are stuck with the
semantics that people assume.

And no, "documented behavior" is BS. It has absolutely no relevance.
All that matters is hard harsh reality.

Yes, this has led to issues before.

Most new system calls have learnt their lesson, and they check for
unused bits in flags etc, and error out on bits that the lernel
doesn't really care about being randomly set - so that one day we
*can* extend on things and start caring about them.

But they do it because we've been burnt so many times before because
we haven't checked those bits, and then we were forced to just live
with the fact that people passed in random values.

>     // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.
>
>     Do you seriously want to proliferate that?

Absolutely.

That's how clever hacks work - they take advantage of things past
their design parameters. "If it works, it's not stupid".

We don't then turn around and say "you were clever, and we did
something stupid, so now we'll hurt you".

This is all 100% on the RSEQ kernel code, not on users who took advantage of it.

                Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 21:28                 ` Linus Torvalds
@ 2026-04-23 23:08                   ` Linus Torvalds
  2026-04-27  7:06                   ` Florian Weimer
  1 sibling, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2026-04-23 23:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathias Stearn, Peter Zijlstra, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney,
	Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel,
	linux-arm-kernel, Ingo Molnar, Mark Rutland, Jinjie Ruan,
	Blake Oler

On Thu, 23 Apr 2026 at 14:28, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is all 100% on the RSEQ kernel code, not on users who took advantage of it.

Side note: when RSEQ was merged, the *primary* documented use case was
literally user space allocators with percpu caches. That's what I was
told at the time.

Now I think it was jemalloc(), not tcmalloc, but it's not like
tcmalloc is some odd minor use-case.

We are pretty much talking about the raison d'être of the whole rseq
feature, not some odd small corner case.

               Linus


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 21:28                 ` Linus Torvalds
  2026-04-23 23:08                   ` Linus Torvalds
@ 2026-04-27  7:06                   ` Florian Weimer
  1 sibling, 0 replies; 41+ messages in thread
From: Florian Weimer @ 2026-04-27  7:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Mathias Stearn, Peter Zijlstra,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions,
	linux-kernel, linux-arm-kernel, Ingo Molnar, Mark Rutland,
	Jinjie Ruan, Blake Oler

* Linus Torvalds:

>>     // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.
>>
>>     Do you seriously want to proliferate that?
>
> Absolutely.
>
> That's how clever hacks work - they take advantage of things past
> their design parameters. "If it works, it's not stupid".
>
> We don't then turn around and say "you were clever, and we did
> something stupid, so now we'll hurt you".
>
> This is all 100% on the RSEQ kernel code, not on users who took
> advantage of it.

RSEQ was intended to be modular, with more than one library using it
within a process, without coordination (beyond sticking to the RSEQ
protocol).  The tcmalloc approach is incompatible with that.  Once
tcmalloc starts using RSEQ in its peculiar way, nothing else in the
process can, and vice versa.  This is far from ideal because the
particular descheduling notification that tcmalloc uses could be
implemented in a much simpler way than full RSEQ, given its non-modular
nature.

Thanks,
Florian



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
       [not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com>
  2026-04-22 12:56 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Peter Zijlstra
@ 2026-04-22 13:09 ` Mark Rutland
  2026-04-22 17:49   ` Thomas Gleixner
  2026-04-24 16:45 ` [PATCH] arm64/entry: Fix arm64-specific rseq brokenness (was: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64) " Mark Rutland
  2 siblings, 1 reply; 41+ messages in thread
From: Mark Rutland @ 2026-04-22 13:09 UTC (permalink / raw)
  To: Mathias Stearn
  Cc: Thomas Gleixner, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra,
	Ingo Molnar, Jinjie Ruan, Blake Oler

Hi Mathias,

On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote:
> TL;DR: As of 6.19, rseq no longer provides the documented atomicity
> guarantees on arm64 by failing to abort the critical section on same-core
> preemption/resumption. Additionally, it breaks tcmalloc specifically by
> failing to overwrite the cpu_id_start field at points where it was relied
> on for correctness.

Thanks for the report, and the test case.

As a holding reply, I'm looking into this now from the arm64 side.

I'll leave it to Thomas/Peter/Mathieu to comment w.r.t. the issue you
raise with cpu_id_start.

For some reason, this mail didn't make it to my inbox, and I had to grab
it from lore using b4. That might be a problem with my local mail
server; I'm just noting that in case others also didn't receive this.

Mark.

> This is a SEVERE breakage for MongoDB. We received several user reports of
> crashes on 6.19. I made a stress test that showed that 6.19 can cause
> malloc to return the same pointer twice without it being freed. Because
> that can cause arbitrary corruption, our latest releases have all been
> patched to refuse to start at all on 6.19+.
> 
> TCMalloc uses rseq in a "creative" way described at
> https://github.com/google/tcmalloc/blob/master/docs/rseq.md. In particular,
> the "Current CPU Slabs Pointer Caching" section describes an optimization
> that relies on an undocumented fact that the kernel was always overwriting
> cpu_id_start (even when it wouldn't change) to invalidate a user-space
> cache. Since the change to stop writing cpu_id_start seemed to be
> intentional as part of a refactoring merged in 2b09f480f0a1, I started
> working on a userspace patch to stop relying on that. Unfortunately when
> that was complete I ran into a wall that is impossible to work around from
> userspace.
> 
> On arm64, the kernel no longer meets the documented guarantee that rseq
> critical sections are atomic with respect to preemption. It seems to only
> abort the critical section when the thread is migrated to a different core.
> The attached test proves it and passes on x86 both before and after 6.19,
> and on arm before 6.19, but fails on arm with 6.19. It pins the process to
> a single core and then has an rseq critical section that observes a change
> made by another thread which is supposed to be impossible. I think this
> will break basically any real usage of rseq, other than just reading the
> current cpu_id.
> 
> An LLM pointed to these two specific commits in the refactor as causing
> this (oldest first):
> - 39a167560a61 rseq: Optimize event setting
> This assumed that user_irq would be set on preemption but it wasn't on
> arm64, so TIF_NOTIFY_RESUME isn't raised on same cpu preemption.
> - 566d8015f7ee rseq: Avoid CPU/MM CID updates when no event pending
> This broke TCMalloc slab caching trick by not overwriting cpu_id_start on
> every return to userspace
> 
> (I have a lot more analysis and suggested fixes from LLMs since I used them
> heavily in this testing and analysis, but I won't spam you with the slop
> unless requested)
> 
> The arm64 change is a clear breakage and I'm sure it will be
> uncontroversial to fix. I can imagine more resistance to reverting to the
> old behavior of always overwriting the cpu_id_start field since that seems
> to have been an intentional optimization choice. I have reached out to the
> TCMalloc maintainers (CC'd) and believe there is a solution that gets the
> vast majority of the optimization while still preserving the behavior that
> TCMalloc currently relies on[1].
> 
> Any time a critical section might be aborted (migration, preemption, signal
> delivery, and membarrier IPI), the kernel already must (but doesn't on
> arm64 at the moment) check the rseq_cs field to see if the thread is in a
> critical section, and is documented as nulling the pointer after (I assume
> to make later checks cheaper). It would be sufficient for tcmalloc's
> internal usage if every time the kernel nulled out rseq_cs, it also wrote
> the cpu id to cpu_id_start. That should be essentially free since you are
> already writing to the same cache line. It was pointed out that that could
> be an issue if another rseq user in the same thread nulled rseq_cs after
> its critical section, which would require the kernel to update cpu_id_start
> each time it checks rseq_cs, regardless of whether it nulls it. We aren't
> aware of any processes that mix tcmalloc with other rseq usages that null
> out the field from userspace, but we can't rule them out since it is open
> source. Either way, this preserves the property of not updating
> cpu_id_start on every syscall return and non-membarrier interrupts, which I
> assume is where the majority of the optimization win was from.
> 
> All testing of problematic versions was performed on x86_64 and
> aarch64 Ubuntu 24.04.4 with the kernel manually upgraded to
> 6.19.8-061908-generic. Source analysis was performed on the v6.19 tag. I
> had a few AI agents confirm that nothing in the relevant changes to master
> should have solved this, but I have not yet tested there.
> 
> $ cat /proc/version
> Linux version 6.19.8-061908-generic (kernel@balboa)
> (aarch64-linux-gnu-gcc-15 (Ubuntu 15.2.0-15ubuntu1) 15.2.0, GNU ld (GNU
> Binutils for Ubuntu) 2.46) #202603131837 SMP PREEMPT_DYNAMIC Sat Mar 14
> 00:00:07 UTC 2026
> 
> [1]  There is also an exploration of some options to make tcmalloc not rely
> on the cpu_id_start overwriting. However we would strongly prefer that
> existing binaries continue to work on 6.19 kernels, even if newer binaries
> don't need that. At least for a good while.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-22 13:09 ` Mark Rutland
@ 2026-04-22 17:49   ` Thomas Gleixner
  2026-04-22 18:11     ` Mark Rutland
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-22 17:49 UTC (permalink / raw)
  To: Mark Rutland, Mathias Stearn
  Cc: Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, Dmitry Vyukov, regressions,
	linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar,
	Jinjie Ruan, Blake Oler

On Wed, Apr 22 2026 at 14:09, Mark Rutland wrote:
> On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote:
>> TL;DR: As of 6.19, rseq no longer provides the documented atomicity
>> guarantees on arm64 by failing to abort the critical section on same-core
>> preemption/resumption. Additionally, it breaks tcmalloc specifically by
>> failing to overwrite the cpu_id_start field at points where it was relied
>> on for correctness.
>
> Thanks for the report, and the test case.
>
> As a holding reply, I'm looking into this now from the arm64 side.

I assume it's the partial conversion to the generic entry code which
screws that up. The problem reproduces with rseq selftests nicely.

The patch below fixes it as it puts ARM64 back to the non-optimized code
for now. Once ARM64 is fully converted it gets all the nice improvements.

Thanks,

        tglx
---
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 2266f4dc77b6..d55476e2a336 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -30,7 +30,7 @@ void __rseq_signal_deliver(int sig, struct pt_regs *regs);
  */
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
 {
-	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
 		/* '&' is intentional to spare one conditional branch */
 		if (current->rseq.event.has_rseq & current->rseq.event.user_irq)
 			__rseq_signal_deliver(ksig->sig, regs);
@@ -50,7 +50,7 @@ static __always_inline void rseq_sched_switch_event(struct task_struct *t)
 {
 	struct rseq_event *ev = &t->rseq.event;
 
-	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
 		/*
 		 * Avoid a boat load of conditionals by using simple logic
 		 * to determine whether NOTIFY_RESUME needs to be raised.
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index a36b472627de..8ccd464a108d 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -80,7 +80,7 @@ bool rseq_debug_validate_ids(struct task_struct *t);
 
 static __always_inline void rseq_note_user_irq_entry(void)
 {
-	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
+	if (IS_ENABLED(CONFIG_GENERIC_ENTRY))
 		current->rseq.event.user_irq = true;
 }
 
@@ -171,8 +171,8 @@ bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
 		if (unlikely(usig != t->rseq.sig))
 			goto die;
 
-		/* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
-		if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+		/* rseq_event.user_irq is only valid if CONFIG_GENERIC_ENTRY=y */
+		if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
 			/* If not in interrupt from user context, let it die */
 			if (unlikely(!t->rseq.event.user_irq))
 				goto die;
@@ -387,7 +387,7 @@ static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *r
 	 * allows to skip the critical section when the entry was not from
 	 * a user space interrupt, unless debug mode is enabled.
 	 */
-	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
 		if (!static_branch_unlikely(&rseq_debug_enabled)) {
 			if (likely(!t->rseq.event.user_irq))
 				return true;


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-22 17:49   ` Thomas Gleixner
@ 2026-04-22 18:11     ` Mark Rutland
  2026-04-22 19:47       ` Thomas Gleixner
  0 siblings, 1 reply; 41+ messages in thread
From: Mark Rutland @ 2026-04-22 18:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra,
	Ingo Molnar, Jinjie Ruan, Blake Oler

On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote:
> On Wed, Apr 22 2026 at 14:09, Mark Rutland wrote:
> > On Wed, Apr 22, 2026 at 11:50:26AM +0200, Mathias Stearn wrote:
> >> TL;DR: As of 6.19, rseq no longer provides the documented atomicity
> >> guarantees on arm64 by failing to abort the critical section on same-core
> >> preemption/resumption. Additionally, it breaks tcmalloc specifically by
> >> failing to overwrite the cpu_id_start field at points where it was relied
> >> on for correctness.
> >
> > Thanks for the report, and the test case.
> >
> > As a holding reply, I'm looking into this now from the arm64 side.
> 
> I assume it's the partial conversion to the generic entry code which
> screws that up. 

It's slightly more than that, but in a sense, yes. ;)

The fix is conceptually simple, but I'll need to do some refactoring.

Conceptually we just need to use syscall_enter_from_user_mode() and
irqentry_enter_from_user_mode() appropriately.

In practice, I can't use those as-is without introducing the exception
masking problems I just fixed up for irqentry_enter_from_kernel_mode(),
so I'll need to do some similar refactoring first.

That and I *think* a couple of of the current checks for CONFIG_GENERIC_ENTRY
should be checking CONFIG_GENERIC_IRQ_ENTRY, since all of the relevant
bits are in the generic irqentry code rather than the GENERIC_SYSCALL
code (and GENERIC_ENTRY is just GENERIC_IRQ_ENTRY + GENERIC_SYSCALL).

> The problem reproduces with rseq selftests nicely.

Ah; that's both good to know, and worrying that we've never had a report
from all the automated testing people are supposedly running. :/

> The patch below fixes it as it puts ARM64 back to the non-optimized code
> for now. Once ARM64 is fully converted it gets all the nice improvements.

Thanks; I'll give that a test tomorrow.

I haven't paged everything in yet, so just to cehck, is there anything
that would behave incorrectly if current->rseq.event.user_irq were set
for syscall entry? IIUC it means we'll effectively do the slow path, and
I was wondering if that might be acceptable as a one-line bodge for
stable.

As above, I'd like if the actual fix could make this work for
GENERIC_IRQ_ENTRY rather than GENERIC_ENTRY, since that way we can make
this work as it was supposed to *before* moving to GENERIC_SYSCALL
(which has a whole lot more ABI impact to worry about).

I think that just needs a small amount of refactoring that arm64 will
need regardless.

Mark.

> 
> Thanks,
> 
>         tglx
> ---
> diff --git a/include/linux/rseq.h b/include/linux/rseq.h
> index 2266f4dc77b6..d55476e2a336 100644
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -30,7 +30,7 @@ void __rseq_signal_deliver(int sig, struct pt_regs *regs);
>   */
>  static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
>  {
> -	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> +	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
>  		/* '&' is intentional to spare one conditional branch */
>  		if (current->rseq.event.has_rseq & current->rseq.event.user_irq)
>  			__rseq_signal_deliver(ksig->sig, regs);
> @@ -50,7 +50,7 @@ static __always_inline void rseq_sched_switch_event(struct task_struct *t)
>  {
>  	struct rseq_event *ev = &t->rseq.event;
>  
> -	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> +	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
>  		/*
>  		 * Avoid a boat load of conditionals by using simple logic
>  		 * to determine whether NOTIFY_RESUME needs to be raised.
> diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
> index a36b472627de..8ccd464a108d 100644
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -80,7 +80,7 @@ bool rseq_debug_validate_ids(struct task_struct *t);
>  
>  static __always_inline void rseq_note_user_irq_entry(void)
>  {
> -	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
> +	if (IS_ENABLED(CONFIG_GENERIC_ENTRY))
>  		current->rseq.event.user_irq = true;
>  }
>  
> @@ -171,8 +171,8 @@ bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
>  		if (unlikely(usig != t->rseq.sig))
>  			goto die;
>  
> -		/* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
> -		if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> +		/* rseq_event.user_irq is only valid if CONFIG_GENERIC_ENTRY=y */
> +		if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
>  			/* If not in interrupt from user context, let it die */
>  			if (unlikely(!t->rseq.event.user_irq))
>  				goto die;
> @@ -387,7 +387,7 @@ static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *r
>  	 * allows to skip the critical section when the entry was not from
>  	 * a user space interrupt, unless debug mode is enabled.
>  	 */
> -	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> +	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
>  		if (!static_branch_unlikely(&rseq_debug_enabled)) {
>  			if (likely(!t->rseq.event.user_irq))
>  				return true;


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-22 18:11     ` Mark Rutland
@ 2026-04-22 19:47       ` Thomas Gleixner
  2026-04-23  1:48         ` Jinjie Ruan
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-22 19:47 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra,
	Ingo Molnar, Jinjie Ruan, Blake Oler

On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote:
> On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote:
> Conceptually we just need to use syscall_enter_from_user_mode() and
> irqentry_enter_from_user_mode() appropriately.

Right. I figured that out.

> In practice, I can't use those as-is without introducing the exception
> masking problems I just fixed up for irqentry_enter_from_kernel_mode(),
> so I'll need to do some similar refactoring first.

See below.

> I haven't paged everything in yet, so just to cehck, is there anything
> that would behave incorrectly if current->rseq.event.user_irq were set
> for syscall entry? IIUC it means we'll effectively do the slow path, and
> I was wondering if that might be acceptable as a one-line bodge for
> stable.

It might work, but it's trivial enough to avoid that. See below. That on
top of 6.19.y makes the selftests pass too.

Thanks,

        tglx
---
 arch/arm64/kernel/entry-common.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode(
 	irqentry_exit(regs, state);
 }
 
+static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs)
+{
+	enter_from_user_mode(regs);
+	mte_disable_tco_entry(current);
+}
+
 /*
  * Handle IRQ/context state management when entering from user mode.
  * Before this function is called it is not safe to call regular kernel code,
@@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode(
  */
 static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
 {
-	enter_from_user_mode(regs);
-	mte_disable_tco_entry(current);
+	arm64_enter_from_user_mode_syscall(regs);
+	rseq_note_user_irq_entry();
 }
 
 /*
@@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_
 
 static void noinstr el0_svc(struct pt_regs *regs)
 {
-	arm64_enter_from_user_mode(regs);
+	arm64_enter_from_user_mode_syscall(regs);
 	cortex_a76_erratum_1463225_svc_handler();
 	fpsimd_syscall_enter();
 	local_daif_restore(DAIF_PROCCTX);
@@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r
 
 static void noinstr el0_svc_compat(struct pt_regs *regs)
 {
-	arm64_enter_from_user_mode(regs);
+	arm64_enter_from_user_mode_syscall(regs);
 	cortex_a76_erratum_1463225_svc_handler();
 	local_daif_restore(DAIF_PROCCTX);
 	do_el0_svc_compat(regs);


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-22 19:47       ` Thomas Gleixner
@ 2026-04-23  1:48         ` Jinjie Ruan
  2026-04-23  5:53           ` Dmitry Vyukov
  0 siblings, 1 reply; 41+ messages in thread
From: Jinjie Ruan @ 2026-04-23  1:48 UTC (permalink / raw)
  To: Thomas Gleixner, Mark Rutland
  Cc: Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra,
	Ingo Molnar, Blake Oler



On 4/23/2026 3:47 AM, Thomas Gleixner wrote:
> On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote:
>> On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote:
>> Conceptually we just need to use syscall_enter_from_user_mode() and
>> irqentry_enter_from_user_mode() appropriately.
> 
> Right. I figured that out.
> 
>> In practice, I can't use those as-is without introducing the exception
>> masking problems I just fixed up for irqentry_enter_from_kernel_mode(),
>> so I'll need to do some similar refactoring first.
> 
> See below.
> 
>> I haven't paged everything in yet, so just to cehck, is there anything
>> that would behave incorrectly if current->rseq.event.user_irq were set
>> for syscall entry? IIUC it means we'll effectively do the slow path, and
>> I was wondering if that might be acceptable as a one-line bodge for
>> stable.
> 
> It might work, but it's trivial enough to avoid that. See below. That on
> top of 6.19.y makes the selftests pass too.

This aligns with my thoughts when convert arm64 to generic syscall
entry. Currently, the arm64 entry code does not distinguish between IRQ
and syscall entries. It fails to call rseq_note_user_irq_entry() for IRQ
entries as the generic entry framework does, because arm64 uses
enter_from_user_mode() exclusively instead of
irqentry_enter_from_user_mode().

https://lore.kernel.org/all/20260320102620.1336796-10-ruanjinjie@huawei.com/

> 
> Thanks,
> 
>         tglx
> ---
>  arch/arm64/kernel/entry-common.c |   14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode(
>  	irqentry_exit(regs, state);
>  }
>  
> +static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs)
> +{
> +	enter_from_user_mode(regs);
> +	mte_disable_tco_entry(current);
> +}
> +
>  /*
>   * Handle IRQ/context state management when entering from user mode.
>   * Before this function is called it is not safe to call regular kernel code,
> @@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode(
>   */
>  static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
>  {
> -	enter_from_user_mode(regs);
> -	mte_disable_tco_entry(current);
> +	arm64_enter_from_user_mode_syscall(regs);
> +	rseq_note_user_irq_entry();
>  }
>  
>  /*
> @@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_
>  
>  static void noinstr el0_svc(struct pt_regs *regs)
>  {
> -	arm64_enter_from_user_mode(regs);
> +	arm64_enter_from_user_mode_syscall(regs);
>  	cortex_a76_erratum_1463225_svc_handler();
>  	fpsimd_syscall_enter();
>  	local_daif_restore(DAIF_PROCCTX);
> @@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r
>  
>  static void noinstr el0_svc_compat(struct pt_regs *regs)
>  {
> -	arm64_enter_from_user_mode(regs);
> +	arm64_enter_from_user_mode_syscall(regs);
>  	cortex_a76_erratum_1463225_svc_handler();
>  	local_daif_restore(DAIF_PROCCTX);
>  	do_el0_svc_compat(regs);



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23  1:48         ` Jinjie Ruan
@ 2026-04-23  5:53           ` Dmitry Vyukov
  2026-04-23 10:39             ` Thomas Gleixner
                               ` (2 more replies)
  0 siblings, 3 replies; 41+ messages in thread
From: Dmitry Vyukov @ 2026-04-23  5:53 UTC (permalink / raw)
  To: Jinjie Ruan, linux-man
  Cc: Thomas Gleixner, Mark Rutland, Mathias Stearn, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney,
	Chris Kennelly, regressions, linux-kernel, linux-arm-kernel,
	Peter Zijlstra, Ingo Molnar, Blake Oler

On Thu, 23 Apr 2026 at 03:48, Jinjie Ruan <ruanjinjie@huawei.com> wrote:
>
> On 4/23/2026 3:47 AM, Thomas Gleixner wrote:
> > On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote:
> >> On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote:
> >> Conceptually we just need to use syscall_enter_from_user_mode() and
> >> irqentry_enter_from_user_mode() appropriately.
> >
> > Right. I figured that out.
> >
> >> In practice, I can't use those as-is without introducing the exception
> >> masking problems I just fixed up for irqentry_enter_from_kernel_mode(),
> >> so I'll need to do some similar refactoring first.
> >
> > See below.
> >
> >> I haven't paged everything in yet, so just to cehck, is there anything
> >> that would behave incorrectly if current->rseq.event.user_irq were set
> >> for syscall entry? IIUC it means we'll effectively do the slow path, and
> >> I was wondering if that might be acceptable as a one-line bodge for
> >> stable.
> >
> > It might work, but it's trivial enough to avoid that. See below. That on
> > top of 6.19.y makes the selftests pass too.
>
> This aligns with my thoughts when convert arm64 to generic syscall
> entry. Currently, the arm64 entry code does not distinguish between IRQ
> and syscall entries. It fails to call rseq_note_user_irq_entry() for IRQ
> entries as the generic entry framework does, because arm64 uses
> enter_from_user_mode() exclusively instead of
> irqentry_enter_from_user_mode().
>
> https://lore.kernel.org/all/20260320102620.1336796-10-ruanjinjie@huawei.com/
>
> >
> > Thanks,
> >
> >         tglx
> > ---
> >  arch/arm64/kernel/entry-common.c |   14 ++++++++++----
> >  1 file changed, 10 insertions(+), 4 deletions(-)
> >
> > --- a/arch/arm64/kernel/entry-common.c
> > +++ b/arch/arm64/kernel/entry-common.c
> > @@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode(
> >       irqentry_exit(regs, state);
> >  }
> >
> > +static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs)
> > +{
> > +     enter_from_user_mode(regs);
> > +     mte_disable_tco_entry(current);
> > +}
> > +
> >  /*
> >   * Handle IRQ/context state management when entering from user mode.
> >   * Before this function is called it is not safe to call regular kernel code,
> > @@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode(
> >   */
> >  static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
> >  {
> > -     enter_from_user_mode(regs);
> > -     mte_disable_tco_entry(current);
> > +     arm64_enter_from_user_mode_syscall(regs);
> > +     rseq_note_user_irq_entry();
> >  }
> >
> >  /*
> > @@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_
> >
> >  static void noinstr el0_svc(struct pt_regs *regs)
> >  {
> > -     arm64_enter_from_user_mode(regs);
> > +     arm64_enter_from_user_mode_syscall(regs);
> >       cortex_a76_erratum_1463225_svc_handler();
> >       fpsimd_syscall_enter();
> >       local_daif_restore(DAIF_PROCCTX);
> > @@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r
> >
> >  static void noinstr el0_svc_compat(struct pt_regs *regs)
> >  {
> > -     arm64_enter_from_user_mode(regs);
> > +     arm64_enter_from_user_mode_syscall(regs);
> >       cortex_a76_erratum_1463225_svc_handler();
> >       local_daif_restore(DAIF_PROCCTX);
> >       do_el0_svc_compat(regs);


+linux-man

This part of the rseq man page needs to be fixed as well I think. The
kernel no longer reliably provides clearing of rseq_cs on preemption,
right?

https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241

"and set to NULL by the kernel when it restarts an assembly
instruction sequence block,
as well as when the kernel detects that it is preempting or delivering
a signal outside of the range targeted by the rseq_cs."


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23  5:53           ` Dmitry Vyukov
@ 2026-04-23 10:39             ` Thomas Gleixner
  2026-04-23 10:51               ` Mathias Stearn
  2026-04-23 12:11             ` Alejandro Colomar
  2026-04-23 12:29             ` Mathieu Desnoyers
  2 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-23 10:39 UTC (permalink / raw)
  To: Dmitry Vyukov, Jinjie Ruan, linux-man
  Cc: Mark Rutland, Mathias Stearn, Mathieu Desnoyers, Catalin Marinas,
	Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly,
	regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra,
	Ingo Molnar, Blake Oler

On Thu, Apr 23 2026 at 07:53, Dmitry Vyukov wrote:
> On Thu, 23 Apr 2026 at 03:48, Jinjie Ruan <ruanjinjie@huawei.com> wrote:
>
> This part of the rseq man page needs to be fixed as well I think. The
> kernel no longer reliably provides clearing of rseq_cs on preemption,
> right?
>
> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241
>
> "and set to NULL by the kernel when it restarts an assembly
> instruction sequence block,
> as well as when the kernel detects that it is preempting or delivering
> a signal outside of the range targeted by the rseq_cs."

The kernel clears rseq_cs reliably when user space was interrupted and:

    the task was preempted
or
    the return from interrupt delivers a signal

If the task invoked a syscall then there is absolutely no reason to do
either of this because syscalls from within a critical section are a
bug and catched when enabling rseq debugging.

The original code did this along with unconditionally updating CPU/MMCID
which resulted in ~15% performance regression on a syscall heavy
database benchmark once glibc started to register rseq.

Thanks,

        tglx








^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 10:39             ` Thomas Gleixner
@ 2026-04-23 10:51               ` Mathias Stearn
  2026-04-23 12:24                 ` David Laight
  2026-04-23 19:31                 ` Thomas Gleixner
  0 siblings, 2 replies; 41+ messages in thread
From: Mathias Stearn @ 2026-04-23 10:51 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler

On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> The kernel clears rseq_cs reliably when user space was interrupted and:
>
>     the task was preempted
> or
>     the return from interrupt delivers a signal
>
> If the task invoked a syscall then there is absolutely no reason to do
> either of this because syscalls from within a critical section are a
> bug and catched when enabling rseq debugging.
>
> The original code did this along with unconditionally updating CPU/MMCID
> which resulted in ~15% performance regression on a syscall heavy
> database benchmark once glibc started to register rseq.

Just to be clear TCMalloc does not need either rseq_cs to be cleared
or cpu_id_start to be written to on syscalls because it doesn't do
syscalls from critical sections. It will actually benefit (slightly)
from not updating cpu_id_start on syscalls.

It is specifically in the cases where an rseq would need to be aborted
(preemption, signals, migration, and membarrier IPI with the rseq
flag) that TCMalloc relies on cpu_id_start being written. It does rely
on that write even when not inside the critical section, because it
effectively uses that to detect if there were any would-cause-abort
events in between two critical sections. But since it leaves the
rseq_cs pointer non-null between critical sections, so you dont need
to add _any_ overhead for programs that never make use of rseq after
registration, or add any overhead to syscalls even for those who do.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 10:51               ` Mathias Stearn
@ 2026-04-23 12:24                 ` David Laight
  2026-04-23 19:31                 ` Thomas Gleixner
  1 sibling, 0 replies; 41+ messages in thread
From: David Laight @ 2026-04-23 12:24 UTC (permalink / raw)
  To: Mathias Stearn
  Cc: Thomas Gleixner, Dmitry Vyukov, Jinjie Ruan, linux-man,
	Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions,
	linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar,
	Blake Oler

On Thu, 23 Apr 2026 12:51:22 +0200
Mathias Stearn <mathias@mongodb.com> wrote:

> On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > The kernel clears rseq_cs reliably when user space was interrupted and:
> >
> >     the task was preempted
> > or
> >     the return from interrupt delivers a signal
> >
> > If the task invoked a syscall then there is absolutely no reason to do
> > either of this because syscalls from within a critical section are a
> > bug and catched when enabling rseq debugging.
> >
> > The original code did this along with unconditionally updating CPU/MMCID
> > which resulted in ~15% performance regression on a syscall heavy
> > database benchmark once glibc started to register rseq.  
> 
> Just to be clear TCMalloc does not need either rseq_cs to be cleared
> or cpu_id_start to be written to on syscalls because it doesn't do
> syscalls from critical sections. It will actually benefit (slightly)
> from not updating cpu_id_start on syscalls.
> 
> It is specifically in the cases where an rseq would need to be aborted
> (preemption, signals, migration, and membarrier IPI with the rseq
> flag) that TCMalloc relies on cpu_id_start being written. It does rely
> on that write even when not inside the critical section, because it
> effectively uses that to detect if there were any would-cause-abort
> events in between two critical sections. But since it leaves the
> rseq_cs pointer non-null between critical sections, so you dont need
> to add _any_ overhead for programs that never make use of rseq after
> registration, or add any overhead to syscalls even for those who do.
> 

That sounds like one long rseq sequence where the 'restart' path
detects that some of the operations have already been done.

	David


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 10:51               ` Mathias Stearn
  2026-04-23 12:24                 ` David Laight
@ 2026-04-23 19:31                 ` Thomas Gleixner
  2026-04-24  7:56                   ` Dmitry Vyukov
  1 sibling, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-23 19:31 UTC (permalink / raw)
  To: Mathias Stearn
  Cc: Dmitry Vyukov, Jinjie Ruan, linux-man, Mark Rutland,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler

On Thu, Apr 23 2026 at 12:51, Mathias Stearn wrote:
> On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> The kernel clears rseq_cs reliably when user space was interrupted and:
>>
>>     the task was preempted
>> or
>>     the return from interrupt delivers a signal
>>
>> If the task invoked a syscall then there is absolutely no reason to do
>> either of this because syscalls from within a critical section are a
>> bug and catched when enabling rseq debugging.
>>
>> The original code did this along with unconditionally updating CPU/MMCID
>> which resulted in ~15% performance regression on a syscall heavy
>> database benchmark once glibc started to register rseq.
>
> Just to be clear TCMalloc does not need either rseq_cs to be cleared
> or cpu_id_start to be written to on syscalls because it doesn't do
> syscalls from critical sections. It will actually benefit (slightly)
> from not updating cpu_id_start on syscalls.

I know that it does not do syscalls from within critical sections, but
it relies on cpu_id_start being unconditionally updated in one way or
the other.

> It is specifically in the cases where an rseq would need to be aborted
> (preemption, signals, migration, and membarrier IPI with the rseq
> flag) that TCMalloc relies on cpu_id_start being written. It does rely
> on that write even when not inside the critical section, because it
> effectively uses that to detect if there were any would-cause-abort
> events in between two critical sections. But since it leaves the
> rseq_cs pointer non-null between critical sections, so you dont need
> to add _any_ overhead for programs that never make use of rseq after
> registration, or add any overhead to syscalls even for those who do.

Well. According to the comment in the tcmalloc code:

// Calculation of the address of the current CPU slabs region is needed for
// allocation/deallocation fast paths, but is quite expensive. Due to variable
// shift and experimental support for "virtual CPUs", the calculation involves
// several additional loads and dependent calculations. Pseudo-code for the
// address calculation is as follows:
//
//   cpu_offset = TcmallocSlab.virtual_cpu_id_offset_;
//   cpu = *(&__rseq_abi + virtual_cpu_id_offset_);
//   slabs_and_shift = TcmallocSlab.slabs_and_shift_;
//   shift = slabs_and_shift & kShiftMask;
//   shifted_cpu = cpu << shift;
//   slabs = slabs_and_shift & kSlabsMask;
//   slabs += shifted_cpu;
//
// To remove this calculation from fast paths, we cache the slabs address
// for the current CPU in thread local storage. However, when a thread is
// rescheduled to another CPU, we somehow need to understand that the cached

                  ^^^^^^^^^^^

// address is not valid anymore. To achieve this, we overlap the top 4 bytes
// of the cached address with __rseq_abi.cpu_id_start. When a thread is
// rescheduled the kernel overwrites cpu_id_start with the current CPU number,
// which gives us the signal that the cached address is not valid anymore.

The kernel still as of today (the arm64 bug aside) updates the
cpu_id_start and cpu_id fields in rseq when a task is rescheduled to
another CPU.

So if the code only requires to know when it got rescheduled to another
CPU then it still should work, no?

But it does not, which makes it clear that it relies on this
undocumented behaviour of the kernel to rewrite rseq::cpu_id_start
unconditionally. I'm not yet convinced that it relies on it only when
interrupted between two subsequent critical sections. We'll see.

....

Now we come to the best part of this comment:

// Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.

So any code sequence which ends up in:

   x = tcmalloc();
   dostuff(x)
     evaluate(rseq::cpu_id_start, rseq::cpu_id)

is doomed. This might be acceptable for Google internal usage where they
control the full stack and can prevent anyone else to utilize rseq, but
in an open ecosystem that's obviously a non-starter.

And they definitely forgot to add this to the comment:

// Never enable CONFIG_RSEQ_DEBUG in the kernel when you use tcmalloc as
// it will expose the blatant ABI abuse and therefore will kill your
// application.

If your assumption that the rewrite is only required when rseq::rseq_cs
is non NULL and user space was interrupted is correct, then the obvious
no-brainer would have been to add:

        __u64	rseq_usr_data;

to struct rseq and clear that unconditionally when rseq::rseq_cs is
cleared.

But that would have been too simple, would work independent of endianess
and not in the way of anybody else.

But I know that's incompatible with the features first, correctness
later and we own the world anyway mindset.

Just for giggles I asked Google Gemini about the implications of
tmalloc's rseq abuse. The answer is pretty clear:

   "In short, TCMalloc treats RSEQ as a private optimization rather than
    a shared system resource, which compromises the stability and
    extensibility of any application that needs RSEQ for anything other
    than memory allocation."

It's also very clear about the wilful ignorance of the tcmalloc people:

   "In summary, the developers have known for at least 6 years that the
    implementation was non-standard and conflicting with other rseq
    usage. The github issue which requested glibc compatibility was
    opened in 2022 and has been unresolved since then."

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 19:31                 ` Thomas Gleixner
@ 2026-04-24  7:56                   ` Dmitry Vyukov
  2026-04-24  8:32                     ` Mathias Stearn
  0 siblings, 1 reply; 41+ messages in thread
From: Dmitry Vyukov @ 2026-04-24  7:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathias Stearn, Jinjie Ruan, linux-man, Mark Rutland,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler

On Thu, 23 Apr 2026 at 21:31, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Apr 23 2026 at 12:51, Mathias Stearn wrote:
> > On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> The kernel clears rseq_cs reliably when user space was interrupted and:
> >>
> >>     the task was preempted
> >> or
> >>     the return from interrupt delivers a signal
> >>
> >> If the task invoked a syscall then there is absolutely no reason to do
> >> either of this because syscalls from within a critical section are a
> >> bug and catched when enabling rseq debugging.
> >>
> >> The original code did this along with unconditionally updating CPU/MMCID
> >> which resulted in ~15% performance regression on a syscall heavy
> >> database benchmark once glibc started to register rseq.
> >
> > Just to be clear TCMalloc does not need either rseq_cs to be cleared
> > or cpu_id_start to be written to on syscalls because it doesn't do
> > syscalls from critical sections. It will actually benefit (slightly)
> > from not updating cpu_id_start on syscalls.
>
> I know that it does not do syscalls from within critical sections, but
> it relies on cpu_id_start being unconditionally updated in one way or
> the other.
>
> > It is specifically in the cases where an rseq would need to be aborted
> > (preemption, signals, migration, and membarrier IPI with the rseq
> > flag) that TCMalloc relies on cpu_id_start being written. It does rely
> > on that write even when not inside the critical section, because it
> > effectively uses that to detect if there were any would-cause-abort
> > events in between two critical sections. But since it leaves the
> > rseq_cs pointer non-null between critical sections, so you dont need
> > to add _any_ overhead for programs that never make use of rseq after
> > registration, or add any overhead to syscalls even for those who do.
>
> Well. According to the comment in the tcmalloc code:
>
> // Calculation of the address of the current CPU slabs region is needed for
> // allocation/deallocation fast paths, but is quite expensive. Due to variable
> // shift and experimental support for "virtual CPUs", the calculation involves
> // several additional loads and dependent calculations. Pseudo-code for the
> // address calculation is as follows:
> //
> //   cpu_offset = TcmallocSlab.virtual_cpu_id_offset_;
> //   cpu = *(&__rseq_abi + virtual_cpu_id_offset_);
> //   slabs_and_shift = TcmallocSlab.slabs_and_shift_;
> //   shift = slabs_and_shift & kShiftMask;
> //   shifted_cpu = cpu << shift;
> //   slabs = slabs_and_shift & kSlabsMask;
> //   slabs += shifted_cpu;
> //
> // To remove this calculation from fast paths, we cache the slabs address
> // for the current CPU in thread local storage. However, when a thread is
> // rescheduled to another CPU, we somehow need to understand that the cached
>
>                   ^^^^^^^^^^^
>
> // address is not valid anymore. To achieve this, we overlap the top 4 bytes
> // of the cached address with __rseq_abi.cpu_id_start. When a thread is
> // rescheduled the kernel overwrites cpu_id_start with the current CPU number,
> // which gives us the signal that the cached address is not valid anymore.
>
> The kernel still as of today (the arm64 bug aside) updates the
> cpu_id_start and cpu_id fields in rseq when a task is rescheduled to
> another CPU.
>
> So if the code only requires to know when it got rescheduled to another
> CPU then it still should work, no?

This was my first thought too:
https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/
The only problem is with membarrier (it used to force write to
__rseq_abi.cpu_id_start for all threads, but now it does not).
Otherwise the caching scheme works.

I have a tentative fix for tcmalloc:
https://github.com/dvyukov/tcmalloc/commit/58d0eca91503f539b26d20b6f55fb2f6f8bc0c37

The crux is as follows.
Tcmalloc needs to make all threads stop using old cached slab
pointers. The stopping procedure is now:

slab->stopped = true;
membarrier();

and all rseq critical sections now check the stopped flag in the
cached slab pointer. If it's set, the thread does not proceed to use
the slab.




> But it does not, which makes it clear that it relies on this
> undocumented behaviour of the kernel to rewrite rseq::cpu_id_start
> unconditionally. I'm not yet convinced that it relies on it only when
> interrupted between two subsequent critical sections. We'll see.
>
> ....
>
> Now we come to the best part of this comment:
>
> // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.
>
> So any code sequence which ends up in:
>
>    x = tcmalloc();
>    dostuff(x)
>      evaluate(rseq::cpu_id_start, rseq::cpu_id)
>
> is doomed. This might be acceptable for Google internal usage where they
> control the full stack and can prevent anyone else to utilize rseq, but
> in an open ecosystem that's obviously a non-starter.
>
> And they definitely forgot to add this to the comment:
>
> // Never enable CONFIG_RSEQ_DEBUG in the kernel when you use tcmalloc as
> // it will expose the blatant ABI abuse and therefore will kill your
> // application.
>
> If your assumption that the rewrite is only required when rseq::rseq_cs
> is non NULL and user space was interrupted is correct, then the obvious
> no-brainer would have been to add:
>
>         __u64   rseq_usr_data;
>
> to struct rseq and clear that unconditionally when rseq::rseq_cs is
> cleared.
>
> But that would have been too simple, would work independent of endianess
> and not in the way of anybody else.
>
> But I know that's incompatible with the features first, correctness
> later and we own the world anyway mindset.
>
> Just for giggles I asked Google Gemini about the implications of
> tmalloc's rseq abuse. The answer is pretty clear:
>
>    "In short, TCMalloc treats RSEQ as a private optimization rather than
>     a shared system resource, which compromises the stability and
>     extensibility of any application that needs RSEQ for anything other
>     than memory allocation."
>
> It's also very clear about the wilful ignorance of the tcmalloc people:
>
>    "In summary, the developers have known for at least 6 years that the
>     implementation was non-standard and conflicting with other rseq
>     usage. The github issue which requested glibc compatibility was
>     opened in 2022 and has been unresolved since then."
>
> Thanks,
>
>         tglx


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-24  7:56                   ` Dmitry Vyukov
@ 2026-04-24  8:32                     ` Mathias Stearn
  2026-04-24  9:30                       ` Dmitry Vyukov
  2026-04-24 14:16                       ` Thomas Gleixner
  0 siblings, 2 replies; 41+ messages in thread
From: Mathias Stearn @ 2026-04-24  8:32 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Thomas Gleixner, Jinjie Ruan, linux-man, Mark Rutland,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler

On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > So if the code only requires to know when it got rescheduled to another
> > CPU then it still should work, no?
>
> This was my first thought too:
> https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/
> The only problem is with membarrier (it used to force write to
> __rseq_abi.cpu_id_start for all threads, but now it does not).
> Otherwise the caching scheme works.

I almost wrote a message last night saying that we didn't need
cpu_id_start invalidation on preemption. However, I remembered that
the Grow() function[1] does a load outside of a critical section then
stores a derived value inside the critical section, guarded only by
the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really
should be doing a compare against the original value inside the
critical section (or just do the whole thing inside), but it doesn't.
I haven't reasoned end-to-end through this fully to prove corruption
is possible, but I suspect that it is if another thread same-cpu
preempts between the loads and the store and updates the header before
the original thread resumes and writes its original intended header
value. Ditto for signals, which sometimes allocate even though they
shouldn't.

I was really hoping that we would only need to do the "redundant"
cpu_id_start writes would only be needed on membarrier_rseq IPIs where
it really is a pay-for-what-you-use functionality, I think existing
binaries depend on invalidation on preemption. Luckily that should be
cheap enough to be ~free.

[1] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L964-L980
[2] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L551-L605

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-24  8:32                     ` Mathias Stearn
@ 2026-04-24  9:30                       ` Dmitry Vyukov
  2026-04-24 14:16                       ` Thomas Gleixner
  1 sibling, 0 replies; 41+ messages in thread
From: Dmitry Vyukov @ 2026-04-24  9:30 UTC (permalink / raw)
  To: Mathias Stearn
  Cc: Thomas Gleixner, Jinjie Ruan, linux-man, Mark Rutland,
	Mathieu Desnoyers, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler

On Fri, 24 Apr 2026 at 10:32, Mathias Stearn <mathias@mongodb.com> wrote:
>
> On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> > > So if the code only requires to know when it got rescheduled to another
> > > CPU then it still should work, no?
> >
> > This was my first thought too:
> > https://lore.kernel.org/lkml/CACT4Y+a9GnOh3wHKSRwzoKF6_OSksQ8qehnHfpCgkQSt_OOmYg@mail.gmail.com/
> > The only problem is with membarrier (it used to force write to
> > __rseq_abi.cpu_id_start for all threads, but now it does not).
> > Otherwise the caching scheme works.
>
> I almost wrote a message last night saying that we didn't need
> cpu_id_start invalidation on preemption. However, I remembered that
> the Grow() function[1] does a load outside of a critical section then
> stores a derived value inside the critical section, guarded only by
> the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really
> should be doing a compare against the original value inside the
> critical section (or just do the whole thing inside), but it doesn't.
> I haven't reasoned end-to-end through this fully to prove corruption
> is possible, but I suspect that it is if another thread same-cpu
> preempts between the loads and the store and updates the header before
> the original thread resumes and writes its original intended header
> value. Ditto for signals, which sometimes allocate even though they
> shouldn't.
>
> I was really hoping that we would only need to do the "redundant"
> cpu_id_start writes would only be needed on membarrier_rseq IPIs where
> it really is a pay-for-what-you-use functionality, I think existing
> binaries depend on invalidation on preemption. Luckily that should be
> cheap enough to be ~free.

I've prototyped this idea too:
https://github.com/dvyukov/linux/commit/1284e3723047cb5afd247f75c53de43efc18db82



> [1] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L964-L980
> [2] https://github.com/google/tcmalloc/blob/8e98046ec5639bffbe70a53770a2699dd355b26d/tcmalloc/internal/percpu_tcmalloc.h#L551-L605


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-24  8:32                     ` Mathias Stearn
  2026-04-24  9:30                       ` Dmitry Vyukov
@ 2026-04-24 14:16                       ` Thomas Gleixner
  2026-04-24 15:03                         ` Peter Zijlstra
  1 sibling, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-24 14:16 UTC (permalink / raw)
  To: Mathias Stearn, Dmitry Vyukov
  Cc: Jinjie Ruan, linux-man, Mark Rutland, Mathieu Desnoyers,
	Catalin Marinas, Will Deacon, Boqun Feng, Paul E. McKenney,
	Chris Kennelly, regressions, linux-kernel, linux-arm-kernel,
	Peter Zijlstra, Ingo Molnar, Blake Oler

On Fri, Apr 24 2026 at 10:32, Mathias Stearn wrote:
> On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote:
>> The only problem is with membarrier (it used to force write to
>> __rseq_abi.cpu_id_start for all threads, but now it does not).
>> Otherwise the caching scheme works.
>
> I almost wrote a message last night saying that we didn't need
> cpu_id_start invalidation on preemption. However, I remembered that
> the Grow() function[1] does a load outside of a critical section then
> stores a derived value inside the critical section, guarded only by
> the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really
> should be doing a compare against the original value inside the
> critical section (or just do the whole thing inside), but it doesn't.
> I haven't reasoned end-to-end through this fully to prove corruption
> is possible, but I suspect that it is if another thread same-cpu
> preempts between the loads and the store and updates the header before
> the original thread resumes and writes its original intended header
> value. Ditto for signals, which sometimes allocate even though they
> shouldn't.
>
> I was really hoping that we would only need to do the "redundant"
> cpu_id_start writes would only be needed on membarrier_rseq IPIs where
> it really is a pay-for-what-you-use functionality,

That's fine and can be solved without adding this sequence overhead into
the scheduler hotpath.

> I think existing binaries depend on invalidation on
> preemption. Luckily that should be cheap enough to be ~free.

That's only free when it can be burried in the rseq_cs update, which
means the ID update would not happen when rseq_cs is NULL.

If those two changes fix it w/o requiring additional tcmalloc changes,
I'm happy to hack that up tomorrow.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-24 14:16                       ` Thomas Gleixner
@ 2026-04-24 15:03                         ` Peter Zijlstra
  2026-04-24 19:44                           ` Thomas Gleixner
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2026-04-24 15:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man,
	Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions,
	linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler

On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote:
> On Fri, Apr 24 2026 at 10:32, Mathias Stearn wrote:
> > On Fri, Apr 24, 2026 at 9:57 AM Dmitry Vyukov <dvyukov@google.com> wrote:
> >> The only problem is with membarrier (it used to force write to
> >> __rseq_abi.cpu_id_start for all threads, but now it does not).
> >> Otherwise the caching scheme works.
> >
> > I almost wrote a message last night saying that we didn't need
> > cpu_id_start invalidation on preemption. However, I remembered that
> > the Grow() function[1] does a load outside of a critical section then
> > stores a derived value inside the critical section, guarded only by
> > the cpu_id_start invalidation check in StoreCurrentCpu[2]. It really
> > should be doing a compare against the original value inside the
> > critical section (or just do the whole thing inside), but it doesn't.
> > I haven't reasoned end-to-end through this fully to prove corruption
> > is possible, but I suspect that it is if another thread same-cpu
> > preempts between the loads and the store and updates the header before
> > the original thread resumes and writes its original intended header
> > value. Ditto for signals, which sometimes allocate even though they
> > shouldn't.
> >
> > I was really hoping that we would only need to do the "redundant"
> > cpu_id_start writes would only be needed on membarrier_rseq IPIs where
> > it really is a pay-for-what-you-use functionality,
> 
> That's fine and can be solved without adding this sequence overhead into
> the scheduler hotpath.

Something like so? (probably needs help for !GENERIC bits)

---

diff --git a/include/asm-generic/thread_info_tif.h b/include/asm-generic/thread_info_tif.h
index 528e6fc7efe9..1d786003e42a 100644
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -48,7 +48,10 @@
 #define TIF_RSEQ		11	// Run RSEQ fast path
 #define _TIF_RSEQ		BIT(TIF_RSEQ)
 
-#define TIF_HRTIMER_REARM	12       // re-arm the timer
+#define TIF_RSEQ_FORCE_RESTART	12	// Reset RSEQ-CS from membarrier
+#define _TIF_RSEQ_FORCE_RESTART	BIT(TIF_RSEQ_FORCE_RESTART)
+
+#define TIF_HRTIMER_REARM	13       // re-arm the timer
 #define _TIF_HRTIMER_REARM	BIT(TIF_HRTIMER_REARM)
 
 #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index b9d62fc2140d..2cbee6d41198 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -158,6 +158,8 @@ static inline unsigned int rseq_alloc_align(void)
 	return 1U << get_count_order(offsetof(struct rseq, end));
 }
 
+extern void rseq_prepare_membarrier(struct mm_struct *mm);
+
 #else /* CONFIG_RSEQ */
 static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
@@ -167,6 +169,7 @@ static inline void rseq_force_update(void) { }
 static inline void rseq_virt_userspace_exit(void) { }
 static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
 static inline void rseq_execve(struct task_struct *t) { }
+static inline void rseq_prepare_membarrier(struct mm_struct *mm) { }
 #endif  /* !CONFIG_RSEQ */
 
 #ifdef CONFIG_DEBUG_RSEQ
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index f11ebd34f8b9..3dfaca776971 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -686,7 +686,12 @@ static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *reg
 #ifdef CONFIG_HAVE_GENERIC_TIF_BITS
 static __always_inline bool test_tif_rseq(unsigned long ti_work)
 {
-	return ti_work & _TIF_RSEQ;
+	return ti_work & (_TIF_RSEQ | _TIF_RSEQ_FORCE_RESTART);
+}
+
+static __always_inline void clear_tif_rseq_force_restart(void)
+{
+	clear_thread_flag(TIF_RSEQ_FORCE_RESTART);
 }
 
 static __always_inline void clear_tif_rseq(void)
@@ -696,6 +701,7 @@ static __always_inline void clear_tif_rseq(void)
 }
 #else
 static __always_inline bool test_tif_rseq(unsigned long ti_work) { return true; }
+static __always_inline void clear_tif_rseq_force_restart(void) { }
 static __always_inline void clear_tif_rseq(void) { }
 #endif
 
@@ -703,6 +709,11 @@ static __always_inline bool
 rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
 {
 	if (unlikely(test_tif_rseq(ti_work))) {
+		if (unlikely(ti_work & _TIF_RSEQ_FORCE_RESTART)) {
+			current->rseq.event.sched_switch = true;
+			current->rseq.event.ids_changed = true;
+			clear_tif_rseq_force_restart();
+		}
 		if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
 			current->rseq.event.slowpath = true;
 			set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 38d3ef540760..9adc7f63adf5 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -255,6 +255,19 @@ static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
 	return false;
 }
 
+void rseq_prepare_membarrier(struct mm_struct *mm)
+{
+	struct task_struct *t;
+
+	guard(mutex)(&mm->mm_cid.mutex);
+
+	hlist_for_each_entry(t, &mm->mm_cid.user_list, mm_cid.node) {
+		if (t == current)
+			continue;
+		set_tsk_thread_flag(t, TIF_RSEQ_FORCE_RESTART);
+	}
+}
+
 static void rseq_slowpath_update_usr(struct pt_regs *regs)
 {
 	/*
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 623445603725..696988bb991b 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -334,6 +334,7 @@ static int membarrier_private_expedited(int flags, int cpu_id)
 		      MEMBARRIER_STATE_PRIVATE_EXPEDITED_RSEQ_READY))
 			return -EPERM;
 		ipi_func = ipi_rseq;
+		rseq_prepare_membarrier(mm);
 	} else {
 		WARN_ON_ONCE(flags);
 		if (!(atomic_read(&mm->membarrier_state) &


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-24 15:03                         ` Peter Zijlstra
@ 2026-04-24 19:44                           ` Thomas Gleixner
  2026-04-26 22:04                             ` Thomas Gleixner
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-24 19:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man,
	Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions,
	linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler

On Fri, Apr 24 2026 at 17:03, Peter Zijlstra wrote:
> On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote:
>> > I was really hoping that we would only need to do the "redundant"
>> > cpu_id_start writes would only be needed on membarrier_rseq IPIs where
>> > it really is a pay-for-what-you-use functionality,
>> 
>> That's fine and can be solved without adding this sequence overhead into
>> the scheduler hotpath.
>
> Something like so? (probably needs help for !GENERIC bits)

Yes and yes :)

Let me stare at that !generic tif bits case.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-24 19:44                           ` Thomas Gleixner
@ 2026-04-26 22:04                             ` Thomas Gleixner
  2026-04-27  7:40                               ` Florian Weimer
  0 siblings, 1 reply; 41+ messages in thread
From: Thomas Gleixner @ 2026-04-26 22:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mathias Stearn, Dmitry Vyukov, Jinjie Ruan, linux-man,
	Mark Rutland, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions,
	linux-kernel, linux-arm-kernel, Ingo Molnar, Blake Oler,
	Dmitry Vyukov, Florian Weimer, Rich Felker, Matthew Wilcox,
	Greg Kroah-Hartman, Linus Torvalds

On Fri, Apr 24 2026 at 21:44, Thomas Gleixner wrote:
> On Fri, Apr 24 2026 at 17:03, Peter Zijlstra wrote:
>> On Fri, Apr 24, 2026 at 04:16:08PM +0200, Thomas Gleixner wrote:
>>> > I was really hoping that we would only need to do the "redundant"
>>> > cpu_id_start writes would only be needed on membarrier_rseq IPIs where
>>> > it really is a pay-for-what-you-use functionality,
>>> 
>>> That's fine and can be solved without adding this sequence overhead into
>>> the scheduler hotpath.
>>
>> Something like so? (probably needs help for !GENERIC bits)
>
> Yes and yes :)
>
> Let me stare at that !generic tif bits case.

I stared at it and finally gave up because all of this is in a
completely FUBAR'ed state and ends up in a horrible pile of hacks and
duct tape with a way larger than zero probability that we chase the
nasty corner cases for quite some time just to add more duct tape and
hacks.

Contrary to that it's rather trivial to cleanly separate the behavioral
cases and guarantees without a masssive runtime overhead and without a
pile of hard to maintain TCMalloc specific hacks.

All required code is already available to support the architectures
which do not utilize the generic entry code and therefore can't neither
use the optimized mode nor time slice extensions. So instead of letting
the compiler optimize that code out for the generic entry code users, we
can keep it around and utilize one or the other depending on the
requested mode. I managed to get the required run-time conditionals down
to a minimum so that they are in the noise when analysing it with perf.

The real question is how to differentiate between the legacy and the
optimized mode. I have two working variants to achieve that:

   1) The fully safe option requires a new flag for RSEQ
      registration. It obviously requires a glibc update. (Suggested by
      PeterZ)

   2) Determine the requirements of the registering task via the size of
      the registered RSEQ area.

      The original implementation, which TCMalloc depends on, registers
      a 32 byte region (ORIG_RSEG_SIZE). This region has 32 byte
      alignment requirement.

      The extension safe newer variant exposes the kernel RSEQ feature
      size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment
      requirement via getauxval(AT_RSEQ_ALIGN). The alignment
      requirement is that the registered rseq region is aligned to the
      next power of two of the feature size. The kernel currently has a
      feature size of 33 bytes, which means the alignment requirement is
      64 bytes.

      The TCMalloc RSEQ region is embedded into a cache line aligned
      data structure starting at offset 32 bytes so that bytes 28-31 and
      the cpu_id_start field at bytes 32-35 form a 64-bit little endian
      pointer with the top-most bit (63 set) to check whether the kernel
      has overwritten cpu_id_start with an actual CPU id value, which is
      guaranteed to not have the top most bit set.

      As this is part of their performance tuned magic, it's a pretty
      safe assumption, that TCMalloc won't use a larger RSEQ size, which
      allows to select optimized mode for registrations with a size
      greater than 32 bytes.

      That does not require any changes to glibc and works out of the
      box. (Suggested by Mathieu)

In both cases the legacy non-optimized mode exposes the original
behaviour up to the mm_cid field and does not provide support for time
slice extensions.  Optimized mode restores the performance gains and
enables support for time slice extensions.

I have no strong preference either way and have working code for both
variants. Though obviously avoiding to update the libc world has a
charme. If that unexpectedly would turn out to be not sufficient, then
disabling that would be a trivial one-liner and as a consequence require
to add the flag and update the libc world.

Combo patch for the auto-detection based on the registered size below as
that allows to immediately test without glibc dependencies. It applies
cleanly on Linus tree and 7.0. 6.19 would need some fixups, but I
learned today that it's already EOL.

In the final version that's three separate patches plus a set of
selftest changes which validate legacy behaviour and run the full param
test suite in both legacy and optimized mode.

Thoughts, preferences?

Thanks,

        tglx
---
 Documentation/userspace-api/rseq.rst |   77 ++++++++++++++
 include/linux/rseq.h                 |   20 +++
 include/linux/rseq_entry.h           |  110 ++++++++++-----------
 include/linux/rseq_types.h           |    3 
 kernel/rseq.c                        |  183 ++++++++++++++++++++++-------------
 kernel/sched/membarrier.c            |   11 +-
 6 files changed, 280 insertions(+), 124 deletions(-)
---
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -9,6 +9,11 @@
 
 void __rseq_handle_slowpath(struct pt_regs *regs);
 
+static __always_inline bool rseq_optimized(struct task_struct *t)
+{
+	return IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && likely(t->rseq.event.optimized);
+}
+
 /* Invoked from resume_user_mode_work() */
 static inline void rseq_handle_slowpath(struct pt_regs *regs)
 {
@@ -30,7 +35,7 @@ void __rseq_signal_deliver(int sig, stru
  */
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
 {
-	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(current)) {
 		/* '&' is intentional to spare one conditional branch */
 		if (current->rseq.event.has_rseq & current->rseq.event.user_irq)
 			__rseq_signal_deliver(ksig->sig, regs);
@@ -50,15 +55,21 @@ static __always_inline void rseq_sched_s
 {
 	struct rseq_event *ev = &t->rseq.event;
 
-	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+	/*
+	 * Only apply the user_irq optimization for RSEQ ABI V2
+	 * registrations. Legacy users like TCMalloc rely on the historical ABI
+	 * V1 behaviour which updates IDs on every context swtich.
+	 */
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_optimized(t)) {
 		/*
 		 * Avoid a boat load of conditionals by using simple logic
 		 * to determine whether NOTIFY_RESUME needs to be raised.
 		 *
 		 * It's required when the CPU or MM CID has changed or
-		 * the entry was from user space.
+		 * the entry was from user space. ev->has_rseq does not
+		 * have to be evaluated because optimized implies has_rseq.
 		 */
-		bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
+		bool raise = ev->user_irq | ev->ids_changed;
 
 		if (raise) {
 			ev->sched_switch = true;
@@ -66,6 +77,7 @@ static __always_inline void rseq_sched_s
 		}
 	} else {
 		if (ev->has_rseq) {
+			t->rseq.event.ids_changed = true;
 			t->rseq.event.sched_switch = true;
 			rseq_raise_notify_resume(t);
 		}
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -111,6 +111,20 @@ static __always_inline void rseq_slice_c
 	t->rseq.slice.state.granted = false;
 }
 
+/*
+ * Open coded, so it can be invoked within a user access region.
+ *
+ * This clears the user space state of the time slice extensions field only when
+ * the task has registered the optimized RSEQ_ABI V2. Some legacy registrations,
+ * e.g. TCMalloc, have conflicting non-ABI fields in struct RSEQ, which would be
+ * overwritten by an unconditional write.
+ */
+#define rseq_slice_clear_user(rseq, efault)				\
+do {									\
+	if (rseq_slice_extension_enabled())				\
+		unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);	\
+} while (0)
+
 static __always_inline bool __rseq_grant_slice_extension(bool work_pending)
 {
 	struct task_struct *curr = current;
@@ -230,10 +244,10 @@ static __always_inline bool rseq_slice_e
 static __always_inline bool rseq_arm_slice_extension_timer(void) { return false; }
 static __always_inline void rseq_slice_clear_grant(struct task_struct *t) { }
 static __always_inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
+#define rseq_slice_clear_user(rseq, efault) do { } while (0)
 #endif /* !CONFIG_RSEQ_SLICE_EXTENSION */
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
-bool rseq_debug_validate_ids(struct task_struct *t);
 
 static __always_inline void rseq_note_user_irq_entry(void)
 {
@@ -353,43 +367,6 @@ bool rseq_debug_update_user_cs(struct ta
 	return false;
 }
 
-/*
- * On debug kernels validate that user space did not mess with it if the
- * debug branch is enabled.
- */
-bool rseq_debug_validate_ids(struct task_struct *t)
-{
-	struct rseq __user *rseq = t->rseq.usrptr;
-	u32 cpu_id, uval, node_id;
-
-	/*
-	 * On the first exit after registering the rseq region CPU ID is
-	 * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0!
-	 */
-	node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ?
-		  cpu_to_node(t->rseq.ids.cpu_id) : 0;
-
-	scoped_user_read_access(rseq, efault) {
-		unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
-		if (cpu_id != t->rseq.ids.cpu_id)
-			goto die;
-		unsafe_get_user(uval, &rseq->cpu_id, efault);
-		if (uval != cpu_id)
-			goto die;
-		unsafe_get_user(uval, &rseq->node_id, efault);
-		if (uval != node_id)
-			goto die;
-		unsafe_get_user(uval, &rseq->mm_cid, efault);
-		if (uval != t->rseq.ids.mm_cid)
-			goto die;
-	}
-	return true;
-die:
-	t->rseq.event.fatal = true;
-efault:
-	return false;
-}
-
 #endif /* RSEQ_BUILD_SLOW_PATH */
 
 /*
@@ -504,12 +481,32 @@ bool rseq_set_ids_get_csaddr(struct task
 {
 	struct rseq __user *rseq = t->rseq.usrptr;
 
-	if (static_branch_unlikely(&rseq_debug_enabled)) {
-		if (!rseq_debug_validate_ids(t))
-			return false;
-	}
-
 	scoped_user_rw_access(rseq, efault) {
+		/* Validate the R/O fields for debug and optimized mode */
+		if (static_branch_unlikely(&rseq_debug_enabled) || rseq_optimized(t)) {
+			u32 cpu_id, uval, node_id;
+
+			/*
+			 * On the first exit after registering the rseq region CPU ID is
+			 * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0!
+			 */
+			node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ?
+				cpu_to_node(t->rseq.ids.cpu_id) : 0;
+
+			unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
+			if (cpu_id != t->rseq.ids.cpu_id)
+				goto die;
+			unsafe_get_user(uval, &rseq->cpu_id, efault);
+			if (uval != cpu_id)
+				goto die;
+			unsafe_get_user(uval, &rseq->node_id, efault);
+			if (uval != node_id)
+				goto die;
+			unsafe_get_user(uval, &rseq->mm_cid, efault);
+			if (uval != t->rseq.ids.mm_cid)
+				goto die;
+		}
+
 		unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
 		unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
 		unsafe_put_user(node_id, &rseq->node_id, efault);
@@ -517,11 +514,9 @@ bool rseq_set_ids_get_csaddr(struct task
 		if (csaddr)
 			unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
 
-		/* Open coded, so it's in the same user access region */
-		if (rseq_slice_extension_enabled()) {
-			/* Unconditionally clear it, no point in conditionals */
-			unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
-		}
+		/* RSEQ ABI V2 only operations */
+		if (rseq_optimized(t))
+			rseq_slice_clear_user(rseq, efault);
 	}
 
 	rseq_slice_clear_grant(t);
@@ -530,6 +525,9 @@ bool rseq_set_ids_get_csaddr(struct task
 	rseq_stat_inc(rseq_stats.ids);
 	rseq_trace_update(t, ids);
 	return true;
+
+die:
+	t->rseq.event.fatal = true;
 efault:
 	return false;
 }
@@ -612,6 +610,14 @@ static __always_inline bool rseq_exit_us
 	 * interrupts disabled
 	 */
 	guard(pagefault)();
+	/*
+	 * This optimization is only valid when the task registered for the
+	 * optimized RSEQ_ABI_V2 variant. Some legacy users rely on the original
+	 * RSEQ implementation behaviour which unconditionally updated the IDs.
+	 * rseq_sched_switch_event() ensures that legacy registrations always
+	 * have both sched_switch and ids_changed set, which is compatible with
+	 * the historical TIF_NOTIFY_RESUME behaviour.
+	 */
 	if (likely(!t->rseq.event.ids_changed)) {
 		struct rseq __user *rseq = t->rseq.usrptr;
 		/*
@@ -623,11 +629,9 @@ static __always_inline bool rseq_exit_us
 		scoped_user_rw_access(rseq, efault) {
 			unsafe_get_user(csaddr, &rseq->rseq_cs, efault);
 
-			/* Open coded, so it's in the same user access region */
-			if (rseq_slice_extension_enabled()) {
-				/* Unconditionally clear it, no point in conditionals */
-				unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
-			}
+			/* RSEQ ABI V2 only operations */
+			if (rseq_optimized(t))
+				rseq_slice_clear_user(rseq, efault);
 		}
 
 		rseq_slice_clear_grant(t);
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -18,6 +18,7 @@ struct rseq;
  * @ids_changed:	Indicator that IDs need to be updated
  * @user_irq:		True on interrupt entry from user mode
  * @has_rseq:		True if the task has a rseq pointer installed
+ * @optimized:		RSEQ ABI V2 optimized mode
  * @error:		Compound error code for the slow path to analyze
  * @fatal:		User space data corrupted or invalid
  * @slowpath:		Indicator that slow path processing via TIF_NOTIFY_RESUME
@@ -41,7 +42,7 @@ struct rseq_event {
 			};
 
 			u8			has_rseq;
-			u8			__pad;
+			u8			optimized;
 			union {
 				u16		error;
 				struct {
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -258,11 +258,15 @@ static bool rseq_handle_cs(struct task_s
 static void rseq_slowpath_update_usr(struct pt_regs *regs)
 {
 	/*
-	 * Preserve rseq state and user_irq state. The generic entry code
-	 * clears user_irq on the way out, the non-generic entry
-	 * architectures are not having user_irq.
-	 */
-	const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
+	 * Preserve has_rseq, optimized and user_irq state. The generic entry
+	 * code clears user_irq on the way out, the non-generic entry
+	 * architectures are not setting user_irq.
+	 */
+	const struct rseq_event evt_mask = {
+		.has_rseq	= true,
+		.user_irq	= true,
+		.optimized	= true,
+	};
 	struct task_struct *t = current;
 	struct rseq_ids ids;
 	u32 node_id;
@@ -335,8 +339,9 @@ void __rseq_handle_slowpath(struct pt_re
 void __rseq_signal_deliver(int sig, struct pt_regs *regs)
 {
 	rseq_stat_inc(rseq_stats.signal);
+
 	/*
-	 * Don't update IDs, they are handled on exit to user if
+	 * Don't update IDs yet, they are handled on exit to user if
 	 * necessary. The important thing is to abort a critical section of
 	 * the interrupted context as after this point the instruction
 	 * pointer in @regs points to the signal handler.
@@ -349,6 +354,13 @@ void __rseq_signal_deliver(int sig, stru
 		current->rseq.event.error = 0;
 		force_sigsegv(sig);
 	}
+
+	/*
+	 * In legacy mode, force the update of IDs before returning to user
+	 * space to stay compatible.
+	 */
+	if (!rseq_optimized(current))
+		rseq_force_update();
 }
 
 /*
@@ -404,66 +416,19 @@ static bool rseq_reset_ids(void)
 /* The original rseq structure size (including padding) is 32 bytes. */
 #define ORIG_RSEQ_SIZE		32
 
-/*
- * sys_rseq - setup restartable sequences for caller thread.
- */
-SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
+static long rseq_register(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig)
 {
+	bool optimized = IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY) && rseq_len > ORIG_RSEQ_SIZE;
 	u32 rseqfl = 0;
 
-	if (flags & RSEQ_FLAG_UNREGISTER) {
-		if (flags & ~RSEQ_FLAG_UNREGISTER)
-			return -EINVAL;
-		/* Unregister rseq for current thread. */
-		if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
-			return -EINVAL;
-		if (rseq_len != current->rseq.len)
-			return -EINVAL;
-		if (current->rseq.sig != sig)
-			return -EPERM;
-		if (!rseq_reset_ids())
-			return -EFAULT;
-		rseq_reset(current);
-		return 0;
-	}
-
-	if (unlikely(flags & ~(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)))
-		return -EINVAL;
-
-	if (current->rseq.usrptr) {
-		/*
-		 * If rseq is already registered, check whether
-		 * the provided address differs from the prior
-		 * one.
-		 */
-		if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
-			return -EINVAL;
-		if (current->rseq.sig != sig)
-			return -EPERM;
-		/* Already registered. */
-		return -EBUSY;
-	}
-
-	/*
-	 * If there was no rseq previously registered, ensure the provided rseq
-	 * is properly aligned, as communcated to user-space through the ELF
-	 * auxiliary vector AT_RSEQ_ALIGN. If rseq_len is the original rseq
-	 * size, the required alignment is the original struct rseq alignment.
-	 *
-	 * The rseq_len is required to be greater or equal to the original rseq
-	 * size. In order to be valid, rseq_len is either the original rseq size,
-	 * or large enough to contain all supported fields, as communicated to
-	 * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE.
-	 */
-	if (rseq_len < ORIG_RSEQ_SIZE ||
-	    (rseq_len == ORIG_RSEQ_SIZE && !IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE)) ||
-	    (rseq_len != ORIG_RSEQ_SIZE && (!IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) ||
-					    rseq_len < offsetof(struct rseq, end))))
-		return -EINVAL;
 	if (!access_ok(rseq, rseq_len))
 		return -EFAULT;
 
-	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION)) {
+	/*
+	 * The optimized check disables time slice extensions for legacy
+	 * registrations.
+	 */
+	if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION) && optimized) {
 		rseqfl |= RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE;
 		if (rseq_slice_extension_enabled() &&
 		    (flags & RSEQ_FLAG_SLICE_EXT_DEFAULT_ON))
@@ -485,7 +450,15 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 		unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
 		unsafe_put_user(0U, &rseq->node_id, efault);
 		unsafe_put_user(0U, &rseq->mm_cid, efault);
-		unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+
+		/*
+		 * All fields past mm_cid are only valid for non-legacy registrations
+		 * which register with rseq_len > ORIG_RSEQ_SIZE.
+		 */
+		if (optimized) {
+			if (IS_ENABLED(CONFIG_RSEQ_SLICE_EXTENSION))
+				unsafe_put_user(0U, &rseq->slice_ctrl.all, efault);
+		}
 	}
 
 	/*
@@ -501,11 +474,11 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 #endif
 
 	/*
-	 * If rseq was previously inactive, and has just been
-	 * registered, ensure the cpu_id_start and cpu_id fields
-	 * are updated before returning to user-space.
+	 * Ensure the cpu_id_start and cpu_id fields are updated before
+	 * returning to user-space.
 	 */
 	current->rseq.event.has_rseq = true;
+	current->rseq.event.optimized = optimized;
 	rseq_force_update();
 	return 0;
 
@@ -513,6 +486,86 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	return -EFAULT;
 }
 
+static long rseq_unregister(struct rseq __user * rseq, u32 rseq_len, int flags, u32 sig)
+{
+	if (flags & ~RSEQ_FLAG_UNREGISTER)
+		return -EINVAL;
+	if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
+		return -EINVAL;
+	if (rseq_len != current->rseq.len)
+		return -EINVAL;
+	if (current->rseq.sig != sig)
+		return -EPERM;
+	if (!rseq_reset_ids())
+		return -EFAULT;
+	rseq_reset(current);
+	return 0;
+}
+
+static long rseq_reregister(struct rseq __user * rseq, u32 rseq_len, u32 sig)
+{
+	/*
+	 * If rseq is already registered, check whether the provided address
+	 * differs from the prior one.
+	 */
+	if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
+		return -EINVAL;
+	if (current->rseq.sig != sig)
+		return -EPERM;
+	/* Already registered. */
+	return -EBUSY;
+}
+
+static bool rseq_length_valid(struct rseq __user *rseq, unsigned int rseq_len)
+{
+	if (rseq_len < ORIG_RSEQ_SIZE)
+		return false;
+
+	/*
+	 * Ensure the provided rseq is properly aligned, as communicated to
+	 * user-space through the ELF auxiliary vector AT_RSEQ_ALIGN. If
+	 * rseq_len is the original rseq size, the required alignment is the
+	 * original struct rseq alignment.
+	 *
+	 * The rseq_len is required to be greater or equal than the original
+	 * rseq size.
+	 *
+	 * In order to be valid, rseq_len is either the original rseq size, or
+	 * large enough to contain all supported fields, as communicated to
+	 * user-space through the ELF auxiliary vector AT_RSEQ_FEATURE_SIZE.
+	 */
+	if (rseq_len < ORIG_RSEQ_SIZE)
+		return false;
+
+	if (rseq_len == ORIG_RSEQ_SIZE)
+		return IS_ALIGNED((unsigned long)rseq, ORIG_RSEQ_SIZE);
+
+	return IS_ALIGNED((unsigned long)rseq, rseq_alloc_align()) &&
+		rseq_len >= offsetof(struct rseq, end);
+}
+
+#define RSEQ_FLAGS_SUPPORTED	(RSEQ_FLAG_SLICE_EXT_DEFAULT_ON)
+
+/*
+ * sys_rseq - Register or unregister restartable sequences for the caller thread.
+ */
+SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
+{
+	if (flags & RSEQ_FLAG_UNREGISTER)
+		return rseq_unregister(rseq, rseq_len, flags, sig);
+
+	if (unlikely(flags & ~RSEQ_FLAGS_SUPPORTED))
+		return -EINVAL;
+
+	if (current->rseq.usrptr)
+		return rseq_reregister(rseq, rseq_len, sig);
+
+	if (!rseq_length_valid(rseq, rseq_len))
+		return -EINVAL;
+
+	return rseq_register(rseq, rseq_len, flags, sig);
+}
+
 #ifdef CONFIG_RSEQ_SLICE_EXTENSION
 struct slice_timer {
 	struct hrtimer	timer;
@@ -713,6 +766,8 @@ int rseq_slice_extension_prctl(unsigned
 			return -ENOTSUPP;
 		if (!current->rseq.usrptr)
 			return -ENXIO;
+		if (!current->rseq.event.optimized)
+			return -ENOTSUPP;
 
 		/* No change? */
 		if (enable == !!current->rseq.slice.state.enabled)
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -199,7 +199,16 @@ static void ipi_rseq(void *info)
 	 * is negligible.
 	 */
 	smp_mb();
-	rseq_sched_switch_event(current);
+	/*
+	 * Legacy mode requires that IDs are written and the critical section is
+	 * evaluated. Optimized mode handles the critical section and IDs are
+	 * only updated if they change as a consequence of preemption after
+	 * return from this IPI.
+	 */
+	if (rseq_optimized(current))
+		rseq_sched_switch_event(current);
+	else
+		rseq_force_update();
 }
 
 static void ipi_sync_rq_state(void *info)
--- a/Documentation/userspace-api/rseq.rst
+++ b/Documentation/userspace-api/rseq.rst
@@ -24,6 +24,80 @@ Quick access to CPU number, node ID
 Allows to implement per CPU data efficiently. Documentation is in code and
 selftests. :(
 
+Optimized RSEQ V2
+-----------------
+
+On architectures which utilize the generic entry code and generic TIF bits
+the kernel supports runtime optimizations for RSEQ, which also enable
+enhanced features like scheduler time slice extensions.
+
+To enable them a task has to register the RSEQ region with at least the
+length advertised by getauxval(AT_RSEQ_FEATURE_SIZE).
+
+If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel
+keeps the legacy low performance mode enabled to fulfil the expectations
+existing users regarding the original RSEQ implementation behaviour.
+
+The following table documents the ABI and behavioral guarantees of the
+legacy and the optimized V2 mode.
+
+.. list-table:: RSEQ modes
+   :header-rows: 1
+
+   * - Nr
+     - What
+     - Legacy
+     - Optimized V2
+   * - 1
+     - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read
+       only)
+     - Updated by the kernel unconditionally after each context switch and
+       before signal delivery
+     - Updated by the kernel if and only if they change, i.e. if the task
+       is migrated or mm_cid changes
+   * - 2
+     - The rseq_cs critical section field
+     - Evaluated and handled unconditionally after each context switch and
+       before signal delivery
+     - Evaluated and handled conditionally only when user space was
+       interrupted. Either after being preempted or before signal delivery
+       in the interrupted context.
+   * - 3
+     - Read only fields
+     - No strict enforcement except in debug mode
+     - Strict enforcement
+   * - 4
+     - membarrier(...RSEQ)
+     - All running threads of the process are interrupted and the ID fields
+       are rewritten and eventually active critical sections are aborted
+       before they return to user space.  All threads which are scheduled
+       out whether voluntary or not are covered by #1/#2 above.
+     - All running threads of the process are interrupted and eventually
+       active critical sections are aborted before these threads return to
+       user space. The ID fields are only updated if changed as a
+       consequence of the interrupt. All threads which are scheduled out
+       whether voluntary not are covered by #1/#2 above.
+   * - 5
+     - Time slice extensions
+     - Not supported
+     - Supported
+
+The legacy mode is obviously less performant as it does unconditional
+updates and critical section checks even if not strictly required by the
+ABI contract. That can't be changed anymore as some users depend on that
+observed behavior, which in turn enables them to violate the ABI and
+overwrite the cpu_id_start field for their own purposes. This is obviously
+discouraged as it renders RSEQ incompatible with the intended usage and
+breaks the expectation of other libraries in the same application.
+
+The ABI compliant optimized mode, which respects the read only fields, does
+not require unconditional updates and therefore is way more performant. The
+kernel validates the read only fields for compliance. If user space
+modifies them, the process is killed. Compliant usage allows multiple
+libraries in the same application to benefit from the RSEQ functionality
+without disturbing each other.
+
+
 Scheduler time slice extensions
 -------------------------------
 
@@ -37,7 +111,8 @@ scheduled out inside of the critical sec
 
     * Enabled at boot time (default is enabled)
 
-    * A rseq userspace pointer has been registered for the thread
+    * A rseq userspace pointer has been registered for the thread in
+      optimized V2 mode
 
 The thread has to enable the functionality via prctl(2)::
 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-26 22:04                             ` Thomas Gleixner
@ 2026-04-27  7:40                               ` Florian Weimer
  0 siblings, 0 replies; 41+ messages in thread
From: Florian Weimer @ 2026-04-27  7:40 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Mathias Stearn, Dmitry Vyukov, Jinjie Ruan,
	linux-man, Mark Rutland, Mathieu Desnoyers, Catalin Marinas,
	Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly,
	regressions, linux-kernel, linux-arm-kernel, Ingo Molnar,
	Blake Oler, Rich Felker, Matthew Wilcox, Greg Kroah-Hartman,
	Linus Torvalds, criu

* Thomas Gleixner:

> The real question is how to differentiate between the legacy and the
> optimized mode. I have two working variants to achieve that:
>
>    1) The fully safe option requires a new flag for RSEQ
>       registration. It obviously requires a glibc update. (Suggested by
>       PeterZ)

Without glibc changes, RSEQ would keep working, but with the old,
problematic performance, right?

If we don't have a notification in the auxiliary vector, we'd have to do
two system calls at process start, which isn't ideal, but is probably
not a significant issue, either.

I haven't verified this, but it looks like introducing the flag breaks
CRIU?  In dump_thread_rseq, we have this:

        if (rseqc.flags != 0) {
                pr_err("something wrong with ptrace(PTRACE_GET_RSEQ_CONFIGURATION, %d) flags = 0x%x\n", tid,
                       rseqc.flags);
                return -1;
        }

I suppose a workaround could make this behavior flag a prctl flag.  CRIU
wouldn't dump and restore that until taught about it.  If the new
behavior is switched on explicitly by the flag, it would be
backwards-compatible, except that restoring with unpatched CRIU would
lead to a performance loss.

>    2) Determine the requirements of the registering task via the size of
>       the registered RSEQ area.
>
>       The original implementation, which TCMalloc depends on, registers
>       a 32 byte region (ORIG_RSEG_SIZE). This region has 32 byte
>       alignment requirement.
>
>       The extension safe newer variant exposes the kernel RSEQ feature
>       size via getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment
>       requirement via getauxval(AT_RSEQ_ALIGN). The alignment
>       requirement is that the registered rseq region is aligned to the
>       next power of two of the feature size. The kernel currently has a
>       feature size of 33 bytes, which means the alignment requirement is
>       64 bytes.

There are still glibc builds in use that do not use AT_RSEQ_ALIGN, and
instead unconditionally reserve a size of 32.  In some builds, the RSEQ
area is not aligned to a multiple of 64, which makes glibc
indistinguishable from tcmalloc.  You could look at the location of the
thread pointer relative to the RSEQ area at registration to tell them
apart, but that is perhaps too nasty.

Switching to the new extensible RSEQ allocation code in older glibc
builds is not entirely trivial, and I would prefer not doing that.
Registering with a new flag is comparatively simple, and we could
backport it, except that it might not be compatible with CRIU.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23  5:53           ` Dmitry Vyukov
  2026-04-23 10:39             ` Thomas Gleixner
@ 2026-04-23 12:11             ` Alejandro Colomar
  2026-04-23 12:54               ` Mathieu Desnoyers
  2026-04-23 12:29             ` Mathieu Desnoyers
  2 siblings, 1 reply; 41+ messages in thread
From: Alejandro Colomar @ 2026-04-23 12:11 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland,
	Mathias Stearn, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney, Chris Kennelly, regressions,
	linux-kernel, linux-arm-kernel, Peter Zijlstra, Ingo Molnar,
	Blake Oler, Michael Jeanson

[-- Attachment #1: Type: text/plain, Size: 4335 bytes --]

Hello Dmitry,

On 2026-04-23T07:53:55+0200, Dmitry Vyukov wrote:
> On Thu, 23 Apr 2026 at 03:48, Jinjie Ruan <ruanjinjie@huawei.com> wrote:
> >
> > On 4/23/2026 3:47 AM, Thomas Gleixner wrote:
> > > On Wed, Apr 22 2026 at 19:11, Mark Rutland wrote:
> > >> On Wed, Apr 22, 2026 at 07:49:30PM +0200, Thomas Gleixner wrote:
> > >> Conceptually we just need to use syscall_enter_from_user_mode() and
> > >> irqentry_enter_from_user_mode() appropriately.
> > >
> > > Right. I figured that out.
> > >
> > >> In practice, I can't use those as-is without introducing the exception
> > >> masking problems I just fixed up for irqentry_enter_from_kernel_mode(),
> > >> so I'll need to do some similar refactoring first.
> > >
> > > See below.
> > >
> > >> I haven't paged everything in yet, so just to cehck, is there anything
> > >> that would behave incorrectly if current->rseq.event.user_irq were set
> > >> for syscall entry? IIUC it means we'll effectively do the slow path, and
> > >> I was wondering if that might be acceptable as a one-line bodge for
> > >> stable.
> > >
> > > It might work, but it's trivial enough to avoid that. See below. That on
> > > top of 6.19.y makes the selftests pass too.
> >
> > This aligns with my thoughts when convert arm64 to generic syscall
> > entry. Currently, the arm64 entry code does not distinguish between IRQ
> > and syscall entries. It fails to call rseq_note_user_irq_entry() for IRQ
> > entries as the generic entry framework does, because arm64 uses
> > enter_from_user_mode() exclusively instead of
> > irqentry_enter_from_user_mode().
> >
> > https://lore.kernel.org/all/20260320102620.1336796-10-ruanjinjie@huawei.com/
> >
> > >
> > > Thanks,
> > >
> > >         tglx
> > > ---
> > >  arch/arm64/kernel/entry-common.c |   14 ++++++++++----
> > >  1 file changed, 10 insertions(+), 4 deletions(-)
> > >
> > > --- a/arch/arm64/kernel/entry-common.c
> > > +++ b/arch/arm64/kernel/entry-common.c
> > > @@ -58,6 +58,12 @@ static void noinstr exit_to_kernel_mode(
> > >       irqentry_exit(regs, state);
> > >  }
> > >
> > > +static __always_inline void arm64_enter_from_user_mode_syscall(struct pt_regs *regs)
> > > +{
> > > +     enter_from_user_mode(regs);
> > > +     mte_disable_tco_entry(current);
> > > +}
> > > +
> > >  /*
> > >   * Handle IRQ/context state management when entering from user mode.
> > >   * Before this function is called it is not safe to call regular kernel code,
> > > @@ -65,8 +71,8 @@ static void noinstr exit_to_kernel_mode(
> > >   */
> > >  static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
> > >  {
> > > -     enter_from_user_mode(regs);
> > > -     mte_disable_tco_entry(current);
> > > +     arm64_enter_from_user_mode_syscall(regs);
> > > +     rseq_note_user_irq_entry();
> > >  }
> > >
> > >  /*
> > > @@ -717,7 +723,7 @@ static void noinstr el0_brk64(struct pt_
> > >
> > >  static void noinstr el0_svc(struct pt_regs *regs)
> > >  {
> > > -     arm64_enter_from_user_mode(regs);
> > > +     arm64_enter_from_user_mode_syscall(regs);
> > >       cortex_a76_erratum_1463225_svc_handler();
> > >       fpsimd_syscall_enter();
> > >       local_daif_restore(DAIF_PROCCTX);
> > > @@ -869,7 +875,7 @@ static void noinstr el0_cp15(struct pt_r
> > >
> > >  static void noinstr el0_svc_compat(struct pt_regs *regs)
> > >  {
> > > -     arm64_enter_from_user_mode(regs);
> > > +     arm64_enter_from_user_mode_syscall(regs);
> > >       cortex_a76_erratum_1463225_svc_handler();
> > >       local_daif_restore(DAIF_PROCCTX);
> > >       do_el0_svc_compat(regs);
> 
> 
> +linux-man
> 
> This part of the rseq man page needs to be fixed as well I think. The
> kernel no longer reliably provides clearing of rseq_cs on preemption,
> right?
> 
> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241

+Michael Jeanson

That page seems to be maintained separately, as part of the librseq
project.


Have a lovely day!
Alex

> 
> "and set to NULL by the kernel when it restarts an assembly
> instruction sequence block,
> as well as when the kernel detects that it is preempting or delivering
> a signal outside of the range targeted by the rseq_cs."
> 

-- 
<https://www.alejandro-colomar.es>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 12:11             ` Alejandro Colomar
@ 2026-04-23 12:54               ` Mathieu Desnoyers
  0 siblings, 0 replies; 41+ messages in thread
From: Mathieu Desnoyers @ 2026-04-23 12:54 UTC (permalink / raw)
  To: Alejandro Colomar, Dmitry Vyukov
  Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland,
	Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler,
	Michael Jeanson

On 2026-04-23 08:11, Alejandro Colomar wrote:
[...]
>>
>> +linux-man
>>
>> This part of the rseq man page needs to be fixed as well I think. The
>> kernel no longer reliably provides clearing of rseq_cs on preemption,
>> right?
>>
>> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241
> 
> +Michael Jeanson
> 
> That page seems to be maintained separately, as part of the librseq
> project.

Yes, I maintain the librseq project, thanks Alejandro!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23  5:53           ` Dmitry Vyukov
  2026-04-23 10:39             ` Thomas Gleixner
  2026-04-23 12:11             ` Alejandro Colomar
@ 2026-04-23 12:29             ` Mathieu Desnoyers
  2026-04-23 12:36               ` Dmitry Vyukov
  2 siblings, 1 reply; 41+ messages in thread
From: Mathieu Desnoyers @ 2026-04-23 12:29 UTC (permalink / raw)
  To: Dmitry Vyukov, Jinjie Ruan, linux-man
  Cc: Thomas Gleixner, Mark Rutland, Mathias Stearn, Catalin Marinas,
	Will Deacon, Boqun Feng, Paul E. McKenney, Chris Kennelly,
	regressions, linux-kernel, linux-arm-kernel, Peter Zijlstra,
	Ingo Molnar, Blake Oler

On 2026-04-23 01:53, Dmitry Vyukov wrote:
[...]
> +linux-man
> 
> This part of the rseq man page needs to be fixed as well I think. The
> kernel no longer reliably provides clearing of rseq_cs on preemption,
> right?
> 
> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241

I'm maintaining this manual page in librseq.

> 
> "and set to NULL by the kernel when it restarts an assembly
> instruction sequence block,
> as well as when the kernel detects that it is preempting or delivering
> a signal outside of the range targeted by the rseq_cs."

I think you got two things confused here.

1) There is currently a bug on arm64 where it fails to honor the
    rseq ABI contract wrt critical section abort. AFAIU there is a
    fix proposed for this.

2) Thomas relaxed the implementation of cpu_id_start field updates
    so it only stores to the rseq area when the current cpu actually
    changes (migration).

So AFAIU the statement in the man page is still fine. It's just arm64
that needs fixing.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 12:29             ` Mathieu Desnoyers
@ 2026-04-23 12:36               ` Dmitry Vyukov
  2026-04-23 12:53                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 41+ messages in thread
From: Dmitry Vyukov @ 2026-04-23 12:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland,
	Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler

On Thu, 23 Apr 2026 at 14:29, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2026-04-23 01:53, Dmitry Vyukov wrote:
> [...]
> > +linux-man
> >
> > This part of the rseq man page needs to be fixed as well I think. The
> > kernel no longer reliably provides clearing of rseq_cs on preemption,
> > right?
> >
> > https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241
>
> I'm maintaining this manual page in librseq.
>
> >
> > "and set to NULL by the kernel when it restarts an assembly
> > instruction sequence block,
> > as well as when the kernel detects that it is preempting or delivering
> > a signal outside of the range targeted by the rseq_cs."
>
> I think you got two things confused here.
>
> 1) There is currently a bug on arm64 where it fails to honor the
>     rseq ABI contract wrt critical section abort. AFAIU there is a
>     fix proposed for this.
>
> 2) Thomas relaxed the implementation of cpu_id_start field updates
>     so it only stores to the rseq area when the current cpu actually
>     changes (migration).
>
> So AFAIU the statement in the man page is still fine. It's just arm64
> that needs fixing.


My understanding was that due to the ev->user_irq check here:

+static __always_inline void rseq_sched_switch_event(struct task_struct *t)
...
+               bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
+
+               if (raise) {
+                       ev->sched_switch = true;
+                       rseq_raise_notify_resume(t);
+               }

There won't be any rseq-related processing for threads preempted in
syscalls, which means that rseq_cs won't be NULLed for threads
preempted inside of syscalls.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 12:36               ` Dmitry Vyukov
@ 2026-04-23 12:53                 ` Mathieu Desnoyers
  2026-04-23 12:58                   ` Dmitry Vyukov
  0 siblings, 1 reply; 41+ messages in thread
From: Mathieu Desnoyers @ 2026-04-23 12:53 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland,
	Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler,
	Michael Jeanson

On 2026-04-23 08:36, Dmitry Vyukov wrote:
> On Thu, 23 Apr 2026 at 14:29, Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2026-04-23 01:53, Dmitry Vyukov wrote:
>> [...]
>>> +linux-man
>>>
>>> This part of the rseq man page needs to be fixed as well I think. The
>>> kernel no longer reliably provides clearing of rseq_cs on preemption,
>>> right?
>>>
>>> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241
>>
>> I'm maintaining this manual page in librseq.
>>
>>>
>>> "and set to NULL by the kernel when it restarts an assembly
>>> instruction sequence block,
>>> as well as when the kernel detects that it is preempting or delivering
>>> a signal outside of the range targeted by the rseq_cs."
>>
>> I think you got two things confused here.
>>
>> 1) There is currently a bug on arm64 where it fails to honor the
>>      rseq ABI contract wrt critical section abort. AFAIU there is a
>>      fix proposed for this.
>>
>> 2) Thomas relaxed the implementation of cpu_id_start field updates
>>      so it only stores to the rseq area when the current cpu actually
>>      changes (migration).
>>
>> So AFAIU the statement in the man page is still fine. It's just arm64
>> that needs fixing.
> 
> 
> My understanding was that due to the ev->user_irq check here:
> 
> +static __always_inline void rseq_sched_switch_event(struct task_struct *t)
> ...
> +               bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
> +
> +               if (raise) {
> +                       ev->sched_switch = true;
> +                       rseq_raise_notify_resume(t);
> +               }
> 
> There won't be any rseq-related processing for threads preempted in
> syscalls, which means that rseq_cs won't be NULLed for threads
> preempted inside of syscalls.

Let's see if I understand your concern correctly. Scenario:

A thread is within a rseq critical section. It exits the critical
section without clearing the rseq_cs pointer, expecting the kernel
to lazily clear the rseq_cs pointer eventually when it detects that
it's not nested on top of the userspace critical section anymore.
It then calls a system call _outside_ of the rseq critical section,
but with rseq_cs pointer set. Based on the rseq man page wording,
it would then expect the preemption within the system call to guarantee
clearing that that pointer.

Here is the relevant comment block in the man page:

                      Updated by user-space, which sets the address of  the  cur‐
                      rently active rseq_cs at the beginning of assembly instruc‐
                      tion sequence block, and set to NULL by the kernel when  it
                      restarts an assembly instruction sequence block, as well as
>>>>>>>>>
                      when the kernel detects that it is preempting or delivering
                      a  signal  outside  of  the  range targeted by the rseq_cs.
>>>>>>>>>
                           ^^^ this

The whole point about lazy-clearing of rseq_cs is that it _may_ happen when
the kernel preempts or delivers a signal (or at any point really), but it's
just an optimization.

Updating the manual page with this wording would match the intent:

                      Updated by user-space, which sets the address of  the  cur‐
                      rently active rseq_cs at the beginning of assembly instruc‐
                      tion sequence block, and set to NULL by the kernel when  it
                      restarts an assembly instruction sequence block. May be set
                      to NULL by the kernel when it detects that the current
                      instruction pointer is outside of the range targeted by
                      the rseq_cs.
                      Also needs to be set to NULL by user-space before  reclaim‐
                      ing memory that contains the targeted struct rseq_cs.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
  2026-04-23 12:53                 ` Mathieu Desnoyers
@ 2026-04-23 12:58                   ` Dmitry Vyukov
  0 siblings, 0 replies; 41+ messages in thread
From: Dmitry Vyukov @ 2026-04-23 12:58 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jinjie Ruan, linux-man, Thomas Gleixner, Mark Rutland,
	Mathias Stearn, Catalin Marinas, Will Deacon, Boqun Feng,
	Paul E. McKenney, Chris Kennelly, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Blake Oler,
	Michael Jeanson

On Thu, 23 Apr 2026 at 14:53, Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2026-04-23 08:36, Dmitry Vyukov wrote:
> > On Thu, 23 Apr 2026 at 14:29, Mathieu Desnoyers
> > <mathieu.desnoyers@efficios.com> wrote:
> >>
> >> On 2026-04-23 01:53, Dmitry Vyukov wrote:
> >> [...]
> >>> +linux-man
> >>>
> >>> This part of the rseq man page needs to be fixed as well I think. The
> >>> kernel no longer reliably provides clearing of rseq_cs on preemption,
> >>> right?
> >>>
> >>> https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2#n241
> >>
> >> I'm maintaining this manual page in librseq.
> >>
> >>>
> >>> "and set to NULL by the kernel when it restarts an assembly
> >>> instruction sequence block,
> >>> as well as when the kernel detects that it is preempting or delivering
> >>> a signal outside of the range targeted by the rseq_cs."
> >>
> >> I think you got two things confused here.
> >>
> >> 1) There is currently a bug on arm64 where it fails to honor the
> >>      rseq ABI contract wrt critical section abort. AFAIU there is a
> >>      fix proposed for this.
> >>
> >> 2) Thomas relaxed the implementation of cpu_id_start field updates
> >>      so it only stores to the rseq area when the current cpu actually
> >>      changes (migration).
> >>
> >> So AFAIU the statement in the man page is still fine. It's just arm64
> >> that needs fixing.
> >
> >
> > My understanding was that due to the ev->user_irq check here:
> >
> > +static __always_inline void rseq_sched_switch_event(struct task_struct *t)
> > ...
> > +               bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
> > +
> > +               if (raise) {
> > +                       ev->sched_switch = true;
> > +                       rseq_raise_notify_resume(t);
> > +               }
> >
> > There won't be any rseq-related processing for threads preempted in
> > syscalls, which means that rseq_cs won't be NULLed for threads
> > preempted inside of syscalls.
>
> Let's see if I understand your concern correctly. Scenario:
>
> A thread is within a rseq critical section. It exits the critical
> section without clearing the rseq_cs pointer, expecting the kernel
> to lazily clear the rseq_cs pointer eventually when it detects that
> it's not nested on top of the userspace critical section anymore.
> It then calls a system call _outside_ of the rseq critical section,
> but with rseq_cs pointer set. Based on the rseq man page wording,
> it would then expect the preemption within the system call to guarantee
> clearing that that pointer.

Yes, this is the scenario I had in mind.

> Here is the relevant comment block in the man page:
>
>                       Updated by user-space, which sets the address of  the  cur‐
>                       rently active rseq_cs at the beginning of assembly instruc‐
>                       tion sequence block, and set to NULL by the kernel when  it
>                       restarts an assembly instruction sequence block, as well as
> >>>>>>>>>
>                       when the kernel detects that it is preempting or delivering
>                       a  signal  outside  of  the  range targeted by the rseq_cs.
> >>>>>>>>>
>                            ^^^ this
>
> The whole point about lazy-clearing of rseq_cs is that it _may_ happen when
> the kernel preempts or delivers a signal (or at any point really), but it's
> just an optimization.
>
> Updating the manual page with this wording would match the intent:
>
>                       Updated by user-space, which sets the address of  the  cur‐
>                       rently active rseq_cs at the beginning of assembly instruc‐
>                       tion sequence block, and set to NULL by the kernel when  it
>                       restarts an assembly instruction sequence block. May be set
>                       to NULL by the kernel when it detects that the current
>                       instruction pointer is outside of the range targeted by
>                       the rseq_cs.
>                       Also needs to be set to NULL by user-space before  reclaim‐
>                       ing memory that contains the targeted struct rseq_cs.
>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] arm64/entry: Fix arm64-specific rseq brokenness (was: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64) and tcmalloc everywhere
       [not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com>
  2026-04-22 12:56 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Peter Zijlstra
  2026-04-22 13:09 ` Mark Rutland
@ 2026-04-24 16:45 ` Mark Rutland
  2 siblings, 0 replies; 41+ messages in thread
From: Mark Rutland @ 2026-04-24 16:45 UTC (permalink / raw)
  To: Mathias Stearn, Linus Torvalds, Catalin Marinas, Will Deacon,
	Thomas Gleixner, Mathieu Desnoyers, Peter Zijlstra
  Cc: Boqun Feng, Paul E. McKenney, Chris Kennelly, Dmitry Vyukov,
	regressions, linux-kernel, linux-arm-kernel, Ingo Molnar,
	Jinjie Ruan, Blake Oler

Patch for the arm64-specific issue below. This doesn't fix the generic
cpu_id_start issue, but it brings arm64 into line with everyone else,
and it's the shape we'll need going forwards for other stuff anyway.

I've given it light testing with Mathias's reproducer and the
kselftests, which all pass.

I've also pushed it to my arm64/rseq branch:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/rseq

Mark.

---->8----
From 79b65cbbfa20aa2cb0bc248591fab5459cdc101b Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland@arm.com>
Date: Thu, 23 Apr 2026 16:51:12 +0100
Subject: [PATCH] arm64/entry: Fix arm64-specific rseq brokenness

Mathias Stearn reports that since v6.19, there are two big issues
affecting rseq:

(1) On arm64 specifically, rseq critical sections aren't aborted when
    they should be.

(2) The 'cpu_id_start' field is no longer written by the kernel in all
    cases it used to be, including some cases where TCMalloc depends on
    the kernel clobbering the field.

This patch fixes issue #1. This patch DOES NOT fix issue #2, which will
need to be addressed by other patches.

The arm64-specific brokenness is a result of commits:

  2fc0e4b4126c ("rseq: Record interrupt from user space")
  39a167560a61 ("rseq: Optimize event setting")

The first commit failed to add a call to rseq_note_user_irq_entry() on
arm64. Thus arm64 never sets rseq_event::user_irq to record that it may
be necessary to abort an active rseq critical section upon return to
userspace. On its own, this commit had no functional impact as the value
of rseq_event::user_irq was not consumed.

The second commit relied upon rseq_event::user_irq to determine whether
or not to bother to perform rseq work when returning to userspace. As
rseq_event::user_irq wasn't set on arm64, this work would be skipped,
and consequently an active rseq critical section would not be aborted.

Fix this by giving arm64 syscall-specific entry/exit paths, and
performing the relevant logic in syscall and non-syscall paths,
including calling rseq_note_user_irq_entry() for non-syscall entry.

Currently arm64 cannot use syscall_enter_from_user_mode(),
syscall_exit_to_user_mode(), and irqentry_exit_to_user_mode(), due to
ordering constraints with exception masking, and risk of ABI breakage
for syscall tracing/audit/etc. For the moment the entry/exit logic is
left as arm64-specific, but mirroring the generic code.

I intend to follow up with refactoring/cleanup, as we did for kernel
mode entry paths in commit:

  041aa7a85390 ("entry: Split preemption from irqentry_exit_to_kernel_mode()")

... which will allow arm64 to use the GENERIC_IRQ_ENTRY functions directly.

Fixes: 39a167560a61 ("rseq: Optimize event setting")
Reported-by: Mathias Stearn <mathias@mongodb.com>
Link: https://lore.kernel.org/regressions/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com/
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/kernel/entry-common.c | 29 ++++++++++++++++++++++-------
 include/linux/irq-entry-common.h |  8 --------
 include/linux/rseq_entry.h       | 19 -------------------
 3 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index cb54335465f66..65ade1f1544f6 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -62,6 +62,12 @@ static void noinstr arm64_exit_to_kernel_mode(struct pt_regs *regs,
 	irqentry_exit_to_kernel_mode_after_preempt(regs, state);
 }
 
+static __always_inline void arm64_syscall_enter_from_user_mode(struct pt_regs *regs)
+{
+	enter_from_user_mode(regs);
+	mte_disable_tco_entry(current);
+}
+
 /*
  * Handle IRQ/context state management when entering from user mode.
  * Before this function is called it is not safe to call regular kernel code,
@@ -70,20 +76,29 @@ static void noinstr arm64_exit_to_kernel_mode(struct pt_regs *regs,
 static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
 {
 	enter_from_user_mode(regs);
+	rseq_note_user_irq_entry();
 	mte_disable_tco_entry(current);
 	sme_enter_from_user_mode();
 }
 
+static __always_inline void arm64_syscall_exit_to_user_mode(struct pt_regs *regs)
+{
+	local_irq_disable();
+	syscall_exit_to_user_mode_prepare(regs);
+	local_daif_mask();
+	mte_check_tfsr_exit();
+	exit_to_user_mode();
+}
+
 /*
  * Handle IRQ/context state management when exiting to user mode.
  * After this function returns it is not safe to call regular kernel code,
  * instrumentable code, or any code which may trigger an exception.
  */
-
 static __always_inline void arm64_exit_to_user_mode(struct pt_regs *regs)
 {
 	local_irq_disable();
-	exit_to_user_mode_prepare_legacy(regs);
+	irqentry_exit_to_user_mode_prepare(regs);
 	local_daif_mask();
 	sme_exit_to_user_mode();
 	mte_check_tfsr_exit();
@@ -92,7 +107,7 @@ static __always_inline void arm64_exit_to_user_mode(struct pt_regs *regs)
 
 asmlinkage void noinstr asm_exit_to_user_mode(struct pt_regs *regs)
 {
-	arm64_exit_to_user_mode(regs);
+	arm64_syscall_exit_to_user_mode(regs);
 }
 
 /*
@@ -716,12 +731,12 @@ static void noinstr el0_brk64(struct pt_regs *regs, unsigned long esr)
 
 static void noinstr el0_svc(struct pt_regs *regs)
 {
-	arm64_enter_from_user_mode(regs);
+	arm64_syscall_enter_from_user_mode(regs);
 	cortex_a76_erratum_1463225_svc_handler();
 	fpsimd_syscall_enter();
 	local_daif_restore(DAIF_PROCCTX);
 	do_el0_svc(regs);
-	arm64_exit_to_user_mode(regs);
+	arm64_syscall_exit_to_user_mode(regs);
 	fpsimd_syscall_exit();
 }
 
@@ -868,11 +883,11 @@ static void noinstr el0_cp15(struct pt_regs *regs, unsigned long esr)
 
 static void noinstr el0_svc_compat(struct pt_regs *regs)
 {
-	arm64_enter_from_user_mode(regs);
+	arm64_syscall_enter_from_user_mode(regs);
 	cortex_a76_erratum_1463225_svc_handler();
 	local_daif_restore(DAIF_PROCCTX);
 	do_el0_svc_compat(regs);
-	arm64_exit_to_user_mode(regs);
+	arm64_syscall_exit_to_user_mode(regs);
 }
 
 static void noinstr el0_bkpt32(struct pt_regs *regs, unsigned long esr)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 167fba7dbf043..1fabf0f5ea8e7 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -218,14 +218,6 @@ static __always_inline void __exit_to_user_mode_validate(void)
 	lockdep_sys_exit();
 }
 
-/* Temporary workaround to keep ARM64 alive */
-static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
-{
-	__exit_to_user_mode_prepare(regs, EXIT_TO_USER_MODE_WORK);
-	rseq_exit_to_user_mode_legacy();
-	__exit_to_user_mode_validate();
-}
-
 /**
  * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index f11ebd34f8b95..a3762410c4ab6 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -753,24 +753,6 @@ static __always_inline void rseq_irqentry_exit_to_user_mode(void)
 	ev->events = 0;
 }
 
-/* Required to keep ARM64 working */
-static __always_inline void rseq_exit_to_user_mode_legacy(void)
-{
-	struct rseq_event *ev = &current->rseq.event;
-
-	rseq_stat_inc(rseq_stats.exit);
-
-	if (static_branch_unlikely(&rseq_debug_enabled))
-		WARN_ON_ONCE(ev->sched_switch);
-
-	/*
-	 * Ensure that event (especially user_irq) is cleared when the
-	 * interrupt did not result in a schedule and therefore the
-	 * rseq processing did not clear it.
-	 */
-	ev->events = 0;
-}
-
 void __rseq_debug_syscall_return(struct pt_regs *regs);
 
 static __always_inline void rseq_debug_syscall_return(struct pt_regs *regs)
@@ -786,7 +768,6 @@ static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned
 }
 static inline void rseq_syscall_exit_to_user_mode(void) { }
 static inline void rseq_irqentry_exit_to_user_mode(void) { }
-static inline void rseq_exit_to_user_mode_legacy(void) { }
 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
 static inline bool rseq_grant_slice_extension(unsigned long ti_work, unsigned long mask) { return false; }
 #endif /* !CONFIG_RSEQ */
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2026-04-27  7:41 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com>
2026-04-22 12:56 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Peter Zijlstra
2026-04-22 13:13   ` Peter Zijlstra
2026-04-23 10:38     ` Mathias Stearn
     [not found]     ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com>
2026-04-23 11:48       ` Thomas Gleixner
2026-04-23 12:11         ` Mathias Stearn
2026-04-23 17:19           ` Thomas Gleixner
2026-04-23 17:38             ` Chris Kennelly
2026-04-23 17:47               ` Mathieu Desnoyers
2026-04-23 19:39               ` Thomas Gleixner
2026-04-23 17:41             ` Linus Torvalds
2026-04-23 18:35               ` Mathias Stearn
2026-04-23 18:53               ` Mark Rutland
2026-04-23 21:03               ` Thomas Gleixner
2026-04-23 21:28                 ` Linus Torvalds
2026-04-23 23:08                   ` Linus Torvalds
2026-04-27  7:06                   ` Florian Weimer
2026-04-22 13:09 ` Mark Rutland
2026-04-22 17:49   ` Thomas Gleixner
2026-04-22 18:11     ` Mark Rutland
2026-04-22 19:47       ` Thomas Gleixner
2026-04-23  1:48         ` Jinjie Ruan
2026-04-23  5:53           ` Dmitry Vyukov
2026-04-23 10:39             ` Thomas Gleixner
2026-04-23 10:51               ` Mathias Stearn
2026-04-23 12:24                 ` David Laight
2026-04-23 19:31                 ` Thomas Gleixner
2026-04-24  7:56                   ` Dmitry Vyukov
2026-04-24  8:32                     ` Mathias Stearn
2026-04-24  9:30                       ` Dmitry Vyukov
2026-04-24 14:16                       ` Thomas Gleixner
2026-04-24 15:03                         ` Peter Zijlstra
2026-04-24 19:44                           ` Thomas Gleixner
2026-04-26 22:04                             ` Thomas Gleixner
2026-04-27  7:40                               ` Florian Weimer
2026-04-23 12:11             ` Alejandro Colomar
2026-04-23 12:54               ` Mathieu Desnoyers
2026-04-23 12:29             ` Mathieu Desnoyers
2026-04-23 12:36               ` Dmitry Vyukov
2026-04-23 12:53                 ` Mathieu Desnoyers
2026-04-23 12:58                   ` Dmitry Vyukov
2026-04-24 16:45 ` [PATCH] arm64/entry: Fix arm64-specific rseq brokenness (was: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64) " Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox