public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: Mathias Stearn <mathias@mongodb.com>
Cc: Dmitry Vyukov <dvyukov@google.com>,
	Jinjie Ruan <ruanjinjie@huawei.com>,
	linux-man@vger.kernel.org, Mark Rutland <mark.rutland@arm.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>, Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Chris Kennelly <ckennelly@google.com>,
	regressions@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Blake Oler <blake.oler@mongodb.com>
Subject: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
Date: Thu, 23 Apr 2026 21:31:30 +0200	[thread overview]
Message-ID: <87a4ut1njh.ffs@tglx> (raw)
In-Reply-To: <CAHnCjA0UBNXfjHw=Y34OrAyGRNUtVF+zWd3ugyX6pd_mCk8K9w@mail.gmail.com>

On Thu, Apr 23 2026 at 12:51, Mathias Stearn wrote:
> On Thu, Apr 23, 2026 at 12:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> The kernel clears rseq_cs reliably when user space was interrupted and:
>>
>>     the task was preempted
>> or
>>     the return from interrupt delivers a signal
>>
>> If the task invoked a syscall then there is absolutely no reason to do
>> either of this because syscalls from within a critical section are a
>> bug and catched when enabling rseq debugging.
>>
>> The original code did this along with unconditionally updating CPU/MMCID
>> which resulted in ~15% performance regression on a syscall heavy
>> database benchmark once glibc started to register rseq.
>
> Just to be clear TCMalloc does not need either rseq_cs to be cleared
> or cpu_id_start to be written to on syscalls because it doesn't do
> syscalls from critical sections. It will actually benefit (slightly)
> from not updating cpu_id_start on syscalls.

I know that it does not do syscalls from within critical sections, but
it relies on cpu_id_start being unconditionally updated in one way or
the other.

> It is specifically in the cases where an rseq would need to be aborted
> (preemption, signals, migration, and membarrier IPI with the rseq
> flag) that TCMalloc relies on cpu_id_start being written. It does rely
> on that write even when not inside the critical section, because it
> effectively uses that to detect if there were any would-cause-abort
> events in between two critical sections. But since it leaves the
> rseq_cs pointer non-null between critical sections, so you dont need
> to add _any_ overhead for programs that never make use of rseq after
> registration, or add any overhead to syscalls even for those who do.

Well. According to the comment in the tcmalloc code:

// Calculation of the address of the current CPU slabs region is needed for
// allocation/deallocation fast paths, but is quite expensive. Due to variable
// shift and experimental support for "virtual CPUs", the calculation involves
// several additional loads and dependent calculations. Pseudo-code for the
// address calculation is as follows:
//
//   cpu_offset = TcmallocSlab.virtual_cpu_id_offset_;
//   cpu = *(&__rseq_abi + virtual_cpu_id_offset_);
//   slabs_and_shift = TcmallocSlab.slabs_and_shift_;
//   shift = slabs_and_shift & kShiftMask;
//   shifted_cpu = cpu << shift;
//   slabs = slabs_and_shift & kSlabsMask;
//   slabs += shifted_cpu;
//
// To remove this calculation from fast paths, we cache the slabs address
// for the current CPU in thread local storage. However, when a thread is
// rescheduled to another CPU, we somehow need to understand that the cached

                  ^^^^^^^^^^^

// address is not valid anymore. To achieve this, we overlap the top 4 bytes
// of the cached address with __rseq_abi.cpu_id_start. When a thread is
// rescheduled the kernel overwrites cpu_id_start with the current CPU number,
// which gives us the signal that the cached address is not valid anymore.

The kernel still as of today (the arm64 bug aside) updates the
cpu_id_start and cpu_id fields in rseq when a task is rescheduled to
another CPU.

So if the code only requires to know when it got rescheduled to another
CPU then it still should work, no?

But it does not, which makes it clear that it relies on this
undocumented behaviour of the kernel to rewrite rseq::cpu_id_start
unconditionally. I'm not yet convinced that it relies on it only when
interrupted between two subsequent critical sections. We'll see.

....

Now we come to the best part of this comment:

// Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose.

So any code sequence which ends up in:

   x = tcmalloc();
   dostuff(x)
     evaluate(rseq::cpu_id_start, rseq::cpu_id)

is doomed. This might be acceptable for Google internal usage where they
control the full stack and can prevent anyone else to utilize rseq, but
in an open ecosystem that's obviously a non-starter.

And they definitely forgot to add this to the comment:

// Never enable CONFIG_RSEQ_DEBUG in the kernel when you use tcmalloc as
// it will expose the blatant ABI abuse and therefore will kill your
// application.

If your assumption that the rewrite is only required when rseq::rseq_cs
is non NULL and user space was interrupted is correct, then the obvious
no-brainer would have been to add:

        __u64	rseq_usr_data;

to struct rseq and clear that unconditionally when rseq::rseq_cs is
cleared.

But that would have been too simple, would work independent of endianess
and not in the way of anybody else.

But I know that's incompatible with the features first, correctness
later and we own the world anyway mindset.

Just for giggles I asked Google Gemini about the implications of
tmalloc's rseq abuse. The answer is pretty clear:

   "In short, TCMalloc treats RSEQ as a private optimization rather than
    a shared system resource, which compromises the stability and
    extensibility of any application that needs RSEQ for anything other
    than memory allocation."

It's also very clear about the wilful ignorance of the tcmalloc people:

   "In summary, the developers have known for at least 6 years that the
    implementation was non-standard and conflicting with other rseq
    usage. The github issue which requested glibc compatibility was
    opened in 2022 and has been unresolved since then."

Thanks,

        tglx


  parent reply	other threads:[~2026-04-23 19:31 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com>
2026-04-22 12:56 ` [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Peter Zijlstra
2026-04-22 13:13   ` Peter Zijlstra
2026-04-23 10:38     ` Mathias Stearn
     [not found]     ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com>
2026-04-23 11:48       ` Thomas Gleixner
2026-04-23 12:11         ` Mathias Stearn
2026-04-23 17:19           ` Thomas Gleixner
2026-04-23 17:38             ` Chris Kennelly
2026-04-23 17:47               ` Mathieu Desnoyers
2026-04-23 19:39               ` Thomas Gleixner
2026-04-23 17:41             ` Linus Torvalds
2026-04-23 18:35               ` Mathias Stearn
2026-04-23 18:53               ` Mark Rutland
2026-04-23 21:03               ` Thomas Gleixner
2026-04-23 21:28                 ` Linus Torvalds
2026-04-23 23:08                   ` Linus Torvalds
2026-04-22 13:09 ` Mark Rutland
2026-04-22 17:49   ` Thomas Gleixner
2026-04-22 18:11     ` Mark Rutland
2026-04-22 19:47       ` Thomas Gleixner
2026-04-23  1:48         ` Jinjie Ruan
2026-04-23  5:53           ` Dmitry Vyukov
2026-04-23 10:39             ` Thomas Gleixner
2026-04-23 10:51               ` Mathias Stearn
2026-04-23 12:24                 ` David Laight
2026-04-23 19:31                 ` Thomas Gleixner [this message]
2026-04-24  7:56                   ` Dmitry Vyukov
2026-04-24  8:32                     ` Mathias Stearn
2026-04-24  9:30                       ` Dmitry Vyukov
2026-04-24 14:16                       ` Thomas Gleixner
2026-04-24 15:03                         ` Peter Zijlstra
2026-04-24 19:44                           ` Thomas Gleixner
2026-04-23 12:11             ` Alejandro Colomar
2026-04-23 12:54               ` Mathieu Desnoyers
2026-04-23 12:29             ` Mathieu Desnoyers
2026-04-23 12:36               ` Dmitry Vyukov
2026-04-23 12:53                 ` Mathieu Desnoyers
2026-04-23 12:58                   ` Dmitry Vyukov
2026-04-24 16:45 ` [PATCH] arm64/entry: Fix arm64-specific rseq brokenness (was: Re: [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64) " Mark Rutland

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87a4ut1njh.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=blake.oler@mongodb.com \
    --cc=boqun.feng@gmail.com \
    --cc=catalin.marinas@arm.com \
    --cc=ckennelly@google.com \
    --cc=dvyukov@google.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mathias@mongodb.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=regressions@lists.linux.dev \
    --cc=ruanjinjie@huawei.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox