From: Thomas Gleixner <tglx@kernel.org>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Mathias Stearn <mathias@mongodb.com>,
Dmitry Vyukov <dvyukov@google.com>,
Peter Zijlstra <peterz@infradead.org>,
linux-man@vger.kernel.org, Mark Rutland <mark.rutland@arm.com>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Chris Kennelly <ckennelly@google.com>,
regressions@lists.linux.dev, Ingo Molnar <mingo@kernel.org>,
Blake Oler <blake.oler@mongodb.com>,
Florian Weimer <fweimer@redhat.com>,
Rich Felker <dalias@libc.org>,
Matthew Wilcox <willy@infradead.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Linus Torvalds <torvalds@linuxfoundation.org>
Subject: [patch 00/10] rseq: Cure refactoring regressions
Date: Wed, 29 Apr 2026 01:33:31 +0200 [thread overview]
Message-ID: <20260428221058.149538293@kernel.org> (raw)
Mathias reported that as of Linux 6.19 TCMalloc fails to work with RSEQ,
which is related to the non-ABI conforming usage of RSEQ by TCMalloc:
https://lore.kernel.org/CAHnCjA25b+nO2n5CeifknSKHssJpPrjnf+dtr7UgzRw4Zgu=oA@mail.gmail.com
How we got there:
1) The original RSEQ implementation updates the rseq::cpu_id_start field
in user space more or less unconditionally on every exit to user
space, whether the CPU/MMCID have been changed or not.
That went unnoticed for years because nothing used rseq aside of
Google and TCMalloc. Once glibc registered rseq, this resulted in a
up to 15% performance penalty for syscall heavy workloads.
2) The rseq::cpu_id_start field is documented as read only for user
space in the ABI contract and guaranteed to be updated by the
kernel when a task is migrated to a different CPU.
3) TCMalloc abuses the sub-optimal implementation (see #1) and scribbles
over rseq::cpu_id_start for their own nefarious purposes.
4) As a consequence of #4 tcmalloc cannot be used together with any
other facility/library which wants to utilize the ABI guaranteed
properties of rseq::cpu_id_start in the same application.
5) tcmalloc violates the ABI from day one and has since refused to
address the problem despite being offered a kernel side rseq
extension to solve it many years ago and despite claiming to
be happy to accomodate.
6) When addressing the performance issues of RSEQ the unconditional
update stopped to exist under the valid assumption that the kernel has
only to satisfy the guaranteed ABI properties, which in turn broke
TCMalloc.
Due to that everyone is in a hard place and up a creek without a paddle.
After various solutions had been discussed and explored, it turned out that
this can be solved sanely with the following focus points:
1) Keep it as simple as possible and avoid expensive workarounds for the
sake of TCMalloc
2) [Re]enable the wider ecosystem to leverage the full potential of RSEQ
There are some unavoidable downsides which come with this:
1) TCMalloc compatible mode is mutually exclusive with the performance
optimizations which broke it in the first place
2) As a collateral damage older GLIBC versions, which do not implement
the variable sized RSEQ registration suffer from #1
3) Existing (time slice extensions) and future extended RSEQ features
can't be enabled for #1 and #2
The required effort for solving these problems, the resulting maintenance
burdens and the thereby inflicted road blocks for further improvements on
the v2 ABI requirements are severe enough to accept that the unwillingness
of a single entity to collaborate with the larger ecosystem for many years
causes the collateral damage described in #2 and therefore #3.
As Linus decreed the onus is on the lack of ABI compliance enforcement in
the original RSEQ kernel implementation and the clever abuse is fine.
That's technically correct, but in the context of the larger ecosystem a
fundamentally flawed decision. Though that's a completely different
discussion to have as it affects the long term sustainability of the Open
Source ecosystem in general and the ability to protect it against rogue
actors, which are thereby officialy entitled to hold a whole ecosystem
hostage and force the people who provide them their operational base to go
out on a limb to make progress. Despite public statements that they are
aware of the ABI violations and "happy" to adjust as the kernel side
evolves. Truly clever by some definition of clever.
Back to the technical issues and the solution.
Thanks to the laziness or lack of maintenance of various architectures the
uptake of the generic entry code infrastructure is limited.
That caused the optimization rework to keep the original code around pretty
much unmodified plus-minus a few trivial details. So far the optimized code
has been written so that the original code paths are optimized out by the
compiler via compile time constant conditions (Kconfig options).
That allows to replace these compile time constant conditions by runtime
evaluations depending on the ABI mode of RSEQ. With some effort it was
possible to reduce the impact of those runtime conditionals to vanish in
the noise of performance counter based observations.
That's the easy part. The more interesting question was how to distinguish
between a v1 (legacy) and v2 (optimized and extended feature enabled) RSEQ
registration to avoid putting the whole ecosystem back into the performance
stone age.
This came with two options:
1) A new registration feature flag, which would cause the whole
ecosystem to adopt including the resulting consequences for CRIU
and others.
2) Make the ABI variant depend on the registration size.
#2 turned out to be the least horrible choice.
The original RSEQ implementation, which TCMalloc depends on, registers a 32
byte region (ORIG_RSEG_SIZE). This region has a 32 byte alignment
requirement.
The extension safe newer variant exposes the kernel RSEQ feature size via
getauxval(AT_RSEQ_FEATURE_SIZE) and the alignment requirement via
getauxval(AT_RSEQ_ALIGN). The alignment requirement is that the registered
RSEQ region is aligned to the next power of two of the feature size. The
kernel currently has a feature size of 33 bytes, which means the alignment
requirement is 64 bytes.
The TCMalloc RSEQ region is embedded into a cache line aligned data
structure starting at offset 32 bytes so that bytes 28-31 and the
cpu_id_start field at bytes 32-35 form a 64-bit little endian pointer with
the top-most bit (63 set) to check whether the kernel has overwritten
cpu_id_start with an actual CPU id value, which is guaranteed to not have
the top most bit set.
As this is part of TCMalloc's performance tuned magic, it's a pretty safe
assumption, that TCMalloc won't use a larger RSEQ size.
This allows the kernel to declare that registrations with a size greater
than the original size of 32 bytes, which is the case since time slice
extensions got introduced, as RSEQ ABI v2 with the following differences to
the original behaviour:
1) Unconditional updates of the user read only fields (CPU, node, MMCID)
are removed. Those fields are only updated on registration, task
migration and MMCID changes.
2) Unconditional evaluation of the criticial section pointer is
removed. It's only evaluated when user space was interrupted and was
scheduled out or before delivering a signal in the interrupted
context.
3) The read/only requirement of the ID fields is enforced. When the
kernel detects that userspace manipulated the fields, the process is
terminated. This ensures that multiple entities (libraries) can
utilize RSEQ without interfering.
4) Todays extended RSEQ feature (time slice extensions) and future
extensions are only enabled in the v2 enabled mode.
Registrations with the original size of 32 bytes operate in backwards
compatible legacy mode without performance improvements and extended
features.
The following series implements exactly this and is based on v7.1-rc1. It
applies cleanly to v7.0 and would need some minor tweaks to be backported
to the already EOL v6.19.y kernel.
It's also available from git with the extra benefit that the git branch
contains also the related ARM64 fix, which affects not just the TCMalloc
usage base for the conveniance of testers.
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git core/rseq
The combination in this tree survives:
- The kernel side (adjusted) selftests
- The provided TCMalloc tests (only partial for ARM64 as the provided
main TCMalloc tests are x86 only and TCMalloc is impossible to build
for mere mortals).
Thanks to everyone involved for feedback, suggestions, testing and test
cases!
Thanks,
tglx
---
Documentation/userspace-api/rseq.rst | 94 +++++++++
include/linux/rseq.h | 35 ++-
include/linux/rseq_entry.h | 110 +++++------
include/linux/rseq_types.h | 9
include/uapi/linux/rseq.h | 5
kernel/rseq.c | 209 +++++++++++++--------
kernel/sched/membarrier.c | 11 +
tools/testing/selftests/rseq/Makefile | 7
tools/testing/selftests/rseq/check_optimized.c | 17 +
tools/testing/selftests/rseq/legacy_check.c | 65 ++++++
tools/testing/selftests/rseq/param_test.c | 22 +-
tools/testing/selftests/rseq/rseq-abi.h | 7
tools/testing/selftests/rseq/rseq.c | 39 +--
tools/testing/selftests/rseq/rseq.h | 8
tools/testing/selftests/rseq/run_legacy_check.sh | 4
tools/testing/selftests/rseq/run_param_test.sh | 39 +++
tools/testing/selftests/rseq/run_timeslice_test.sh | 14 +
tools/testing/selftests/rseq/slice_test.c | 12 -
18 files changed, 521 insertions(+), 186 deletions(-)
next reply other threads:[~2026-04-28 23:33 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-28 23:33 Thomas Gleixner [this message]
2026-04-28 23:33 ` [patch 01/10] rseq: Set rseq::cpu_id_start to 0 on unregistration Thomas Gleixner
2026-04-29 8:20 ` Dmitry Vyukov
2026-04-28 23:33 ` [patch 02/10] rseq: Protect rseq_reset() against interrupts Thomas Gleixner
2026-04-29 8:22 ` Dmitry Vyukov
2026-04-28 23:33 ` [patch 03/10] rseq: Dont advertise time slice extensions if disabled Thomas Gleixner
2026-04-29 8:36 ` Dmitry Vyukov
2026-04-28 23:33 ` [patch 04/10] rseq: Revert to historical performance killing behaviour Thomas Gleixner
2026-04-29 8:51 ` Dmitry Vyukov
2026-05-05 14:13 ` [tip: sched/urgent] " tip-bot2 for Thomas Gleixner
2026-04-28 23:33 ` [patch 05/10] selftests/rseq: Skip tests if time slice extensions are not available Thomas Gleixner
2026-04-29 9:34 ` Dmitry Vyukov
2026-05-05 14:13 ` [tip: sched/urgent] " tip-bot2 for Thomas Gleixner
2026-04-28 23:34 ` [patch 06/10] selftests/rseq: Make registration flexible for legacy and optimized mode Thomas Gleixner
2026-04-29 9:34 ` Dmitry Vyukov
2026-05-05 14:13 ` [tip: sched/urgent] " tip-bot2 for Thomas Gleixner
2026-04-28 23:34 ` [patch 07/10] selftests/rseq: Validate legacy behavior Thomas Gleixner
2026-04-29 9:35 ` Dmitry Vyukov
2026-05-05 14:13 ` [tip: sched/urgent] " tip-bot2 for Thomas Gleixner
2026-04-28 23:34 ` [patch 08/10] rseq: Implement read only ABI enforcement for optimized RSEQ V2 mode Thomas Gleixner
2026-04-29 9:35 ` Dmitry Vyukov
2026-05-05 14:13 ` [tip: sched/urgent] " tip-bot2 for Thomas Gleixner
2026-04-28 23:34 ` [patch 09/10] rseq: Reenable performance optimizations conditionally Thomas Gleixner
2026-04-29 9:35 ` Dmitry Vyukov
2026-05-05 14:13 ` [tip: sched/urgent] " tip-bot2 for Thomas Gleixner
2026-04-28 23:34 ` [patch 10/10] selftests/rseq: Expand for optimized RSEQ ABI v2 Thomas Gleixner
2026-04-29 9:35 ` Dmitry Vyukov
2026-05-05 14:13 ` [tip: sched/urgent] " tip-bot2 for Thomas Gleixner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260428221058.149538293@kernel.org \
--to=tglx@kernel.org \
--cc=blake.oler@mongodb.com \
--cc=ckennelly@google.com \
--cc=dalias@libc.org \
--cc=dvyukov@google.com \
--cc=fweimer@redhat.com \
--cc=gregkh@linuxfoundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-man@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=mathias@mongodb.com \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@kernel.org \
--cc=peterz@infradead.org \
--cc=regressions@lists.linux.dev \
--cc=torvalds@linuxfoundation.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox