public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere
@ 2026-04-22  9:50 Mathias Stearn
  2026-04-22 12:56 ` Peter Zijlstra
  2026-04-22 13:09 ` Mark Rutland
  0 siblings, 2 replies; 32+ messages in thread
From: Mathias Stearn @ 2026-04-22  9:50 UTC (permalink / raw)
  To: Thomas Gleixner, Mathieu Desnoyers, Catalin Marinas, Will Deacon,
	Boqun Feng, Paul E. McKenney
  Cc: Chris Kennelly, Dmitry Vyukov, regressions, linux-kernel,
	linux-arm-kernel, Peter Zijlstra, Ingo Molnar, Mark Rutland,
	Jinjie Ruan, Blake Oler


[-- Attachment #1.1: Type: text/plain, Size: 4968 bytes --]

TL;DR: As of 6.19, rseq no longer provides the documented atomicity
guarantees on arm64 by failing to abort the critical section on same-core
preemption/resumption. Additionally, it breaks tcmalloc specifically by
failing to overwrite the cpu_id_start field at points where it was relied
on for correctness.

This is a SEVERE breakage for MongoDB. We received several user reports of
crashes on 6.19. I made a stress test that showed that 6.19 can cause
malloc to return the same pointer twice without it being freed. Because
that can cause arbitrary corruption, our latest releases have all been
patched to refuse to start at all on 6.19+.

TCMalloc uses rseq in a "creative" way described at
https://github.com/google/tcmalloc/blob/master/docs/rseq.md. In particular,
the "Current CPU Slabs Pointer Caching" section describes an optimization
that relies on an undocumented fact that the kernel was always overwriting
cpu_id_start (even when it wouldn't change) to invalidate a user-space
cache. Since the change to stop writing cpu_id_start seemed to be
intentional as part of a refactoring merged in 2b09f480f0a1, I started
working on a userspace patch to stop relying on that. Unfortunately when
that was complete I ran into a wall that is impossible to work around from
userspace.

On arm64, the kernel no longer meets the documented guarantee that rseq
critical sections are atomic with respect to preemption. It seems to only
abort the critical section when the thread is migrated to a different core.
The attached test proves it and passes on x86 both before and after 6.19,
and on arm before 6.19, but fails on arm with 6.19. It pins the process to
a single core and then has an rseq critical section that observes a change
made by another thread which is supposed to be impossible. I think this
will break basically any real usage of rseq, other than just reading the
current cpu_id.

An LLM pointed to these two specific commits in the refactor as causing
this (oldest first):
- 39a167560a61 rseq: Optimize event setting
This assumed that user_irq would be set on preemption but it wasn't on
arm64, so TIF_NOTIFY_RESUME isn't raised on same cpu preemption.
- 566d8015f7ee rseq: Avoid CPU/MM CID updates when no event pending
This broke TCMalloc slab caching trick by not overwriting cpu_id_start on
every return to userspace

(I have a lot more analysis and suggested fixes from LLMs since I used them
heavily in this testing and analysis, but I won't spam you with the slop
unless requested)

The arm64 change is a clear breakage and I'm sure it will be
uncontroversial to fix. I can imagine more resistance to reverting to the
old behavior of always overwriting the cpu_id_start field since that seems
to have been an intentional optimization choice. I have reached out to the
TCMalloc maintainers (CC'd) and believe there is a solution that gets the
vast majority of the optimization while still preserving the behavior that
TCMalloc currently relies on[1].

Any time a critical section might be aborted (migration, preemption, signal
delivery, and membarrier IPI), the kernel already must (but doesn't on
arm64 at the moment) check the rseq_cs field to see if the thread is in a
critical section, and is documented as nulling the pointer after (I assume
to make later checks cheaper). It would be sufficient for tcmalloc's
internal usage if every time the kernel nulled out rseq_cs, it also wrote
the cpu id to cpu_id_start. That should be essentially free since you are
already writing to the same cache line. It was pointed out that that could
be an issue if another rseq user in the same thread nulled rseq_cs after
its critical section, which would require the kernel to update cpu_id_start
each time it checks rseq_cs, regardless of whether it nulls it. We aren't
aware of any processes that mix tcmalloc with other rseq usages that null
out the field from userspace, but we can't rule them out since it is open
source. Either way, this preserves the property of not updating
cpu_id_start on every syscall return and non-membarrier interrupts, which I
assume is where the majority of the optimization win was from.

All testing of problematic versions was performed on x86_64 and
aarch64 Ubuntu 24.04.4 with the kernel manually upgraded to
6.19.8-061908-generic. Source analysis was performed on the v6.19 tag. I
had a few AI agents confirm that nothing in the relevant changes to master
should have solved this, but I have not yet tested there.

$ cat /proc/version
Linux version 6.19.8-061908-generic (kernel@balboa)
(aarch64-linux-gnu-gcc-15 (Ubuntu 15.2.0-15ubuntu1) 15.2.0, GNU ld (GNU
Binutils for Ubuntu) 2.46) #202603131837 SMP PREEMPT_DYNAMIC Sat Mar 14
00:00:07 UTC 2026

[1]  There is also an exploration of some options to make tcmalloc not rely
on the cpu_id_start overwriting. However we would strongly prefer that
existing binaries continue to work on 6.19 kernels, even if newer binaries
don't need that. At least for a good while.

[-- Attachment #1.2: Type: text/html, Size: 5426 bytes --]

[-- Attachment #2: rseq_same_cpu_preempt_test.cc --]
[-- Type: application/octet-stream, Size: 8419 bytes --]

// Minimal single-file rseq repro for same-CPU preemption handling.
//
// Build:
//   g++ -O2 -std=c++20 -pthread rseq_same_cpu_preempt_test.cc -o rseq_same_cpu_preempt_test
//
// The main thread pins itself and a writer thread to one CPU. It then enters an
// rseq critical section that stores 0 to a shared flag and spins until it sees
// the flag become 1. If the critical section resumes after a preemption without
// being aborted, it will eventually observe the writer's 1 and abort.
//
// The writer thread wakes every 10 usec and stores 1 to the shared flag.
//
// Expected behavior if rseq preemption aborts work correctly:
//   the program runs for 10 seconds and exits 0.
//
// Expected behavior if same-CPU preemption can resume inside the CS:
//   the main thread eventually reads 1 inside the CS and aborts.
//
// Note to readers: the top of this file is boring setup code. The interesting
// code starts at run_one_rseq_attempt() so you should skip down there first.

#include <errno.h>
#include <linux/rseq.h>
#include <sched.h>
#include <sys/rseq.h>
#include <unistd.h>

#include <chrono>
#include <cstdarg>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <thread>

#if !defined(__aarch64__) && !defined(__x86_64__)
#error "This repro is currently implemented for aarch64 and x86_64 only."
#endif

namespace {

constexpr std::chrono::seconds kRuntime{10};
constexpr long kWriterSleepNs = 10'000;  // 10 usec

alignas(4) uint32_t g_shared_flag = 0;

struct rseq* current_rseq_abi() {
    auto* thread_ptr = reinterpret_cast<char*>(__builtin_thread_pointer());
    return reinterpret_cast<struct rseq*>(thread_ptr + __rseq_offset);
}

[[noreturn, gnu::format(printf, 1, 2)]] void die(const char* fmt, ...) {
    va_list args;
    va_start(args, fmt);
    std::vfprintf(stderr, fmt, args);
    va_end(args);
    std::fprintf(stderr, "\n");
    _Exit(1);
}

[[noreturn]] void die_errno(const char* what) {
    die("%s failed: %s", what, std::strerror(errno));
}

int pick_and_pin_first_allowed_cpu() {
    cpu_set_t set;
    CPU_ZERO(&set);
    if (sched_getaffinity(0, sizeof(set), &set) != 0) {
        die_errno("sched_getaffinity");
    }

    for (int cpu = 0; cpu < CPU_SETSIZE; ++cpu) {
        if (CPU_ISSET(cpu, &set)) {
            CPU_ZERO(&set);
            CPU_SET(cpu, &set);
            if (sched_setaffinity(0, sizeof(set), &set) != 0) {
                die_errno("sched_setaffinity");
            }
            return cpu;
        }
    }
    die("No allowed CPU found");
}

#define RSEQ_STR_1(x) #x
#define RSEQ_STR(x) RSEQ_STR_1(x)

#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip) \
    ".pushsection __rseq_cs, \"aw\"  \n\t"                                              \
    ".balign 32  \n\t"                                                                  \
    RSEQ_STR(label) ":  \n\t"                                                           \
    ".long 0  \n\t"  /* version */                                                      \
    ".long 0  \n\t"  /* flags */                                                        \
    ".quad " RSEQ_STR(start_ip) "  \n\t"  /* start_ip */                                \
    ".quad " RSEQ_STR((post_commit_ip) - (start_ip)) "  \n\t"  /* post_commit_offset */ \
    ".quad " RSEQ_STR(abort_ip) "  \n\t"  /* abort_ip */                                \
    ".popsection  \n\t"

int run_one_rseq_attempt(struct rseq* abi, uint32_t* shared_flag) {
    int result = 0;

#ifdef C_EQUIVALENT  // C equivalent:
    // Critical section: store 0, then spin until flag becomes 1
    *shared_flag = 0;
    while (*shared_flag == 0) {
        // spin
    }
    result = 1;   // Observed flag == 1
abort:            // Abort handler (kernel jumps here if preempted inside CS)
    result = -1;  // We correctly observed a preemption inside the CS
#elif defined(__aarch64__)
    __asm__ __volatile__(
        RSEQ_ASM_DEFINE_TABLE(1, 2f, 3f, 4f)
        //  Store address of rseq_cs descriptor into abi->rseq_cs
        "  adrp x15, 1b  \n"
        "  add x15, x15, :lo12:1b  \n"
        "  str x15, %[rseq_cs]  \n"
        "2:  \n"                         //  Critical section start (label 2)
        "  str wzr, %[shared_flag]  \n"  //  *shared_flag = 0
        "5:  \n"                         //  Spin loop: while (*shared_flag == 0) {}
        "  ldr w15, %[shared_flag]  \n"  //  w15 = *shared_flag
        "  cbz w15, 5b  \n"              //  if (w15 == 0) goto 5 (spin)
        "  mov %w[result], #1  \n"       //  result = 1 (observed flag == 1)
        "3:  \n"                         //  Critical section end - fall through
        "  b 99f  \n"                    //  Jump past abort handler
        "  .long %c[sig]  \n"            //  RSEQ signature (magic bytes required by kernel)
        "4:  \n"                         //  Abort handler entry (label 4)
        "  mov %w[result], #-1  \n"      //  result = -1
        "99:  \n"                        //  End of abort handler
        : [result] "+r"(result), [rseq_cs] "=m"(abi->rseq_cs), [shared_flag] "+Q"(*shared_flag)
        : [sig] "i"(RSEQ_SIG)
        : "memory", "x15");
#elif defined(__x86_64__)
    __asm__ __volatile__(
        RSEQ_ASM_DEFINE_TABLE(1, 2f, 3f, 4f)
        //  Store address of rseq_cs descriptor into abi->rseq_cs
        "  leaq 1b(%%rip), %%rax  \n\t"
        "  movq %%rax, %[rseq_cs]  \n\t"
        "2:  \n\t"                            //  Critical section start (label 2)
        "  movl $0, %[shared_flag]  \n\t"     //  *shared_flag = 0
        "5:  \n\t"                            //  Spin loop: while (*shared_flag == 0) {}
        "  movl %[shared_flag], %%eax  \n\t"  //  eax = *shared_flag
        "  testl %%eax, %%eax  \n\t"          //  test eax == 0
        "  jz 5b  \n\t"                       //  if (eax == 0) goto 5 (spin)
        "  movl $1, %[result]  \n\t"          //  result = 1 (observed flag == 1)
        "3:  \n\t"                            //  Critical section end - fall through
        "  jmp 99f  \n\t"                     //  Jump past abort handler
        "  .long %c[sig]  \n\t"               //  RSEQ signature (magic bytes required by kernel)
        "4:  \n\t"                            //  Abort handler entry (label 4)
        "  movl $-1, %[result]  \n\t"         //  result = -1
        "99:  \n\t"                           //  End of abort handler
        : [result] "+r"(result), [rseq_cs] "=m"(abi->rseq_cs), [shared_flag] "+m"(*shared_flag)
        : [sig] "i"(RSEQ_SIG)
        : "memory", "cc", "rax");
#endif

    return result;
}

void writer_thread_main() {
    while (true) {
        std::this_thread::sleep_for(std::chrono::nanoseconds(kWriterSleepNs));
        __atomic_store_n(&g_shared_flag, 1u, __ATOMIC_RELAXED);
    }
}

}  // namespace

int main() {
    if (__rseq_size == 0) {
        die("rseq is not registered for this thread (glibc __rseq_size == 0); "
            "need glibc >= 2.35 with rseq support and a kernel that supports rseq");
    }
    const int cpu = pick_and_pin_first_allowed_cpu();
    if ((int)current_rseq_abi()->cpu_id != cpu) {
        die("rseq abi cpu_id is %d after pinning rather than %d",
            current_rseq_abi()->cpu_id,
            cpu);
    }

    std::thread(writer_thread_main).detach();

    const auto deadline = std::chrono::steady_clock::now() + kRuntime;
    uint64_t attempts = 0;
    uint64_t abort_retries = 0;

    while (std::chrono::steady_clock::now() < deadline) {
        ++attempts;
        const int rc = run_one_rseq_attempt(current_rseq_abi(), &g_shared_flag);
        if (rc == 1) {
            die("Observed shared_flag == 1 inside the rseq critical section "
                "after %llu attempts on cpu %d",
                static_cast<unsigned long long>(attempts),
                cpu);
        } else if (rc != -1) {
            die("Unexpected return value from rseq: %d after %llu attempts",
                rc,
                static_cast<unsigned long long>(attempts));
        }
        ++abort_retries;
    }

    std::fprintf(stderr,
                 "PASS: ran for %lld seconds on cpu %d, attempts=%llu abort_retries=%llu\n",
                 static_cast<long long>(kRuntime.count()),
                 cpu,
                 static_cast<unsigned long long>(attempts),
                 static_cast<unsigned long long>(abort_retries));
    _Exit(0);
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2026-04-23 23:08 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22  9:50 [REGRESSION] rseq: refactoring in v6.19 broke everyone on arm64 and tcmalloc everywhere Mathias Stearn
2026-04-22 12:56 ` Peter Zijlstra
2026-04-22 13:13   ` Peter Zijlstra
2026-04-23 10:38     ` Mathias Stearn
     [not found]     ` <CAHnCjA2fa+dP1+yCYNQrTXQaW-JdtfMj7wMikwMeeCRg-3NhiA@mail.gmail.com>
2026-04-23 11:48       ` Thomas Gleixner
2026-04-23 12:11         ` Mathias Stearn
2026-04-23 17:19           ` Thomas Gleixner
2026-04-23 17:38             ` Chris Kennelly
2026-04-23 17:47               ` Mathieu Desnoyers
2026-04-23 19:39               ` Thomas Gleixner
2026-04-23 17:41             ` Linus Torvalds
2026-04-23 18:35               ` Mathias Stearn
2026-04-23 18:53               ` Mark Rutland
2026-04-23 21:03               ` Thomas Gleixner
2026-04-23 21:28                 ` Linus Torvalds
2026-04-23 23:08                   ` Linus Torvalds
2026-04-22 13:09 ` Mark Rutland
2026-04-22 17:49   ` Thomas Gleixner
2026-04-22 18:11     ` Mark Rutland
2026-04-22 19:47       ` Thomas Gleixner
2026-04-23  1:48         ` Jinjie Ruan
2026-04-23  5:53           ` Dmitry Vyukov
2026-04-23 10:39             ` Thomas Gleixner
2026-04-23 10:51               ` Mathias Stearn
2026-04-23 12:24                 ` David Laight
2026-04-23 19:31                 ` Thomas Gleixner
2026-04-23 12:11             ` Alejandro Colomar
2026-04-23 12:54               ` Mathieu Desnoyers
2026-04-23 12:29             ` Mathieu Desnoyers
2026-04-23 12:36               ` Dmitry Vyukov
2026-04-23 12:53                 ` Mathieu Desnoyers
2026-04-23 12:58                   ` Dmitry Vyukov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox