linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch V2 00/37] rseq: Optimize exit to user space
@ 2025-08-23 16:39 Thomas Gleixner
  2025-08-23 16:39 ` [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
                   ` (37 more replies)
  0 siblings, 38 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt, linux-arch, Thomas Bogendoerfer, Michael Ellerman,
	Jonas Bonn

OA
This is a follow up on the initial series, which did a very basic attempt
to sanitize the RSEQ handling in the kernel:

   https://lore.kernel.org/all/20250813155941.014821755@linutronix.de

Further analysis turned up more than these initial problems:

  1) task::rseq_event_mask is a pointless bit-field despite the fact that
     the ABI flags it was meant to support have been deprecated and
     functionally disabled three years ago.

  2) task::rseq_event_mask is accumulating bits unless there is a critical
     section discovered in the user space rseq memory. This results in
     pointless invocations of the rseq user space exit handler even if
     there had nothing changed. As a matter of correctness these bits have
     to be clear when exiting to user space and therefore pristine when
     coming back into the kernel. Aside of correctness, this also avoids
     pointless evaluation of the user space memory, which is a performance
     benefit.

  3) The evaluation of critical sections does not differentiate between
     syscall and interrupt/exception exits. The current implementation
     silently tolerates and fixes up critical sections which invoked a
     syscall unless CONFIG_DEBUG_RSEQ is enabled.

     That's just wrong. If user space does that on a production kernel it
     can keep the pieces. The kernel is not there to proliferate mindless
     user space programming and letting everyone pay the performance
     penalty.

Additional findings:

  4) The decision to raise the work for exit is more than suboptimal.
     Basically every context switch does so if the task has rseq, which is
     nowadays likely as glibc makes use of it if available.

     The consequence is that a lot of exits have to process RSEQ just for
     nothing. The only reasons to do so are:

       the task was interrupted in user space and schedules

     or

       the CPU or MM CID changes in schedule() independent of the entry
       mode

     That reduces the invocation space obviously significantly.

  5) Signal handling does the RSEQ update unconditionally.

     That's wrong as the only reason to do so is when the task was
     interrupted in user space independent of a schedule event.

     The only important task in that case is to handle the critical section
     because after switching to the signal frame the original return IP is
     not longer available.

     The CPU/MM CID values do not need to be updated at that point as they
     can change again before the signal delivery goes out to user space.

     Again, if the task was in a critical section and issued a syscall then
     it can keep the pieces as that's a violation of the ABI contract.

  6) CPU and MM CID are updated unconditionally

     That's again a pointless exercise when they didn't change. Then the
     only action required is to check the critical section if and only if
     the entry came via an interrupt.

     That can obviously be avoided by caching the values written to user
     space and avoiding that path if they haven't changed

  7) The TIF_NOTIFY_RESUME mechanism is a horrorshow

     TIF_NOTIFY_RESUME is a multiplexing TIF bit and needs to invoke the
     world and some more. Depending on workloads this can be set by
     task_work, security, block and memory management. All unrelated to
     RSEQ and quite some of them are likely to cause a reschedule.
     But most of them are low frequency.

     So doing this work in the loop unconditionally is just waste. The
     correct point to do this is at the end of that loop once all other bits
     have been processed, because that's the point where the task is
     actually going out to user space.

  8) #7 caused another subtle work for nothing issue

     IO/URING and hypervisors invoke resume_user_mode_work() with a NULL
     pointer for pt_regs, which causes the RSEQ code to ignore the critical
     section check, but updating the CPU ID/ MM CID values unconditionally.

     For IO/URING this invocation is irrelevant because the IO workers can
     never go out to user space and therefore do not have RSEQ memory in
     the first place. So it's a non problem in the existing code as
     task::rseq is NULL in that case.

     Hypervisors are a different story. They need to drain task_work and
     other pending items, which are multiplexed by TIF_NOTIFY_RESUME,
     before entering guest mode.

     The invocation of resume_user_mode_work() clears TIF_NOTIFY_RESUME,
     which means if rseq would ignore that case then it could miss a CPU
     or MM CID update on the way back to user space.

     The handling of that is just a horrible and mindless hack as the event
     might be re-raised between the time the ioctl() enters guest mode and
     the actual exit to user space.

     So the obvious thing is to ignore the regs=NULL call and let the
     offending hypervisor calls check when returning from the ioctl()
     whether the event bit is set and re-raise the notification again.

  9) Code efficiency

     RSEQ aims to improve performance for user space, but it completely
     ignores the fact, that this needs to be implemented in a way which
     does not impact the performance of the kernel significantly.

     So far this did not pop up as just a few people used it, but that has
     changed because glibc started to use it widely.

     It's not so efficiOS^H^Hent as advertised:

     For a full kernel rebuild:

     	 exit to user:                  50106428
	 signal checks:                    24703
	 slowpath runs:                   943785 1.88%
	 id updates:                      968488 1.93%
	 cs checks:                       888768 1.77%
	   cs cleared:                      888768 100.00%
	   cs fixup:                             0 0.00%

     The cs_cleared/fixup numbers are relative to cs_checks, which means
     this does pointless clears even if it was clear already because that's
     what is the case with glibc. glibc is only interested in the CPU/MM
     CID values so far. And no, it's not only a store, it's the whole dance
     of spectre-v1 mitigation plus user access enable/disable.

     I really have to ask why all those people who are deeply caring about
     performance and have utilized this stuff for quite some time have not
     noticed what a performance sh*tshow this inflicts on the kernel.

     Seriously?

     Aside of that did anyone ever look at the resulting assembly code?

     While the user space counterpart is carefully crafted in hand written
     assembly for efficiency the kernel implementation is a complete and
     utter nightmare.

     C allows to implement very efficient code, but the fact that the
     compiler requires help from the programmer to turn it into actual
     efficient assembly is not a totally new finding.

     The absence of properly designed data structures, which allow to build
     efficient and comprehensible code, is amazing.

     Slapping random storage members into struct task_struct, implementing
     a conditional maze around them and then hoping that the compiler will
     create sensible code out of it, is wishful thinking at best.

     I completely understand that all the sanity checks and loops and hoops
     were required to get this actually working. But I'm seriously grumpy
     about the fact, that once the PoC code was shoved into the kernel, the
     'performance work' was completed and just the next fancy 'performance'
     features were piled on top of it.

     Once this is addressed the above picture changes completely:

          exit to user:                  50071399
	  signal checks:                      182
	  slowpath runs:                     8149 0.02%
	  fastpath runs:                   763366 1.52%
	  id updates:                       78419 0.16%
	  cs checks:                            0 0.00%
	  cs cleared:                           0
	  cs fixup:                             0

     And according to perf that very same kernel build consistently gets
     faster from:

           92.500894648 seconds time elapsed (upstream)

     to

           91.693778124 seconds time elapsed

     Not a lot but not in the noise either, i.e. >= 1%

     For a 1e9 gettid() loop this results in going from:

           49.782703878 seconds time elapsed (upstream)

     to

      	   49.327360851 seconds time elapsed

     Not a lot either, but consistently >=1% as the above.

     For actual rseq critical section usage this makes even more of a
     difference. Aside of avoiding pointless work in the kernel this does
     also not abort critical sections when not necessary, which improves
     user space performance too. The kernel selftests magically complete
     six seconds faster, which is again not a lot compared to a total run
     time of ~540s. But those tests are not really reflecting real work
     loads, they emphasize on stress testing the kernel implementation,
     which is perfectly fine. But still, the reduced amount of pointless
     kernel work is what makes the difference:

     Upstream:

	exit to user:                 736568321
	signal checks:                 62002182
	slowpath runs:                358362121 57.07%
	fastpath runs:                        0 0.00%
	id updates:                   358362129 57.07%
	cs checks:                    358362110 57.07%
	  cs cleared:                   347611246 97.01%
	  cs fixup:                      10750863  2.99%

     Upstream + simple obvious fixes:

	exit to user:                 736462043
	signal checks:                 61902154
	slowpath runs:                223394362 30.33%
	fastpath runs:                        0 0.00%
	id updates:                   285296516 38.74%
	cs checks:                    284430371 38.62%
	  cs cleared:                   277110571 97.43%
	  cs fixup:                       7319800  2.57%

     Fully patched:

        exit to user:                 736477630
	signal checks:                        0
	slowpath runs:                     1367 0.00%
	fastpath runs:                114230240 15.51%
	id updates:                   106498492 14.46%
	cs checks:                      7665511 1.04%
	  cs cleared:                     2092386 27.30%
	  cs fixup:                       5573118 72.70%

     Perf confirms:

       46,242,994,988,432      cycles
        8,390,996,405,969      instructions

           1949.923762000 seconds user
          16111.947776000 seconds sys

       versus:

       44,910,362,743,328      cycles		-3 %
        8,305,138,767,914      instructions     -1 %

           1985.342453000 seconds user		+2 %
          15653.639218000 seconds sys		-3 %

     Running the same with only looking at the kernel counts:

      1) Upstream:

         39,530,448,181,974      cycles:khH
          2,872,985,410,904      instructions:khH
            640,968,094,639      branches:khH

      2) Upstream + simple obvious fixes:
                                                            -> #1
         39,005,607,758,104      cycles:khH                 -1.5 %
          2,853,676,629,923      instructions:khH           -0.8 %
            635,538,256,875      branches:khH               -1.0 %

      3) Fully patched:
						   -> #2    -> #1
         38,786,867,052,870      cycles:khH        -0.6 %   -2.3 %
          2,784,773,121,491      instructions:khH  -2.5 %   -3.1 %
            618,677,047,160      branches:khH      -2.7 %   -3.6 %

     Looking at the kernel build that way:

      1) Upstream:

        474,191,665,063      cycles:khH
        164,176,072,767      instructions:khH
         33,037,122,843      branches:khH

      2) Upstream + simple obvious fixes:
                                                            -> #1
         473,423,488,450      cycles:khH 		    ~0.0 %          
         162,749,873,914      instructions:khH              -1.0 %
          32,264,397,605      branches:khH                  -2.5 %

      3) Fully patched:
                                                   -> #2    -> #1
         468,658,348,098      cycles:khH           -1.1 %   -1.2 %
         160,941,028,283      instructions:khH     -1.2 %   -2.0 %
          31,893,296,267      branches:khH         -1.5 %   -3.7 %

     That's pretty much in line with the 2-3% regressions observed by Jens.

     TBH, I absolutely do not care about the way how performance is
     evaluated in the entities who pushed for this, but I very much care
     about the general impact of this.

     It seems to me the performance evaluation by the $corp power users of
     RSEQ stopped at the point where a significant improvement in their
     user space was observed. It's interesting, that the resulting overhead
     in the kernel seems not that relevant for the overall performance
     evaluation. That's left to others to mop up.

     The tragedy of the commons in action.

TBH, what surprised me most when looking into this in detail, was the large
amount of low hanging fruit, which was sitting there in plain sight.

That said, this series addresses the overall problems by:

  1) Limiting the RSEQ work to the actual conditions where it is
     required. The full benefit is only available for architectures using
     the generic entry infrastructure. All others get at least the basic
     improvements.

  2) Re-implementing the whole user space handling based on proper data
     structures and by actually looking at the impact it creates in the
     fast path.

  3) Moving the actual handling of RSEQ out to the latest point in the exit
     path, where possible. This is fully inlined into the fast path to keep
     the impact confined.

     The initial attempt to make it completely independent of TIF bits and
     just handle it with a quick check unconditionally on exit to user
     space turned out to be not feasible. On workloads which are doing a
     lot of quick syscalls the extra four instructions add up
     significantly.

     So instead I ended up doing it at the end of the exit to user TIF
     work loop once when all other TIF bits have been processed. At this
     point interrupts are disabled and there is no way that the state
     can change before the task goes out to user space for real.

  Versus the limitations of #1 and #3:

   I wasted several days of my so copious time to figure out how to not
   break all the architectures, which still insist on benefiting from core
   code improvements by pulling everything and the world into their
   architecture specific hackery.

   It's more than five years now that the generic entry code infrastructure
   has been introduced for the very reason to lower the burden for core
   code developers and maintainers and to share the common functionality
   across the architecture zoo.

   Aside of the initial x86 move, which started this effort, there are only
   three architectures who actually made the effort to utilize this. Two of
   them were new ones, which were asked to use it right away.

   The only existing one, which converted over since then is S390 and I'm
   truly grateful that they improved the generic infrastructure in that
   process significantly.

   On ARM[64] there are at least serious efforts underway to move their
   code over.

   Does everybody else think that core code improvements come for free and
   the architecture specific hackery does not put any burden on others?

   Here is the hall of fame as far as RSEQ goes:

   	arch/mips/Kconfig:      select HAVE_RSEQ
	arch/openrisc/Kconfig:  select HAVE_RSEQ
	arch/powerpc/Kconfig:   select HAVE_RSEQ

   Two of them are barely maintained and shouldn't have RSEQ in the first
   place....

   While I was very forthcoming in the past to accomodate for that and went
   out of my way to enable stuff for everyone, but I'm drawing a line now.

   All extra improvements which are enabled by #1/#3 depend hard on the
   generic infrastructure.

   I know that it's quite some effort to move an architecture over, but
   it's a one time effort and investment into the future. This 'my
   architecture is special for no reason' mindset is not sustainable and
   just pushes the burden on others. There is zero justification for this.

   Not converging on common infrastructure is not only a burden for the
   core people, it's also a missed opportunity for the architectures to
   lower their own burden of chasing core improvements and implementing
   them each with a different set of bugs.

   This is not the first time this happens. There are enough other examples
   where it took ages to consolidate on common code. This just accumulates
   technical debt and needless complexity, which everyone suffers from.

   I have happily converted the four architectures, which use the generic
   entry code over, to utilize a shared generic TIF bit header so that
   adding the TIF_RSEQ bit becomes a two line change and all four get the
   benefit immediately. That was more consequent than just adding the bits
   for each of them and it makes further maintainence of core
   infrastructure simpler for all sides. See?


That said, as for the first version these patches have a pile of dependencies:

The series depends on the separately posted rseq bugfix:

   https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/

and the uaccess generic helper series:

   https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/

and a related futex fix in

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent

The combination of all of them and some other related fixes (rseq
selftests) are available here:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/base

For your convenience all of it is also available as a conglomerate from
git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

The diffstat looks large, but a lot of that is due to extensive comments
and the extra hackery to accommodate for random architecture code.

I did not yet come around to test this on anything else than x86. Help with
that would be truly appreciated.

Thanks,

	tglx

  "Additional problems are the offspring of poor solutions." - Mark Twain
	
---
 Documentation/admin-guide/kernel-parameters.txt |    4 
 arch/Kconfig                                    |    4 
 arch/loongarch/Kconfig                          |    1 
 arch/loongarch/include/asm/thread_info.h        |   76 +-
 arch/riscv/Kconfig                              |    1 
 arch/riscv/include/asm/thread_info.h            |   29 -
 arch/s390/Kconfig                               |    1 
 arch/s390/include/asm/thread_info.h             |   44 -
 arch/x86/Kconfig                                |    1 
 arch/x86/entry/syscall_32.c                     |    3 
 arch/x86/include/asm/thread_info.h              |   74 +-
 b/include/asm-generic/thread_info_tif.h         |   51 ++
 b/include/linux/rseq_entry.h                    |  601 +++++++++++++++++++++++
 b/include/linux/rseq_types.h                    |   72 ++
 drivers/hv/mshv_root_main.c                     |    2 
 fs/binfmt_elf.c                                 |    2 
 fs/exec.c                                       |    2 
 include/linux/entry-common.h                    |   38 -
 include/linux/irq-entry-common.h                |   68 ++
 include/linux/mm.h                              |   25 
 include/linux/resume_user_mode.h                |    2 
 include/linux/rseq.h                            |  216 +++++---
 include/linux/sched.h                           |   50 +
 include/linux/thread_info.h                     |    5 
 include/trace/events/rseq.h                     |    4 
 include/uapi/linux/rseq.h                       |   21 
 init/Kconfig                                    |   28 +
 kernel/entry/common.c                           |   37 -
 kernel/entry/syscall-common.c                   |    8 
 kernel/rseq.c                                   |  610 ++++++++++--------------
 kernel/sched/core.c                             |   10 
 kernel/sched/membarrier.c                       |    8 
 kernel/sched/sched.h                            |    5 
 virt/kvm/kvm_main.c                             |    3 
 34 files changed, 1406 insertions(+), 700 deletions(-)

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume()
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 15:39   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 02/37] rseq: Condense the inline stubs Thomas Gleixner
                   ` (36 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

From: Thomas Gleixner <tglx@linutronix.de>

The RSEQ critical section mechanism only clears the event mask when a
critical section is registered, otherwise it is stale and collects
bits.

That means once a critical section is installed the first invocation of
that code when TIF_NOTIFY_RESUME is set will abort the critical section,
even when the TIF bit was not raised by the rseq preempt/migrate/signal
helpers.

This also has a performance implication because TIF_NOTIFY_RESUME is a
multiplexing TIF bit, which is utilized by quite some infrastructure. That
means every invocation of __rseq_notify_resume() goes unconditionally
through the heavy lifting of user space access and consistency checks even
if there is no reason to do so.

Keeping the stale event mask around when exiting to user space also
prevents it from being utilized by the upcoming time slice extension
mechanism.

Avoid this by reading and clearing the event mask before doing the user
space critical section access with interrupts or preemption disabled, which
ensures that the read and clear operation is CPU local atomic versus
scheduling and the membarrier IPI.

This is correct as after re-enabling interrupts/preemption any relevant
event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the
user space exit code take another round of TIF bit clearing.

If the event mask was non-zero, invoke the slow path. On debug kernels the
slow path is invoked unconditionally and the result of the event mask
evaluation is handed in.

Add a exit path check after the TIF bit loop, which validates on debug
kernels that the event mask is zero before exiting to user space.

While at it reword the convoluted comment why the pt_regs pointer can be
NULL under certain circumstances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>

---
 include/linux/irq-entry-common.h |    7 ++--
 include/linux/rseq.h             |   10 +++++
 kernel/rseq.c                    |   66 ++++++++++++++++++++++++++-------------
 3 files changed, 58 insertions(+), 25 deletions(-)
---
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -2,11 +2,12 @@
 #ifndef __LINUX_IRQENTRYCOMMON_H
 #define __LINUX_IRQENTRYCOMMON_H
 
+#include <linux/context_tracking.h>
+#include <linux/kmsan.h>
+#include <linux/rseq.h>
 #include <linux/static_call_types.h>
 #include <linux/syscalls.h>
-#include <linux/context_tracking.h>
 #include <linux/tick.h>
-#include <linux/kmsan.h>
 #include <linux/unwind_deferred.h>
 
 #include <asm/entry-common.h>
@@ -226,6 +227,8 @@ static __always_inline void exit_to_user
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
+	rseq_exit_to_user_mode();
+
 	/* Ensure that kernel state is sane for a return to userspace */
 	kmap_assert_nomap();
 	lockdep_assert_irqs_disabled();
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -66,6 +66,14 @@ static inline void rseq_migrate(struct t
 	rseq_set_notify_resume(t);
 }
 
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
+		if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
+			current->rseq_event_mask = 0;
+	}
+}
+
 /*
  * If parent process has a registered restartable sequences area, the
  * child inherits. Unregister rseq for a clone with CLONE_VM set.
@@ -118,7 +126,7 @@ static inline void rseq_fork(struct task
 static inline void rseq_execve(struct task_struct *t)
 {
 }
-
+static inline void rseq_exit_to_user_mode(void) { }
 #endif
 
 #ifdef CONFIG_DEBUG_RSEQ
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char *
 	return true;
 }
 
-static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
+static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
 {
-	u32 flags, event_mask;
+	u32 flags;
 	int ret;
 
 	if (rseq_warn_flags("rseq_cs", cs_flags))
@@ -339,17 +339,7 @@ static int rseq_need_restart(struct task
 
 	if (rseq_warn_flags("rseq", flags))
 		return -EINVAL;
-
-	/*
-	 * Load and clear event mask atomically with respect to
-	 * scheduler preemption and membarrier IPIs.
-	 */
-	scoped_guard(RSEQ_EVENT_GUARD) {
-		event_mask = t->rseq_event_mask;
-		t->rseq_event_mask = 0;
-	}
-
-	return !!event_mask;
+	return 0;
 }
 
 static int clear_rseq_cs(struct rseq __user *rseq)
@@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip,
 	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
 }
 
-static int rseq_ip_fixup(struct pt_regs *regs)
+static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
 {
 	unsigned long ip = instruction_pointer(regs);
 	struct task_struct *t = current;
@@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs
 	 */
 	if (!in_rseq_cs(ip, &rseq_cs))
 		return clear_rseq_cs(t->rseq);
-	ret = rseq_need_restart(t, rseq_cs.flags);
-	if (ret <= 0)
+	ret = rseq_check_flags(t, rseq_cs.flags);
+	if (ret < 0)
 		return ret;
+	if (!abort)
+		return 0;
 	ret = clear_rseq_cs(t->rseq);
 	if (ret)
 		return ret;
@@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct
 		return;
 
 	/*
-	 * regs is NULL if and only if the caller is in a syscall path.  Skip
-	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
-	 * kill a misbehaving userspace on debug kernels.
+	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
+	 * pointer, so fixup cannot be done. If the syscall which led to
+	 * this invocation was invoked inside a critical section, then it
+	 * will either end up in this code again or a possible violation of
+	 * a syscall inside a critical region can only be detected by the
+	 * debug code in rseq_syscall() in a debug enabled kernel.
 	 */
 	if (regs) {
-		ret = rseq_ip_fixup(regs);
-		if (unlikely(ret < 0))
-			goto error;
+		/*
+		 * Read and clear the event mask first. If the task was not
+		 * preempted or migrated or a signal is on the way, there
+		 * is no point in doing any of the heavy lifting here on
+		 * production kernels. In that case TIF_NOTIFY_RESUME was
+		 * raised by some other functionality.
+		 *
+		 * This is correct because the read/clear operation is
+		 * guarded against scheduler preemption, which makes it CPU
+		 * local atomic. If the task is preempted right after
+		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
+		 * again and this function is invoked another time _before_
+		 * the task is able to return to user mode.
+		 *
+		 * On a debug kernel, invoke the fixup code unconditionally
+		 * with the result handed in to allow the detection of
+		 * inconsistencies.
+		 */
+		u32 event_mask;
+
+		scoped_guard(RSEQ_EVENT_GUARD) {
+			event_mask = t->rseq_event_mask;
+			t->rseq_event_mask = 0;
+		}
+
+		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
+			ret = rseq_ip_fixup(regs, !!event_mask);
+			if (unlikely(ret < 0))
+				goto error;
+		}
 	}
 	if (unlikely(rseq_update_cpu_node_id(t)))
 		goto error;


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 02/37] rseq: Condense the inline stubs
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
  2025-08-23 16:39 ` [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 15:40   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 03/37] resq: Move algorithm comment to top Thomas Gleixner
                   ` (35 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

From: Thomas Gleixner <tglx@linutronix.de>

Scrolling over tons of pointless

{
}

lines to find the actual code is annoying at best.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>

---
 include/linux/rseq.h |   47 ++++++++++++-----------------------------------
 1 file changed, 12 insertions(+), 35 deletions(-)
---
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -101,44 +101,21 @@ static inline void rseq_execve(struct ta
 	t->rseq_event_mask = 0;
 }
 
-#else
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
-}
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
-					     struct pt_regs *regs)
-{
-}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
-				       struct pt_regs *regs)
-{
-}
-static inline void rseq_preempt(struct task_struct *t)
-{
-}
-static inline void rseq_migrate(struct task_struct *t)
-{
-}
-static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
-{
-}
-static inline void rseq_execve(struct task_struct *t)
-{
-}
+#else /* CONFIG_RSEQ */
+static inline void rseq_set_notify_resume(struct task_struct *t) { }
+static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_preempt(struct task_struct *t) { }
+static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
+static inline void rseq_execve(struct task_struct *t) { }
 static inline void rseq_exit_to_user_mode(void) { }
-#endif
+#endif  /* !CONFIG_RSEQ */
 
 #ifdef CONFIG_DEBUG_RSEQ
-
 void rseq_syscall(struct pt_regs *regs);
-
-#else
-
-static inline void rseq_syscall(struct pt_regs *regs)
-{
-}
-
-#endif
+#else /* CONFIG_DEBUG_RSEQ */
+static inline void rseq_syscall(struct pt_regs *regs) { }
+#endif /* !CONFIG_DEBUG_RSEQ */
 
 #endif /* _LINUX_RSEQ_H */


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 03/37] resq: Move algorithm comment to top
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
  2025-08-23 16:39 ` [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
  2025-08-23 16:39 ` [patch V2 02/37] rseq: Condense the inline stubs Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 15:41   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
                   ` (34 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Move the comment which documents the RSEQ algorithm to the top of the file,
so it does not create horrible diffs later when the actual implementation
is fed into the mincer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/rseq.c |  119 ++++++++++++++++++++++++++++------------------------------
 1 file changed, 59 insertions(+), 60 deletions(-)

--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -8,6 +8,65 @@
  * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
  */
 
+/*
+ * Restartable sequences are a lightweight interface that allows
+ * user-level code to be executed atomically relative to scheduler
+ * preemption and signal delivery. Typically used for implementing
+ * per-cpu operations.
+ *
+ * It allows user-space to perform update operations on per-cpu data
+ * without requiring heavy-weight atomic operations.
+ *
+ * Detailed algorithm of rseq user-space assembly sequences:
+ *
+ *                     init(rseq_cs)
+ *                     cpu = TLS->rseq::cpu_id_start
+ *   [1]               TLS->rseq::rseq_cs = rseq_cs
+ *   [start_ip]        ----------------------------
+ *   [2]               if (cpu != TLS->rseq::cpu_id)
+ *                             goto abort_ip;
+ *   [3]               <last_instruction_in_cs>
+ *   [post_commit_ip]  ----------------------------
+ *
+ *   The address of jump target abort_ip must be outside the critical
+ *   region, i.e.:
+ *
+ *     [abort_ip] < [start_ip]  || [abort_ip] >= [post_commit_ip]
+ *
+ *   Steps [2]-[3] (inclusive) need to be a sequence of instructions in
+ *   userspace that can handle being interrupted between any of those
+ *   instructions, and then resumed to the abort_ip.
+ *
+ *   1.  Userspace stores the address of the struct rseq_cs assembly
+ *       block descriptor into the rseq_cs field of the registered
+ *       struct rseq TLS area. This update is performed through a single
+ *       store within the inline assembly instruction sequence.
+ *       [start_ip]
+ *
+ *   2.  Userspace tests to check whether the current cpu_id field match
+ *       the cpu number loaded before start_ip, branching to abort_ip
+ *       in case of a mismatch.
+ *
+ *       If the sequence is preempted or interrupted by a signal
+ *       at or after start_ip and before post_commit_ip, then the kernel
+ *       clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
+ *       ip to abort_ip before returning to user-space, so the preempted
+ *       execution resumes at abort_ip.
+ *
+ *   3.  Userspace critical section final instruction before
+ *       post_commit_ip is the commit. The critical section is
+ *       self-terminating.
+ *       [post_commit_ip]
+ *
+ *   4.  <success>
+ *
+ *   On failure at [2], or if interrupted by preempt or signal delivery
+ *   between [1] and [3]:
+ *
+ *       [abort_ip]
+ *   F1. <failure>
+ */
+
 #include <linux/sched.h>
 #include <linux/uaccess.h>
 #include <linux/syscalls.h>
@@ -98,66 +157,6 @@ static int rseq_validate_ro_fields(struc
 	unsafe_put_user(value, &t->rseq->field, error_label)
 #endif
 
-/*
- *
- * Restartable sequences are a lightweight interface that allows
- * user-level code to be executed atomically relative to scheduler
- * preemption and signal delivery. Typically used for implementing
- * per-cpu operations.
- *
- * It allows user-space to perform update operations on per-cpu data
- * without requiring heavy-weight atomic operations.
- *
- * Detailed algorithm of rseq user-space assembly sequences:
- *
- *                     init(rseq_cs)
- *                     cpu = TLS->rseq::cpu_id_start
- *   [1]               TLS->rseq::rseq_cs = rseq_cs
- *   [start_ip]        ----------------------------
- *   [2]               if (cpu != TLS->rseq::cpu_id)
- *                             goto abort_ip;
- *   [3]               <last_instruction_in_cs>
- *   [post_commit_ip]  ----------------------------
- *
- *   The address of jump target abort_ip must be outside the critical
- *   region, i.e.:
- *
- *     [abort_ip] < [start_ip]  || [abort_ip] >= [post_commit_ip]
- *
- *   Steps [2]-[3] (inclusive) need to be a sequence of instructions in
- *   userspace that can handle being interrupted between any of those
- *   instructions, and then resumed to the abort_ip.
- *
- *   1.  Userspace stores the address of the struct rseq_cs assembly
- *       block descriptor into the rseq_cs field of the registered
- *       struct rseq TLS area. This update is performed through a single
- *       store within the inline assembly instruction sequence.
- *       [start_ip]
- *
- *   2.  Userspace tests to check whether the current cpu_id field match
- *       the cpu number loaded before start_ip, branching to abort_ip
- *       in case of a mismatch.
- *
- *       If the sequence is preempted or interrupted by a signal
- *       at or after start_ip and before post_commit_ip, then the kernel
- *       clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
- *       ip to abort_ip before returning to user-space, so the preempted
- *       execution resumes at abort_ip.
- *
- *   3.  Userspace critical section final instruction before
- *       post_commit_ip is the commit. The critical section is
- *       self-terminating.
- *       [post_commit_ip]
- *
- *   4.  <success>
- *
- *   On failure at [2], or if interrupted by preempt or signal delivery
- *   between [1] and [3]:
- *
- *       [abort_ip]
- *   F1. <failure>
- */
-
 static int rseq_update_cpu_node_id(struct task_struct *t)
 {
 	struct rseq __user *rseq = t->rseq;


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume()
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (2 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 03/37] resq: Move algorithm comment to top Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 15:43   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 05/37] rseq: Simplify registration Thomas Gleixner
                   ` (33 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

There is no point for this being visible in the resume_to_user_mode()
handling.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/resume_user_mode.h |    2 +-
 include/linux/rseq.h             |   13 +++++++------
 2 files changed, 8 insertions(+), 7 deletions(-)

--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 
-	rseq_handle_notify_resume(NULL, regs);
+	rseq_handle_notify_resume(regs);
 }
 
 #endif /* LINUX_RESUME_USER_MODE_H */
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -37,19 +37,20 @@ static inline void rseq_set_notify_resum
 
 void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
 
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
-					     struct pt_regs *regs)
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
 {
 	if (current->rseq)
-		__rseq_handle_notify_resume(ksig, regs);
+		__rseq_handle_notify_resume(NULL, regs);
 }
 
 static inline void rseq_signal_deliver(struct ksignal *ksig,
 				       struct pt_regs *regs)
 {
-	scoped_guard(RSEQ_EVENT_GUARD)
-		__set_bit(RSEQ_EVENT_SIGNAL_BIT, &current->rseq_event_mask);
-	rseq_handle_notify_resume(ksig, regs);
+	if (current->rseq) {
+		scoped_guard(RSEQ_EVENT_GUARD)
+			__set_bit(RSEQ_EVENT_SIGNAL_BIT, &current->rseq_event_mask);
+		__rseq_handle_notify_resume(ksig, regs);
+	}
 }
 
 /* rseq_preempt() requires preemption to be disabled. */


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 05/37] rseq: Simplify registration
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (3 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 15:44   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 06/37] rseq: Simplify the event notification Thomas Gleixner
                   ` (32 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

There is no point to read the critical section element in the newly
registered user space RSEQ struct first in order to clear it.

Just clear it and be done with it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/rseq.c |   10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -492,11 +492,9 @@ void rseq_syscall(struct pt_regs *regs)
 /*
  * sys_rseq - setup restartable sequences for caller thread.
  */
-SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
-		int, flags, u32, sig)
+SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
 {
 	int ret;
-	u64 rseq_cs;
 
 	if (flags & RSEQ_FLAG_UNREGISTER) {
 		if (flags & ~RSEQ_FLAG_UNREGISTER)
@@ -557,11 +555,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	 * avoid a potential segfault on return to user-space. The proper thing
 	 * to do would have been to fail the registration but this would break
 	 * older libcs that reuse the rseq area for new threads without
-	 * clearing the fields.
+	 * clearing the fields. Don't bother reading it, just reset it.
 	 */
-	if (rseq_get_rseq_cs_ptr_val(rseq, &rseq_cs))
-	        return -EFAULT;
-	if (rseq_cs && clear_rseq_cs(rseq))
+	if (put_user_masked_u64(0UL, &rseq->rseq_cs))
 		return -EFAULT;
 
 #ifdef CONFIG_DEBUG_RSEQ


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 06/37] rseq: Simplify the event notification
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (4 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 05/37] rseq: Simplify registration Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 17:36   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
                   ` (31 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
flags") the bits in task::rseq_event_mask are meaningless and just extra
work in terms of setting them individually.

Aside of that the only relevant point where an event has to be raised is
context switch. Neither the CPU nor MM CID can change without going through
a context switch.

Collapse them all into a single boolean which simplifies the code a lot and
remove the pointless invocations which have been sprinkled all over the
place for no value.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
V2: Reduce it to the sched switch event.
---
 fs/exec.c                 |    2 -
 include/linux/rseq.h      |   66 +++++++++-------------------------------------
 include/linux/sched.h     |   10 +++---
 include/uapi/linux/rseq.h |   21 ++++----------
 kernel/rseq.c             |   28 +++++++++++--------
 kernel/sched/core.c       |    5 ---
 kernel/sched/membarrier.c |    8 ++---
 7 files changed, 48 insertions(+), 92 deletions(-)
---
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp
 		force_fatal_sig(SIGSEGV);
 
 	sched_mm_cid_after_execve(current);
-	rseq_set_notify_resume(current);
+	rseq_sched_switch_event(current);
 	current->in_execve = 0;
 
 	return retval;
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -3,38 +3,8 @@
 #define _LINUX_RSEQ_H
 
 #ifdef CONFIG_RSEQ
-
-#include <linux/preempt.h>
 #include <linux/sched.h>
 
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD	irq
-#else
-# define RSEQ_EVENT_GUARD	preempt
-#endif
-
-/*
- * Map the event mask on the user-space ABI enum rseq_cs_flags
- * for direct mask checks.
- */
-enum rseq_event_mask_bits {
-	RSEQ_EVENT_PREEMPT_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
-	RSEQ_EVENT_SIGNAL_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
-	RSEQ_EVENT_MIGRATE_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
-};
-
-enum rseq_event_mask {
-	RSEQ_EVENT_PREEMPT	= (1U << RSEQ_EVENT_PREEMPT_BIT),
-	RSEQ_EVENT_SIGNAL	= (1U << RSEQ_EVENT_SIGNAL_BIT),
-	RSEQ_EVENT_MIGRATE	= (1U << RSEQ_EVENT_MIGRATE_BIT),
-};
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
-	if (t->rseq)
-		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
-}
-
 void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
 
 static inline void rseq_handle_notify_resume(struct pt_regs *regs)
@@ -43,35 +13,27 @@ static inline void rseq_handle_notify_re
 		__rseq_handle_notify_resume(NULL, regs);
 }
 
-static inline void rseq_signal_deliver(struct ksignal *ksig,
-				       struct pt_regs *regs)
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
 {
 	if (current->rseq) {
-		scoped_guard(RSEQ_EVENT_GUARD)
-			__set_bit(RSEQ_EVENT_SIGNAL_BIT, &current->rseq_event_mask);
+		current->rseq_event_pending = true;
 		__rseq_handle_notify_resume(ksig, regs);
 	}
 }
 
-/* rseq_preempt() requires preemption to be disabled. */
-static inline void rseq_preempt(struct task_struct *t)
+static inline void rseq_sched_switch_event(struct task_struct *t)
 {
-	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
-	rseq_set_notify_resume(t);
-}
-
-/* rseq_migrate() requires preemption to be disabled. */
-static inline void rseq_migrate(struct task_struct *t)
-{
-	__set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
-	rseq_set_notify_resume(t);
+	if (t->rseq) {
+		t->rseq_event_pending = true;
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+	}
 }
 
 static __always_inline void rseq_exit_to_user_mode(void)
 {
 	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
-		if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
-			current->rseq_event_mask = 0;
+		if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
+			current->rseq_event_pending = false;
 	}
 }
 
@@ -85,12 +47,12 @@ static inline void rseq_fork(struct task
 		t->rseq = NULL;
 		t->rseq_len = 0;
 		t->rseq_sig = 0;
-		t->rseq_event_mask = 0;
+		t->rseq_event_pending = false;
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
-		t->rseq_event_mask = current->rseq_event_mask;
+		t->rseq_event_pending = current->rseq_event_pending;
 	}
 }
 
@@ -99,15 +61,13 @@ static inline void rseq_execve(struct ta
 	t->rseq = NULL;
 	t->rseq_len = 0;
 	t->rseq_sig = 0;
-	t->rseq_event_mask = 0;
+	t->rseq_event_pending = false;
 }
 
 #else /* CONFIG_RSEQ */
-static inline void rseq_set_notify_resume(struct task_struct *t) { }
 static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
-static inline void rseq_preempt(struct task_struct *t) { }
-static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_sched_switch_event(struct task_struct *t) { }
 static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
 static inline void rseq_execve(struct task_struct *t) { }
 static inline void rseq_exit_to_user_mode(void) { }
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1401,14 +1401,14 @@ struct task_struct {
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_RSEQ
-	struct rseq __user *rseq;
-	u32 rseq_len;
-	u32 rseq_sig;
+	struct rseq __user		*rseq;
+	u32				rseq_len;
+	u32				rseq_sig;
 	/*
-	 * RmW on rseq_event_mask must be performed atomically
+	 * RmW on rseq_event_pending must be performed atomically
 	 * with respect to preemption.
 	 */
-	unsigned long rseq_event_mask;
+	bool				rseq_event_pending;
 # ifdef CONFIG_DEBUG_RSEQ
 	/*
 	 * This is a place holder to save a copy of the rseq fields for
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -114,20 +114,13 @@ struct rseq {
 	/*
 	 * Restartable sequences flags field.
 	 *
-	 * This field should only be updated by the thread which
-	 * registered this data structure. Read by the kernel.
-	 * Mainly used for single-stepping through rseq critical sections
-	 * with debuggers.
-	 *
-	 * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
-	 *     Inhibit instruction sequence block restart on preemption
-	 *     for this thread.
-	 * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
-	 *     Inhibit instruction sequence block restart on signal
-	 *     delivery for this thread.
-	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
-	 *     Inhibit instruction sequence block restart on migration for
-	 *     this thread.
+	 * This field was initialy intended to allow event masking for for
+	 * single-stepping through rseq critical sections with debuggers.
+	 * The kernel does not support this anymore and the relevant bits
+	 * are checked for being always false:
+	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
 	 */
 	__u32 flags;
 
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -78,6 +78,12 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/rseq.h>
 
+#ifdef CONFIG_MEMBARRIER
+# define RSEQ_EVENT_GUARD	irq
+#else
+# define RSEQ_EVENT_GUARD	preempt
+#endif
+
 /* The original rseq structure size (including padding) is 32 bytes. */
 #define ORIG_RSEQ_SIZE		32
 
@@ -430,11 +436,11 @@ void __rseq_handle_notify_resume(struct
 	 */
 	if (regs) {
 		/*
-		 * Read and clear the event mask first. If the task was not
-		 * preempted or migrated or a signal is on the way, there
-		 * is no point in doing any of the heavy lifting here on
-		 * production kernels. In that case TIF_NOTIFY_RESUME was
-		 * raised by some other functionality.
+		 * Read and clear the event pending bit first. If the task
+		 * was not preempted or migrated or a signal is on the way,
+		 * there is no point in doing any of the heavy lifting here
+		 * on production kernels. In that case TIF_NOTIFY_RESUME
+		 * was raised by some other functionality.
 		 *
 		 * This is correct because the read/clear operation is
 		 * guarded against scheduler preemption, which makes it CPU
@@ -447,15 +453,15 @@ void __rseq_handle_notify_resume(struct
 		 * with the result handed in to allow the detection of
 		 * inconsistencies.
 		 */
-		u32 event_mask;
+		bool event;
 
 		scoped_guard(RSEQ_EVENT_GUARD) {
-			event_mask = t->rseq_event_mask;
-			t->rseq_event_mask = 0;
+			event = t->rseq_event_pending;
+			t->rseq_event_pending = false;
 		}
 
-		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
-			ret = rseq_ip_fixup(regs, !!event_mask);
+		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+			ret = rseq_ip_fixup(regs, event);
 			if (unlikely(ret < 0))
 				goto error;
 		}
@@ -584,7 +590,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	 * registered, ensure the cpu_id_start and cpu_id fields
 	 * are updated before returning to user-space.
 	 */
-	rseq_set_notify_resume(current);
+	rseq_sched_switch_event(current);
 
 	return 0;
 }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3364,7 +3364,6 @@ void set_task_cpu(struct task_struct *p,
 		if (p->sched_class->migrate_task_rq)
 			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
-		rseq_migrate(p);
 		sched_mm_cid_migrate_from(p);
 		perf_event_task_migrate(p);
 	}
@@ -4795,7 +4794,6 @@ int sched_cgroup_fork(struct task_struct
 		p->sched_task_group = tg;
 	}
 #endif
-	rseq_migrate(p);
 	/*
 	 * We're setting the CPU for the first time, we don't migrate,
 	 * so use __set_task_cpu().
@@ -4859,7 +4857,6 @@ void wake_up_new_task(struct task_struct
 	 * as we're not fully set-up yet.
 	 */
 	p->recent_used_cpu = task_cpu(p);
-	rseq_migrate(p);
 	__set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
 	rq = __task_rq_lock(p, &rf);
 	update_rq_clock(rq);
@@ -5153,7 +5150,7 @@ prepare_task_switch(struct rq *rq, struc
 	kcov_prepare_switch(prev);
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
-	rseq_preempt(prev);
+	rseq_sched_switch_event(prev);
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -199,7 +199,7 @@ static void ipi_rseq(void *info)
 	 * is negligible.
 	 */
 	smp_mb();
-	rseq_preempt(current);
+	rseq_sched_switch_event(current);
 }
 
 static void ipi_sync_rq_state(void *info)
@@ -407,9 +407,9 @@ static int membarrier_private_expedited(
 		 * membarrier, we will end up with some thread in the mm
 		 * running without a core sync.
 		 *
-		 * For RSEQ, don't rseq_preempt() the caller.  User code
-		 * is not supposed to issue syscalls at all from inside an
-		 * rseq critical section.
+		 * For RSEQ, don't invoke rseq_sched_switch_event() on the
+		 * caller.  User code is not supposed to issue syscalls at
+		 * all from inside an rseq critical section.
 		 */
 		if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
 			preempt_disable();


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (5 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 06/37] rseq: Simplify the event notification Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 17:54   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
                   ` (30 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Hypervisors invoke resume_user_mode_work() before entering the guest, which
clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
space context available to them, so the rseq notify handler skips
inspecting the critical section, but updates the CPU/MM CID values
unconditionally so that the eventual pending rseq event is not lost on the
way to user space.

This is a pointless exercise as the task might be rescheduled before
actually returning to user space and it creates unnecessary work in the
vcpu_run() loops.

It's way more efficient to ignore that invocation based on @regs == NULL
and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
vcpu_run() loop before returning from the ioctl().

This ensures that a pending RSEQ update is not lost and the IDs are updated
before returning to user space.

Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
a NOOP.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Dexuan Cui <decui@microsoft.com>
---
 drivers/hv/mshv_root_main.c |    2 +
 include/linux/rseq.h        |   17 +++++++++
 kernel/rseq.c               |   76 +++++++++++++++++++++++---------------------
 virt/kvm/kvm_main.c         |    3 +
 4 files changed, 62 insertions(+), 36 deletions(-)

--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -585,6 +585,8 @@ static long mshv_run_vp_with_root_schedu
 		}
 	} while (!vp->run.flags.intercept_suspend);
 
+	rseq_virt_userspace_exit();
+
 	return ret;
 }
 
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to
 }
 
 /*
+ * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
+ * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
+ * that case just to do it eventually again before returning to user space,
+ * the entry resume_user_mode_work() invocation is ignored as the register
+ * argument is NULL.
+ *
+ * After returning from guest mode, they have to invoke this function to
+ * re-raise TIF_NOTIFY_RESUME if necessary.
+ */
+static inline void rseq_virt_userspace_exit(void)
+{
+	if (current->rseq_event_pending)
+		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+}
+
+/*
  * If parent process has a registered restartable sequences area, the
  * child inherits. Unregister rseq for a clone with CLONE_VM set.
  */
@@ -68,6 +84,7 @@ static inline void rseq_execve(struct ta
 static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_virt_userspace_exit(void) { }
 static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
 static inline void rseq_execve(struct task_struct *t) { }
 static inline void rseq_exit_to_user_mode(void) { }
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct
 {
 	struct task_struct *t = current;
 	int ret, sig;
+	bool event;
+
+	/*
+	 * If invoked from hypervisors before entering the guest via
+	 * resume_user_mode_work(), then @regs is a NULL pointer.
+	 *
+	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+	 * it before returning from the ioctl() to user space when
+	 * rseq_event.sched_switch is set.
+	 *
+	 * So it's safe to ignore here instead of pointlessly updating it
+	 * in the vcpu_run() loop.
+	 */
+	if (!regs)
+		return;
 
 	if (unlikely(t->flags & PF_EXITING))
 		return;
 
 	/*
-	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
-	 * pointer, so fixup cannot be done. If the syscall which led to
-	 * this invocation was invoked inside a critical section, then it
-	 * will either end up in this code again or a possible violation of
-	 * a syscall inside a critical region can only be detected by the
-	 * debug code in rseq_syscall() in a debug enabled kernel.
+	 * Read and clear the event pending bit first. If the task
+	 * was not preempted or migrated or a signal is on the way,
+	 * there is no point in doing any of the heavy lifting here
+	 * on production kernels. In that case TIF_NOTIFY_RESUME
+	 * was raised by some other functionality.
+	 *
+	 * This is correct because the read/clear operation is
+	 * guarded against scheduler preemption, which makes it CPU
+	 * local atomic. If the task is preempted right after
+	 * re-enabling preemption then TIF_NOTIFY_RESUME is set
+	 * again and this function is invoked another time _before_
+	 * the task is able to return to user mode.
+	 *
+	 * On a debug kernel, invoke the fixup code unconditionally
+	 * with the result handed in to allow the detection of
+	 * inconsistencies.
 	 */
-	if (regs) {
-		/*
-		 * Read and clear the event pending bit first. If the task
-		 * was not preempted or migrated or a signal is on the way,
-		 * there is no point in doing any of the heavy lifting here
-		 * on production kernels. In that case TIF_NOTIFY_RESUME
-		 * was raised by some other functionality.
-		 *
-		 * This is correct because the read/clear operation is
-		 * guarded against scheduler preemption, which makes it CPU
-		 * local atomic. If the task is preempted right after
-		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
-		 * again and this function is invoked another time _before_
-		 * the task is able to return to user mode.
-		 *
-		 * On a debug kernel, invoke the fixup code unconditionally
-		 * with the result handed in to allow the detection of
-		 * inconsistencies.
-		 */
-		bool event;
-
-		scoped_guard(RSEQ_EVENT_GUARD) {
-			event = t->rseq_event_pending;
-			t->rseq_event_pending = false;
-		}
+	scoped_guard(RSEQ_EVENT_GUARD) {
+		event = t->rseq_event_pending;
+		t->rseq_event_pending = false;
+	}
 
-		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
-			ret = rseq_ip_fixup(regs, event);
-			if (unlikely(ret < 0))
-				goto error;
-		}
+	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+		ret = rseq_ip_fixup(regs, event);
+		if (unlikely(ret < 0))
+			goto error;
 	}
+
 	if (unlikely(rseq_update_cpu_node_id(t)))
 		goto error;
 	return;
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
 #include <linux/lockdep.h>
 #include <linux/kthread.h>
 #include <linux/suspend.h>
+#include <linux/rseq.h>
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
 		r = kvm_arch_vcpu_ioctl_run(vcpu);
 		vcpu->wants_to_run = false;
 
+		rseq_virt_userspace_exit();
+
 		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
 		break;
 	}


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (6 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 18:02   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 09/37] rseq: Introduce struct rseq_event Thomas Gleixner
                   ` (29 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

There is no need to update these values unconditionally if there is no
event pending.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/rseq.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -464,11 +464,12 @@ void __rseq_handle_notify_resume(struct
 		t->rseq_event_pending = false;
 	}
 
-	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
-		ret = rseq_ip_fixup(regs, event);
-		if (unlikely(ret < 0))
-			goto error;
-	}
+	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+		return;
+
+	ret = rseq_ip_fixup(regs, event);
+	if (unlikely(ret < 0))
+		goto error;
 
 	if (unlikely(rseq_update_cpu_node_id(t)))
 		goto error;


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 09/37] rseq: Introduce struct rseq_event
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (7 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 18:11   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 10/37] entry: Cleanup header Thomas Gleixner
                   ` (28 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

In preparation for a major rewrite of this code, provide a data structure
for event management.

Put the sched_switch event and a indicator for RSEQ on a task into it as a
start. That uses a union, which allows to mask and clear the whole lot
efficiently.

The indicators are explicitely not a bit field. Bit fields generate abysmal
code.

The boolean members are defined as u8 as that actually guarantees that it
fits. There seem to be strange architecture ABIs which need more than 8bits
for a boolean.

The has_rseq member is redudandant vs. task::rseq, but it turns out that
boolean operations and quick checks on the union generate better code than
fiddling with seperate entities and data types. 

This struct will be extended over time to carry more information.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq.h       |   23 ++++++++++++-----------
 include/linux/rseq_types.h |   30 ++++++++++++++++++++++++++++++
 include/linux/sched.h      |    7 ++-----
 kernel/rseq.c              |    6 ++++--
 4 files changed, 48 insertions(+), 18 deletions(-)

--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -9,22 +9,22 @@ void __rseq_handle_notify_resume(struct
 
 static inline void rseq_handle_notify_resume(struct pt_regs *regs)
 {
-	if (current->rseq)
+	if (current->rseq_event.has_rseq)
 		__rseq_handle_notify_resume(NULL, regs);
 }
 
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
 {
-	if (current->rseq) {
-		current->rseq_event_pending = true;
+	if (current->rseq_event.has_rseq) {
+		current->rseq_event.sched_switch = true;
 		__rseq_handle_notify_resume(ksig, regs);
 	}
 }
 
 static inline void rseq_sched_switch_event(struct task_struct *t)
 {
-	if (t->rseq) {
-		t->rseq_event_pending = true;
+	if (t->rseq_event.has_rseq) {
+		t->rseq_event.sched_switch = true;
 		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
 	}
 }
@@ -32,8 +32,9 @@ static inline void rseq_sched_switch_eve
 static __always_inline void rseq_exit_to_user_mode(void)
 {
 	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
-		if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
-			current->rseq_event_pending = false;
+		if (WARN_ON_ONCE(current->rseq_event.has_rseq &&
+				 current->rseq_event.events))
+			current->rseq_event.events = 0;
 	}
 }
 
@@ -49,7 +50,7 @@ static __always_inline void rseq_exit_to
  */
 static inline void rseq_virt_userspace_exit(void)
 {
-	if (current->rseq_event_pending)
+	if (current->rseq_event.sched_switch)
 		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
 }
 
@@ -63,12 +64,12 @@ static inline void rseq_fork(struct task
 		t->rseq = NULL;
 		t->rseq_len = 0;
 		t->rseq_sig = 0;
-		t->rseq_event_pending = false;
+		t->rseq_event.all = 0;
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
-		t->rseq_event_pending = current->rseq_event_pending;
+		t->rseq_event = current->rseq_event;
 	}
 }
 
@@ -77,7 +78,7 @@ static inline void rseq_execve(struct ta
 	t->rseq = NULL;
 	t->rseq_len = 0;
 	t->rseq_sig = 0;
-	t->rseq_event_pending = false;
+	t->rseq_event.all = 0;
 }
 
 #else /* CONFIG_RSEQ */
--- /dev/null
+++ b/include/linux/rseq_types.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_TYPES_H
+#define _LINUX_RSEQ_TYPES_H
+
+#include <linux/types.h>
+
+/*
+ * struct rseq_event - Storage for rseq related event management
+ * @all:		Compound to initialize and clear the data efficiently
+ * @events:		Compund to access events with a single load/store
+ * @sched_switch:	True if the task was scheduled out
+ * @has_rseq:		True if the task has a rseq pointer installed
+ */
+struct rseq_event {
+	union {
+		u32				all;
+		struct {
+			union {
+				u16		events;
+				struct {
+					u8	sched_switch;
+				};
+			};
+
+			u8			has_rseq;
+		};
+	};
+};
+
+#endif
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -41,6 +41,7 @@
 #include <linux/task_io_accounting.h>
 #include <linux/posix-timers_types.h>
 #include <linux/restart_block.h>
+#include <linux/rseq_types.h>
 #include <uapi/linux/rseq.h>
 #include <linux/seqlock_types.h>
 #include <linux/kcsan.h>
@@ -1404,11 +1405,7 @@ struct task_struct {
 	struct rseq __user		*rseq;
 	u32				rseq_len;
 	u32				rseq_sig;
-	/*
-	 * RmW on rseq_event_pending must be performed atomically
-	 * with respect to preemption.
-	 */
-	bool				rseq_event_pending;
+	struct rseq_event		rseq_event;
 # ifdef CONFIG_DEBUG_RSEQ
 	/*
 	 * This is a place holder to save a copy of the rseq fields for
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -460,8 +460,8 @@ void __rseq_handle_notify_resume(struct
 	 * inconsistencies.
 	 */
 	scoped_guard(RSEQ_EVENT_GUARD) {
-		event = t->rseq_event_pending;
-		t->rseq_event_pending = false;
+		event = t->rseq_event.sched_switch;
+		t->rseq_event.sched_switch = false;
 	}
 
 	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -523,6 +523,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 		current->rseq = NULL;
 		current->rseq_sig = 0;
 		current->rseq_len = 0;
+		current->rseq_event.all = 0;
 		return 0;
 	}
 
@@ -595,6 +596,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	 * registered, ensure the cpu_id_start and cpu_id fields
 	 * are updated before returning to user-space.
 	 */
+	current->rseq_event.has_rseq = true;
 	rseq_sched_switch_event(current);
 
 	return 0;


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 10/37] entry: Cleanup header
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (8 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 09/37] rseq: Introduce struct rseq_event Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 18:13   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 11/37] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
                   ` (27 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

From: Thomas Gleixner <tglx@linutronix.de>

Cleanup the include ordering, kernel-doc and other trivialities before
making further changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>

---
 include/linux/entry-common.h     |    8 ++++----
 include/linux/irq-entry-common.h |    2 ++
 2 files changed, 6 insertions(+), 4 deletions(-)
---
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -3,11 +3,11 @@
 #define __LINUX_ENTRYCOMMON_H
 
 #include <linux/irq-entry-common.h>
+#include <linux/livepatch.h>
 #include <linux/ptrace.h>
+#include <linux/resume_user_mode.h>
 #include <linux/seccomp.h>
 #include <linux/sched.h>
-#include <linux/livepatch.h>
-#include <linux/resume_user_mode.h>
 
 #include <asm/entry-common.h>
 #include <asm/syscall.h>
@@ -37,6 +37,7 @@
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
 				 ARCH_SYSCALL_WORK_ENTER)
+
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
@@ -61,8 +62,7 @@
  */
 void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
 
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
-			 unsigned long work);
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
 
 /**
  * syscall_enter_from_user_mode_work - Check and handle work before invoking
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_
 
 /**
  * enter_from_user_mode - Establish state when coming from user mode
+ * @regs:	Pointer to currents pt_regs
  *
  * Syscall/interrupt entry disables interrupts, but user mode is traced as
  * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter(
  * Conditional reschedule with additional sanity checks.
  */
 void raw_irqentry_exit_cond_resched(void);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 11/37] entry: Remove syscall_enter_from_user_mode_prepare()
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (9 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 10/37] entry: Cleanup header Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-23 16:39 ` [patch V2 12/37] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
                   ` (26 subsequent siblings)
  37 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, x86, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Paolo Bonzini, Sean Christopherson,
	Wei Liu, Dexuan Cui, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Open code the only user in the x86 syscall code and reduce the zoo of
functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
---
 arch/x86/entry/syscall_32.c   |    3 ++-
 include/linux/entry-common.h  |   26 +++++---------------------
 kernel/entry/syscall-common.c |    8 --------
 3 files changed, 7 insertions(+), 30 deletions(-)

--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@@ -274,9 +274,10 @@ static noinstr bool __do_fast_syscall_32
 	 * fetch EBP before invoking any of the syscall entry work
 	 * functions.
 	 */
-	syscall_enter_from_user_mode_prepare(regs);
+	enter_from_user_mode(regs);
 
 	instrumentation_begin();
+	local_irq_enable();
 	/* Fetch EBP from where the vDSO stashed it. */
 	if (IS_ENABLED(CONFIG_X86_64)) {
 		/*
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -45,23 +45,6 @@
 				 SYSCALL_WORK_SYSCALL_EXIT_TRAP	|	\
 				 ARCH_SYSCALL_WORK_EXIT)
 
-/**
- * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
- * @regs:	Pointer to currents pt_regs
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct, interrupts are enabled and the
- * subsequent functions can be instrumented.
- *
- * This handles lockdep, RCU (context tracking) and tracing state, i.e.
- * the functionality provided by enter_from_user_mode().
- *
- * This is invoked when there is extra architecture specific functionality
- * to be done between establishing state and handling user mode entry work.
- */
-void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-
 long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
 
 /**
@@ -71,8 +54,8 @@ long syscall_trace_enter(struct pt_regs
  * @syscall:	The syscall number
  *
  * Invoked from architecture specific syscall entry code with interrupts
- * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
- * architecture specific work.
+ * enabled after invoking enter_from_user_mode(), enabling interrupts and
+ * extra architecture specific work.
  *
  * Returns: The original or a modified syscall number
  *
@@ -108,8 +91,9 @@ static __always_inline long syscall_ente
  * function returns all state is correct, interrupts are enabled and the
  * subsequent functions can be instrumented.
  *
- * This is combination of syscall_enter_from_user_mode_prepare() and
- * syscall_enter_from_user_mode_work().
+ * This is the combination of enter_from_user_mode() and
+ * syscall_enter_from_user_mode_work() to be used when there is no
+ * architecture specific work to be done between the two.
  *
  * Returns: The original or a modified syscall number. See
  * syscall_enter_from_user_mode_work() for further explanation.
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -63,14 +63,6 @@ long syscall_trace_enter(struct pt_regs
 	return ret ? : syscall;
 }
 
-noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
-{
-	enter_from_user_mode(regs);
-	instrumentation_begin();
-	local_irq_enable();
-	instrumentation_end();
-}
-
 /*
  * If SYSCALL_EMU is set, then the only reason to report is when
  * SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP).  This syscall


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 12/37] entry: Inline irqentry_enter/exit_from/to_user_mode()
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (10 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 11/37] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-23 16:39 ` [patch V2 13/37] sched: Move MM CID related functions to sched.h Thomas Gleixner
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

There is no point to have this as a function which just inlines
enter_from_user_mode(). The function call overhead is larger than the
function itself.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irq-entry-common.h |   13 +++++++++++--
 kernel/entry/common.c            |   13 -------------
 2 files changed, 11 insertions(+), 15 deletions(-)

--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -278,7 +278,10 @@ static __always_inline void exit_to_user
  *
  * The function establishes state (lockdep, RCU (context tracking), tracing)
  */
-void irqentry_enter_from_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
+{
+	enter_from_user_mode(regs);
+}
 
 /**
  * irqentry_exit_to_user_mode - Interrupt exit work
@@ -293,7 +296,13 @@ void irqentry_enter_from_user_mode(struc
  * Interrupt exit is not invoking #1 which is the syscall specific one time
  * work.
  */
-void irqentry_exit_to_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
+{
+	instrumentation_begin();
+	exit_to_user_mode_prepare(regs);
+	instrumentation_end();
+	exit_to_user_mode();
+}
 
 #ifndef irqentry_state
 /**
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -62,19 +62,6 @@ void __weak arch_do_signal_or_restart(st
 	return ti_work;
 }
 
-noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
-{
-	enter_from_user_mode(regs);
-}
-
-noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
-{
-	instrumentation_begin();
-	exit_to_user_mode_prepare(regs);
-	instrumentation_end();
-	exit_to_user_mode();
-}
-
 noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
 {
 	irqentry_state_t ret = {


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 13/37] sched: Move MM CID related functions to sched.h
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (11 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 12/37] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 18:14   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 14/37] rseq: Cache CPU ID and MM CID values Thomas Gleixner
                   ` (24 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

There is nothing mm specific in that and including mm.h can cause header
recursion hell.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/mm.h    |   25 -------------------------
 include/linux/sched.h |   26 ++++++++++++++++++++++++++
 2 files changed, 26 insertions(+), 25 deletions(-)

--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2310,31 +2310,6 @@ struct zap_details {
 /* Set in unmap_vmas() to indicate a final unmap call.  Only used by hugetlb */
 #define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))
 
-#ifdef CONFIG_SCHED_MM_CID
-void sched_mm_cid_before_execve(struct task_struct *t);
-void sched_mm_cid_after_execve(struct task_struct *t);
-void sched_mm_cid_fork(struct task_struct *t);
-void sched_mm_cid_exit_signals(struct task_struct *t);
-static inline int task_mm_cid(struct task_struct *t)
-{
-	return t->mm_cid;
-}
-#else
-static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_fork(struct task_struct *t) { }
-static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
-static inline int task_mm_cid(struct task_struct *t)
-{
-	/*
-	 * Use the processor id as a fall-back when the mm cid feature is
-	 * disabled. This provides functional per-cpu data structure accesses
-	 * in user-space, althrough it won't provide the memory usage benefits.
-	 */
-	return raw_smp_processor_id();
-}
-#endif
-
 #ifdef CONFIG_MMU
 extern bool can_do_mlock(void);
 #else
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2309,4 +2309,30 @@ static __always_inline void alloc_tag_re
 #define alloc_tag_restore(_tag, _old)		do {} while (0)
 #endif
 
+/* Avoids recursive inclusion hell */
+#ifdef CONFIG_SCHED_MM_CID
+void sched_mm_cid_before_execve(struct task_struct *t);
+void sched_mm_cid_after_execve(struct task_struct *t);
+void sched_mm_cid_fork(struct task_struct *t);
+void sched_mm_cid_exit_signals(struct task_struct *t);
+static inline int task_mm_cid(struct task_struct *t)
+{
+	return t->mm_cid;
+}
+#else
+static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_fork(struct task_struct *t) { }
+static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
+static inline int task_mm_cid(struct task_struct *t)
+{
+	/*
+	 * Use the processor id as a fall-back when the mm cid feature is
+	 * disabled. This provides functional per-cpu data structure accesses
+	 * in user-space, althrough it won't provide the memory usage benefits.
+	 */
+	return task_cpu(t);
+}
+#endif
+
 #endif


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 14/37] rseq: Cache CPU ID and MM CID values
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (12 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 13/37] sched: Move MM CID related functions to sched.h Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 18:19   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 15/37] rseq: Record interrupt from user space Thomas Gleixner
                   ` (23 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

In preparation for rewriting RSEQ exit to user space handling provide
storage to cache the CPU ID and MM CID values which were written to user
space. That prepares for a quick check, which avoids the update when
nothing changed.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq.h        |    3 +++
 include/linux/rseq_types.h  |   19 +++++++++++++++++++
 include/linux/sched.h       |    1 +
 include/trace/events/rseq.h |    4 ++--
 kernel/rseq.c               |    4 ++++
 5 files changed, 29 insertions(+), 2 deletions(-)

--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -64,11 +64,13 @@ static inline void rseq_fork(struct task
 		t->rseq = NULL;
 		t->rseq_len = 0;
 		t->rseq_sig = 0;
+		t->rseq_ids.cpu_cid = ~0ULL;
 		t->rseq_event.all = 0;
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
+		t->rseq_ids.cpu_cid = ~0ULL;
 		t->rseq_event = current->rseq_event;
 	}
 }
@@ -78,6 +80,7 @@ static inline void rseq_execve(struct ta
 	t->rseq = NULL;
 	t->rseq_len = 0;
 	t->rseq_sig = 0;
+	t->rseq_ids.cpu_cid = ~0ULL;
 	t->rseq_event.all = 0;
 }
 
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -27,4 +27,23 @@ struct rseq_event {
 	};
 };
 
+/*
+ * struct rseq_ids - Cache for ids, which need to be updated
+ * @cpu_cid:	Compound of @cpu_id and @mm_cid to make the
+ *		compiler emit a single compare on 64-bit
+ * @cpu_id:	The CPU ID which was written last to user space
+ * @mm_cid:	The MM CID which was written last to user space
+ *
+ * @cpu_id and @mm_cid are updated when the data is written to user space.
+ */
+struct rseq_ids {
+	union {
+		u64		cpu_cid;
+		struct {
+			u32	cpu_id;
+			u32	mm_cid;
+		};
+	};
+};
+
 #endif
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1406,6 +1406,7 @@ struct task_struct {
 	u32				rseq_len;
 	u32				rseq_sig;
 	struct rseq_event		rseq_event;
+	struct rseq_ids			rseq_ids;
 # ifdef CONFIG_DEBUG_RSEQ
 	/*
 	 * This is a place holder to save a copy of the rseq fields for
--- a/include/trace/events/rseq.h
+++ b/include/trace/events/rseq.h
@@ -21,9 +21,9 @@ TRACE_EVENT(rseq_update,
 	),
 
 	TP_fast_assign(
-		__entry->cpu_id = raw_smp_processor_id();
+		__entry->cpu_id = t->rseq_ids.cpu_id;
 		__entry->node_id = cpu_to_node(__entry->cpu_id);
-		__entry->mm_cid = task_mm_cid(t);
+		__entry->mm_cid = t->rseq_ids.mm_cid;
 	),
 
 	TP_printk("cpu_id=%d node_id=%d mm_cid=%d", __entry->cpu_id,
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -184,6 +184,10 @@ static int rseq_update_cpu_node_id(struc
 	rseq_unsafe_put_user(t, node_id, node_id, efault_end);
 	rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
 
+	/* Cache the user space values */
+	t->rseq_ids.cpu_id = cpu_id;
+	t->rseq_ids.mm_cid = mm_cid;
+
 	/*
 	 * Additional feature fields added after ORIG_RSEQ_SIZE
 	 * need to be conditionally updated only if


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 15/37] rseq: Record interrupt from user space
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (13 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 14/37] rseq: Cache CPU ID and MM CID values Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 18:29   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 16/37] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
                   ` (22 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

For RSEQ the only relevant reason to inspect and eventually fixup (abort)
user space critical sections is when user space was interrupted and the
task was scheduled out.

If the user to kernel entry was from a syscall no fixup is required. If
user space invokes a syscall from a critical section it can keep the
pieces as documented.

This is only supported on architectures, which utilize the generic entry
code. If your architecture does not use it, bad luck.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irq-entry-common.h |    3 ++-
 include/linux/rseq.h             |   16 +++++++++++-----
 include/linux/rseq_entry.h       |   18 ++++++++++++++++++
 include/linux/rseq_types.h       |    2 ++
 4 files changed, 33 insertions(+), 6 deletions(-)

--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -4,7 +4,7 @@
 
 #include <linux/context_tracking.h>
 #include <linux/kmsan.h>
-#include <linux/rseq.h>
+#include <linux/rseq_entry.h>
 #include <linux/static_call_types.h>
 #include <linux/syscalls.h>
 #include <linux/tick.h>
@@ -281,6 +281,7 @@ static __always_inline void exit_to_user
 static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
 {
 	enter_from_user_mode(regs);
+	rseq_note_user_irq_entry();
 }
 
 /**
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -31,11 +31,17 @@ static inline void rseq_sched_switch_eve
 
 static __always_inline void rseq_exit_to_user_mode(void)
 {
-	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
-		if (WARN_ON_ONCE(current->rseq_event.has_rseq &&
-				 current->rseq_event.events))
-			current->rseq_event.events = 0;
-	}
+	struct rseq_event *ev = &current->rseq_event;
+
+	if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+		WARN_ON_ONCE(ev->sched_switch);
+
+	/*
+	 * Ensure that event (especially user_irq) is cleared when the
+	 * interrupt did not result in a schedule and therefore the
+	 * rseq processing did not clear it.
+	 */
+	ev->events = 0;
 }
 
 /*
--- /dev/null
+++ b/include/linux/rseq_entry.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_ENTRY_H
+#define _LINUX_RSEQ_ENTRY_H
+
+#ifdef CONFIG_RSEQ
+#include <linux/rseq.h>
+
+static __always_inline void rseq_note_user_irq_entry(void)
+{
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
+		current->rseq_event.user_irq = true;
+}
+
+#else /* CONFIG_RSEQ */
+static inline void rseq_note_user_irq_entry(void) { }
+#endif /* !CONFIG_RSEQ */
+
+#endif /* _LINUX_RSEQ_ENTRY_H */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -9,6 +9,7 @@
  * @all:		Compound to initialize and clear the data efficiently
  * @events:		Compund to access events with a single load/store
  * @sched_switch:	True if the task was scheduled out
+ * @user_irq:		True on interrupt entry from user mode
  * @has_rseq:		True if the task has a rseq pointer installed
  */
 struct rseq_event {
@@ -19,6 +20,7 @@ struct rseq_event {
 				u16		events;
 				struct {
 					u8	sched_switch;
+					u8	user_irq;
 				};
 			};
 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 16/37] rseq: Provide tracepoint wrappers for inline code
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (14 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 15/37] rseq: Record interrupt from user space Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 18:32   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 17/37] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
                   ` (21 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline
fast path, so that the header can be safely included by code which defines
actual trace points.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq_entry.h |   30 ++++++++++++++++++++++++++++++
 kernel/rseq.c              |   17 +++++++++++++++++
 2 files changed, 47 insertions(+)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -5,6 +5,36 @@
 #ifdef CONFIG_RSEQ
 #include <linux/rseq.h>
 
+#include <linux/tracepoint-defs.h>
+
+#ifdef CONFIG_TRACEPOINTS
+DECLARE_TRACEPOINT(rseq_update);
+DECLARE_TRACEPOINT(rseq_ip_fixup);
+void __rseq_trace_update(struct task_struct *t);
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+			   unsigned long offset, unsigned long abort_ip);
+
+static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids)
+{
+	if (tracepoint_enabled(rseq_update)) {
+		if (ids)
+			__rseq_trace_update(t);
+	}
+}
+
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+				       unsigned long offset, unsigned long abort_ip)
+{
+	if (tracepoint_enabled(rseq_ip_fixup))
+		__rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+
+#else /* CONFIG_TRACEPOINT */
+static inline void rseq_trace_update(struct task_struct *t) { }
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+				       unsigned long offset, unsigned long abort_ip) { }
+#endif /* !CONFIG_TRACEPOINT */
+
 static __always_inline void rseq_note_user_irq_entry(void)
 {
 	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -91,6 +91,23 @@
 				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
 				  RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
 
+#ifdef CONFIG_TRACEPOINTS
+/*
+ * Out of line, so the actual update functions can be in a header to be
+ * inlined into the exit to user code.
+ */
+void __rseq_trace_update(struct task_struct *t)
+{
+	trace_rseq_update(t);
+}
+
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+			   unsigned long offset, unsigned long abort_ip)
+{
+	trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+#endif /* CONFIG_TRACEPOINTS */
+
 #ifdef CONFIG_DEBUG_RSEQ
 static struct rseq *rseq_kernel_fields(struct task_struct *t)
 {


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 17/37] rseq: Expose lightweight statistics in debugfs
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (15 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 16/37] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 18:34   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 18/37] rseq: Provide static branch for runtime debugging Thomas Gleixner
                   ` (20 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Analyzing the call frequency without actually using tracing is helpful for
analysis of this infrastructure. The overhead is minimal as it just
increments a per CPU counter associated to each operation.

The debugfs readout provides a racy sum of all counters.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq.h       |   16 ---------
 include/linux/rseq_entry.h |   49 +++++++++++++++++++++++++++
 init/Kconfig               |   12 ++++++
 kernel/rseq.c              |   79 +++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 133 insertions(+), 23 deletions(-)

--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -29,21 +29,6 @@ static inline void rseq_sched_switch_eve
 	}
 }
 
-static __always_inline void rseq_exit_to_user_mode(void)
-{
-	struct rseq_event *ev = &current->rseq_event;
-
-	if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
-		WARN_ON_ONCE(ev->sched_switch);
-
-	/*
-	 * Ensure that event (especially user_irq) is cleared when the
-	 * interrupt did not result in a schedule and therefore the
-	 * rseq processing did not clear it.
-	 */
-	ev->events = 0;
-}
-
 /*
  * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
  * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
@@ -97,7 +82,6 @@ static inline void rseq_sched_switch_eve
 static inline void rseq_virt_userspace_exit(void) { }
 static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
 static inline void rseq_execve(struct task_struct *t) { }
-static inline void rseq_exit_to_user_mode(void) { }
 #endif  /* !CONFIG_RSEQ */
 
 #ifdef CONFIG_DEBUG_RSEQ
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -2,6 +2,37 @@
 #ifndef _LINUX_RSEQ_ENTRY_H
 #define _LINUX_RSEQ_ENTRY_H
 
+/* Must be outside the CONFIG_RSEQ guard to resolve the stubs */
+#ifdef CONFIG_RSEQ_STATS
+#include <linux/percpu.h>
+
+struct rseq_stats {
+	unsigned long	exit;
+	unsigned long	signal;
+	unsigned long	slowpath;
+	unsigned long	ids;
+	unsigned long	cs;
+	unsigned long	clear;
+	unsigned long	fixup;
+};
+
+DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
+
+/*
+ * Slow path has interrupts and preemption enabled, but the fast path
+ * runs with interrupts disabled so there is no point in having the
+ * preemption checks implied in __this_cpu_inc() for every operation.
+ */
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_stat_inc(which)	this_cpu_inc((which))
+#else
+#define rseq_stat_inc(which)	raw_cpu_inc((which))
+#endif
+
+#else /* CONFIG_RSEQ_STATS */
+#define rseq_stat_inc(x)	do { } while (0)
+#endif /* !CONFIG_RSEQ_STATS */
+
 #ifdef CONFIG_RSEQ
 #include <linux/rseq.h>
 
@@ -41,8 +72,26 @@ static __always_inline void rseq_note_us
 		current->rseq_event.user_irq = true;
 }
 
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+	struct rseq_event *ev = &current->rseq_event;
+
+	rseq_stat_inc(rseq_stats.exit);
+
+	if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+		WARN_ON_ONCE(ev->sched_switch);
+
+	/*
+	 * Ensure that event (especially user_irq) is cleared when the
+	 * interrupt did not result in a schedule and therefore the
+	 * rseq processing did not clear it.
+	 */
+	ev->events = 0;
+}
+
 #else /* CONFIG_RSEQ */
 static inline void rseq_note_user_irq_entry(void) { }
+static inline void rseq_exit_to_user_mode(void) { }
 #endif /* !CONFIG_RSEQ */
 
 #endif /* _LINUX_RSEQ_ENTRY_H */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1883,6 +1883,18 @@ config RSEQ
 
 	  If unsure, say Y.
 
+config RSEQ_STATS
+	default n
+	bool "Enable lightweight statistics of restartable sequences" if EXPERT
+	depends on RSEQ && DEBUG_FS
+	help
+	  Enable lightweight counters which expose information about the
+	  frequency of RSEQ operations via debugfs. Mostly interesting for
+	  kernel debugging or performance analysis. While lightweight it's
+	  still adding code into the user/kernel mode transitions.
+
+	  If unsure, say N.
+
 config DEBUG_RSEQ
 	default n
 	bool "Enable debugging of rseq() system call" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -67,12 +67,16 @@
  *   F1. <failure>
  */
 
+/* Required to select the proper per_cpu ops for rseq_stats_inc() */
+#define RSEQ_BUILD_SLOW_PATH
+
+#include <linux/debugfs.h>
+#include <linux/ratelimit.h>
+#include <linux/rseq_entry.h>
 #include <linux/sched.h>
-#include <linux/uaccess.h>
 #include <linux/syscalls.h>
-#include <linux/rseq.h>
+#include <linux/uaccess.h>
 #include <linux/types.h>
-#include <linux/ratelimit.h>
 #include <asm/ptrace.h>
 
 #define CREATE_TRACE_POINTS
@@ -108,6 +112,56 @@ void __rseq_trace_ip_fixup(unsigned long
 }
 #endif /* CONFIG_TRACEPOINTS */
 
+#ifdef CONFIG_RSEQ_STATS
+DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+	struct rseq_stats stats = { };
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		stats.exit	+= data_race(per_cpu(rseq_stats.exit, cpu));
+		stats.signal	+= data_race(per_cpu(rseq_stats.signal, cpu));
+		stats.slowpath	+= data_race(per_cpu(rseq_stats.slowpath, cpu));
+		stats.ids	+= data_race(per_cpu(rseq_stats.ids, cpu));
+		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
+		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
+		stats.fixup	+= data_race(per_cpu(rseq_stats.fixup, cpu));
+	}
+
+	seq_printf(m, "exit:   %16lu\n", stats.exit);
+	seq_printf(m, "signal: %16lu\n", stats.signal);
+	seq_printf(m, "slowp:  %16lu\n", stats.slowpath);
+	seq_printf(m, "ids:    %16lu\n", stats.ids);
+	seq_printf(m, "cs:     %16lu\n", stats.cs);
+	seq_printf(m, "clear:  %16lu\n", stats.clear);
+	seq_printf(m, "fixup:  %16lu\n", stats.fixup);
+	return 0;
+}
+
+static int rseq_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, rseq_debug_show, inode->i_private);
+}
+
+static const struct file_operations dfs_ops = {
+	.open		= rseq_debug_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init rseq_debugfs_init(void)
+{
+	struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
+
+	debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+	return 0;
+}
+__initcall(rseq_debugfs_init);
+#endif /* CONFIG_RSEQ_STATS */
+
 #ifdef CONFIG_DEBUG_RSEQ
 static struct rseq *rseq_kernel_fields(struct task_struct *t)
 {
@@ -187,12 +241,13 @@ static int rseq_update_cpu_node_id(struc
 	u32 node_id = cpu_to_node(cpu_id);
 	u32 mm_cid = task_mm_cid(t);
 
-	/*
-	 * Validate read-only rseq fields.
-	 */
+	rseq_stat_inc(rseq_stats.ids);
+
+	/* Validate read-only rseq fields on debug kernels */
 	if (rseq_validate_ro_fields(t))
 		goto efault;
 	WARN_ON_ONCE((int) mm_cid < 0);
+
 	if (!user_write_access_begin(rseq, t->rseq_len))
 		goto efault;
 
@@ -403,6 +458,8 @@ static int rseq_ip_fixup(struct pt_regs
 	struct rseq_cs rseq_cs;
 	int ret;
 
+	rseq_stat_inc(rseq_stats.cs);
+
 	ret = rseq_get_rseq_cs(t, &rseq_cs);
 	if (ret)
 		return ret;
@@ -412,8 +469,10 @@ static int rseq_ip_fixup(struct pt_regs
 	 * If not nested over a rseq critical section, restart is useless.
 	 * Clear the rseq_cs pointer and return.
 	 */
-	if (!in_rseq_cs(ip, &rseq_cs))
+	if (!in_rseq_cs(ip, &rseq_cs)) {
+		rseq_stat_inc(rseq_stats.clear);
 		return clear_rseq_cs(t->rseq);
+	}
 	ret = rseq_check_flags(t, rseq_cs.flags);
 	if (ret < 0)
 		return ret;
@@ -422,6 +481,7 @@ static int rseq_ip_fixup(struct pt_regs
 	ret = clear_rseq_cs(t->rseq);
 	if (ret)
 		return ret;
+	rseq_stat_inc(rseq_stats.fixup);
 	trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
 			    rseq_cs.abort_ip);
 	instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
@@ -462,6 +522,11 @@ void __rseq_handle_notify_resume(struct
 	if (unlikely(t->flags & PF_EXITING))
 		return;
 
+	if (ksig)
+		rseq_stat_inc(rseq_stats.signal);
+	else
+		rseq_stat_inc(rseq_stats.slowpath);
+
 	/*
 	 * Read and clear the event pending bit first. If the task
 	 * was not preempted or migrated or a signal is on the way,


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 18/37] rseq: Provide static branch for runtime debugging
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (16 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 17/37] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 18:36   ` Mathieu Desnoyers
  2025-08-25 20:30   ` Michael Jeanson
  2025-08-23 16:39 ` [patch V2 19/37] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
                   ` (19 subsequent siblings)
  37 siblings, 2 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Config based debug is rarely turned on and is not available easily when
things go wrong.

Provide a static branch to allow permanent integration of debug mechanisms
along with the usual toggles in Kconfig, command line and debugfs.

Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/admin-guide/kernel-parameters.txt |    4 +
 include/linux/rseq_entry.h                      |    3 
 init/Kconfig                                    |   14 ++++
 kernel/rseq.c                                   |   73 ++++++++++++++++++++++--
 4 files changed, 90 insertions(+), 4 deletions(-)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6443,6 +6443,10 @@
 			Memory area to be used by remote processor image,
 			managed by CMA.
 
+	rseq_debug=	[KNL] Enable or disable restartable sequence
+			debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE.
+			Format: <bool>
+
 	rt_group_sched=	[KNL] Enable or disable SCHED_RR/FIFO group scheduling
 			when CONFIG_RT_GROUP_SCHED=y. Defaults to
 			!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
 #endif /* !CONFIG_RSEQ_STATS */
 
 #ifdef CONFIG_RSEQ
+#include <linux/jump_label.h>
 #include <linux/rseq.h>
 
 #include <linux/tracepoint-defs.h>
@@ -66,6 +67,8 @@ static inline void rseq_trace_ip_fixup(u
 				       unsigned long offset, unsigned long abort_ip) { }
 #endif /* !CONFIG_TRACEPOINT */
 
+DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
 static __always_inline void rseq_note_user_irq_entry(void)
 {
 	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1893,10 +1893,24 @@ config RSEQ_STATS
 
 	  If unsure, say N.
 
+config RSEQ_DEBUG_DEFAULT_ENABLE
+	default n
+	bool "Enable restartable sequences debug mode by default" if EXPERT
+	depends on RSEQ
+	help
+	  This enables the static branch for debug mode of restartable
+	  sequences.
+
+	  This also can be controlled on the kernel command line via the
+	  command line parameter "rseq_debug=0/1" and through debugfs.
+
+	  If unsure, say N.
+
 config DEBUG_RSEQ
 	default n
 	bool "Enable debugging of rseq() system call" if EXPERT
 	depends on RSEQ && DEBUG_KERNEL
+	select RSEQ_DEBUG_DEFAULT_ENABLE
 	help
 	  Enable extra debugging checks for the rseq system call.
 
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -95,6 +95,27 @@
 				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
 				  RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
 
+DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
+static inline void rseq_control_debug(bool on)
+{
+	if (on)
+		static_branch_enable(&rseq_debug_enabled);
+	else
+		static_branch_disable(&rseq_debug_enabled);
+}
+
+static int __init rseq_setup_debug(char *str)
+{
+	bool on;
+
+	if (kstrtobool(str, &on))
+		return -EINVAL;
+	rseq_control_debug(on);
+	return 0;
+}
+__setup("rseq_debug=", rseq_setup_debug);
+
 #ifdef CONFIG_TRACEPOINTS
 /*
  * Out of line, so the actual update functions can be in a header to be
@@ -112,10 +133,11 @@ void __rseq_trace_ip_fixup(unsigned long
 }
 #endif /* CONFIG_TRACEPOINTS */
 
+#ifdef CONFIG_DEBUG_FS
 #ifdef CONFIG_RSEQ_STATS
 DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
 
-static int rseq_debug_show(struct seq_file *m, void *p)
+static int rseq_stats_show(struct seq_file *m, void *p)
 {
 	struct rseq_stats stats = { };
 	unsigned int cpu;
@@ -140,14 +162,56 @@ static int rseq_debug_show(struct seq_fi
 	return 0;
 }
 
+static int rseq_stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, rseq_stats_show, inode->i_private);
+}
+
+static const struct file_operations stat_ops = {
+	.open		= rseq_stats_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+static int __init rseq_stats_init(struct dentry *root_dir)
+{
+	debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops);
+	return 0;
+}
+#else
+static inline void rseq_stats_init(struct dentry *root_dir) { }
+#endif /* CONFIG_RSEQ_STATS */
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+	bool on = static_branch_unlikely(&rseq_debug_enabled);
+
+	seq_printf(m, "%d\n", on);
+	return 0;
+}
+
+static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf,
+			    size_t count, loff_t *ppos)
+{
+	bool on;
+
+	if (kstrtobool_from_user(ubuf, count, &on))
+		return -EINVAL;
+
+	rseq_control_debug(on);
+	return count;
+}
+
 static int rseq_debug_open(struct inode *inode, struct file *file)
 {
 	return single_open(file, rseq_debug_show, inode->i_private);
 }
 
-static const struct file_operations dfs_ops = {
+static const struct file_operations debug_ops = {
 	.open		= rseq_debug_open,
 	.read		= seq_read,
+	.write		= rseq_debug_write,
 	.llseek		= seq_lseek,
 	.release	= single_release,
 };
@@ -156,11 +220,12 @@ static int __init rseq_debugfs_init(void
 {
 	struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
 
-	debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+	debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops);
+	rseq_stats_init(root_dir);
 	return 0;
 }
 __initcall(rseq_debugfs_init);
-#endif /* CONFIG_RSEQ_STATS */
+#endif /* CONFIG_DEBUG_FS */
 
 #ifdef CONFIG_DEBUG_RSEQ
 static struct rseq *rseq_kernel_fields(struct task_struct *t)


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 19/37] rseq: Provide and use rseq_update_user_cs()
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (17 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 18/37] rseq: Provide static branch for runtime debugging Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-25 19:16   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 20/37] rseq: Replace the debug crud Thomas Gleixner
                   ` (18 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Provide a straight forward implementation to check for and eventually
clear/fixup critical sections in user space.

The non-debug version does not any sanity checks and aims for efficiency.

The only attack vector, which needs to be reliably prevented is an abort IP
which is in the kernel address space. That would cause at least x86 to
return to kernel space via IRET. Instead of a check, just mask the address
and be done with it.

The magic signature check along with it's obscure "possible attack" printk
is just voodoo security. If an attacker manages to manipulate the abort_ip
member in the critical section descriptor, then it can equally manipulate
any other indirection in the application. If user space truly cares about
the security of the critical section descriptors, then it should set them
up once and map the descriptor memory read only. There is no justification
for voodoo security in the kernel fast path to encourage user space to be
careless under a completely non-sensical "security" claim.

If the section descriptors are invalid then the resulting misbehaviour of
the user space application is not the kernels problem.

The kernel provides a run-time switchable debug slow path, which implements
the full zoo of checks (except the silly attack message) including
termination of the task when one of the gazillion conditions is not met.

Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME
handler. Move the reminders into the CONFIG_DEBUG_RSEQ section, which will
be replaced and removed in a subsequent step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq_entry.h |  194 ++++++++++++++++++++++++++++++++++++
 include/linux/rseq_types.h |   11 +-
 kernel/rseq.c              |  238 +++++++++++++--------------------------------
 3 files changed, 273 insertions(+), 170 deletions(-)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -36,6 +36,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
 #ifdef CONFIG_RSEQ
 #include <linux/jump_label.h>
 #include <linux/rseq.h>
+#include <linux/uaccess.h>
 
 #include <linux/tracepoint-defs.h>
 
@@ -69,12 +70,205 @@ static inline void rseq_trace_ip_fixup(u
 
 DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
 
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_inline
+#else
+#define rseq_inline __always_inline
+#endif
+
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+
 static __always_inline void rseq_note_user_irq_entry(void)
 {
 	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
 		current->rseq_event.user_irq = true;
 }
 
+/*
+ * Check whether there is a valid critical section and whether the
+ * instruction pointer in @regs is inside the critical section.
+ *
+ *  - If the critical section is invalid, terminate the task.
+ *
+ *  - If valid and the instruction pointer is inside, set it to the abort IP
+ *
+ *  - If valid and the instruction pointer is outside, clear the critical
+ *    section address.
+ *
+ * Returns true, if the section was valid and either fixup or clear was
+ * done, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when a invalid
+ * section was found. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+
+#ifdef RSEQ_BUILD_SLOW_PATH
+/*
+ * The debug version is put out of line, but kept here so the code stays
+ * together.
+ *
+ * @csaddr has already been checked by the caller to be in user space
+ */
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
+{
+	struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+	u64 start_ip, abort_ip, offset, cs_end, head, tasksize = TASK_SIZE;
+	unsigned long ip = instruction_pointer(regs);
+	u64 __user *uc_head = (u64 __user *) ucs;
+	u32 usig, __user *uc_sig;
+
+	if (!user_rw_masked_begin(ucs))
+		return false;
+
+	/*
+	 * Evaluate the user pile and exit if one of the conditions is not
+	 * fulfilled.
+	 */
+	unsafe_get_user(start_ip, &ucs->start_ip, fail);
+	if (unlikely(start_ip >= tasksize))
+		goto die;
+	/* If outside, just clear the critical section. */
+	if (ip < start_ip)
+		goto clear;
+
+	unsafe_get_user(offset, &ucs->post_commit_offset, fail);
+	cs_end = start_ip + offset;
+	/* Check for overflow and wraparound */
+	if (unlikely(cs_end >= tasksize || cs_end < start_ip))
+		goto die;
+
+	/* If not inside, clear it. */
+	if (ip >= cs_end)
+		goto clear;
+
+	unsafe_get_user(abort_ip, &ucs->abort_ip, fail);
+	/* Ensure it's "valid" */
+	if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
+		goto die;
+	/* Validate that the abort IP is not in the critical section */
+	if (unlikely(abort_ip - start_ip < offset))
+		goto die;
+
+	/*
+	 * Check version and flags for 0. No point in emitting deprecated
+	 * warnings before dying. That could be done in the slow path
+	 * eventually, but *shrug*.
+	 */
+	unsafe_get_user(head, uc_head, fail);
+	if (unlikely(head))
+		goto die;
+
+	/* abort_ip - 4 is >= 0. See abort_ip check above */
+	uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+	unsafe_get_user(usig, uc_sig, fail);
+	if (unlikely(usig != t->rseq_sig))
+		goto die;
+
+	/* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+		/* If not in interrupt from user context, let it die */
+		if (unlikely(!t->rseq_event.user_irq))
+			goto die;
+	}
+
+	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
+	user_access_end();
+
+	instruction_pointer_set(regs, (unsigned long)abort_ip);
+
+	rseq_stat_inc(rseq_stats.fixup);
+	rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+	return true;
+clear:
+	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
+	user_access_end();
+	rseq_stat_inc(rseq_stats.clear);
+	return true;
+die:
+	t->rseq_event.fatal = true;
+fail:
+	user_access_end();
+	return false;
+}
+#endif /* RSEQ_BUILD_SLOW_PATH */
+
+/*
+ * This only ensures that abort_ip is in the user address space by masking it.
+ * No other sanity checks are done here, that's what the debug code is for.
+ */
+static rseq_inline bool
+rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
+{
+	struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+	unsigned long ip = instruction_pointer(regs);
+	u64 start_ip, abort_ip, offset;
+
+	rseq_stat_inc(rseq_stats.cs);
+
+	if (unlikely(csaddr >= TASK_SIZE)) {
+		t->rseq_event.fatal = true;
+		return false;
+	}
+
+	if (static_branch_unlikely(&rseq_debug_enabled))
+		return rseq_debug_update_user_cs(t, regs, csaddr);
+
+	if (!user_rw_masked_begin(ucs))
+		return false;
+
+	unsafe_get_user(start_ip, &ucs->start_ip, fail);
+	unsafe_get_user(offset, &ucs->post_commit_offset, fail);
+	unsafe_get_user(abort_ip, &ucs->abort_ip, fail);
+
+	/*
+	 * No sanity checks. If user space screwed it up, it can
+	 * keep the pieces. That's what debug code is for.
+	 *
+	 * If outside, just clear the critical section.
+	 */
+	if (ip - start_ip >= offset)
+		goto clear;
+
+	/*
+	 * Force it to be in user space as x86 IRET would happily return to
+	 * the kernel. Can't use TASK_SIZE as a mask because that's not
+	 * necessarily a power of two. Just make sure it's in the user
+	 * address space. Let the pagefault handler sort it out.
+	 *
+	 * Use LONG_MAX and not LLONG_MAX to keep it correct for 32 and 64
+	 * bit architectures.
+	 */
+	abort_ip &= (u64)LONG_MAX;
+
+	/* Invalidate the critical section */
+	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
+	user_access_end();
+
+	/* Update the instruction pointer */
+	instruction_pointer_set(regs, (unsigned long)abort_ip);
+
+	rseq_stat_inc(rseq_stats.fixup);
+	rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+	return true;
+clear:
+	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
+	user_access_end();
+	rseq_stat_inc(rseq_stats.clear);
+	return true;
+
+fail:
+	user_access_end();
+	return false;
+}
+
 static __always_inline void rseq_exit_to_user_mode(void)
 {
 	struct rseq_event *ev = &current->rseq_event;
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -11,10 +11,12 @@
  * @sched_switch:	True if the task was scheduled out
  * @user_irq:		True on interrupt entry from user mode
  * @has_rseq:		True if the task has a rseq pointer installed
+ * @error:		Compound error code for the slow path to analyze
+ * @fatal:		User space data corrupted or invalid
  */
 struct rseq_event {
 	union {
-		u32				all;
+		u64				all;
 		struct {
 			union {
 				u16		events;
@@ -25,6 +27,13 @@ struct rseq_event {
 			};
 
 			u8			has_rseq;
+			u8			__pad;
+			union {
+				u16		error;
+				struct {
+					u8	fatal;
+				};
+			};
 		};
 	};
 };
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -382,175 +382,15 @@ static int rseq_reset_rseq_cpu_node_id(s
 	return -EFAULT;
 }
 
-/*
- * Get the user-space pointer value stored in the 'rseq_cs' field.
- */
-static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs)
-{
-	if (!rseq_cs)
-		return -EFAULT;
-
-#ifdef CONFIG_64BIT
-	if (get_user(*rseq_cs, &rseq->rseq_cs))
-		return -EFAULT;
-#else
-	if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs)))
-		return -EFAULT;
-#endif
-
-	return 0;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
-	struct rseq_cs __user *urseq_cs;
-	u64 ptr;
-	u32 __user *usig;
-	u32 sig;
-	int ret;
-
-	ret = rseq_get_rseq_cs_ptr_val(t->rseq, &ptr);
-	if (ret)
-		return ret;
-
-	/* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
-	if (!ptr) {
-		memset(rseq_cs, 0, sizeof(*rseq_cs));
-		return 0;
-	}
-	/* Check that the pointer value fits in the user-space process space. */
-	if (ptr >= TASK_SIZE)
-		return -EINVAL;
-	urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
-	if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
-		return -EFAULT;
-
-	if (rseq_cs->start_ip >= TASK_SIZE ||
-	    rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
-	    rseq_cs->abort_ip >= TASK_SIZE ||
-	    rseq_cs->version > 0)
-		return -EINVAL;
-	/* Check for overflow. */
-	if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
-		return -EINVAL;
-	/* Ensure that abort_ip is not in the critical section. */
-	if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
-		return -EINVAL;
-
-	usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
-	ret = get_user(sig, usig);
-	if (ret)
-		return ret;
-
-	if (current->rseq_sig != sig) {
-		printk_ratelimited(KERN_WARNING
-			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
-			sig, current->rseq_sig, current->pid, usig);
-		return -EINVAL;
-	}
-	return 0;
-}
-
-static bool rseq_warn_flags(const char *str, u32 flags)
+static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
 {
-	u32 test_flags;
+	u64 csaddr;
 
-	if (!flags)
+	if (get_user_masked_u64(&csaddr, &t->rseq->rseq_cs))
 		return false;
-	test_flags = flags & RSEQ_CS_NO_RESTART_FLAGS;
-	if (test_flags)
-		pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, str);
-	test_flags = flags & ~RSEQ_CS_NO_RESTART_FLAGS;
-	if (test_flags)
-		pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str);
-	return true;
-}
-
-static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
-{
-	u32 flags;
-	int ret;
-
-	if (rseq_warn_flags("rseq_cs", cs_flags))
-		return -EINVAL;
-
-	/* Get thread flags. */
-	ret = get_user(flags, &t->rseq->flags);
-	if (ret)
-		return ret;
-
-	if (rseq_warn_flags("rseq", flags))
-		return -EINVAL;
-	return 0;
-}
-
-static int clear_rseq_cs(struct rseq __user *rseq)
-{
-	/*
-	 * The rseq_cs field is set to NULL on preemption or signal
-	 * delivery on top of rseq assembly block, as well as on top
-	 * of code outside of the rseq assembly block. This performs
-	 * a lazy clear of the rseq_cs field.
-	 *
-	 * Set rseq_cs to NULL.
-	 */
-#ifdef CONFIG_64BIT
-	return put_user(0UL, &rseq->rseq_cs);
-#else
-	if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs)))
-		return -EFAULT;
-	return 0;
-#endif
-}
-
-/*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
-	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
-{
-	unsigned long ip = instruction_pointer(regs);
-	struct task_struct *t = current;
-	struct rseq_cs rseq_cs;
-	int ret;
-
-	rseq_stat_inc(rseq_stats.cs);
-
-	ret = rseq_get_rseq_cs(t, &rseq_cs);
-	if (ret)
-		return ret;
-
-	/*
-	 * Handle potentially not being within a critical section.
-	 * If not nested over a rseq critical section, restart is useless.
-	 * Clear the rseq_cs pointer and return.
-	 */
-	if (!in_rseq_cs(ip, &rseq_cs)) {
-		rseq_stat_inc(rseq_stats.clear);
-		return clear_rseq_cs(t->rseq);
-	}
-	ret = rseq_check_flags(t, rseq_cs.flags);
-	if (ret < 0)
-		return ret;
-	if (!abort)
-		return 0;
-	ret = clear_rseq_cs(t->rseq);
-	if (ret)
-		return ret;
-	rseq_stat_inc(rseq_stats.fixup);
-	trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
-			    rseq_cs.abort_ip);
-	instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
-	return 0;
+	if (likely(!csaddr))
+		return true;
+	return rseq_update_user_cs(t, regs, csaddr);
 }
 
 /*
@@ -567,8 +407,8 @@ static int rseq_ip_fixup(struct pt_regs
 void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 {
 	struct task_struct *t = current;
-	int ret, sig;
 	bool event;
+	int sig;
 
 	/*
 	 * If invoked from hypervisors before entering the guest via
@@ -618,8 +458,7 @@ void __rseq_handle_notify_resume(struct
 	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
 		return;
 
-	ret = rseq_ip_fixup(regs, event);
-	if (unlikely(ret < 0))
+	if (!rseq_handle_cs(t, regs))
 		goto error;
 
 	if (unlikely(rseq_update_cpu_node_id(t)))
@@ -632,6 +471,67 @@ void __rseq_handle_notify_resume(struct
 }
 
 #ifdef CONFIG_DEBUG_RSEQ
+/*
+ * Unsigned comparison will be true when ip >= start_ip, and when
+ * ip < start_ip + post_commit_offset.
+ */
+static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
+{
+	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
+}
+
+/*
+ * If the rseq_cs field of 'struct rseq' contains a valid pointer to
+ * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
+ */
+static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
+{
+	struct rseq_cs __user *urseq_cs;
+	u64 ptr;
+	u32 __user *usig;
+	u32 sig;
+	int ret;
+
+	if (get_user_masked_u64(&ptr, &t->rseq->rseq_cs))
+		return -EFAULT;
+
+	/* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
+	if (!ptr) {
+		memset(rseq_cs, 0, sizeof(*rseq_cs));
+		return 0;
+	}
+	/* Check that the pointer value fits in the user-space process space. */
+	if (ptr >= TASK_SIZE)
+		return -EINVAL;
+	urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
+	if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
+		return -EFAULT;
+
+	if (rseq_cs->start_ip >= TASK_SIZE ||
+	    rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
+	    rseq_cs->abort_ip >= TASK_SIZE ||
+	    rseq_cs->version > 0)
+		return -EINVAL;
+	/* Check for overflow. */
+	if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
+		return -EINVAL;
+	/* Ensure that abort_ip is not in the critical section. */
+	if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
+		return -EINVAL;
+
+	usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
+	ret = get_user(sig, usig);
+	if (ret)
+		return ret;
+
+	if (current->rseq_sig != sig) {
+		printk_ratelimited(KERN_WARNING
+			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
+			sig, current->rseq_sig, current->pid, usig);
+		return -EINVAL;
+	}
+	return 0;
+}
 
 /*
  * Terminate the process if a syscall is issued within a restartable


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 20/37] rseq: Replace the debug crud
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (18 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 19/37] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-26 14:21   ` Mathieu Desnoyers
  2025-08-23 16:39 ` [patch V2 21/37] rseq: Make exit debugging static branch based Thomas Gleixner
                   ` (17 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Just utilize the new infrastructure and put the original one to rest.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/rseq.c |   80 ++++++++--------------------------------------------------
 1 file changed, 12 insertions(+), 68 deletions(-)

--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -472,83 +472,27 @@ void __rseq_handle_notify_resume(struct
 
 #ifdef CONFIG_DEBUG_RSEQ
 /*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
-	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
-	struct rseq_cs __user *urseq_cs;
-	u64 ptr;
-	u32 __user *usig;
-	u32 sig;
-	int ret;
-
-	if (get_user_masked_u64(&ptr, &t->rseq->rseq_cs))
-		return -EFAULT;
-
-	/* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
-	if (!ptr) {
-		memset(rseq_cs, 0, sizeof(*rseq_cs));
-		return 0;
-	}
-	/* Check that the pointer value fits in the user-space process space. */
-	if (ptr >= TASK_SIZE)
-		return -EINVAL;
-	urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
-	if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
-		return -EFAULT;
-
-	if (rseq_cs->start_ip >= TASK_SIZE ||
-	    rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
-	    rseq_cs->abort_ip >= TASK_SIZE ||
-	    rseq_cs->version > 0)
-		return -EINVAL;
-	/* Check for overflow. */
-	if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
-		return -EINVAL;
-	/* Ensure that abort_ip is not in the critical section. */
-	if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
-		return -EINVAL;
-
-	usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
-	ret = get_user(sig, usig);
-	if (ret)
-		return ret;
-
-	if (current->rseq_sig != sig) {
-		printk_ratelimited(KERN_WARNING
-			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
-			sig, current->rseq_sig, current->pid, usig);
-		return -EINVAL;
-	}
-	return 0;
-}
-
-/*
  * Terminate the process if a syscall is issued within a restartable
  * sequence.
  */
 void rseq_syscall(struct pt_regs *regs)
 {
-	unsigned long ip = instruction_pointer(regs);
 	struct task_struct *t = current;
-	struct rseq_cs rseq_cs;
+	u64 csaddr;
 
-	if (!t->rseq)
+	if (!t->rseq_event.has_rseq)
+		return;
+	if (get_user_masked_u64(&csaddr, &t->rseq->rseq_cs))
+		goto fail;
+	if (likely(!csaddr))
 		return;
-	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
-		force_sig(SIGSEGV);
+	if (unlikely(csaddr >= TASK_SIZE))
+		goto fail;
+	if (rseq_debug_update_user_cs(t, regs, csaddr))
+		return;
+fail:
+	force_sig(SIGSEGV);
 }
-
 #endif
 
 /*


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 21/37] rseq: Make exit debugging static branch based
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (19 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 20/37] rseq: Replace the debug crud Thomas Gleixner
@ 2025-08-23 16:39 ` Thomas Gleixner
  2025-08-26 14:23   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
                   ` (16 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Disconnect it from the config switch and use the static debug branch. This
is a temporary measure for validating the rework. At the end this check
needs to be hidden behind lockdep as it has nothing to do with the other
debug infrastructure, which mainly aids user space debugging by enabling a
zoo of checks which terminate misbehaving tasks instead of letting them
keep the hard to diagnose pieces.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq_entry.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -275,7 +275,7 @@ static __always_inline void rseq_exit_to
 
 	rseq_stat_inc(rseq_stats.exit);
 
-	if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+	if (static_branch_unlikely(&rseq_debug_enabled))
 		WARN_ON_ONCE(ev->sched_switch);
 
 	/*


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (20 preceding siblings ...)
  2025-08-23 16:39 ` [patch V2 21/37] rseq: Make exit debugging static branch based Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-26 14:28   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 23/37] rseq: Provide and use rseq_set_uids() Thomas Gleixner
                   ` (15 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Make the syscall exit debug mechanism available via the static branch on
architectures which utilize the generic entry code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/entry-common.h |    2 +-
 include/linux/rseq_entry.h   |    9 +++++++++
 kernel/rseq.c                |   19 +++++++++++++------
 3 files changed, 23 insertions(+), 7 deletions(-)

--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -146,7 +146,7 @@ static __always_inline void syscall_exit
 			local_irq_enable();
 	}
 
-	rseq_syscall(regs);
+	rseq_debug_syscall_return(regs);
 
 	/*
 	 * Do one-time syscall specific work. If these work items are
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -286,9 +286,18 @@ static __always_inline void rseq_exit_to
 	ev->events = 0;
 }
 
+void __rseq_debug_syscall_return(struct pt_regs *regs);
+
+static inline void rseq_debug_syscall_return(struct pt_regs *regs)
+{
+	if (static_branch_unlikely(&rseq_debug_enabled))
+		__rseq_debug_syscall_return(regs);
+}
+
 #else /* CONFIG_RSEQ */
 static inline void rseq_note_user_irq_entry(void) { }
 static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
 #endif /* !CONFIG_RSEQ */
 
 #endif /* _LINUX_RSEQ_ENTRY_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -470,12 +470,7 @@ void __rseq_handle_notify_resume(struct
 	force_sigsegv(sig);
 }
 
-#ifdef CONFIG_DEBUG_RSEQ
-/*
- * Terminate the process if a syscall is issued within a restartable
- * sequence.
- */
-void rseq_syscall(struct pt_regs *regs)
+void __rseq_debug_syscall_return(struct pt_regs *regs)
 {
 	struct task_struct *t = current;
 	u64 csaddr;
@@ -493,6 +488,18 @@ void rseq_syscall(struct pt_regs *regs)
 fail:
 	force_sig(SIGSEGV);
 }
+
+#ifdef CONFIG_DEBUG_RSEQ
+/*
+ * Kept around to keep GENERIC_ENTRY=n architectures supported.
+ *
+ * Terminate the process if a syscall is issued within a restartable
+ * sequence.
+ */
+void rseq_syscall(struct pt_regs *regs)
+{
+	__rseq_debug_syscall_return(regs);
+}
 #endif
 
 /*


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 23/37] rseq: Provide and use rseq_set_uids()
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (21 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-26 14:52   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 24/37] rseq: Seperate the signal delivery path Thomas Gleixner
                   ` (14 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Provide a new and straight forward implementation to set the IDs (CPU ID,
Node ID and MM CID), which can be later inlined into the fast path.

It does all operations in one user_rw_masked_begin() section and retrieves
also the critical section member (rseq::cs_rseq) from user space to avoid
another user..begin/end() pair. This is in preparation for optimizing the
fast path to avoid extra work when not required.

Use it to replace the whole related zoo in rseq.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 fs/binfmt_elf.c            |    2 
 include/linux/rseq_entry.h |   95 ++++++++++++++++++++
 include/linux/rseq_types.h |    2 
 include/linux/sched.h      |   10 --
 kernel/rseq.c              |  208 ++++++---------------------------------------
 5 files changed, 130 insertions(+), 187 deletions(-)

--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -46,7 +46,7 @@
 #include <linux/cred.h>
 #include <linux/dax.h>
 #include <linux/uaccess.h>
-#include <linux/rseq.h>
+#include <uapi/linux/rseq.h>
 #include <asm/param.h>
 #include <asm/page.h>
 
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -38,6 +38,8 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
 #include <linux/rseq.h>
 #include <linux/uaccess.h>
 
+#include <uapi/linux/rseq.h>
+
 #include <linux/tracepoint-defs.h>
 
 #ifdef CONFIG_TRACEPOINTS
@@ -77,6 +79,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
 #endif
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+bool rseq_debug_validate_uids(struct task_struct *t);
 
 static __always_inline void rseq_note_user_irq_entry(void)
 {
@@ -198,6 +201,44 @@ bool rseq_debug_update_user_cs(struct ta
 	user_access_end();
 	return false;
 }
+
+/*
+ * On debug kernels validate that user space did not mess with it if
+ * DEBUG_RSEQ is enabled, but don't on the first exit to user space. In
+ * that case cpu_cid is ~0. See fork/execve.
+ */
+bool rseq_debug_validate_uids(struct task_struct *t)
+{
+	u32 cpu_id, uval, node_id = cpu_to_node(task_cpu(t));
+	struct rseq __user *rseq = t->rseq;
+
+	if (t->rseq_ids.cpu_cid == ~0)
+		return true;
+
+	if (!user_read_masked_begin(rseq))
+		return false;
+
+	unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
+	if (cpu_id != t->rseq_ids.cpu_id)
+		goto die;
+	unsafe_get_user(uval, &rseq->cpu_id, efault);
+	if (uval != cpu_id)
+		goto die;
+	unsafe_get_user(uval, &rseq->node_id, efault);
+	if (uval != node_id)
+		goto die;
+	unsafe_get_user(uval, &rseq->mm_cid, efault);
+	if (uval != t->rseq_ids.mm_cid)
+		goto die;
+	user_access_end();
+	return true;
+die:
+	t->rseq_event.fatal = true;
+efault:
+	user_access_end();
+	return false;
+}
+
 #endif /* RSEQ_BUILD_SLOW_PATH */
 
 /*
@@ -268,6 +309,60 @@ rseq_update_user_cs(struct task_struct *
 	user_access_end();
 	return false;
 }
+
+/*
+ * Updates CPU ID, Node ID and MM CID and reads the critical section
+ * address, when @csaddr != NULL. This allows to put the ID update and the
+ * read under the same uaccess region to spare a seperate begin/end.
+ *
+ * As this is either invoked from a C wrapper with @csaddr = NULL or from
+ * the fast path code with a valid pointer, a clever compiler should be
+ * able to optimize the read out. Spares a duplicate implementation.
+ *
+ * Returns true, if the operation was successful, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when invalid data
+ * was found on debug kernels. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+static rseq_inline
+bool rseq_set_uids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
+			      u32 node_id, u64 *csaddr)
+{
+	struct rseq __user *rseq = t->rseq;
+
+	if (static_branch_unlikely(&rseq_debug_enabled)) {
+		if (!rseq_debug_validate_uids(t))
+			return false;
+	}
+
+	if (!user_rw_masked_begin(rseq))
+		return false;
+
+	unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
+	unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
+	unsafe_put_user(node_id, &rseq->node_id, efault);
+	unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
+	if (csaddr)
+		unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+	user_access_end();
+
+	/* Cache the new values */
+	t->rseq_ids.cpu_cid = ids->cpu_cid;
+	rseq_stat_inc(rseq_stats.ids);
+	rseq_trace_update(t, ids);
+	return true;
+efault:
+	user_access_end();
+	return false;
+}
 
 static __always_inline void rseq_exit_to_user_mode(void)
 {
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -3,6 +3,8 @@
 #define _LINUX_RSEQ_TYPES_H
 
 #include <linux/types.h>
+/* Forward declaration for the sched.h */
+struct rseq;
 
 /*
  * struct rseq_event - Storage for rseq related event management
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -42,7 +42,6 @@
 #include <linux/posix-timers_types.h>
 #include <linux/restart_block.h>
 #include <linux/rseq_types.h>
-#include <uapi/linux/rseq.h>
 #include <linux/seqlock_types.h>
 #include <linux/kcsan.h>
 #include <linux/rv.h>
@@ -1407,15 +1406,6 @@ struct task_struct {
 	u32				rseq_sig;
 	struct rseq_event		rseq_event;
 	struct rseq_ids			rseq_ids;
-# ifdef CONFIG_DEBUG_RSEQ
-	/*
-	 * This is a place holder to save a copy of the rseq fields for
-	 * validation of read-only fields. The struct rseq has a
-	 * variable-length array at the end, so it cannot be used
-	 * directly. Reserve a size large enough for the known fields.
-	 */
-	char				rseq_fields[sizeof(struct rseq)];
-# endif
 #endif
 
 #ifdef CONFIG_SCHED_MM_CID
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -88,13 +88,6 @@
 # define RSEQ_EVENT_GUARD	preempt
 #endif
 
-/* The original rseq structure size (including padding) is 32 bytes. */
-#define ORIG_RSEQ_SIZE		32
-
-#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
-				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
-				  RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
-
 DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
 
 static inline void rseq_control_debug(bool on)
@@ -227,159 +220,9 @@ static int __init rseq_debugfs_init(void
 __initcall(rseq_debugfs_init);
 #endif /* CONFIG_DEBUG_FS */
 
-#ifdef CONFIG_DEBUG_RSEQ
-static struct rseq *rseq_kernel_fields(struct task_struct *t)
-{
-	return (struct rseq *) t->rseq_fields;
-}
-
-static int rseq_validate_ro_fields(struct task_struct *t)
-{
-	static DEFINE_RATELIMIT_STATE(_rs,
-				      DEFAULT_RATELIMIT_INTERVAL,
-				      DEFAULT_RATELIMIT_BURST);
-	u32 cpu_id_start, cpu_id, node_id, mm_cid;
-	struct rseq __user *rseq = t->rseq;
-
-	/*
-	 * Validate fields which are required to be read-only by
-	 * user-space.
-	 */
-	if (!user_read_access_begin(rseq, t->rseq_len))
-		goto efault;
-	unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
-	unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
-	unsafe_get_user(node_id, &rseq->node_id, efault_end);
-	unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end);
-	user_read_access_end();
-
-	if ((cpu_id_start != rseq_kernel_fields(t)->cpu_id_start ||
-	    cpu_id != rseq_kernel_fields(t)->cpu_id ||
-	    node_id != rseq_kernel_fields(t)->node_id ||
-	    mm_cid != rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) {
-
-		pr_warn("Detected rseq corruption for pid: %d, name: %s\n"
-			"\tcpu_id_start: %u ?= %u\n"
-			"\tcpu_id:       %u ?= %u\n"
-			"\tnode_id:      %u ?= %u\n"
-			"\tmm_cid:       %u ?= %u\n",
-			t->pid, t->comm,
-			cpu_id_start, rseq_kernel_fields(t)->cpu_id_start,
-			cpu_id, rseq_kernel_fields(t)->cpu_id,
-			node_id, rseq_kernel_fields(t)->node_id,
-			mm_cid, rseq_kernel_fields(t)->mm_cid);
-	}
-
-	/* For now, only print a console warning on mismatch. */
-	return 0;
-
-efault_end:
-	user_read_access_end();
-efault:
-	return -EFAULT;
-}
-
-/*
- * Update an rseq field and its in-kernel copy in lock-step to keep a coherent
- * state.
- */
-#define rseq_unsafe_put_user(t, value, field, error_label)		\
-	do {								\
-		unsafe_put_user(value, &t->rseq->field, error_label);	\
-		rseq_kernel_fields(t)->field = value;			\
-	} while (0)
-
-#else
-static int rseq_validate_ro_fields(struct task_struct *t)
-{
-	return 0;
-}
-
-#define rseq_unsafe_put_user(t, value, field, error_label)		\
-	unsafe_put_user(value, &t->rseq->field, error_label)
-#endif
-
-static int rseq_update_cpu_node_id(struct task_struct *t)
-{
-	struct rseq __user *rseq = t->rseq;
-	u32 cpu_id = raw_smp_processor_id();
-	u32 node_id = cpu_to_node(cpu_id);
-	u32 mm_cid = task_mm_cid(t);
-
-	rseq_stat_inc(rseq_stats.ids);
-
-	/* Validate read-only rseq fields on debug kernels */
-	if (rseq_validate_ro_fields(t))
-		goto efault;
-	WARN_ON_ONCE((int) mm_cid < 0);
-
-	if (!user_write_access_begin(rseq, t->rseq_len))
-		goto efault;
-
-	rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
-	rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
-	rseq_unsafe_put_user(t, node_id, node_id, efault_end);
-	rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
-	/* Cache the user space values */
-	t->rseq_ids.cpu_id = cpu_id;
-	t->rseq_ids.mm_cid = mm_cid;
-
-	/*
-	 * Additional feature fields added after ORIG_RSEQ_SIZE
-	 * need to be conditionally updated only if
-	 * t->rseq_len != ORIG_RSEQ_SIZE.
-	 */
-	user_write_access_end();
-	trace_rseq_update(t);
-	return 0;
-
-efault_end:
-	user_write_access_end();
-efault:
-	return -EFAULT;
-}
-
-static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
+static bool rseq_set_uids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
 {
-	struct rseq __user *rseq = t->rseq;
-	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
-	    mm_cid = 0;
-
-	/*
-	 * Validate read-only rseq fields.
-	 */
-	if (rseq_validate_ro_fields(t))
-		goto efault;
-
-	if (!user_write_access_begin(rseq, t->rseq_len))
-		goto efault;
-
-	/*
-	 * Reset all fields to their initial state.
-	 *
-	 * All fields have an initial state of 0 except cpu_id which is set to
-	 * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after
-	 * unregistration can figure out that rseq needs to be registered
-	 * again.
-	 */
-	rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end);
-	rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
-	rseq_unsafe_put_user(t, node_id, node_id, efault_end);
-	rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
-	/*
-	 * Additional feature fields added after ORIG_RSEQ_SIZE
-	 * need to be conditionally reset only if
-	 * t->rseq_len != ORIG_RSEQ_SIZE.
-	 */
-	user_write_access_end();
-	return 0;
-
-efault_end:
-	user_write_access_end();
-efault:
-	return -EFAULT;
+	return rseq_set_uids_get_csaddr(t, ids, node_id, NULL);
 }
 
 static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
@@ -407,6 +250,8 @@ static bool rseq_handle_cs(struct task_s
 void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
 {
 	struct task_struct *t = current;
+	struct rseq_ids ids;
+	u32 node_id;
 	bool event;
 	int sig;
 
@@ -453,6 +298,8 @@ void __rseq_handle_notify_resume(struct
 	scoped_guard(RSEQ_EVENT_GUARD) {
 		event = t->rseq_event.sched_switch;
 		t->rseq_event.sched_switch = false;
+		ids.cpu_id = task_cpu(t);
+		ids.mm_cid = task_mm_cid(t);
 	}
 
 	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -461,7 +308,8 @@ void __rseq_handle_notify_resume(struct
 	if (!rseq_handle_cs(t, regs))
 		goto error;
 
-	if (unlikely(rseq_update_cpu_node_id(t)))
+	node_id = cpu_to_node(ids.cpu_id);
+	if (!rseq_set_uids(t, &ids, node_id))
 		goto error;
 	return;
 
@@ -502,13 +350,33 @@ void rseq_syscall(struct pt_regs *regs)
 }
 #endif
 
+static bool rseq_reset_ids(void)
+{
+	struct rseq_ids ids = {
+		.cpu_id		= RSEQ_CPU_ID_UNINITIALIZED,
+		.mm_cid		= 0,
+	};
+
+	/*
+	 * If this fails, terminate it because this leaves the kernel in
+	 * stupid state as exit to user space will try to fixup the ids
+	 * again.
+	 */
+	if (rseq_set_uids(current, &ids, 0))
+		return true;
+
+	force_sig(SIGSEGV);
+	return false;
+}
+
+/* The original rseq structure size (including padding) is 32 bytes. */
+#define ORIG_RSEQ_SIZE		32
+
 /*
  * sys_rseq - setup restartable sequences for caller thread.
  */
 SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
 {
-	int ret;
-
 	if (flags & RSEQ_FLAG_UNREGISTER) {
 		if (flags & ~RSEQ_FLAG_UNREGISTER)
 			return -EINVAL;
@@ -519,9 +387,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 			return -EINVAL;
 		if (current->rseq_sig != sig)
 			return -EPERM;
-		ret = rseq_reset_rseq_cpu_node_id(current);
-		if (ret)
-			return ret;
+		if (!rseq_reset_ids())
+			return -EFAULT;
 		current->rseq = NULL;
 		current->rseq_sig = 0;
 		current->rseq_len = 0;
@@ -574,17 +441,6 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	if (put_user_masked_u64(0UL, &rseq->rseq_cs))
 		return -EFAULT;
 
-#ifdef CONFIG_DEBUG_RSEQ
-	/*
-	 * Initialize the in-kernel rseq fields copy for validation of
-	 * read-only fields.
-	 */
-	if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_start) ||
-	    get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) ||
-	    get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) ||
-	    get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid))
-		return -EFAULT;
-#endif
 	/*
 	 * Activate the registration by setting the rseq area address, length
 	 * and signature in the task struct.


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 24/37] rseq: Seperate the signal delivery path
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (22 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 23/37] rseq: Provide and use rseq_set_uids() Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-26 15:08   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
                   ` (13 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Completely seperate the signal delivery path from the notify handler as
they have different semantics versus the event handling.

The signal delivery only needs to ensure that the interrupted user context
was not in a critical section or the section is aborted before it switches
to the signal frame context. The signal frame context does not have the
original instruction pointer anymore, so that can't be handled on exit to
user space.

No point in updating the CPU/CID ids as they might change again before the
task returns to user space for real.

The fast path optimization, which checks for the 'entry from user via
interrupt' condition is only available for architectures which use the
generic entry code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq.h       |   21 ++++++++++++++++-----
 include/linux/rseq_entry.h |   29 +++++++++++++++++++++++++++++
 kernel/rseq.c              |   30 ++++++++++++++++++++++--------
 3 files changed, 67 insertions(+), 13 deletions(-)

--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -5,22 +5,33 @@
 #ifdef CONFIG_RSEQ
 #include <linux/sched.h>
 
-void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
+void __rseq_handle_notify_resume(struct pt_regs *regs);
 
 static inline void rseq_handle_notify_resume(struct pt_regs *regs)
 {
 	if (current->rseq_event.has_rseq)
-		__rseq_handle_notify_resume(NULL, regs);
+		__rseq_handle_notify_resume(regs);
 }
 
+void __rseq_signal_deliver(int sig, struct pt_regs *regs);
+
+/*
+ * Invoked from signal delivery to fixup based on the register context before
+ * switching to the signal delivery context.
+ */
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
 {
-	if (current->rseq_event.has_rseq) {
-		current->rseq_event.sched_switch = true;
-		__rseq_handle_notify_resume(ksig, regs);
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+		/* '&' is intentional to spare one conditional branch */
+		if (current->rseq_event.has_rseq & current->rseq_event.user_irq)
+			__rseq_signal_deliver(ksig->sig, regs);
+	} else {
+		if (current->rseq_event.has_rseq)
+			__rseq_signal_deliver(ksig->sig, regs);
 	}
 }
 
+/* Raised from context switch and exevce to force evaluation on exit to user */
 static inline void rseq_sched_switch_event(struct task_struct *t)
 {
 	if (t->rseq_event.has_rseq) {
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -364,6 +364,35 @@ bool rseq_set_uids_get_csaddr(struct tas
 	return false;
 }
 
+/*
+ * Update user space with new IDs and conditionally check whether the task
+ * is in a critical section.
+ */
+static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *regs,
+					struct rseq_ids *ids, u32 node_id)
+{
+	u64 csaddr;
+
+	if (!rseq_set_uids_get_csaddr(t, ids, node_id, &csaddr))
+		return false;
+
+	/*
+	 * On architectures which utilize the generic entry code this
+	 * allows to skip the critical section when the entry was not from
+	 * a user space interrupt, unless debug mode is enabled.
+	 */
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+		if (!static_branch_unlikely(&rseq_debug_enabled)) {
+			if (likely(!t->rseq_event.user_irq))
+				return true;
+		}
+	}
+	if (likely(!csaddr))
+		return true;
+	/* Sigh, this really needs to do work */
+	return rseq_update_user_cs(t, regs, csaddr);
+}
+
 static __always_inline void rseq_exit_to_user_mode(void)
 {
 	struct rseq_event *ev = &current->rseq_event;
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -247,13 +247,12 @@ static bool rseq_handle_cs(struct task_s
  * respect to other threads scheduled on the same CPU, and with respect
  * to signal handlers.
  */
-void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
+void __rseq_handle_notify_resume(struct pt_regs *regs)
 {
 	struct task_struct *t = current;
 	struct rseq_ids ids;
 	u32 node_id;
 	bool event;
-	int sig;
 
 	/*
 	 * If invoked from hypervisors before entering the guest via
@@ -272,10 +271,7 @@ void __rseq_handle_notify_resume(struct
 	if (unlikely(t->flags & PF_EXITING))
 		return;
 
-	if (ksig)
-		rseq_stat_inc(rseq_stats.signal);
-	else
-		rseq_stat_inc(rseq_stats.slowpath);
+	rseq_stat_inc(rseq_stats.slowpath);
 
 	/*
 	 * Read and clear the event pending bit first. If the task
@@ -314,8 +310,26 @@ void __rseq_handle_notify_resume(struct
 	return;
 
 error:
-	sig = ksig ? ksig->sig : 0;
-	force_sigsegv(sig);
+	force_sig(SIGSEGV);
+}
+
+void __rseq_signal_deliver(int sig, struct pt_regs *regs)
+{
+	rseq_stat_inc(rseq_stats.signal);
+	/*
+	 * Don't update IDs, they are handled on exit to user if
+	 * necessary. The important thing is to abort a critical section of
+	 * the interrupted context as after this point the instruction
+	 * pointer in @regs points to the signal handler.
+	 */
+	if (unlikely(!rseq_handle_cs(current, regs))) {
+		/*
+		 * Clear the errors just in case this might survive
+		 * magically, but leave the rest intact.
+		 */
+		current->rseq_event.error = 0;
+		force_sigsegv(sig);
+	}
 }
 
 void __rseq_debug_syscall_return(struct pt_regs *regs)


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (23 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 24/37] rseq: Seperate the signal delivery path Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-26 15:12   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 26/37] rseq: Optimize event setting Thomas Gleixner
                   ` (12 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Replace the whole logic with the new implementation, which is shared with
signal delivery and the upcoming exit fast path.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/rseq.c |   78 +++++++++++++++++++++++++---------------------------------
 1 file changed, 34 insertions(+), 44 deletions(-)

--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -82,12 +82,6 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/rseq.h>
 
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD	irq
-#else
-# define RSEQ_EVENT_GUARD	preempt
-#endif
-
 DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
 
 static inline void rseq_control_debug(bool on)
@@ -236,38 +230,15 @@ static bool rseq_handle_cs(struct task_s
 	return rseq_update_user_cs(t, regs, csaddr);
 }
 
-/*
- * This resume handler must always be executed between any of:
- * - preemption,
- * - signal delivery,
- * and return to user-space.
- *
- * This is how we can ensure that the entire rseq critical section
- * will issue the commit instruction only if executed atomically with
- * respect to other threads scheduled on the same CPU, and with respect
- * to signal handlers.
- */
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+static void rseq_slowpath_update_usr(struct pt_regs *regs)
 {
+	/* Preserve rseq state and user_irq state for exit to user */
+	const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
 	struct task_struct *t = current;
 	struct rseq_ids ids;
 	u32 node_id;
 	bool event;
 
-	/*
-	 * If invoked from hypervisors before entering the guest via
-	 * resume_user_mode_work(), then @regs is a NULL pointer.
-	 *
-	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
-	 * it before returning from the ioctl() to user space when
-	 * rseq_event.sched_switch is set.
-	 *
-	 * So it's safe to ignore here instead of pointlessly updating it
-	 * in the vcpu_run() loop.
-	 */
-	if (!regs)
-		return;
-
 	if (unlikely(t->flags & PF_EXITING))
 		return;
 
@@ -291,26 +262,45 @@ void __rseq_handle_notify_resume(struct
 	 * with the result handed in to allow the detection of
 	 * inconsistencies.
 	 */
-	scoped_guard(RSEQ_EVENT_GUARD) {
-		event = t->rseq_event.sched_switch;
-		t->rseq_event.sched_switch = false;
+	scoped_guard(irq) {
 		ids.cpu_id = task_cpu(t);
 		ids.mm_cid = task_mm_cid(t);
+		event = t->rseq_event.sched_switch;
+		t->rseq_event.all &= evt_mask.all;
 	}
 
-	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+	if (!event)
 		return;
 
-	if (!rseq_handle_cs(t, regs))
-		goto error;
-
 	node_id = cpu_to_node(ids.cpu_id);
-	if (!rseq_set_uids(t, &ids, node_id))
-		goto error;
-	return;
 
-error:
-	force_sig(SIGSEGV);
+	if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) {
+		/*
+		 * Clear the errors just in case this might survive magically, but
+		 * leave the rest intact.
+		 */
+		t->rseq_event.error = 0;
+		force_sig(SIGSEGV);
+	}
+}
+
+void __rseq_handle_notify_resume(struct pt_regs *regs)
+{
+	/*
+	 * If invoked from hypervisors before entering the guest via
+	 * resume_user_mode_work(), then @regs is a NULL pointer.
+	 *
+	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+	 * it before returning from the ioctl() to user space when
+	 * rseq_event.sched_switch is set.
+	 *
+	 * So it's safe to ignore here instead of pointlessly updating it
+	 * in the vcpu_run() loop.
+	 */
+	if (!regs)
+		return;
+
+	rseq_slowpath_update_usr(regs);
 }
 
 void __rseq_signal_deliver(int sig, struct pt_regs *regs)


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 26/37] rseq: Optimize event setting
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (24 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-26 15:26   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 27/37] rseq: Implement fast path for exit to user Thomas Gleixner
                   ` (11 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

After removing the various condition bits earlier it turns out that one
extra information is needed to avoid setting event::sched_switch and
TIF_NOTIFY_RESUME unconditionally on every context switch.

The update of the RSEQ user space memory is only required, when either

  the task was interrupted in user space and schedules

or

  the CPU or MM CID changes in schedule() independent of the entry mode

Right now only the interrupt from user information is available.

Add a event flag, which is set when the CPU or MM CID or both change.

Evaluate this event in the scheduler to decide whether the sched_switch
event and the TIF bit need to be set.

It's an extra conditional in context_switch(), but the downside of
unconditionally handling RSEQ after a context switch to user is way more
significant. The utilized boolean logic minimizes this to a single
conditional branch.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 fs/exec.c                  |    2 -
 include/linux/rseq.h       |   81 +++++++++++++++++++++++++++++++++++++++++----
 include/linux/rseq_types.h |   11 +++++-
 kernel/rseq.c              |    2 -
 kernel/sched/core.c        |    7 +++
 kernel/sched/sched.h       |    5 ++
 6 files changed, 95 insertions(+), 13 deletions(-)

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp
 		force_fatal_sig(SIGSEGV);
 
 	sched_mm_cid_after_execve(current);
-	rseq_sched_switch_event(current);
+	rseq_force_update();
 	current->in_execve = 0;
 
 	return retval;
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -9,7 +9,8 @@ void __rseq_handle_notify_resume(struct
 
 static inline void rseq_handle_notify_resume(struct pt_regs *regs)
 {
-	if (current->rseq_event.has_rseq)
+	/* '&' is intentional to spare one conditional branch */
+	if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
 		__rseq_handle_notify_resume(regs);
 }
 
@@ -31,12 +32,75 @@ static inline void rseq_signal_deliver(s
 	}
 }
 
-/* Raised from context switch and exevce to force evaluation on exit to user */
-static inline void rseq_sched_switch_event(struct task_struct *t)
+static inline void rseq_raise_notify_resume(struct task_struct *t)
 {
-	if (t->rseq_event.has_rseq) {
-		t->rseq_event.sched_switch = true;
-		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+	set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+
+/* Invoked from context switch to force evaluation on exit to user */
+static __always_inline void rseq_sched_switch_event(struct task_struct *t)
+{
+	struct rseq_event *ev = &t->rseq_event;
+
+	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+		/*
+		 * Avoid a boat load of conditionals by using simple logic
+		 * to determine whether NOTIFY_RESUME needs to be raised.
+		 *
+		 * It's required when the CPU or MM CID has changed or
+		 * the entry was from user space.
+		 */
+		bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
+
+		if (raise) {
+			ev->sched_switch = true;
+			rseq_raise_notify_resume(t);
+		}
+	} else {
+		if (ev->has_rseq) {
+			t->rseq_event.sched_switch = true;
+			rseq_raise_notify_resume(t);
+		}
+	}
+}
+
+/*
+ * Invoked from __set_task_cpu() when a task migrates to enforce an IDs
+ * update.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu)
+{
+	t->rseq_event.ids_changed = true;
+}
+
+/*
+ * Invoked from switch_mm_cid() in context switch when the task gets a MM
+ * CID assigned.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid)
+{
+	/*
+	 * Requires a comparison as the switch_mm_cid() code does not
+	 * provide a conditional for it readily. So avoid excessive updates
+	 * when nothing changes.
+	 */
+	if (t->rseq_ids.mm_cid != cid)
+		t->rseq_event.ids_changed = true;
+}
+
+/* Enforce a full update after RSEQ registration and when execve() failed */
+static inline void rseq_force_update(void)
+{
+	if (current->rseq_event.has_rseq) {
+		current->rseq_event.ids_changed = true;
+		current->rseq_event.sched_switch = true;
+		rseq_raise_notify_resume(current);
 	}
 }
 
@@ -53,7 +117,7 @@ static inline void rseq_sched_switch_eve
 static inline void rseq_virt_userspace_exit(void)
 {
 	if (current->rseq_event.sched_switch)
-		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+		rseq_raise_notify_resume(current);
 }
 
 /*
@@ -90,6 +154,9 @@ static inline void rseq_execve(struct ta
 static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
+static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid) { }
+static inline void rseq_force_update(void) { }
 static inline void rseq_virt_userspace_exit(void) { }
 static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
 static inline void rseq_execve(struct task_struct *t) { }
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -10,20 +10,27 @@ struct rseq;
  * struct rseq_event - Storage for rseq related event management
  * @all:		Compound to initialize and clear the data efficiently
  * @events:		Compund to access events with a single load/store
- * @sched_switch:	True if the task was scheduled out
+ * @sched_switch:	True if the task was scheduled and needs update on
+ *			exit to user
+ * @ids_changed:	Indicator that IDs need to be updated
  * @user_irq:		True on interrupt entry from user mode
  * @has_rseq:		True if the task has a rseq pointer installed
  * @error:		Compound error code for the slow path to analyze
  * @fatal:		User space data corrupted or invalid
+ *
+ * @sched_switch and @ids_changed must be adjacent and the combo must be
+ * 16bit aligned to allow a single store, when both are set at the same
+ * time in the scheduler.
  */
 struct rseq_event {
 	union {
 		u64				all;
 		struct {
 			union {
-				u16		events;
+				u32		events;
 				struct {
 					u8	sched_switch;
+					u8	ids_changed;
 					u8	user_irq;
 				};
 			};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -459,7 +459,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	 * are updated before returning to user-space.
 	 */
 	current->rseq_event.has_rseq = true;
-	rseq_sched_switch_event(current);
+	rseq_force_update();
 
 	return 0;
 }
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5150,7 +5150,6 @@ prepare_task_switch(struct rq *rq, struc
 	kcov_prepare_switch(prev);
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
-	rseq_sched_switch_event(prev);
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
@@ -5348,6 +5347,12 @@ context_switch(struct rq *rq, struct tas
 	/* switch_mm_cid() requires the memory barriers above. */
 	switch_mm_cid(rq, prev, next);
 
+	/*
+	 * Tell rseq that the task was scheduled in. Must be after
+	 * switch_mm_cid() to get the TIF flag set.
+	 */
+	rseq_sched_switch_event(next);
+
 	prepare_lock_switch(rq, next, rf);
 
 	/* Here we just switch the register state and the stack. */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2181,6 +2181,7 @@ static inline void __set_task_cpu(struct
 	smp_wmb();
 	WRITE_ONCE(task_thread_info(p)->cpu, cpu);
 	p->wake_cpu = cpu;
+	rseq_sched_set_task_cpu(p, cpu);
 #endif /* CONFIG_SMP */
 }
 
@@ -3778,8 +3779,10 @@ static inline void switch_mm_cid(struct
 		mm_cid_put_lazy(prev);
 		prev->mm_cid = -1;
 	}
-	if (next->mm_cid_active)
+	if (next->mm_cid_active) {
 		next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
+		rseq_sched_set_task_mm_cid(next, next->mm_cid);
+	}
 }
 
 #else /* !CONFIG_SCHED_MM_CID: */


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 27/37] rseq: Implement fast path for exit to user
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (25 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 26/37] rseq: Optimize event setting Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-26 15:33   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 28/37] rseq: Switch to fast path processing on " Thomas Gleixner
                   ` (10 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Implement the actual logic for handling RSEQ updates in a fast path after
handling the TIF work and at the point where the task is actually returning
to user space.

This is the right point to do that because at this point the CPU and the MM
CID are stable and cannot longer change due to yet another reschedule.
That happens when the task is handling it via TIF_NOTIFY_RESUME in
resume_user_mode_work(), which is invoked from the exit to user mode work
loop.

The function is invoked after the TIF work is handled and runs with
interrupts disabled, which means it cannot resolve page faults. It
therefore disables page faults and in case the access to the user space
memory faults, it:

  - notes the fail in the event struct
  - raises TIF_NOTIFY_RESUME
  - returns false to the caller

The caller has to go back to the TIF work, which runs with interrupts
enabled and therefore can resolve the page faults. This happens mostly on
fork() when the memory is marked COW. That will be optimized by setting the
failure flag and raising TIF_NOTIFY_RESUME right on fork to avoid the
otherwise unavoidable round trip.

If the user memory inspection finds invalid data, the function returns
false as well and sets the fatal flag in the event struct along with
TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
and terminate the task with SIGSEGV as documented.

The initial decision to invoke any of this is based on two flags in the
event struct: @has_rseq and @sched_switch. The decision is in pseudo ASM:

      load	tsk::event::has_rseq
      and	tsk::event::sched_switch
      jnz	inspect_user_space
      mov	$0, tsk::event::events
      ...
      leave

So for the common case where the task was not scheduled out, this really
boils down to four instructions before going out if the compiler is not
completely stupid (and yes, some of them are).

If the condition is true, then it checks, whether CPU ID or MM CID have
changed. If so, then the CPU/MM IDs have to be updated and are thereby
cached for the next round. The update unconditionally retrieves the user
space critical section address to spare another user*begin/end() pair.  If
that's not zero and tsk::event::user_irq is set, then the critical section
is analyzed and acted upon. If either zero or the entry came via syscall
the critical section analysis is skipped.

If the comparison is false then the critical section has to be analyzed
because the event flag is then only true when entry from user was by
interrupt.

This is provided without the actual hookup to let reviewers focus on the
implementation details. The hookup happens in the next step.

Note: As with quite some other optimizations this depends on the generic
entry infrastructure and is not enabled to be sucked into random
architecture implementations.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq_entry.h |  137 ++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/rseq_types.h |    3 
 kernel/rseq.c              |    2 
 3 files changed, 139 insertions(+), 3 deletions(-)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -10,6 +10,7 @@ struct rseq_stats {
 	unsigned long	exit;
 	unsigned long	signal;
 	unsigned long	slowpath;
+	unsigned long	fastpath;
 	unsigned long	ids;
 	unsigned long	cs;
 	unsigned long	clear;
@@ -204,8 +205,8 @@ bool rseq_debug_update_user_cs(struct ta
 
 /*
  * On debug kernels validate that user space did not mess with it if
- * DEBUG_RSEQ is enabled, but don't on the first exit to user space. In
- * that case cpu_cid is ~0. See fork/execve.
+ * debugging is enabled, but don't do that on the first exit to user
+ * space. In that case cpu_cid is ~0. See fork/execve.
  */
 bool rseq_debug_validate_uids(struct task_struct *t)
 {
@@ -393,6 +394,131 @@ static rseq_inline bool rseq_update_usr(
 	return rseq_update_user_cs(t, regs, csaddr);
 }
 
+/*
+ * If you want to use this then convert your architecture to the generic
+ * entry code. I'm tired of building workarounds for people who can't be
+ * bothered to make the maintainence of generic infrastructure less
+ * burdensome. Just sucking everything into the architecture code and
+ * thereby making others chase the horrible hacks and keep them working is
+ * neither acceptable nor sustainable.
+ */
+#ifdef CONFIG_GENERIC_ENTRY
+
+/*
+ * This is inlined into the exit path because:
+ *
+ * 1) It's a one time comparison in the fast path when there is no event to
+ *    handle
+ *
+ * 2) The access to the user space rseq memory (TLS) is unlikely to fault
+ *    so the straight inline operation is:
+ *
+ *	- Four 32-bit stores only if CPU ID/ MM CID need to be updated
+ *	- One 64-bit load to retrieve the critical section address
+ *
+ * 3) In the unlikely case that the critical section address is != NULL:
+ *
+ *     - One 64-bit load to retrieve the start IP
+ *     - One 64-bit load to retrieve the offset for calculating the end
+ *     - One 64-bit load to retrieve the abort IP
+ *     - One store to clear the critical section address
+ *
+ * The non-debug case implements only the minimal required checking and
+ * protection against a rogue abort IP in kernel space, which would be
+ * exploitable at least on x86. Any fallout from invalid critical section
+ * descriptors is a user space problem. The debug case provides the full
+ * set of checks and terminates the task if a condition is not met.
+ *
+ * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and
+ * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
+ * slow path there will handle the fail.
+ */
+static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+	struct task_struct *t = current;
+
+	/*
+	 * If the task did not go through schedule or got the flag enforced
+	 * by the rseq syscall or execve, then nothing to do here.
+	 *
+	 * CPU ID and MM CID can only change when going through a context
+	 * switch.
+	 *
+	 * This can only be done when rseq_event::has_rseq is true.
+	 * rseq_sched_switch_event() sets rseq_event::sched unconditionally
+	 * true to avoid a load of rseq_event::has_rseq in the context
+	 * switch path.
+	 *
+	 * This check uses a '&' and not a '&&' to force the compiler to do
+	 * an actual AND operation instead of two seperate conditionals.
+	 *
+	 * A sane compiler requires four instructions for the nothing to do
+	 * case including clearing the events, but your milage might vary.
+	 */
+	if (likely(!(t->rseq_event.sched_switch & t->rseq_event.has_rseq)))
+		goto done;
+
+	rseq_stat_inc(rseq_stats.fastpath);
+
+	pagefault_disable();
+
+	if (likely(!t->rseq_event.ids_changed)) {
+		/*
+		 * If IDs have not changed rseq_event::user_irq must be true
+		 * See rseq_sched_switch_event().
+		 */
+		u64 csaddr;
+
+		if (unlikely(get_user_masked_u64(&csaddr, &t->rseq->rseq_cs)))
+			goto fail;
+
+		if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
+			if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
+				goto fail;
+		}
+	} else {
+		struct rseq_ids ids = {
+			.cpu_id = task_cpu(t),
+			.mm_cid = task_mm_cid(t),
+		};
+		u32 node_id = cpu_to_node(ids.cpu_id);
+
+		if (unlikely(!rseq_update_usr(t, regs, &ids, node_id)))
+			goto fail;
+	}
+
+	pagefault_enable();
+
+done:
+	/* Clear state so next entry starts from a clean slate */
+	t->rseq_event.events = 0;
+	return false;
+
+fail:
+	pagefault_enable();
+	/* Force it into the slow path. Don't clear the state! */
+	t->rseq_event.slowpath = true;
+	set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+	return true;
+}
+
+static __always_inline unsigned long
+rseq_exit_to_user_mode_work(struct pt_regs *regs, unsigned long ti_work, const unsigned long mask)
+{
+	/*
+	 * Check if all work bits have been cleared before handling rseq.
+	 */
+	if ((ti_work & mask) != 0)
+		return ti_work;
+
+	if (likely(!__rseq_exit_to_user_mode_restart(regs)))
+		return ti_work;
+
+	return ti_work | _TIF_NOTIFY_RESUME;
+}
+
+#endif /* !CONFIG_GENERIC_ENTRY */
+
 static __always_inline void rseq_exit_to_user_mode(void)
 {
 	struct rseq_event *ev = &current->rseq_event;
@@ -417,8 +543,13 @@ static inline void rseq_debug_syscall_re
 	if (static_branch_unlikely(&rseq_debug_enabled))
 		__rseq_debug_syscall_return(regs);
 }
-
 #else /* CONFIG_RSEQ */
+static inline unsigned long rseq_exit_to_user_mode_work(struct pt_regs *regs,
+							unsigned long ti_work,
+							const unsigned long mask)
+{
+	return ti_work;
+}
 static inline void rseq_note_user_irq_entry(void) { }
 static inline void rseq_exit_to_user_mode(void) { }
 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -17,6 +17,8 @@ struct rseq;
  * @has_rseq:		True if the task has a rseq pointer installed
  * @error:		Compound error code for the slow path to analyze
  * @fatal:		User space data corrupted or invalid
+ * @slowpath:		Indicator that slow path processing via TIF_NOTIFY_RESUME
+ *			is required
  *
  * @sched_switch and @ids_changed must be adjacent and the combo must be
  * 16bit aligned to allow a single store, when both are set at the same
@@ -41,6 +43,7 @@ struct rseq_event {
 				u16		error;
 				struct {
 					u8	fatal;
+					u8	slowpath;
 				};
 			};
 		};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -133,6 +133,7 @@ static int rseq_stats_show(struct seq_fi
 		stats.exit	+= data_race(per_cpu(rseq_stats.exit, cpu));
 		stats.signal	+= data_race(per_cpu(rseq_stats.signal, cpu));
 		stats.slowpath	+= data_race(per_cpu(rseq_stats.slowpath, cpu));
+		stats.fastpath	+= data_race(per_cpu(rseq_stats.fastpath, cpu));
 		stats.ids	+= data_race(per_cpu(rseq_stats.ids, cpu));
 		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
 		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
@@ -142,6 +143,7 @@ static int rseq_stats_show(struct seq_fi
 	seq_printf(m, "exit:   %16lu\n", stats.exit);
 	seq_printf(m, "signal: %16lu\n", stats.signal);
 	seq_printf(m, "slowp:  %16lu\n", stats.slowpath);
+	seq_printf(m, "fastp:  %16lu\n", stats.fastpath);
 	seq_printf(m, "ids:    %16lu\n", stats.ids);
 	seq_printf(m, "cs:     %16lu\n", stats.cs);
 	seq_printf(m, "clear:  %16lu\n", stats.clear);


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 28/37] rseq: Switch to fast path processing on exit to user
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (26 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 27/37] rseq: Implement fast path for exit to user Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-26 15:40   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 29/37] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
                   ` (9 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Now that all bits and pieces are in place, hook the RSEQ handling fast path
function into exit_to_user_mode_prepare() after the TIF work bits have been
handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
and the caller needs to take another turn through the TIF handling slow
path.

This only works for architectures, which use the generic entry code.
Architectures, who still have their own incomplete hacks are not supported
and won't be.

This results in the following improvements:

  Kernel build	       Before		  After		      Reduction
		       
  exit to user         80692981		  80514451      
  signal checks:          32581		       121	       99%
  slowpath runs:        1201408   1.49%	       198 0.00%      100%
  fastpath runs:           	  	    675941 0.84%       N/A
  id updates:           1233989   1.53%	     50541 0.06%       96%
  cs checks:            1125366   1.39%	         0 0.00%      100%
    cs cleared:         1125366      100%	 0            100%
    cs fixup:                 0        0%	 0      

  RSEQ selftests      Before		  After		      Reduction

  exit to user:       386281778		  387373750       
  signal checks:       35661203		          0           100%
  slowpath runs:      140542396 36.38%	        100  0.00%    100%
  fastpath runs:           	  	    9509789  2.51%     N/A
  id updates:         176203599 45.62%	    9087994  2.35%     95%
  cs checks:          175587856 45.46%	    4728394  1.22%     98%
    cs cleared:       172359544   98.16%    1319307   27.90%   99% 
    cs fixup:           3228312    1.84%    3409087   72.10%

The 'cs cleared' and 'cs fixup' percentanges are not relative to the exit
to user invocations, they are relative to the actual 'cs check'
invocations.

While some of this could have been avoided in the original code, like the
obvious clearing of CS when it's already clear, the main problem of going
through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
notify handler is invoked more than once before going out to user
space. Doing this once when everything has stabilized is the only solution
to avoid this.

The initial attempt to completely decouple it from the TIF work turned out
to be suboptimal for workloads, which do a lot of quick and short system
calls. Even if the fast path decision is only 4 instructions (including a
conditional branch), this adds up quickly and becomes measurable when the
rate for actually having to handle rseq is in the low single digit
percentage range of user/kernel transitions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irq-entry-common.h |    7 ++-----
 include/linux/resume_user_mode.h |    2 +-
 include/linux/rseq.h             |   24 ++++++++++++++++++------
 include/linux/rseq_entry.h       |    2 +-
 init/Kconfig                     |    2 +-
 kernel/entry/common.c            |   17 ++++++++++++++---
 kernel/rseq.c                    |    8 ++++++--
 7 files changed, 43 insertions(+), 19 deletions(-)

--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -197,11 +197,8 @@ static __always_inline void arch_exit_to
  */
 void arch_do_signal_or_restart(struct pt_regs *regs);
 
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- */
-unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-				     unsigned long ti_work);
+/* Handle pending TIF work */
+unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
 
 /**
  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 
-	rseq_handle_notify_resume(regs);
+	rseq_handle_slowpath(regs);
 }
 
 #endif /* LINUX_RESUME_USER_MODE_H */
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -5,13 +5,19 @@
 #ifdef CONFIG_RSEQ
 #include <linux/sched.h>
 
-void __rseq_handle_notify_resume(struct pt_regs *regs);
+void __rseq_handle_slowpath(struct pt_regs *regs);
 
-static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+/* Invoked from resume_user_mode_work() */
+static inline void rseq_handle_slowpath(struct pt_regs *regs)
 {
-	/* '&' is intentional to spare one conditional branch */
-	if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
-		__rseq_handle_notify_resume(regs);
+	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
+		if (current->rseq_event.slowpath)
+			__rseq_handle_slowpath(regs);
+	} else {
+		/* '&' is intentional to spare one conditional branch */
+		if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
+			__rseq_handle_slowpath(regs);
+	}
 }
 
 void __rseq_signal_deliver(int sig, struct pt_regs *regs);
@@ -138,6 +144,12 @@ static inline void rseq_fork(struct task
 		t->rseq_sig = current->rseq_sig;
 		t->rseq_ids.cpu_cid = ~0ULL;
 		t->rseq_event = current->rseq_event;
+		/*
+		 * If it has rseq, force it into the slow path right away
+		 * because it is guaranteed to fault.
+		 */
+		if (t->rseq_event.has_rseq)
+			t->rseq_event.slowpath = true;
 	}
 }
 
@@ -151,7 +163,7 @@ static inline void rseq_execve(struct ta
 }
 
 #else /* CONFIG_RSEQ */
-static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_sched_switch_event(struct task_struct *t) { }
 static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -433,7 +433,7 @@ static rseq_inline bool rseq_update_usr(
  * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
  * slow path there will handle the fail.
  */
-static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
 {
 	struct task_struct *t = current;
 
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1911,7 +1911,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
 config DEBUG_RSEQ
 	default n
 	bool "Enable debugging of rseq() system call" if EXPERT
-	depends on RSEQ && DEBUG_KERNEL
+	depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
 	select RSEQ_DEBUG_DEFAULT_ENABLE
 	help
 	  Enable extra debugging checks for the rseq system call.
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -23,8 +23,7 @@ void __weak arch_do_signal_or_restart(st
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
 	 */
-	while (ti_work & EXIT_TO_USER_MODE_WORK) {
-
+	do {
 		local_irq_enable_exit_to_user(ti_work);
 
 		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
@@ -56,7 +55,19 @@ void __weak arch_do_signal_or_restart(st
 		tick_nohz_user_enter_prepare();
 
 		ti_work = read_thread_flags();
-	}
+
+		/*
+		 * This returns the unmodified ti_work, when ti_work is not
+		 * empty. In that case it waits for the next round to avoid
+		 * multiple updates in case of rescheduling.
+		 *
+		 * When it handles rseq it returns either with empty work
+		 * on success or with TIF_NOTIFY_RESUME set on failure to
+		 * kick the handling into the slow path.
+		 */
+		ti_work = rseq_exit_to_user_mode_work(regs, ti_work, EXIT_TO_USER_MODE_WORK);
+
+	} while (ti_work & EXIT_TO_USER_MODE_WORK);
 
 	/* Return the latest work state for arch_exit_to_user_mode() */
 	return ti_work;
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -234,7 +234,11 @@ static bool rseq_handle_cs(struct task_s
 
 static void rseq_slowpath_update_usr(struct pt_regs *regs)
 {
-	/* Preserve rseq state and user_irq state for exit to user */
+	/*
+	 * Preserve rseq state and user_irq state. The generic entry code
+	 * clears user_irq on the way out, the non-generic entry
+	 * architectures are not having user_irq.
+	 */
 	const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
 	struct task_struct *t = current;
 	struct rseq_ids ids;
@@ -286,7 +290,7 @@ static void rseq_slowpath_update_usr(str
 	}
 }
 
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+void __rseq_handle_slowpath(struct pt_regs *regs)
 {
 	/*
 	 * If invoked from hypervisors before entering the guest via


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 29/37] entry: Split up exit_to_user_mode_prepare()
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (27 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 28/37] rseq: Switch to fast path processing on " Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-26 15:41   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 30/37] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
                   ` (8 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

exit_to_user_mode_prepare() is used for both interrupts and syscalls, but
there is extra rseq work, which is only required for in the interrupt exit
case.

Split up the function and provide wrappers for syscalls and interrupts,
which allows to seperate the rseq exit work in the next step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/entry-common.h     |    2 -
 include/linux/irq-entry-common.h |   42 ++++++++++++++++++++++++++++++++++-----
 2 files changed, 38 insertions(+), 6 deletions(-)

--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -156,7 +156,7 @@ static __always_inline void syscall_exit
 	if (unlikely(work & SYSCALL_WORK_EXIT))
 		syscall_exit_work(regs, work);
 	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
+	syscall_exit_to_user_mode_prepare(regs);
 }
 
 /**
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -201,7 +201,7 @@ void arch_do_signal_or_restart(struct pt
 unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
 
 /**
- * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
  *
  * 1) check that interrupts are disabled
@@ -209,8 +209,10 @@ unsigned long exit_to_user_mode_loop(str
  * 3) call exit_to_user_mode_loop() if any flags from
  *    EXIT_TO_USER_MODE_WORK are set
  * 4) check that interrupts are still disabled
+ *
+ * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
  */
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
 {
 	unsigned long ti_work;
 
@@ -224,15 +226,45 @@ static __always_inline void exit_to_user
 		ti_work = exit_to_user_mode_loop(regs, ti_work);
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
+}
 
-	rseq_exit_to_user_mode();
-
+static __always_inline void __exit_to_user_mode_validate(void)
+{
 	/* Ensure that kernel state is sane for a return to userspace */
 	kmap_assert_nomap();
 	lockdep_assert_irqs_disabled();
 	lockdep_sys_exit();
 }
 
+
+/**
+ * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs:	Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	__exit_to_user_mode_prepare(regs);
+	rseq_exit_to_user_mode();
+	__exit_to_user_mode_validate();
+}
+
+/**
+ * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs:	Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	__exit_to_user_mode_prepare(regs);
+	rseq_exit_to_user_mode();
+	__exit_to_user_mode_validate();
+}
+
 /**
  * exit_to_user_mode - Fixup state when exiting to user mode
  *
@@ -297,7 +329,7 @@ static __always_inline void irqentry_ent
 static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
-	exit_to_user_mode_prepare(regs);
+	irqentry_exit_to_user_mode_prepare(regs);
 	instrumentation_end();
 	exit_to_user_mode();
 }


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 30/37] rseq: Split up rseq_exit_to_user_mode()
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (28 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 29/37] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-26 15:45   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 31/37] asm-generic: Provide generic TIF infrastructure Thomas Gleixner
                   ` (7 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Seperate the interrupt and syscall exit handling. Syscall exit does not
require to clear the user_irq bit as it can't be set. On interrupt exit it
can be set when the interrupt did not result in a scheduling event and
therefore the return path did not invoke the TIF work handling, which would
have cleared it.

The debug check for the event state is also not really required even when
debug mode is enabled via the static key. Debug mode is largely aiding user
space by enabling a larger amount of validation checks, which cause a
segfault when a malformed critical section is detected. In production mode
the critical section handling takes the content mostly as is and lets user
space keep the pieces when it screwed up.

On kernel changes in that area the state check is useful, but that can be
done when lockdep is enabled, which is anyway a required test scenario for
fundamental changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/irq-entry-common.h |    4 ++--
 include/linux/rseq_entry.h       |   21 +++++++++++++++++----
 2 files changed, 19 insertions(+), 6 deletions(-)

--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -247,7 +247,7 @@ static __always_inline void __exit_to_us
 static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
 	__exit_to_user_mode_prepare(regs);
-	rseq_exit_to_user_mode();
+	rseq_syscall_exit_to_user_mode();
 	__exit_to_user_mode_validate();
 }
 
@@ -261,7 +261,7 @@ static __always_inline void syscall_exit
 static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
 {
 	__exit_to_user_mode_prepare(regs);
-	rseq_exit_to_user_mode();
+	rseq_irqentry_exit_to_user_mode();
 	__exit_to_user_mode_validate();
 }
 
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -519,19 +519,31 @@ rseq_exit_to_user_mode_work(struct pt_re
 
 #endif /* !CONFIG_GENERIC_ENTRY */
 
-static __always_inline void rseq_exit_to_user_mode(void)
+static __always_inline void rseq_syscall_exit_to_user_mode(void)
 {
 	struct rseq_event *ev = &current->rseq_event;
 
 	rseq_stat_inc(rseq_stats.exit);
 
-	if (static_branch_unlikely(&rseq_debug_enabled))
+	/* Needed to remove the store for the !lockdep case */
+	if (IS_ENABLED(CONFIG_LOCKDEP)) {
 		WARN_ON_ONCE(ev->sched_switch);
+		ev->events = 0;
+	}
+}
+
+static __always_inline void rseq_irqentry_exit_to_user_mode(void)
+{
+	struct rseq_event *ev = &current->rseq_event;
+
+	rseq_stat_inc(rseq_stats.exit);
+
+	lockdep_assert_once(!ev->sched_switch);
 
 	/*
 	 * Ensure that event (especially user_irq) is cleared when the
 	 * interrupt did not result in a schedule and therefore the
-	 * rseq processing did not clear it.
+	 * rseq processing could not clear it.
 	 */
 	ev->events = 0;
 }
@@ -551,7 +563,8 @@ static inline unsigned long rseq_exit_to
 	return ti_work;
 }
 static inline void rseq_note_user_irq_entry(void) { }
-static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_syscall_exit_to_user_mode(void) { }
+static inline void rseq_irqentry_exit_to_user_mode(void) { }
 static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
 #endif /* !CONFIG_RSEQ */
 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 31/37] asm-generic: Provide generic TIF infrastructure
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (29 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 30/37] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-23 20:37   ` Arnd Bergmann
  2025-08-25 19:33   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 32/37] x86: Use generic TIF bits Thomas Gleixner
                   ` (6 subsequent siblings)
  37 siblings, 2 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Arnd Bergmann, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Paolo Bonzini, Sean Christopherson,
	Wei Liu, Dexuan Cui, x86, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

Common TIF bits do not have to be defined by every architecture. They can
be defined in a generic header.

That allows adding generic TIF bits without chasing a gazillion of
architecture headers, which is again a unjustified burden on anyone who
works on generic infrastructure as it always needs a boat load of work to
keep existing architecture code working when adding new stuff.

While it is not as horrible as the ignorance of the generic entry
infrastructure, it is a welcome mechanism to make architecture people
rethink their approach of just leaching generic improvements into
architecture code and thereby making it accumulatingly harder to maintain
and improve generic code. It's about time that this changea.

Provide the infrastructure and split the TIF space in half, 16 generic and
16 architecture specific bits.

This could probably be extended by TIF_SINGLESTEP and BLOCKSTEP, but those
are only used in architecture specific code. So leave them alone for now.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
---
 arch/Kconfig                          |    4 ++
 include/asm-generic/thread_info_tif.h |   48 ++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1730,6 +1730,10 @@ config ARCH_VMLINUX_NEEDS_RELOCS
 	  relocations preserved. This is used by some architectures to
 	  construct bespoke relocation tables for KASLR.
 
+# Select if architecture uses the common generic TIF bits
+config HAVE_GENERIC_TIF_BITS
+       bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
--- /dev/null
+++ b/include/asm-generic/thread_info_tif.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_GENERIC_THREAD_INFO_TIF_H_
+#define _ASM_GENERIC_THREAD_INFO_TIF_H_
+
+#include <vdso/bits.h>
+
+/* Bits 16-31 are reserved for architecture specific purposes */
+
+#define TIF_NOTIFY_RESUME	0	// callback before returning to user
+#define _TIF_NOTIFY_RESUME	BIT(TIF_NOTIFY_RESUME)
+
+#define TIF_SIGPENDING		1	// signal pending
+#define _TIF_SIGPENDING		BIT(TIF_SIGPENDING)
+
+#define TIF_NOTIFY_SIGNAL	2	// signal notifications exist
+#define _TIF_NOTIFY_SIGNAL	BIT(TIF_NOTIFY_SIGNAL)
+
+#define TIF_MEMDIE		3	// is terminating due to OOM killer
+#define _TIF_MEMDIE		BIT(TIF_MEMDIE)
+
+#define TIF_NEED_RESCHED	4	// rescheduling necessary
+#define _TIF_NEED_RESCHED	BIT(TIF_NEED_RESCHED)
+
+#ifdef HAVE_TIF_NEED_RESCHED_LAZY
+# define TIF_NEED_RESCHED_LAZY	5	// Lazy rescheduling needed
+# define _TIF_NEED_RESCHED_LAZY	BIT(TIF_NEED_RESCHED_LAZY)
+#endif
+
+#ifdef HAVE_TIF_POLLING_NRFLAG
+# define TIF_POLLING_NRFLAG	6	// idle is polling for TIF_NEED_RESCHED
+# define _TIF_POLLING_NRFLAG	BIT(TIF_POLLING_NRFLAG)
+#endif
+
+#define TIF_USER_RETURN_NOTIFY	7	// notify kernel of userspace return
+#define _TIF_USER_RETURN_NOTIFY	BIT(TIF_USER_RETURN_NOTIFY)
+
+#define TIF_UPROBE		8	// breakpointed or singlestepping
+#define _TIF_UPROBE		BIT(TIF_UPROBE)
+
+#define TIF_PATCH_PENDING	9	// pending live patching update
+#define _TIF_PATCH_PENDING	BIT(TIF_PATCH_PENDING)
+
+#ifdef HAVE_TIF_RESTORE_SIGMASK
+# define TIF_RESTORE_SIGMASK	10	// Restore signal mask in do_signal() */
+# define _TIF_RESTORE_SIGMASK	BIT(TIF_RESTORE_SIGMASK)
+#endif
+
+#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 32/37] x86: Use generic TIF bits
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (30 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 31/37] asm-generic: Provide generic TIF infrastructure Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-25 19:34   ` Mathieu Desnoyers
  2025-08-23 16:40 ` [patch V2 33/37] s390: " Thomas Gleixner
                   ` (5 subsequent siblings)
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, x86, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Paolo Bonzini, Sean Christopherson,
	Wei Liu, Dexuan Cui, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

No point in defining generic items and the upcoming RSEQ optimizations are
only available with this _and_ the generic entry infrastructure, which is
already used by x86. So no further action required here.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
---
 arch/x86/Kconfig                   |    1 
 arch/x86/include/asm/thread_info.h |   74 +++++++++++++++----------------------
 2 files changed, 31 insertions(+), 44 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -239,6 +239,7 @@ config X86
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
 	select HAVE_EISA			if X86_32
 	select HAVE_EXIT_THREAD
+	select HAVE_GENERIC_TIF_BITS
 	select HAVE_GUP_FAST
 	select HAVE_FENTRY			if X86_64 || DYNAMIC_FTRACE
 	select HAVE_FTRACE_GRAPH_FUNC		if HAVE_FUNCTION_GRAPH_TRACER
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -80,56 +80,42 @@ struct thread_info {
 #endif
 
 /*
- * thread information flags
- * - these are process state flags that various assembly files
- *   may need to access
+ * Tell the generic TIF infrastructure which bits x86 supports
  */
-#define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
-#define TIF_SIGPENDING		2	/* signal pending */
-#define TIF_NEED_RESCHED	3	/* rescheduling necessary */
-#define TIF_NEED_RESCHED_LAZY	4	/* Lazy rescheduling needed */
-#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
-#define TIF_SSBD		6	/* Speculative store bypass disable */
-#define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
-#define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
-#define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
-#define TIF_UPROBE		12	/* breakpointed or singlestepping */
-#define TIF_PATCH_PENDING	13	/* pending live patching update */
-#define TIF_NEED_FPU_LOAD	14	/* load FPU on return to userspace */
-#define TIF_NOCPUID		15	/* CPUID is not accessible in userland */
-#define TIF_NOTSC		16	/* TSC is not accessible in userland */
-#define TIF_NOTIFY_SIGNAL	17	/* signal notifications exist */
-#define TIF_MEMDIE		20	/* is terminating due to OOM killer */
-#define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
+#define HAVE_TIF_NEED_RESCHED_LAZY
+#define HAVE_TIF_POLLING_NRFLAG
+#define HAVE_TIF_SINGLESTEP
+
+#include <asm-generic/thread_info_tif.h>
+
+/* Architecture specific TIF space starts at 16 */
+#define TIF_SSBD		16	/* Speculative store bypass disable */
+#define TIF_SPEC_IB		17	/* Indirect branch speculation mitigation */
+#define TIF_SPEC_L1D_FLUSH	18	/* Flush L1D on mm switches (processes) */
+#define TIF_NEED_FPU_LOAD	19	/* load FPU on return to userspace */
+#define TIF_NOCPUID		20	/* CPUID is not accessible in userland */
+#define TIF_NOTSC		21	/* TSC is not accessible in userland */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
 #define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
 #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
-#define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
+#define TIF_SINGLESTEP		25	/* reenable singlestep on user return*/
+#define TIF_BLOCKSTEP		26	/* set when we want DEBUGCTLMSR_BTF */
 #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
-#define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
+#define TIF_ADDR32		28	/* 32-bit address space on 64 bits */
 
-#define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
-#define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
-#define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
-#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
-#define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
-#define _TIF_SSBD		(1 << TIF_SSBD)
-#define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
-#define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
-#define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
-#define _TIF_UPROBE		(1 << TIF_UPROBE)
-#define _TIF_PATCH_PENDING	(1 << TIF_PATCH_PENDING)
-#define _TIF_NEED_FPU_LOAD	(1 << TIF_NEED_FPU_LOAD)
-#define _TIF_NOCPUID		(1 << TIF_NOCPUID)
-#define _TIF_NOTSC		(1 << TIF_NOTSC)
-#define _TIF_NOTIFY_SIGNAL	(1 << TIF_NOTIFY_SIGNAL)
-#define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
-#define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
-#define _TIF_SPEC_FORCE_UPDATE	(1 << TIF_SPEC_FORCE_UPDATE)
-#define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
-#define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
-#define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
-#define _TIF_ADDR32		(1 << TIF_ADDR32)
+#define _TIF_SSBD		BIT(TIF_SSBD)
+#define _TIF_SPEC_IB		BIT(TIF_SPEC_IB)
+#define _TIF_SPEC_L1D_FLUSH	BIT(TIF_SPEC_L1D_FLUSH)
+#define _TIF_NEED_FPU_LOAD	BIT(TIF_NEED_FPU_LOAD)
+#define _TIF_NOCPUID		BIT(TIF_NOCPUID)
+#define _TIF_NOTSC		BIT(TIF_NOTSC)
+#define _TIF_IO_BITMAP		BIT(TIF_IO_BITMAP)
+#define _TIF_SPEC_FORCE_UPDATE	BIT(TIF_SPEC_FORCE_UPDATE)
+#define _TIF_FORCED_TF		BIT(TIF_FORCED_TF)
+#define _TIF_BLOCKSTEP		BIT(TIF_BLOCKSTEP)
+#define _TIF_SINGLESTEP		BIT(TIF_SINGLESTEP)
+#define _TIF_LAZY_MMU_UPDATES	BIT(TIF_LAZY_MMU_UPDATES)
+#define _TIF_ADDR32		BIT(TIF_ADDR32)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 33/37] s390: Use generic TIF bits
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (31 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 32/37] x86: Use generic TIF bits Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-23 16:40 ` [patch V2 34/37] loongarch: " Thomas Gleixner
                   ` (4 subsequent siblings)
  37 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Heiko Carstens, Christian Borntraeger, Sven Schnelle,
	Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Huacai Chen, Paul Walmsley, Palmer Dabbelt

No point in defining generic items and the upcoming RSEQ optimizations are
only available with this _and_ the generic entry infrastructure, which is
already used by s390. So no further action required here.

This leaves a comment about the AUDIT/TRACE/SECCOMP bits which are handled
by SYSCALL_WORK in the generic code, so they seem redundant, but that's a
problem for the s390 wizards to think about.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
---
 arch/s390/Kconfig                   |    1 
 arch/s390/include/asm/thread_info.h |   44 ++++++++++++++----------------------
 2 files changed, 19 insertions(+), 26 deletions(-)

--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -199,6 +199,7 @@ config S390
 	select HAVE_DYNAMIC_FTRACE_WITH_REGS
 	select HAVE_EBPF_JIT if HAVE_MARCH_Z196_FEATURES
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
+	select HAVE_GENERIC_TIF_BITS
 	select HAVE_GUP_FAST
 	select HAVE_FENTRY
 	select HAVE_FTRACE_GRAPH_FUNC
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -56,43 +56,35 @@ void arch_setup_new_exec(void);
 
 /*
  * thread information flags bit numbers
+ *
+ * Tell the generic TIF infrastructure which special bits s390 supports
  */
-#define TIF_NOTIFY_RESUME	0	/* callback before returning to user */
-#define TIF_SIGPENDING		1	/* signal pending */
-#define TIF_NEED_RESCHED	2	/* rescheduling necessary */
-#define TIF_NEED_RESCHED_LAZY	3	/* lazy rescheduling needed */
-#define TIF_UPROBE		4	/* breakpointed or single-stepping */
-#define TIF_PATCH_PENDING	5	/* pending live patching update */
-#define TIF_ASCE_PRIMARY	6	/* primary asce is kernel asce */
-#define TIF_NOTIFY_SIGNAL	7	/* signal notifications exist */
-#define TIF_GUARDED_STORAGE	8	/* load guarded storage control block */
-#define TIF_ISOLATE_BP_GUEST	9	/* Run KVM guests with isolated BP */
-#define TIF_PER_TRAP		10	/* Need to handle PER trap on exit to usermode */
-#define TIF_31BIT		16	/* 32bit process */
-#define TIF_MEMDIE		17	/* is terminating due to OOM killer */
-#define TIF_RESTORE_SIGMASK	18	/* restore signal mask in do_signal() */
-#define TIF_SINGLE_STEP		19	/* This task is single stepped */
-#define TIF_BLOCK_STEP		20	/* This task is block stepped */
-#define TIF_UPROBE_SINGLESTEP	21	/* This task is uprobe single stepped */
+#define HAVE_TIF_NEED_RESCHED_LAZY
+#define HAVE_TIF_RESTORE_SIGMASK
+
+#include <asm-generic/thread_info_tif.h>
+
+/* Architecture specific bits */
+#define TIF_ASCE_PRIMARY	16	/* primary asce is kernel asce */
+#define TIF_GUARDED_STORAGE	17	/* load guarded storage control block */
+#define TIF_ISOLATE_BP_GUEST	18	/* Run KVM guests with isolated BP */
+#define TIF_PER_TRAP		19	/* Need to handle PER trap on exit to usermode */
+#define TIF_31BIT		20	/* 32bit process */
+#define TIF_SINGLE_STEP		21	/* This task is single stepped */
+#define TIF_BLOCK_STEP		22	/* This task is block stepped */
+#define TIF_UPROBE_SINGLESTEP	23	/* This task is uprobe single stepped */
+
+/* These could move over to SYSCALL_WORK bits, no? */
 #define TIF_SYSCALL_TRACE	24	/* syscall trace active */
 #define TIF_SYSCALL_AUDIT	25	/* syscall auditing active */
 #define TIF_SECCOMP		26	/* secure computing */
 #define TIF_SYSCALL_TRACEPOINT	27	/* syscall tracepoint instrumentation */
 
-#define _TIF_NOTIFY_RESUME	BIT(TIF_NOTIFY_RESUME)
-#define _TIF_SIGPENDING		BIT(TIF_SIGPENDING)
-#define _TIF_NEED_RESCHED	BIT(TIF_NEED_RESCHED)
-#define _TIF_NEED_RESCHED_LAZY	BIT(TIF_NEED_RESCHED_LAZY)
-#define _TIF_UPROBE		BIT(TIF_UPROBE)
-#define _TIF_PATCH_PENDING	BIT(TIF_PATCH_PENDING)
 #define _TIF_ASCE_PRIMARY	BIT(TIF_ASCE_PRIMARY)
-#define _TIF_NOTIFY_SIGNAL	BIT(TIF_NOTIFY_SIGNAL)
 #define _TIF_GUARDED_STORAGE	BIT(TIF_GUARDED_STORAGE)
 #define _TIF_ISOLATE_BP_GUEST	BIT(TIF_ISOLATE_BP_GUEST)
 #define _TIF_PER_TRAP		BIT(TIF_PER_TRAP)
 #define _TIF_31BIT		BIT(TIF_31BIT)
-#define _TIF_MEMDIE		BIT(TIF_MEMDIE)
-#define _TIF_RESTORE_SIGMASK	BIT(TIF_RESTORE_SIGMASK)
 #define _TIF_SINGLE_STEP	BIT(TIF_SINGLE_STEP)
 #define _TIF_BLOCK_STEP		BIT(TIF_BLOCK_STEP)
 #define _TIF_UPROBE_SINGLESTEP	BIT(TIF_UPROBE_SINGLESTEP)


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 34/37] loongarch: Use generic TIF bits
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (32 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 33/37] s390: " Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-23 16:40 ` [patch V2 35/37] riscv: " Thomas Gleixner
                   ` (3 subsequent siblings)
  37 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Huacai Chen, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Paolo Bonzini, Sean Christopherson,
	Wei Liu, Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Paul Walmsley,
	Palmer Dabbelt

No point in defining generic items and the upcoming RSEQ optimizations are
only available with this _and_ the generic entry infrastructure, which is
already used by loongarch. So no further action required here.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
---
 arch/loongarch/Kconfig                   |    1 
 arch/loongarch/include/asm/thread_info.h |   76 +++++++++++++------------------
 2 files changed, 35 insertions(+), 42 deletions(-)

--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -140,6 +140,7 @@ config LOONGARCH
 	select HAVE_EBPF_JIT
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS if !ARCH_STRICT_ALIGN
 	select HAVE_EXIT_THREAD
+	select HAVE_GENERIC_TIF_BITS
 	select HAVE_GUP_FAST
 	select HAVE_FTRACE_GRAPH_FUNC
 	select HAVE_FUNCTION_ARG_ACCESS_API
--- a/arch/loongarch/include/asm/thread_info.h
+++ b/arch/loongarch/include/asm/thread_info.h
@@ -65,50 +65,42 @@ register unsigned long current_stack_poi
  *   access
  * - pending work-to-be-done flags are in LSW
  * - other flags in MSW
+ *
+ * Tell the generic TIF infrastructure which special bits loongarch supports
  */
-#define TIF_NEED_RESCHED	0	/* rescheduling necessary */
-#define TIF_NEED_RESCHED_LAZY	1	/* lazy rescheduling necessary */
-#define TIF_SIGPENDING		2	/* signal pending */
-#define TIF_NOTIFY_RESUME	3	/* callback before returning to user */
-#define TIF_NOTIFY_SIGNAL	4	/* signal notifications exist */
-#define TIF_RESTORE_SIGMASK	5	/* restore signal mask in do_signal() */
-#define TIF_NOHZ		6	/* in adaptive nohz mode */
-#define TIF_UPROBE		7	/* breakpointed or singlestepping */
-#define TIF_USEDFPU		8	/* FPU was used by this task this quantum (SMP) */
-#define TIF_USEDSIMD		9	/* SIMD has been used this quantum */
-#define TIF_MEMDIE		10	/* is terminating due to OOM killer */
-#define TIF_FIXADE		11	/* Fix address errors in software */
-#define TIF_LOGADE		12	/* Log address errors to syslog */
-#define TIF_32BIT_REGS		13	/* 32-bit general purpose registers */
-#define TIF_32BIT_ADDR		14	/* 32-bit address space */
-#define TIF_LOAD_WATCH		15	/* If set, load watch registers */
-#define TIF_SINGLESTEP		16	/* Single Step */
-#define TIF_LSX_CTX_LIVE	17	/* LSX context must be preserved */
-#define TIF_LASX_CTX_LIVE	18	/* LASX context must be preserved */
-#define TIF_USEDLBT		19	/* LBT was used by this task this quantum (SMP) */
-#define TIF_LBT_CTX_LIVE	20	/* LBT context must be preserved */
-#define TIF_PATCH_PENDING	21	/* pending live patching update */
+#define HAVE_TIF_NEED_RESCHED_LAZY
+#define HAVE_TIF_RESTORE_SIGMASK
 
-#define _TIF_NEED_RESCHED	(1<<TIF_NEED_RESCHED)
-#define _TIF_NEED_RESCHED_LAZY	(1<<TIF_NEED_RESCHED_LAZY)
-#define _TIF_SIGPENDING		(1<<TIF_SIGPENDING)
-#define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
-#define _TIF_NOTIFY_SIGNAL	(1<<TIF_NOTIFY_SIGNAL)
-#define _TIF_NOHZ		(1<<TIF_NOHZ)
-#define _TIF_UPROBE		(1<<TIF_UPROBE)
-#define _TIF_USEDFPU		(1<<TIF_USEDFPU)
-#define _TIF_USEDSIMD		(1<<TIF_USEDSIMD)
-#define _TIF_FIXADE		(1<<TIF_FIXADE)
-#define _TIF_LOGADE		(1<<TIF_LOGADE)
-#define _TIF_32BIT_REGS		(1<<TIF_32BIT_REGS)
-#define _TIF_32BIT_ADDR		(1<<TIF_32BIT_ADDR)
-#define _TIF_LOAD_WATCH		(1<<TIF_LOAD_WATCH)
-#define _TIF_SINGLESTEP		(1<<TIF_SINGLESTEP)
-#define _TIF_LSX_CTX_LIVE	(1<<TIF_LSX_CTX_LIVE)
-#define _TIF_LASX_CTX_LIVE	(1<<TIF_LASX_CTX_LIVE)
-#define _TIF_USEDLBT		(1<<TIF_USEDLBT)
-#define _TIF_LBT_CTX_LIVE	(1<<TIF_LBT_CTX_LIVE)
-#define _TIF_PATCH_PENDING	(1<<TIF_PATCH_PENDING)
+#include <asm-generic/thread_info_tif.h>
+
+/* Architecture specific bits */
+#define TIF_NOHZ		16	/* in adaptive nohz mode */
+#define TIF_USEDFPU		17	/* FPU was used by this task this quantum (SMP) */
+#define TIF_USEDSIMD		18	/* SIMD has been used this quantum */
+#define TIF_FIXADE		10	/* Fix address errors in software */
+#define TIF_LOGADE		20	/* Log address errors to syslog */
+#define TIF_32BIT_REGS		21	/* 32-bit general purpose registers */
+#define TIF_32BIT_ADDR		22	/* 32-bit address space */
+#define TIF_LOAD_WATCH		23	/* If set, load watch registers */
+#define TIF_SINGLESTEP		24	/* Single Step */
+#define TIF_LSX_CTX_LIVE	25	/* LSX context must be preserved */
+#define TIF_LASX_CTX_LIVE	26	/* LASX context must be preserved */
+#define TIF_USEDLBT		27	/* LBT was used by this task this quantum (SMP) */
+#define TIF_LBT_CTX_LIVE	28	/* LBT context must be preserved */
+
+#define _TIF_NOHZ		BIT(TIF_NOHZ)
+#define _TIF_USEDFPU		BIT(TIF_USEDFPU)
+#define _TIF_USEDSIMD		BIT(TIF_USEDSIMD)
+#define _TIF_FIXADE		BIT(TIF_FIXADE)
+#define _TIF_LOGADE		BIT(TIF_LOGADE)
+#define _TIF_32BIT_REGS		BIT(TIF_32BIT_REGS)
+#define _TIF_32BIT_ADDR		BIT(TIF_32BIT_ADDR)
+#define _TIF_LOAD_WATCH		BIT(TIF_LOAD_WATCH)
+#define _TIF_SINGLESTEP		BIT(TIF_SINGLESTEP)
+#define _TIF_LSX_CTX_LIVE	BIT(TIF_LSX_CTX_LIVE)
+#define _TIF_LASX_CTX_LIVE	BIT(TIF_LASX_CTX_LIVE)
+#define _TIF_USEDLBT		BIT(TIF_USEDLBT)
+#define _TIF_LBT_CTX_LIVE	BIT(TIF_LBT_CTX_LIVE)
 
 #endif /* __KERNEL__ */
 #endif /* _ASM_THREAD_INFO_H */


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 35/37] riscv: Use generic TIF bits
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (33 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 34/37] loongarch: " Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-23 16:40 ` [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
                   ` (2 subsequent siblings)
  37 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Paul Walmsley, Palmer Dabbelt, Mathieu Desnoyers,
	Peter Zijlstra, Paul E. McKenney, Boqun Feng, Paolo Bonzini,
	Sean Christopherson, Wei Liu, Dexuan Cui, x86, Arnd Bergmann,
	Heiko Carstens, Christian Borntraeger, Sven Schnelle, Huacai Chen

No point in defining generic items and the upcoming RSEQ optimizations are
only available with this _and_ the generic entry infrastructure, which is
already used by RISCV. So no further action required here.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
---
 arch/riscv/Kconfig                   |    1 +
 arch/riscv/include/asm/thread_info.h |   29 ++++++++++++-----------------
 2 files changed, 13 insertions(+), 17 deletions(-)

--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -161,6 +161,7 @@ config RISCV
 	select HAVE_FUNCTION_GRAPH_FREGS
 	select HAVE_FUNCTION_TRACER if !XIP_KERNEL && HAVE_DYNAMIC_FTRACE
 	select HAVE_EBPF_JIT if MMU
+	select HAVE_GENERIC_TIF_BITS
 	select HAVE_GUP_FAST if MMU
 	select HAVE_FUNCTION_ARG_ACCESS_API
 	select HAVE_FUNCTION_ERROR_INJECTION
--- a/arch/riscv/include/asm/thread_info.h
+++ b/arch/riscv/include/asm/thread_info.h
@@ -107,23 +107,18 @@ int arch_dup_task_struct(struct task_str
  * - pending work-to-be-done flags are in lowest half-word
  * - other flags in upper half-word(s)
  */
-#define TIF_NEED_RESCHED	0	/* rescheduling necessary */
-#define TIF_NEED_RESCHED_LAZY	1       /* Lazy rescheduling needed */
-#define TIF_NOTIFY_RESUME	2	/* callback before returning to user */
-#define TIF_SIGPENDING		3	/* signal pending */
-#define TIF_RESTORE_SIGMASK	4	/* restore signal mask in do_signal() */
-#define TIF_MEMDIE		5	/* is terminating due to OOM killer */
-#define TIF_NOTIFY_SIGNAL	9	/* signal notifications exist */
-#define TIF_UPROBE		10	/* uprobe breakpoint or singlestep */
-#define TIF_32BIT		11	/* compat-mode 32bit process */
-#define TIF_RISCV_V_DEFER_RESTORE	12 /* restore Vector before returing to user */
 
-#define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
-#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
-#define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
-#define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
-#define _TIF_NOTIFY_SIGNAL	(1 << TIF_NOTIFY_SIGNAL)
-#define _TIF_UPROBE		(1 << TIF_UPROBE)
-#define _TIF_RISCV_V_DEFER_RESTORE	(1 << TIF_RISCV_V_DEFER_RESTORE)
+/*
+ * Tell the generic TIF infrastructure which bits riscv supports
+ */
+#define HAVE_TIF_NEED_RESCHED_LAZY
+#define HAVE_TIF_RESTORE_SIGMASK
+
+#include <asm-generic/thread_info_tif.h>
+
+#define TIF_32BIT			16	/* compat-mode 32bit process */
+#define TIF_RISCV_V_DEFER_RESTORE	17	/* restore Vector before returing to user */
+
+#define _TIF_RISCV_V_DEFER_RESTORE	BIT(TIF_RISCV_V_DEFER_RESTORE)
 
 #endif /* _ASM_RISCV_THREAD_INFO_H */


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (34 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 35/37] riscv: " Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-25 19:39   ` Mathieu Desnoyers
  2025-08-25 20:02   ` Sean Christopherson
  2025-08-23 16:40 ` [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit Thomas Gleixner
  2025-08-25 15:10 ` [patch V2 00/37] rseq: Optimize exit to user space Mathieu Desnoyers
  37 siblings, 2 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially
with the RSEQ fast path depending on it, but not really handling it.

Define a seperate TIF_RSEQ in the generic TIF space and enable the full
seperation of fast and slow path for architectures which utilize that.

That avoids the hassle with invocations of resume_user_mode_work() from
hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required
re-evaluation at the end of vcpu_run() a NOOP on architectures which
utilize the generic TIF space and have a seperate TIF_RSEQ.

The hypervisor TIF handling does not include the seperate TIF_RSEQ as there
is no point in doing so. The guest does neither know nor care about the VMM
host applications RSEQ state. That state is only relevant when the ioctl()
returns to user space.

The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure
handling, but this only happens within exit_to_user_mode_loop(), so
arguably the hypervisor ioctl() code is long done when this happens.

This allows further optimizations for blocking syscall heavy workloads in a
subsequent step.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/asm-generic/thread_info_tif.h |    3 +++
 include/linux/irq-entry-common.h      |    2 +-
 include/linux/rseq.h                  |   13 ++++++++++---
 include/linux/rseq_entry.h            |   23 +++++++++++++++++++----
 include/linux/thread_info.h           |    5 +++++
 5 files changed, 38 insertions(+), 8 deletions(-)

--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -45,4 +45,7 @@
 # define _TIF_RESTORE_SIGMASK	BIT(TIF_RESTORE_SIGMASK)
 #endif
 
+#define TIF_RSEQ		11	// Run RSEQ fast path
+#define _TIF_RSEQ		BIT(TIF_RSEQ)
+
 #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -30,7 +30,7 @@
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
 	 _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY |			\
-	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |			\
+	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ |		\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -40,7 +40,7 @@ static inline void rseq_signal_deliver(s
 
 static inline void rseq_raise_notify_resume(struct task_struct *t)
 {
-	set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+	set_tsk_thread_flag(t, TIF_RSEQ);
 }
 
 /* Invoked from context switch to force evaluation on exit to user */
@@ -122,7 +122,7 @@ static inline void rseq_force_update(voi
  */
 static inline void rseq_virt_userspace_exit(void)
 {
-	if (current->rseq_event.sched_switch)
+	if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) && current->rseq_event.sched_switch)
 		rseq_raise_notify_resume(current);
 }
 
@@ -147,9 +147,16 @@ static inline void rseq_fork(struct task
 		/*
 		 * If it has rseq, force it into the slow path right away
 		 * because it is guaranteed to fault.
+		 *
+		 * Setting TIF_NOTIFY_RESUME is redundant but harmless for
+		 * architectures which do not have a seperate TIF_RSEQ, but
+		 * for those who do it's required to enforce the slow path
+		 * as the scheduler sets only TIF_RSEQ.
 		 */
-		if (t->rseq_event.has_rseq)
+		if (t->rseq_event.has_rseq) {
 			t->rseq_event.slowpath = true;
+			set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+		}
 	}
 }
 
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -502,18 +502,33 @@ static __always_inline bool __rseq_exit_
 	return true;
 }
 
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
+# define CHECK_TIF_RSEQ		_TIF_RSEQ
+static __always_inline void clear_tif_rseq(void)
+{
+	clear_thread_flag(TIF_RSEQ);
+}
+#else
+# define CHECK_TIF_RSEQ		0UL
+static inline void clear_tif_rseq(void) { }
+#endif
+
 static __always_inline unsigned long
 rseq_exit_to_user_mode_work(struct pt_regs *regs, unsigned long ti_work, const unsigned long mask)
 {
 	/*
 	 * Check if all work bits have been cleared before handling rseq.
+	 *
+	 * In case of a seperate TIF_RSEQ this checks for all other bits to
+	 * be cleared and TIF_RSEQ to be set.
 	 */
-	if ((ti_work & mask) != 0)
-		return ti_work;
-
-	if (likely(!__rseq_exit_to_user_mode_restart(regs)))
+	if ((ti_work & mask) != CHECK_TIF_RSEQ)
 		return ti_work;
 
+	if (likely(!__rseq_exit_to_user_mode_restart(regs))) {
+		clear_tif_rseq();
+		return ti_work & ~CHECK_TIF_RSEQ;
+	}
 	return ti_work | _TIF_NOTIFY_RESUME;
 }
 
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -67,6 +67,11 @@ enum syscall_work_bit {
 #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
 #endif
 
+#ifndef TIF_RSEQ
+# define TIF_RSEQ	TIF_NOTIFY_RESUME
+# define _TIF_RSEQ	_TIF_NOTIFY_RESUME
+#endif
+
 #ifdef __KERNEL__
 
 #ifndef arch_set_restart_data


^ permalink raw reply	[flat|nested] 91+ messages in thread

* [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (35 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
@ 2025-08-23 16:40 ` Thomas Gleixner
  2025-08-25 19:43   ` Mathieu Desnoyers
  2025-08-25 15:10 ` [patch V2 00/37] rseq: Optimize exit to user space Mathieu Desnoyers
  37 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:40 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

Further analysis of the exit path with the seperate TIF_RSEQ showed that
depending on the workload a significant amount of invocations of
resume_user_mode_work() ends up with no other bit set than TIF_RSEQ.

On architectures with a separate TIF_RSEQ this can be distinguished and
checked right at the beginning of the function before entering the loop.

The quick check is lightweight so it does not impose a massive penalty on
non-RSEQ use cases. It just checks for the work being empty, except for
TIF_RSEQ and jumps right into the handling fast path.

This is truly the only TIF bit there which can be optimized that way
because the handling runs only when all the other work has been done. The
optimization spares a full round trip through the other conditionals and an
interrupt enable/disable pair. The generated code looks reasonable enough
to justify this and the resulting numbers do so as well.

The main beneficiaries are blocking syscall heavy work loads, where the
tasks often end up being scheduled on a different CPU or get a different MM
CID, but have no other work to handle on return.

A futex benchmark showed up to 90% shortcut utilization and a measurable
improvement in perf of ~1%. Non-scheduling work loads do neither see an
improvement nor degrade. A full kernel build shows about 15% shortcuts,
but no measurable side effects in either direction.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/rseq_entry.h |   14 ++++++++++++++
 kernel/entry/common.c      |   13 +++++++++++--
 kernel/rseq.c              |    2 ++
 3 files changed, 27 insertions(+), 2 deletions(-)

--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -11,6 +11,7 @@ struct rseq_stats {
 	unsigned long	signal;
 	unsigned long	slowpath;
 	unsigned long	fastpath;
+	unsigned long	quicktif;
 	unsigned long	ids;
 	unsigned long	cs;
 	unsigned long	clear;
@@ -532,6 +533,14 @@ rseq_exit_to_user_mode_work(struct pt_re
 	return ti_work | _TIF_NOTIFY_RESUME;
 }
 
+static __always_inline bool
+rseq_exit_to_user_mode_early(unsigned long ti_work, const unsigned long mask)
+{
+	if (IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS))
+		return (ti_work & mask) == CHECK_TIF_RSEQ;
+	return false;
+}
+
 #endif /* !CONFIG_GENERIC_ENTRY */
 
 static __always_inline void rseq_syscall_exit_to_user_mode(void)
@@ -577,6 +586,11 @@ static inline unsigned long rseq_exit_to
 {
 	return ti_work;
 }
+
+static inline bool rseq_exit_to_user_mode_early(unsigned long ti_work, const unsigned long mask)
+{
+	return false;
+}
 static inline void rseq_note_user_irq_entry(void) { }
 static inline void rseq_syscall_exit_to_user_mode(void) { }
 static inline void rseq_irqentry_exit_to_user_mode(void) { }
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -22,7 +22,14 @@ void __weak arch_do_signal_or_restart(st
 	/*
 	 * Before returning to user space ensure that all pending work
 	 * items have been completed.
+	 *
+	 * Optimize for TIF_RSEQ being the only bit set.
 	 */
+	if (rseq_exit_to_user_mode_early(ti_work, EXIT_TO_USER_MODE_WORK)) {
+		rseq_stat_inc(rseq_stats.quicktif);
+		goto do_rseq;
+	}
+
 	do {
 		local_irq_enable_exit_to_user(ti_work);
 
@@ -56,10 +63,12 @@ void __weak arch_do_signal_or_restart(st
 
 		ti_work = read_thread_flags();
 
+	do_rseq:
 		/*
 		 * This returns the unmodified ti_work, when ti_work is not
-		 * empty. In that case it waits for the next round to avoid
-		 * multiple updates in case of rescheduling.
+		 * empty (except for TIF_RSEQ). In that case it waits for
+		 * the next round to avoid multiple updates in case of
+		 * rescheduling.
 		 *
 		 * When it handles rseq it returns either with empty work
 		 * on success or with TIF_NOTIFY_RESUME set on failure to
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -134,6 +134,7 @@ static int rseq_stats_show(struct seq_fi
 		stats.signal	+= data_race(per_cpu(rseq_stats.signal, cpu));
 		stats.slowpath	+= data_race(per_cpu(rseq_stats.slowpath, cpu));
 		stats.fastpath	+= data_race(per_cpu(rseq_stats.fastpath, cpu));
+		stats.quicktif	+= data_race(per_cpu(rseq_stats.quicktif, cpu));
 		stats.ids	+= data_race(per_cpu(rseq_stats.ids, cpu));
 		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
 		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
@@ -144,6 +145,7 @@ static int rseq_stats_show(struct seq_fi
 	seq_printf(m, "signal: %16lu\n", stats.signal);
 	seq_printf(m, "slowp:  %16lu\n", stats.slowpath);
 	seq_printf(m, "fastp:  %16lu\n", stats.fastpath);
+	seq_printf(m, "quickt: %16lu\n", stats.quicktif);
 	seq_printf(m, "ids:    %16lu\n", stats.ids);
 	seq_printf(m, "cs:     %16lu\n", stats.cs);
 	seq_printf(m, "clear:  %16lu\n", stats.clear);


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 31/37] asm-generic: Provide generic TIF infrastructure
  2025-08-23 16:40 ` [patch V2 31/37] asm-generic: Provide generic TIF infrastructure Thomas Gleixner
@ 2025-08-23 20:37   ` Arnd Bergmann
  2025-08-25 19:33   ` Mathieu Desnoyers
  1 sibling, 0 replies; 91+ messages in thread
From: Arnd Bergmann @ 2025-08-23 20:37 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Sat, Aug 23, 2025, at 18:40, Thomas Gleixner wrote:
> Common TIF bits do not have to be defined by every architecture. They can
> be defined in a generic header.
>
> That allows adding generic TIF bits without chasing a gazillion of
> architecture headers, which is again a unjustified burden on anyone who
> works on generic infrastructure as it always needs a boat load of work to
> keep existing architecture code working when adding new stuff.
>
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Arnd Bergmann <arnd@arndb.de>

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 00/37] rseq: Optimize exit to user space
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
                   ` (36 preceding siblings ...)
  2025-08-23 16:40 ` [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit Thomas Gleixner
@ 2025-08-25 15:10 ` Mathieu Desnoyers
  37 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 15:10 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt,
	linux-arch, Thomas Bogendoerfer, Michael Ellerman, Jonas Bonn,
	Florian Weimer

On 2025-08-23 12:39, Thomas Gleixner wrote:
> OA
> This is a follow up on the initial series, which did a very basic attempt
> to sanitize the RSEQ handling in the kernel:
> 
>     https://lore.kernel.org/all/20250813155941.014821755@linutronix.de
> 
> Further analysis turned up more than these initial problems:

Thanks Thomas for looking into this. Sorry for the delayed reply,
I was on vacation.

> 
>    1) task::rseq_event_mask is a pointless bit-field despite the fact that
>       the ABI flags it was meant to support have been deprecated and
>       functionally disabled three years ago.

Yes, this should be converted to a simple boolean now.

> 
>    2) task::rseq_event_mask is accumulating bits unless there is a critical
>       section discovered in the user space rseq memory. This results in
>       pointless invocations of the rseq user space exit handler even if
>       there had nothing changed. As a matter of correctness these bits have
>       to be clear when exiting to user space and therefore pristine when
>       coming back into the kernel. Aside of correctness, this also avoids
>       pointless evaluation of the user space memory, which is a performance
>       benefit.

Thanks for catching this, that's indeed not the intended behavior.

> 
>    3) The evaluation of critical sections does not differentiate between
>       syscall and interrupt/exception exits. The current implementation
>       silently tolerates and fixes up critical sections which invoked a
>       syscall unless CONFIG_DEBUG_RSEQ is enabled.
> 
>       That's just wrong. If user space does that on a production kernel it
>       can keep the pieces. The kernel is not there to proliferate mindless
>       user space programming and letting everyone pay the performance
>       penalty.

Agreed. There is no point in supporting a userspace behavior on
production kernels that is prevented on debug kernels, especially
if this adds overhead to production kernels.

> 
> Additional findings:
> 
>    4) The decision to raise the work for exit is more than suboptimal.

Terminology-wise, just making sure we are on the same page: here
you are talking about "exit to usermode", and *not* the exit(2) system
call. I'm clarifying because I know mm people care about fork/clone/exit
syscall overhead as well.

>       Basically every context switch does so if the task has rseq, which is
>       nowadays likely as glibc makes use of it if available.

Correct.

> 
>       The consequence is that a lot of exits have to process RSEQ just for
>       nothing. The only reasons to do so are:
> 
>         the task was interrupted in user space and schedules

interrupted or takes a trap/exception (I will assume you consider traps and
exceptions as an interrupt classes within this discussion).

> 
>       or
> 
>         the CPU or MM CID changes in schedule() independent of the entry
>         mode

or the numa node id, which is typically tied to the CPU number, except
on powerpc AFAIR where numa node id to cpu mapping can be reconfigured
dynamically.

> 
>       That reduces the invocation space obviously significantly.

Yes.

> 
>    5) Signal handling does the RSEQ update unconditionally.
> 
>       That's wrong as the only reason to do so is when the task was
>       interrupted in user space independent of a schedule event.
> 
>       The only important task in that case is to handle the critical section
>       because after switching to the signal frame the original return IP is
>       not longer available.
> 
>       The CPU/MM CID values do not need to be updated at that point as they
>       can change again before the signal delivery goes out to user space.
> 
>       Again, if the task was in a critical section and issued a syscall then
>       it can keep the pieces as that's a violation of the ABI contract.

The key thing here is that the state of the cpu/mm cid/numa node id
fields are updated before returning to the userspace signal handler,
and that a rseq critical section within a signal handler works.
 From your description here we can indeed do less work on signal
delivery and still meet those requirements. That's good.

> 
>    6) CPU and MM CID are updated unconditionally
> 
>       That's again a pointless exercise when they didn't change. Then the
>       only action required is to check the critical section if and only if
>       the entry came via an interrupt.
> 
>       That can obviously be avoided by caching the values written to user
>       space and avoiding that path if they haven't changed

This is a good performance improvement, I agree. Note that this
will likely break Google's tcmalloc hack of re-using the cpu_id_start
field as a way to get notified about preemption. But they were warned
not to do that, and it breaks the documented userspace ABI contract,
so they will need to adapt. This situation is documented here:

commit 7d5265ffcd8b
     rseq: Validate read-only fields under DEBUG_RSEQ config

> 
>    7) The TIF_NOTIFY_RESUME mechanism is a horrorshow
> 
>       TIF_NOTIFY_RESUME is a multiplexing TIF bit and needs to invoke the
>       world and some more. Depending on workloads this can be set by
>       task_work, security, block and memory management. All unrelated to
>       RSEQ and quite some of them are likely to cause a reschedule.
>       But most of them are low frequency.
> 
>       So doing this work in the loop unconditionally is just waste. The
>       correct point to do this is at the end of that loop once all other bits
>       have been processed, because that's the point where the task is
>       actually going out to user space.

Note that the rseq work can trigger a page fault and a reschedule, which
makes me whether moving this work out of the work loop can be OK ?
I'll need to have a closer look at the relevant patches to understand
this better.

I suspect that the underlying problem here is not that the notify resume
adds too much overhead, but rather that we're setting the
TIF_NOTIFY_RESUME bit way more often than is good for us.

Initially I focused on minimizing the scheduler overhead, at the
expense of doing more work than strictly needed on exit to usermode.
But of course reality is not as simple, and we need to fine the right
balance.

If your goal is to have fewer calls to the resume notifier triggered
by rseq, we could perhaps change the way rseq_set_notify_resume()
is implemented to make it conditional on either:

- migration,
- preemption AND
     current->rseq->rseq_cs userspace value is set OR
     would require a page fault to be read.
- preemption AND mm_cid changes.

I'm not sure what to do about the powerpc numa node id reconfiguration
fringe use-case though.

> 
>    8) #7 caused another subtle work for nothing issue
> 
>       IO/URING and hypervisors invoke resume_user_mode_work() with a NULL
>       pointer for pt_regs, which causes the RSEQ code to ignore the critical
>       section check, but updating the CPU ID/ MM CID values unconditionally.
> 
>       For IO/URING this invocation is irrelevant because the IO workers can
>       never go out to user space and therefore do not have RSEQ memory in
>       the first place. So it's a non problem in the existing code as
>       task::rseq is NULL in that case.

OK

> 
>       Hypervisors are a different story. They need to drain task_work and
>       other pending items, which are multiplexed by TIF_NOTIFY_RESUME,
>       before entering guest mode.
> 
>       The invocation of resume_user_mode_work() clears TIF_NOTIFY_RESUME,
>       which means if rseq would ignore that case then it could miss a CPU
>       or MM CID update on the way back to user space.
> 
>       The handling of that is just a horrible and mindless hack as the event
>       might be re-raised between the time the ioctl() enters guest mode and
>       the actual exit to user space.
> 
>       So the obvious thing is to ignore the regs=NULL call and let the
>       offending hypervisor calls check when returning from the ioctl()
>       whether the event bit is set and re-raise the notification again.

Is this a useless work issue or a correctness issue ?

Also, AFAIU, your proposed approach will ensure that the rseq fields
are up to date when returning to userspace from the ioctl in the host
process, but there are no guarantees about having up to date fields
when running the guest VM. I'm fine with this, but I think it needs to
be clearly spelled out.

> 
>    9) Code efficiency
> 
>       RSEQ aims to improve performance for user space, but it completely
>       ignores the fact, that this needs to be implemented in a way which
>       does not impact the performance of the kernel significantly.
> 
>       So far this did not pop up as just a few people used it, but that has
>       changed because glibc started to use it widely.

The history of incremental rseq upstreaming into the kernel and then glibc can
help understand how this situation came to be:

Linux commit d7822b1e24f2 ("rseq: Introduce restartable sequences system call")

(2018)

     This includes microbenchmarks of specific use-cases, which clearly
     do not stress neither the scheduler nor exit to usermode.

     It includes jemalloc memory allocator (Facebook) latency benchmarks,
     which depend heavily on the userspace fast-paths.

     It includes hackbench results, cover the scheduling and return
     to usermode overhead. Given that there was no libc integration
     back then, the hackbench results do not include any rseq-triggered
     resume notifier work.

     Based on the rseq use at that point in time, the overhead of
     rseq was acceptable for it to be merged.

glibc commit 95e114a0919d ("nptl: Add rseq registration")

(2021)

     Florian Weimer added this feature to glibc. At this point
     the overhead you observe now should have become visible.
     This integration and performance testing was done by the
     glibc maintainers and distribution vendors who packaged this
     libc.

[...]

> That said, this series addresses the overall problems by:
> 
>    1) Limiting the RSEQ work to the actual conditions where it is
>       required. The full benefit is only available for architectures using
>       the generic entry infrastructure. All others get at least the basic
>       improvements.
> 
>    2) Re-implementing the whole user space handling based on proper data
>       structures and by actually looking at the impact it creates in the
>       fast path.
> 
>    3) Moving the actual handling of RSEQ out to the latest point in the exit
>       path, where possible. This is fully inlined into the fast path to keep
>       the impact confined.
> 
>       The initial attempt to make it completely independent of TIF bits and
>       just handle it with a quick check unconditionally on exit to user
>       space turned out to be not feasible. On workloads which are doing a
>       lot of quick syscalls the extra four instructions add up
>       significantly.
> 
>       So instead I ended up doing it at the end of the exit to user TIF
>       work loop once when all other TIF bits have been processed. At this
>       point interrupts are disabled and there is no way that the state
>       can change before the task goes out to user space for real.

I'll need to review what happens in case rseq needs to take a page fault
after your proposed changes. More discussion to come around the specific
patch in the series.

> 
>    Versus the limitations of #1 and #3:
> 
>     I wasted several days of my so copious time to figure out how to not
>     break all the architectures, which still insist on benefiting from core
>     code improvements by pulling everything and the world into their
>     architecture specific hackery.
> 
>     It's more than five years now that the generic entry code infrastructure
>     has been introduced for the very reason to lower the burden for core
>     code developers and maintainers and to share the common functionality
>     across the architecture zoo.
> 
>     Aside of the initial x86 move, which started this effort, there are only
>     three architectures who actually made the effort to utilize this. Two of
>     them were new ones, which were asked to use it right away.
> 
>     The only existing one, which converted over since then is S390 and I'm
>     truly grateful that they improved the generic infrastructure in that
>     process significantly.
> 
>     On ARM[64] there are at least serious efforts underway to move their
>     code over.
> 
>     Does everybody else think that core code improvements come for free and
>     the architecture specific hackery does not put any burden on others?
> 
>     Here is the hall of fame as far as RSEQ goes:
> 
>     	arch/mips/Kconfig:      select HAVE_RSEQ
> 	arch/openrisc/Kconfig:  select HAVE_RSEQ
> 	arch/powerpc/Kconfig:   select HAVE_RSEQ
> 
>     Two of them are barely maintained and shouldn't have RSEQ in the first
>     place....
> 
>     While I was very forthcoming in the past to accomodate for that and went
>     out of my way to enable stuff for everyone, but I'm drawing a line now.
> 
>     All extra improvements which are enabled by #1/#3 depend hard on the
>     generic infrastructure.
> 
>     I know that it's quite some effort to move an architecture over, but
>     it's a one time effort and investment into the future. This 'my
>     architecture is special for no reason' mindset is not sustainable and
>     just pushes the burden on others. There is zero justification for this.
> 
>     Not converging on common infrastructure is not only a burden for the
>     core people, it's also a missed opportunity for the architectures to
>     lower their own burden of chasing core improvements and implementing
>     them each with a different set of bugs.
> 
>     This is not the first time this happens. There are enough other examples
>     where it took ages to consolidate on common code. This just accumulates
>     technical debt and needless complexity, which everyone suffers from.
> 
>     I have happily converted the four architectures, which use the generic
>     entry code over, to utilize a shared generic TIF bit header so that
>     adding the TIF_RSEQ bit becomes a two line change and all four get the
>     benefit immediately. That was more consequent than just adding the bits
>     for each of them and it makes further maintainence of core
>     infrastructure simpler for all sides. See?


Can we make RSEQ depend on CONFIG_GENERIC_ENTRY (if that is the correct
option) ?

This would force additional architectures to move to the generic entry
infrastructure if they want to benefit from rseq.

I'll start looking into the series now.

Thanks,

Mathieu

> 
> 
> That said, as for the first version these patches have a pile of dependencies:
> 
> The series depends on the separately posted rseq bugfix:
> 
>     https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/
> 
> and the uaccess generic helper series:
> 
>     https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/
> 
> and a related futex fix in
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent
> 
> The combination of all of them and some other related fixes (rseq
> selftests) are available here:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/base
> 
> For your convenience all of it is also available as a conglomerate from
> git:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
> 
> The diffstat looks large, but a lot of that is due to extensive comments
> and the extra hackery to accommodate for random architecture code.
> 
> I did not yet come around to test this on anything else than x86. Help with
> that would be truly appreciated.
> 
> Thanks,
> 
> 	tglx
> 
>    "Additional problems are the offspring of poor solutions." - Mark Twain
> 	
> ---
>   Documentation/admin-guide/kernel-parameters.txt |    4
>   arch/Kconfig                                    |    4
>   arch/loongarch/Kconfig                          |    1
>   arch/loongarch/include/asm/thread_info.h        |   76 +-
>   arch/riscv/Kconfig                              |    1
>   arch/riscv/include/asm/thread_info.h            |   29 -
>   arch/s390/Kconfig                               |    1
>   arch/s390/include/asm/thread_info.h             |   44 -
>   arch/x86/Kconfig                                |    1
>   arch/x86/entry/syscall_32.c                     |    3
>   arch/x86/include/asm/thread_info.h              |   74 +-
>   b/include/asm-generic/thread_info_tif.h         |   51 ++
>   b/include/linux/rseq_entry.h                    |  601 +++++++++++++++++++++++
>   b/include/linux/rseq_types.h                    |   72 ++
>   drivers/hv/mshv_root_main.c                     |    2
>   fs/binfmt_elf.c                                 |    2
>   fs/exec.c                                       |    2
>   include/linux/entry-common.h                    |   38 -
>   include/linux/irq-entry-common.h                |   68 ++
>   include/linux/mm.h                              |   25
>   include/linux/resume_user_mode.h                |    2
>   include/linux/rseq.h                            |  216 +++++---
>   include/linux/sched.h                           |   50 +
>   include/linux/thread_info.h                     |    5
>   include/trace/events/rseq.h                     |    4
>   include/uapi/linux/rseq.h                       |   21
>   init/Kconfig                                    |   28 +
>   kernel/entry/common.c                           |   37 -
>   kernel/entry/syscall-common.c                   |    8
>   kernel/rseq.c                                   |  610 ++++++++++--------------
>   kernel/sched/core.c                             |   10
>   kernel/sched/membarrier.c                       |    8
>   kernel/sched/sched.h                            |    5
>   virt/kvm/kvm_main.c                             |    3
>   34 files changed, 1406 insertions(+), 700 deletions(-)


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume()
  2025-08-23 16:39 ` [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
@ 2025-08-25 15:39   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 15:39 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The RSEQ critical section mechanism only clears the event mask when a
> critical section is registered, otherwise it is stale and collects
> bits.
> 
> That means once a critical section is installed the first invocation of
> that code when TIF_NOTIFY_RESUME is set will abort the critical section,
> even when the TIF bit was not raised by the rseq preempt/migrate/signal
> helpers.
> 
> This also has a performance implication because TIF_NOTIFY_RESUME is a
> multiplexing TIF bit, which is utilized by quite some infrastructure. That
> means every invocation of __rseq_notify_resume() goes unconditionally
> through the heavy lifting of user space access and consistency checks even
> if there is no reason to do so.

Even worse in terms of overhead implication: given a userspace that
has no rseq critical sections, all kernel rseq events end up setting
the rseq event mask bits, and they are never cleared, which then makes
all TIF_NOTIFY_RESUME work slower. This is not intended and should
indeed be fixed.

> 
> Keeping the stale event mask around when exiting to user space also
> prevents it from being utilized by the upcoming time slice extension
> mechanism.
> 
> Avoid this by reading and clearing the event mask before doing the user
> space critical section access with interrupts or preemption disabled, which
> ensures that the read and clear operation is CPU local atomic versus
> scheduling and the membarrier IPI.
> 
> This is correct as after re-enabling interrupts/preemption any relevant
> event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the
> user space exit code take another round of TIF bit clearing.
> 
> If the event mask was non-zero, invoke the slow path. On debug kernels the
> slow path is invoked unconditionally and the result of the event mask
> evaluation is handed in.
> 
> Add a exit path check after the TIF bit loop, which validates on debug
> kernels that the event mask is zero before exiting to user space.
> 
> While at it reword the convoluted comment why the pt_regs pointer can be
> NULL under certain circumstances.
> 

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> 
> ---
>   include/linux/irq-entry-common.h |    7 ++--
>   include/linux/rseq.h             |   10 +++++
>   kernel/rseq.c                    |   66 ++++++++++++++++++++++++++-------------
>   3 files changed, 58 insertions(+), 25 deletions(-)
> ---
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -2,11 +2,12 @@
>   #ifndef __LINUX_IRQENTRYCOMMON_H
>   #define __LINUX_IRQENTRYCOMMON_H
>   
> +#include <linux/context_tracking.h>
> +#include <linux/kmsan.h>
> +#include <linux/rseq.h>
>   #include <linux/static_call_types.h>
>   #include <linux/syscalls.h>
> -#include <linux/context_tracking.h>
>   #include <linux/tick.h>
> -#include <linux/kmsan.h>
>   #include <linux/unwind_deferred.h>
>   
>   #include <asm/entry-common.h>
> @@ -226,6 +227,8 @@ static __always_inline void exit_to_user
>   
>   	arch_exit_to_user_mode_prepare(regs, ti_work);
>   
> +	rseq_exit_to_user_mode();
> +
>   	/* Ensure that kernel state is sane for a return to userspace */
>   	kmap_assert_nomap();
>   	lockdep_assert_irqs_disabled();
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -66,6 +66,14 @@ static inline void rseq_migrate(struct t
>   	rseq_set_notify_resume(t);
>   }
>   
> +static __always_inline void rseq_exit_to_user_mode(void)
> +{
> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
> +		if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
> +			current->rseq_event_mask = 0;
> +	}
> +}
> +
>   /*
>    * If parent process has a registered restartable sequences area, the
>    * child inherits. Unregister rseq for a clone with CLONE_VM set.
> @@ -118,7 +126,7 @@ static inline void rseq_fork(struct task
>   static inline void rseq_execve(struct task_struct *t)
>   {
>   }
> -
> +static inline void rseq_exit_to_user_mode(void) { }
>   #endif
>   
>   #ifdef CONFIG_DEBUG_RSEQ
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char *
>   	return true;
>   }
>   
> -static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
> +static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
>   {
> -	u32 flags, event_mask;
> +	u32 flags;
>   	int ret;
>   
>   	if (rseq_warn_flags("rseq_cs", cs_flags))
> @@ -339,17 +339,7 @@ static int rseq_need_restart(struct task
>   
>   	if (rseq_warn_flags("rseq", flags))
>   		return -EINVAL;
> -
> -	/*
> -	 * Load and clear event mask atomically with respect to
> -	 * scheduler preemption and membarrier IPIs.
> -	 */
> -	scoped_guard(RSEQ_EVENT_GUARD) {
> -		event_mask = t->rseq_event_mask;
> -		t->rseq_event_mask = 0;
> -	}
> -
> -	return !!event_mask;
> +	return 0;
>   }
>   
>   static int clear_rseq_cs(struct rseq __user *rseq)
> @@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip,
>   	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
>   }
>   
> -static int rseq_ip_fixup(struct pt_regs *regs)
> +static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
>   {
>   	unsigned long ip = instruction_pointer(regs);
>   	struct task_struct *t = current;
> @@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs
>   	 */
>   	if (!in_rseq_cs(ip, &rseq_cs))
>   		return clear_rseq_cs(t->rseq);
> -	ret = rseq_need_restart(t, rseq_cs.flags);
> -	if (ret <= 0)
> +	ret = rseq_check_flags(t, rseq_cs.flags);
> +	if (ret < 0)
>   		return ret;
> +	if (!abort)
> +		return 0;
>   	ret = clear_rseq_cs(t->rseq);
>   	if (ret)
>   		return ret;
> @@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct
>   		return;
>   
>   	/*
> -	 * regs is NULL if and only if the caller is in a syscall path.  Skip
> -	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> -	 * kill a misbehaving userspace on debug kernels.
> +	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
> +	 * pointer, so fixup cannot be done. If the syscall which led to
> +	 * this invocation was invoked inside a critical section, then it
> +	 * will either end up in this code again or a possible violation of
> +	 * a syscall inside a critical region can only be detected by the
> +	 * debug code in rseq_syscall() in a debug enabled kernel.
>   	 */
>   	if (regs) {
> -		ret = rseq_ip_fixup(regs);
> -		if (unlikely(ret < 0))
> -			goto error;
> +		/*
> +		 * Read and clear the event mask first. If the task was not
> +		 * preempted or migrated or a signal is on the way, there
> +		 * is no point in doing any of the heavy lifting here on
> +		 * production kernels. In that case TIF_NOTIFY_RESUME was
> +		 * raised by some other functionality.
> +		 *
> +		 * This is correct because the read/clear operation is
> +		 * guarded against scheduler preemption, which makes it CPU
> +		 * local atomic. If the task is preempted right after
> +		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> +		 * again and this function is invoked another time _before_
> +		 * the task is able to return to user mode.
> +		 *
> +		 * On a debug kernel, invoke the fixup code unconditionally
> +		 * with the result handed in to allow the detection of
> +		 * inconsistencies.
> +		 */
> +		u32 event_mask;
> +
> +		scoped_guard(RSEQ_EVENT_GUARD) {
> +			event_mask = t->rseq_event_mask;
> +			t->rseq_event_mask = 0;
> +		}
> +
> +		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
> +			ret = rseq_ip_fixup(regs, !!event_mask);
> +			if (unlikely(ret < 0))
> +				goto error;
> +		}
>   	}
>   	if (unlikely(rseq_update_cpu_node_id(t)))
>   		goto error;
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 02/37] rseq: Condense the inline stubs
  2025-08-23 16:39 ` [patch V2 02/37] rseq: Condense the inline stubs Thomas Gleixner
@ 2025-08-25 15:40   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 15:40 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Scrolling over tons of pointless
> 
> {
> }
> 
> lines to find the actual code is annoying at best.
> 

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> 
> ---
>   include/linux/rseq.h |   47 ++++++++++++-----------------------------------
>   1 file changed, 12 insertions(+), 35 deletions(-)
> ---
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -101,44 +101,21 @@ static inline void rseq_execve(struct ta
>   	t->rseq_event_mask = 0;
>   }
>   
> -#else
> -
> -static inline void rseq_set_notify_resume(struct task_struct *t)
> -{
> -}
> -static inline void rseq_handle_notify_resume(struct ksignal *ksig,
> -					     struct pt_regs *regs)
> -{
> -}
> -static inline void rseq_signal_deliver(struct ksignal *ksig,
> -				       struct pt_regs *regs)
> -{
> -}
> -static inline void rseq_preempt(struct task_struct *t)
> -{
> -}
> -static inline void rseq_migrate(struct task_struct *t)
> -{
> -}
> -static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
> -{
> -}
> -static inline void rseq_execve(struct task_struct *t)
> -{
> -}
> +#else /* CONFIG_RSEQ */
> +static inline void rseq_set_notify_resume(struct task_struct *t) { }
> +static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
> +static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
> +static inline void rseq_preempt(struct task_struct *t) { }
> +static inline void rseq_migrate(struct task_struct *t) { }
> +static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
> +static inline void rseq_execve(struct task_struct *t) { }
>   static inline void rseq_exit_to_user_mode(void) { }
> -#endif
> +#endif  /* !CONFIG_RSEQ */
>   
>   #ifdef CONFIG_DEBUG_RSEQ
> -
>   void rseq_syscall(struct pt_regs *regs);
> -
> -#else
> -
> -static inline void rseq_syscall(struct pt_regs *regs)
> -{
> -}
> -
> -#endif
> +#else /* CONFIG_DEBUG_RSEQ */
> +static inline void rseq_syscall(struct pt_regs *regs) { }
> +#endif /* !CONFIG_DEBUG_RSEQ */
>   
>   #endif /* _LINUX_RSEQ_H */
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 03/37] resq: Move algorithm comment to top
  2025-08-23 16:39 ` [patch V2 03/37] resq: Move algorithm comment to top Thomas Gleixner
@ 2025-08-25 15:41   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 15:41 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Move the comment which documents the RSEQ algorithm to the top of the file,
> so it does not create horrible diffs later when the actual implementation
> is fed into the mincer.

Typo in the subject:

resq -> rseq

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Other than this nit:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   kernel/rseq.c |  119 ++++++++++++++++++++++++++++------------------------------
>   1 file changed, 59 insertions(+), 60 deletions(-)
> 
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -8,6 +8,65 @@
>    * Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
>    */
>   
> +/*
> + * Restartable sequences are a lightweight interface that allows
> + * user-level code to be executed atomically relative to scheduler
> + * preemption and signal delivery. Typically used for implementing
> + * per-cpu operations.
> + *
> + * It allows user-space to perform update operations on per-cpu data
> + * without requiring heavy-weight atomic operations.
> + *
> + * Detailed algorithm of rseq user-space assembly sequences:
> + *
> + *                     init(rseq_cs)
> + *                     cpu = TLS->rseq::cpu_id_start
> + *   [1]               TLS->rseq::rseq_cs = rseq_cs
> + *   [start_ip]        ----------------------------
> + *   [2]               if (cpu != TLS->rseq::cpu_id)
> + *                             goto abort_ip;
> + *   [3]               <last_instruction_in_cs>
> + *   [post_commit_ip]  ----------------------------
> + *
> + *   The address of jump target abort_ip must be outside the critical
> + *   region, i.e.:
> + *
> + *     [abort_ip] < [start_ip]  || [abort_ip] >= [post_commit_ip]
> + *
> + *   Steps [2]-[3] (inclusive) need to be a sequence of instructions in
> + *   userspace that can handle being interrupted between any of those
> + *   instructions, and then resumed to the abort_ip.
> + *
> + *   1.  Userspace stores the address of the struct rseq_cs assembly
> + *       block descriptor into the rseq_cs field of the registered
> + *       struct rseq TLS area. This update is performed through a single
> + *       store within the inline assembly instruction sequence.
> + *       [start_ip]
> + *
> + *   2.  Userspace tests to check whether the current cpu_id field match
> + *       the cpu number loaded before start_ip, branching to abort_ip
> + *       in case of a mismatch.
> + *
> + *       If the sequence is preempted or interrupted by a signal
> + *       at or after start_ip and before post_commit_ip, then the kernel
> + *       clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
> + *       ip to abort_ip before returning to user-space, so the preempted
> + *       execution resumes at abort_ip.
> + *
> + *   3.  Userspace critical section final instruction before
> + *       post_commit_ip is the commit. The critical section is
> + *       self-terminating.
> + *       [post_commit_ip]
> + *
> + *   4.  <success>
> + *
> + *   On failure at [2], or if interrupted by preempt or signal delivery
> + *   between [1] and [3]:
> + *
> + *       [abort_ip]
> + *   F1. <failure>
> + */
> +
>   #include <linux/sched.h>
>   #include <linux/uaccess.h>
>   #include <linux/syscalls.h>
> @@ -98,66 +157,6 @@ static int rseq_validate_ro_fields(struc
>   	unsafe_put_user(value, &t->rseq->field, error_label)
>   #endif
>   
> -/*
> - *
> - * Restartable sequences are a lightweight interface that allows
> - * user-level code to be executed atomically relative to scheduler
> - * preemption and signal delivery. Typically used for implementing
> - * per-cpu operations.
> - *
> - * It allows user-space to perform update operations on per-cpu data
> - * without requiring heavy-weight atomic operations.
> - *
> - * Detailed algorithm of rseq user-space assembly sequences:
> - *
> - *                     init(rseq_cs)
> - *                     cpu = TLS->rseq::cpu_id_start
> - *   [1]               TLS->rseq::rseq_cs = rseq_cs
> - *   [start_ip]        ----------------------------
> - *   [2]               if (cpu != TLS->rseq::cpu_id)
> - *                             goto abort_ip;
> - *   [3]               <last_instruction_in_cs>
> - *   [post_commit_ip]  ----------------------------
> - *
> - *   The address of jump target abort_ip must be outside the critical
> - *   region, i.e.:
> - *
> - *     [abort_ip] < [start_ip]  || [abort_ip] >= [post_commit_ip]
> - *
> - *   Steps [2]-[3] (inclusive) need to be a sequence of instructions in
> - *   userspace that can handle being interrupted between any of those
> - *   instructions, and then resumed to the abort_ip.
> - *
> - *   1.  Userspace stores the address of the struct rseq_cs assembly
> - *       block descriptor into the rseq_cs field of the registered
> - *       struct rseq TLS area. This update is performed through a single
> - *       store within the inline assembly instruction sequence.
> - *       [start_ip]
> - *
> - *   2.  Userspace tests to check whether the current cpu_id field match
> - *       the cpu number loaded before start_ip, branching to abort_ip
> - *       in case of a mismatch.
> - *
> - *       If the sequence is preempted or interrupted by a signal
> - *       at or after start_ip and before post_commit_ip, then the kernel
> - *       clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
> - *       ip to abort_ip before returning to user-space, so the preempted
> - *       execution resumes at abort_ip.
> - *
> - *   3.  Userspace critical section final instruction before
> - *       post_commit_ip is the commit. The critical section is
> - *       self-terminating.
> - *       [post_commit_ip]
> - *
> - *   4.  <success>
> - *
> - *   On failure at [2], or if interrupted by preempt or signal delivery
> - *   between [1] and [3]:
> - *
> - *       [abort_ip]
> - *   F1. <failure>
> - */
> -
>   static int rseq_update_cpu_node_id(struct task_struct *t)
>   {
>   	struct rseq __user *rseq = t->rseq;
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume()
  2025-08-23 16:39 ` [patch V2 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
@ 2025-08-25 15:43   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 15:43 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> There is no point for this being visible in the resume_to_user_mode()
> handling.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   include/linux/resume_user_mode.h |    2 +-
>   include/linux/rseq.h             |   13 +++++++------
>   2 files changed, 8 insertions(+), 7 deletions(-)
> 
> --- a/include/linux/resume_user_mode.h
> +++ b/include/linux/resume_user_mode.h
> @@ -59,7 +59,7 @@ static inline void resume_user_mode_work
>   	mem_cgroup_handle_over_high(GFP_KERNEL);
>   	blkcg_maybe_throttle_current();
>   
> -	rseq_handle_notify_resume(NULL, regs);
> +	rseq_handle_notify_resume(regs);
>   }
>   
>   #endif /* LINUX_RESUME_USER_MODE_H */
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -37,19 +37,20 @@ static inline void rseq_set_notify_resum
>   
>   void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
>   
> -static inline void rseq_handle_notify_resume(struct ksignal *ksig,
> -					     struct pt_regs *regs)
> +static inline void rseq_handle_notify_resume(struct pt_regs *regs)
>   {
>   	if (current->rseq)
> -		__rseq_handle_notify_resume(ksig, regs);
> +		__rseq_handle_notify_resume(NULL, regs);
>   }
>   
>   static inline void rseq_signal_deliver(struct ksignal *ksig,
>   				       struct pt_regs *regs)
>   {
> -	scoped_guard(RSEQ_EVENT_GUARD)
> -		__set_bit(RSEQ_EVENT_SIGNAL_BIT, &current->rseq_event_mask);
> -	rseq_handle_notify_resume(ksig, regs);
> +	if (current->rseq) {
> +		scoped_guard(RSEQ_EVENT_GUARD)
> +			__set_bit(RSEQ_EVENT_SIGNAL_BIT, &current->rseq_event_mask);
> +		__rseq_handle_notify_resume(ksig, regs);
> +	}
>   }
>   
>   /* rseq_preempt() requires preemption to be disabled. */
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 05/37] rseq: Simplify registration
  2025-08-23 16:39 ` [patch V2 05/37] rseq: Simplify registration Thomas Gleixner
@ 2025-08-25 15:44   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 15:44 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> There is no point to read the critical section element in the newly
> registered user space RSEQ struct first in order to clear it.
> 
> Just clear it and be done with it.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   kernel/rseq.c |   10 +++-------
>   1 file changed, 3 insertions(+), 7 deletions(-)
> 
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -492,11 +492,9 @@ void rseq_syscall(struct pt_regs *regs)
>   /*
>    * sys_rseq - setup restartable sequences for caller thread.
>    */
> -SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
> -		int, flags, u32, sig)
> +SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
>   {
>   	int ret;
> -	u64 rseq_cs;
>   
>   	if (flags & RSEQ_FLAG_UNREGISTER) {
>   		if (flags & ~RSEQ_FLAG_UNREGISTER)
> @@ -557,11 +555,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   	 * avoid a potential segfault on return to user-space. The proper thing
>   	 * to do would have been to fail the registration but this would break
>   	 * older libcs that reuse the rseq area for new threads without
> -	 * clearing the fields.
> +	 * clearing the fields. Don't bother reading it, just reset it.
>   	 */
> -	if (rseq_get_rseq_cs_ptr_val(rseq, &rseq_cs))
> -	        return -EFAULT;
> -	if (rseq_cs && clear_rseq_cs(rseq))
> +	if (put_user_masked_u64(0UL, &rseq->rseq_cs))
>   		return -EFAULT;
>   
>   #ifdef CONFIG_DEBUG_RSEQ
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 06/37] rseq: Simplify the event notification
  2025-08-23 16:39 ` [patch V2 06/37] rseq: Simplify the event notification Thomas Gleixner
@ 2025-08-25 17:36   ` Mathieu Desnoyers
  2025-09-02 13:39     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 17:36 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
> flags") the bits in task::rseq_event_mask are meaningless and just extra
> work in terms of setting them individually.
> 
> Aside of that the only relevant point where an event has to be raised is
> context switch. Neither the CPU nor MM CID can change without going through
> a context switch.

Note: we may want to include the numa node id field as well in this
list of fields.

> 
> Collapse them all into a single boolean which simplifies the code a lot and
> remove the pointless invocations which have been sprinkled all over the
> place for no value.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
> V2: Reduce it to the sched switch event.
> ---
>   fs/exec.c                 |    2 -
>   include/linux/rseq.h      |   66 +++++++++-------------------------------------
>   include/linux/sched.h     |   10 +++---
>   include/uapi/linux/rseq.h |   21 ++++----------
>   kernel/rseq.c             |   28 +++++++++++--------
>   kernel/sched/core.c       |    5 ---
>   kernel/sched/membarrier.c |    8 ++---
>   7 files changed, 48 insertions(+), 92 deletions(-)
> ---
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp
>   		force_fatal_sig(SIGSEGV);
>   
>   	sched_mm_cid_after_execve(current);
> -	rseq_set_notify_resume(current);
> +	rseq_sched_switch_event(current);
>   	current->in_execve = 0;
>   
>   	return retval;
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -3,38 +3,8 @@
>   #define _LINUX_RSEQ_H
>   
>   #ifdef CONFIG_RSEQ
> -
> -#include <linux/preempt.h>
>   #include <linux/sched.h>
>   
> -#ifdef CONFIG_MEMBARRIER
> -# define RSEQ_EVENT_GUARD	irq
> -#else
> -# define RSEQ_EVENT_GUARD	preempt
> -#endif
> -
> -/*
> - * Map the event mask on the user-space ABI enum rseq_cs_flags
> - * for direct mask checks.
> - */
> -enum rseq_event_mask_bits {
> -	RSEQ_EVENT_PREEMPT_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
> -	RSEQ_EVENT_SIGNAL_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
> -	RSEQ_EVENT_MIGRATE_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
> -};
> -
> -enum rseq_event_mask {
> -	RSEQ_EVENT_PREEMPT	= (1U << RSEQ_EVENT_PREEMPT_BIT),
> -	RSEQ_EVENT_SIGNAL	= (1U << RSEQ_EVENT_SIGNAL_BIT),
> -	RSEQ_EVENT_MIGRATE	= (1U << RSEQ_EVENT_MIGRATE_BIT),
> -};
> -
> -static inline void rseq_set_notify_resume(struct task_struct *t)
> -{
> -	if (t->rseq)
> -		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> -}
> -
>   void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
>   
>   static inline void rseq_handle_notify_resume(struct pt_regs *regs)
> @@ -43,35 +13,27 @@ static inline void rseq_handle_notify_re
>   		__rseq_handle_notify_resume(NULL, regs);
>   }
>   
> -static inline void rseq_signal_deliver(struct ksignal *ksig,
> -				       struct pt_regs *regs)
> +static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
>   {
>   	if (current->rseq) {
> -		scoped_guard(RSEQ_EVENT_GUARD)
> -			__set_bit(RSEQ_EVENT_SIGNAL_BIT, &current->rseq_event_mask);
> +		current->rseq_event_pending = true;
>   		__rseq_handle_notify_resume(ksig, regs);
>   	}
>   }
>   
> -/* rseq_preempt() requires preemption to be disabled. */
> -static inline void rseq_preempt(struct task_struct *t)
> +static inline void rseq_sched_switch_event(struct task_struct *t)
>   {
> -	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
> -	rseq_set_notify_resume(t);
> -}
> -
> -/* rseq_migrate() requires preemption to be disabled. */
> -static inline void rseq_migrate(struct task_struct *t)
> -{
> -	__set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
> -	rseq_set_notify_resume(t);
> +	if (t->rseq) {
> +		t->rseq_event_pending = true;
> +		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> +	}
>   }
>   
>   static __always_inline void rseq_exit_to_user_mode(void)
>   {
>   	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
> -		if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
> -			current->rseq_event_mask = 0;
> +		if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
> +			current->rseq_event_pending = false;
>   	}
>   }
>   
> @@ -85,12 +47,12 @@ static inline void rseq_fork(struct task
>   		t->rseq = NULL;
>   		t->rseq_len = 0;
>   		t->rseq_sig = 0;
> -		t->rseq_event_mask = 0;
> +		t->rseq_event_pending = false;
>   	} else {
>   		t->rseq = current->rseq;
>   		t->rseq_len = current->rseq_len;
>   		t->rseq_sig = current->rseq_sig;
> -		t->rseq_event_mask = current->rseq_event_mask;
> +		t->rseq_event_pending = current->rseq_event_pending;
>   	}
>   }
>   
> @@ -99,15 +61,13 @@ static inline void rseq_execve(struct ta
>   	t->rseq = NULL;
>   	t->rseq_len = 0;
>   	t->rseq_sig = 0;
> -	t->rseq_event_mask = 0;
> +	t->rseq_event_pending = false;
>   }
>   
>   #else /* CONFIG_RSEQ */
> -static inline void rseq_set_notify_resume(struct task_struct *t) { }
>   static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
> -static inline void rseq_preempt(struct task_struct *t) { }
> -static inline void rseq_migrate(struct task_struct *t) { }
> +static inline void rseq_sched_switch_event(struct task_struct *t) { }
>   static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
>   static inline void rseq_execve(struct task_struct *t) { }
>   static inline void rseq_exit_to_user_mode(void) { }
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1401,14 +1401,14 @@ struct task_struct {
>   #endif /* CONFIG_NUMA_BALANCING */
>   
>   #ifdef CONFIG_RSEQ
> -	struct rseq __user *rseq;
> -	u32 rseq_len;
> -	u32 rseq_sig;
> +	struct rseq __user		*rseq;
> +	u32				rseq_len;
> +	u32				rseq_sig;
>   	/*
> -	 * RmW on rseq_event_mask must be performed atomically
> +	 * RmW on rseq_event_pending must be performed atomically
>   	 * with respect to preemption.
>   	 */
> -	unsigned long rseq_event_mask;
> +	bool				rseq_event_pending;

AFAIU, this rseq_event_pending field is now concurrently set from:

- rseq_signal_deliver (without any preempt nor irqoff guard)
- rseq_sched_switch_event (with preemption disabled)

Is it safe to concurrently store to a "bool" field within a structure
without any protection against concurrent stores ? Typically I've used
an integer field just to be on the safe side in that kind of situation.

AFAIR, a bool type needs to be at least 1 byte. Do all architectures
supported by Linux have a single byte store instruction, or can we end
up incorrectly storing to other nearby fields ? (for instance, DEC
Alpha ?)

>   # ifdef CONFIG_DEBUG_RSEQ
>   	/*
>   	 * This is a place holder to save a copy of the rseq fields for
> --- a/include/uapi/linux/rseq.h
> +++ b/include/uapi/linux/rseq.h
> @@ -114,20 +114,13 @@ struct rseq {
>   	/*
>   	 * Restartable sequences flags field.
>   	 *
> -	 * This field should only be updated by the thread which
> -	 * registered this data structure. Read by the kernel.
> -	 * Mainly used for single-stepping through rseq critical sections
> -	 * with debuggers.
> -	 *
> -	 * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
> -	 *     Inhibit instruction sequence block restart on preemption
> -	 *     for this thread.
> -	 * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> -	 *     Inhibit instruction sequence block restart on signal
> -	 *     delivery for this thread.
> -	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
> -	 *     Inhibit instruction sequence block restart on migration for
> -	 *     this thread.
> +	 * This field was initialy intended to allow event masking for for

initially

for for -> for

> +	 * single-stepping through rseq critical sections with debuggers.
> +	 * The kernel does not support this anymore and the relevant bits
> +	 * are checked for being always false:
> +	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
> +	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
> +	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
>   	 */
>   	__u32 flags;
>   
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -78,6 +78,12 @@
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/rseq.h>
>   
> +#ifdef CONFIG_MEMBARRIER
> +# define RSEQ_EVENT_GUARD	irq
> +#else
> +# define RSEQ_EVENT_GUARD	preempt
> +#endif
> +
>   /* The original rseq structure size (including padding) is 32 bytes. */
>   #define ORIG_RSEQ_SIZE		32
>   
> @@ -430,11 +436,11 @@ void __rseq_handle_notify_resume(struct
>   	 */
>   	if (regs) {
>   		/*
> -		 * Read and clear the event mask first. If the task was not
> -		 * preempted or migrated or a signal is on the way, there
> -		 * is no point in doing any of the heavy lifting here on
> -		 * production kernels. In that case TIF_NOTIFY_RESUME was
> -		 * raised by some other functionality.
> +		 * Read and clear the event pending bit first. If the task
> +		 * was not preempted or migrated or a signal is on the way,
> +		 * there is no point in doing any of the heavy lifting here
> +		 * on production kernels. In that case TIF_NOTIFY_RESUME
> +		 * was raised by some other functionality.
>   		 *
>   		 * This is correct because the read/clear operation is
>   		 * guarded against scheduler preemption, which makes it CPU
> @@ -447,15 +453,15 @@ void __rseq_handle_notify_resume(struct
>   		 * with the result handed in to allow the detection of
>   		 * inconsistencies.
>   		 */
> -		u32 event_mask;
> +		bool event;
>   
>   		scoped_guard(RSEQ_EVENT_GUARD) {
> -			event_mask = t->rseq_event_mask;
> -			t->rseq_event_mask = 0;
> +			event = t->rseq_event_pending;
> +			t->rseq_event_pending = false;
>   		}
>   
> -		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
> -			ret = rseq_ip_fixup(regs, !!event_mask);
> +		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> +			ret = rseq_ip_fixup(regs, event);
>   			if (unlikely(ret < 0))
>   				goto error;
>   		}
> @@ -584,7 +590,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   	 * registered, ensure the cpu_id_start and cpu_id fields
>   	 * are updated before returning to user-space.
>   	 */
> -	rseq_set_notify_resume(current);
> +	rseq_sched_switch_event(current);
>   
>   	return 0;
>   }
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3364,7 +3364,6 @@ void set_task_cpu(struct task_struct *p,
>   		if (p->sched_class->migrate_task_rq)
>   			p->sched_class->migrate_task_rq(p, new_cpu);
>   		p->se.nr_migrations++;
> -		rseq_migrate(p);

OK yes, all rseq_migrate can go away because it requires a preemption,
and those are combined into the same state.

Thanks,

Mathieu

>   		sched_mm_cid_migrate_from(p);
>   		perf_event_task_migrate(p);
>   	}
> @@ -4795,7 +4794,6 @@ int sched_cgroup_fork(struct task_struct
>   		p->sched_task_group = tg;
>   	}
>   #endif
> -	rseq_migrate(p);
>   	/*
>   	 * We're setting the CPU for the first time, we don't migrate,
>   	 * so use __set_task_cpu().
> @@ -4859,7 +4857,6 @@ void wake_up_new_task(struct task_struct
>   	 * as we're not fully set-up yet.
>   	 */
>   	p->recent_used_cpu = task_cpu(p);
> -	rseq_migrate(p);
>   	__set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
>   	rq = __task_rq_lock(p, &rf);
>   	update_rq_clock(rq);
> @@ -5153,7 +5150,7 @@ prepare_task_switch(struct rq *rq, struc
>   	kcov_prepare_switch(prev);
>   	sched_info_switch(rq, prev, next);
>   	perf_event_task_sched_out(prev, next);
> -	rseq_preempt(prev);
> +	rseq_sched_switch_event(prev);
>   	fire_sched_out_preempt_notifiers(prev, next);
>   	kmap_local_sched_out();
>   	prepare_task(next);
> --- a/kernel/sched/membarrier.c
> +++ b/kernel/sched/membarrier.c
> @@ -199,7 +199,7 @@ static void ipi_rseq(void *info)
>   	 * is negligible.
>   	 */
>   	smp_mb();
> -	rseq_preempt(current);
> +	rseq_sched_switch_event(current);
>   }
>   
>   static void ipi_sync_rq_state(void *info)
> @@ -407,9 +407,9 @@ static int membarrier_private_expedited(
>   		 * membarrier, we will end up with some thread in the mm
>   		 * running without a core sync.
>   		 *
> -		 * For RSEQ, don't rseq_preempt() the caller.  User code
> -		 * is not supposed to issue syscalls at all from inside an
> -		 * rseq critical section.
> +		 * For RSEQ, don't invoke rseq_sched_switch_event() on the
> +		 * caller.  User code is not supposed to issue syscalls at
> +		 * all from inside an rseq critical section.
>   		 */
>   		if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
>   			preempt_disable();
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()
  2025-08-23 16:39 ` [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
@ 2025-08-25 17:54   ` Mathieu Desnoyers
  2025-08-25 20:24     ` Sean Christopherson
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 17:54 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, Peter Zijlstra, Paul E. McKenney, Boqun Feng, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Hypervisors invoke resume_user_mode_work() before entering the guest, which
> clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
> space context available to them, so the rseq notify handler skips
> inspecting the critical section, but updates the CPU/MM CID values
> unconditionally so that the eventual pending rseq event is not lost on the
> way to user space.
> 
> This is a pointless exercise as the task might be rescheduled before
> actually returning to user space and it creates unnecessary work in the
> vcpu_run() loops.

One question here: AFAIU, this removes the updates to the cpu_id_start,
cpu_id, mm_cid, and node_id fields on exit to virt usermode. This means
that while the virt guest is running in usermode, the host hypervisor
process has stale rseq fields, until it eventually returns to the
hypervisor's host userspace (from ioctl).

Considering the rseq uapi documentation, this should not matter.
Each of those fields have this statement:

"This field should only be read by the thread which registered this data
structure."

I can however think of use-cases for reading the rseq fields from other
hypervisor threads to figure out information about thread placement.
Doing so would however go against the documented uapi.

I'd rather ask whether anyone is misusing this uapi in that way before
going ahead with the change, just to prevent surprises.

I'm OK with the re-trigger of rseq, as it does indeed appear to fix
an issue, but I'm concerned about the ABI impact of skipping the
rseq_update_cpu_node_id() on return to virt userspace.

Thoughts ?

Thanks,

Mathieu

> 
> It's way more efficient to ignore that invocation based on @regs == NULL
> and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
> vcpu_run() loop before returning from the ioctl().
> 
> This ensures that a pending RSEQ update is not lost and the IDs are updated
> before returning to user space.
> 
> Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
> a NOOP.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Cc: Dexuan Cui <decui@microsoft.com>
> ---
>   drivers/hv/mshv_root_main.c |    2 +
>   include/linux/rseq.h        |   17 +++++++++
>   kernel/rseq.c               |   76 +++++++++++++++++++++++---------------------
>   virt/kvm/kvm_main.c         |    3 +
>   4 files changed, 62 insertions(+), 36 deletions(-)
> 
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -585,6 +585,8 @@ static long mshv_run_vp_with_root_schedu
>   		}
>   	} while (!vp->run.flags.intercept_suspend);
>   
> +	rseq_virt_userspace_exit();
> +
>   	return ret;
>   }
>   
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to
>   }
>   
>   /*
> + * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
> + * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
> + * that case just to do it eventually again before returning to user space,
> + * the entry resume_user_mode_work() invocation is ignored as the register
> + * argument is NULL.
> + *
> + * After returning from guest mode, they have to invoke this function to
> + * re-raise TIF_NOTIFY_RESUME if necessary.
> + */
> +static inline void rseq_virt_userspace_exit(void)
> +{
> +	if (current->rseq_event_pending)
> +		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> +}
> +
> +/*
>    * If parent process has a registered restartable sequences area, the
>    * child inherits. Unregister rseq for a clone with CLONE_VM set.
>    */
> @@ -68,6 +84,7 @@ static inline void rseq_execve(struct ta
>   static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
> +static inline void rseq_virt_userspace_exit(void) { }
>   static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
>   static inline void rseq_execve(struct task_struct *t) { }
>   static inline void rseq_exit_to_user_mode(void) { }
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct
>   {
>   	struct task_struct *t = current;
>   	int ret, sig;
> +	bool event;
> +
> +	/*
> +	 * If invoked from hypervisors before entering the guest via
> +	 * resume_user_mode_work(), then @regs is a NULL pointer.
> +	 *
> +	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
> +	 * it before returning from the ioctl() to user space when
> +	 * rseq_event.sched_switch is set.
> +	 *
> +	 * So it's safe to ignore here instead of pointlessly updating it
> +	 * in the vcpu_run() loop.
> +	 */
> +	if (!regs)
> +		return;
>   
>   	if (unlikely(t->flags & PF_EXITING))
>   		return;
>   
>   	/*
> -	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
> -	 * pointer, so fixup cannot be done. If the syscall which led to
> -	 * this invocation was invoked inside a critical section, then it
> -	 * will either end up in this code again or a possible violation of
> -	 * a syscall inside a critical region can only be detected by the
> -	 * debug code in rseq_syscall() in a debug enabled kernel.
> +	 * Read and clear the event pending bit first. If the task
> +	 * was not preempted or migrated or a signal is on the way,
> +	 * there is no point in doing any of the heavy lifting here
> +	 * on production kernels. In that case TIF_NOTIFY_RESUME
> +	 * was raised by some other functionality.
> +	 *
> +	 * This is correct because the read/clear operation is
> +	 * guarded against scheduler preemption, which makes it CPU
> +	 * local atomic. If the task is preempted right after
> +	 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> +	 * again and this function is invoked another time _before_
> +	 * the task is able to return to user mode.
> +	 *
> +	 * On a debug kernel, invoke the fixup code unconditionally
> +	 * with the result handed in to allow the detection of
> +	 * inconsistencies.
>   	 */
> -	if (regs) {
> -		/*
> -		 * Read and clear the event pending bit first. If the task
> -		 * was not preempted or migrated or a signal is on the way,
> -		 * there is no point in doing any of the heavy lifting here
> -		 * on production kernels. In that case TIF_NOTIFY_RESUME
> -		 * was raised by some other functionality.
> -		 *
> -		 * This is correct because the read/clear operation is
> -		 * guarded against scheduler preemption, which makes it CPU
> -		 * local atomic. If the task is preempted right after
> -		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> -		 * again and this function is invoked another time _before_
> -		 * the task is able to return to user mode.
> -		 *
> -		 * On a debug kernel, invoke the fixup code unconditionally
> -		 * with the result handed in to allow the detection of
> -		 * inconsistencies.
> -		 */
> -		bool event;
> -
> -		scoped_guard(RSEQ_EVENT_GUARD) {
> -			event = t->rseq_event_pending;
> -			t->rseq_event_pending = false;
> -		}
> +	scoped_guard(RSEQ_EVENT_GUARD) {
> +		event = t->rseq_event_pending;
> +		t->rseq_event_pending = false;
> +	}
>   
> -		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> -			ret = rseq_ip_fixup(regs, event);
> -			if (unlikely(ret < 0))
> -				goto error;
> -		}
> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> +		ret = rseq_ip_fixup(regs, event);
> +		if (unlikely(ret < 0))
> +			goto error;
>   	}
> +
>   	if (unlikely(rseq_update_cpu_node_id(t)))
>   		goto error;
>   	return;
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -49,6 +49,7 @@
>   #include <linux/lockdep.h>
>   #include <linux/kthread.h>
>   #include <linux/suspend.h>
> +#include <linux/rseq.h>
>   
>   #include <asm/processor.h>
>   #include <asm/ioctl.h>
> @@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
>   		r = kvm_arch_vcpu_ioctl_run(vcpu);
>   		vcpu->wants_to_run = false;
>   
> +		rseq_virt_userspace_exit();
> +
>   		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>   		break;
>   	}
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending
  2025-08-23 16:39 ` [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
@ 2025-08-25 18:02   ` Mathieu Desnoyers
  2025-09-02 13:41     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 18:02 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> There is no need to update these values unconditionally if there is no
> event pending.

I agree with this change.

On a related note, I wonder if arch/powerpc/mm/numa.c:
find_and_update_cpu_nid() should set the rseq_event pending bool to true
for each thread in the system ?

Thanks,

Mathieu

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   kernel/rseq.c |   11 ++++++-----
>   1 file changed, 6 insertions(+), 5 deletions(-)
> 
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -464,11 +464,12 @@ void __rseq_handle_notify_resume(struct
>   		t->rseq_event_pending = false;
>   	}
>   
> -	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> -		ret = rseq_ip_fixup(regs, event);
> -		if (unlikely(ret < 0))
> -			goto error;
> -	}
> +	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
> +		return;
> +
> +	ret = rseq_ip_fixup(regs, event);
> +	if (unlikely(ret < 0))
> +		goto error;
>   
>   	if (unlikely(rseq_update_cpu_node_id(t)))
>   		goto error;
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 09/37] rseq: Introduce struct rseq_event
  2025-08-23 16:39 ` [patch V2 09/37] rseq: Introduce struct rseq_event Thomas Gleixner
@ 2025-08-25 18:11   ` Mathieu Desnoyers
  2025-09-02 13:45     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 18:11 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> In preparation for a major rewrite of this code, provide a data structure
> for event management.
> 
> Put the sched_switch event and a indicator for RSEQ on a task into it as a
> start. That uses a union, which allows to mask and clear the whole lot
> efficiently.
> 
> The indicators are explicitely not a bit field. Bit fields generate abysmal

explicitly

> code.
> 
> The boolean members are defined as u8 as that actually guarantees that it
> fits. There seem to be strange architecture ABIs which need more than 8bits
> for a boolean.
> 
> The has_rseq member is redudandant vs. task::rseq, but it turns out that

redundant

> boolean operations and quick checks on the union generate better code than
> fiddling with seperate entities and data types.

separate

> 
> This struct will be extended over time to carry more information.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/rseq.h       |   23 ++++++++++++-----------
>   include/linux/rseq_types.h |   30 ++++++++++++++++++++++++++++++
>   include/linux/sched.h      |    7 ++-----
>   kernel/rseq.c              |    6 ++++--
>   4 files changed, 48 insertions(+), 18 deletions(-)
> 
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -9,22 +9,22 @@ void __rseq_handle_notify_resume(struct
>   
>   static inline void rseq_handle_notify_resume(struct pt_regs *regs)
>   {
> -	if (current->rseq)
> +	if (current->rseq_event.has_rseq)
>   		__rseq_handle_notify_resume(NULL, regs);
>   }
>   
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
>   {
> -	if (current->rseq) {
> -		current->rseq_event_pending = true;
> +	if (current->rseq_event.has_rseq) {
> +		current->rseq_event.sched_switch = true;
>   		__rseq_handle_notify_resume(ksig, regs);
>   	}
>   }
>   
>   static inline void rseq_sched_switch_event(struct task_struct *t)
>   {
> -	if (t->rseq) {
> -		t->rseq_event_pending = true;
> +	if (t->rseq_event.has_rseq) {
> +		t->rseq_event.sched_switch = true;
>   		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
>   	}
>   }
> @@ -32,8 +32,9 @@ static inline void rseq_sched_switch_eve
>   static __always_inline void rseq_exit_to_user_mode(void)
>   {
>   	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
> -		if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
> -			current->rseq_event_pending = false;
> +		if (WARN_ON_ONCE(current->rseq_event.has_rseq &&
> +				 current->rseq_event.events))
> +			current->rseq_event.events = 0;
>   	}
>   }
>   
> @@ -49,7 +50,7 @@ static __always_inline void rseq_exit_to
>    */
>   static inline void rseq_virt_userspace_exit(void)
>   {
> -	if (current->rseq_event_pending)
> +	if (current->rseq_event.sched_switch)
>   		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
>   }
>   
> @@ -63,12 +64,12 @@ static inline void rseq_fork(struct task
>   		t->rseq = NULL;
>   		t->rseq_len = 0;
>   		t->rseq_sig = 0;
> -		t->rseq_event_pending = false;
> +		t->rseq_event.all = 0;
>   	} else {
>   		t->rseq = current->rseq;
>   		t->rseq_len = current->rseq_len;
>   		t->rseq_sig = current->rseq_sig;
> -		t->rseq_event_pending = current->rseq_event_pending;
> +		t->rseq_event = current->rseq_event;
>   	}
>   }
>   
> @@ -77,7 +78,7 @@ static inline void rseq_execve(struct ta
>   	t->rseq = NULL;
>   	t->rseq_len = 0;
>   	t->rseq_sig = 0;
> -	t->rseq_event_pending = false;
> +	t->rseq_event.all = 0;
>   }
>   
>   #else /* CONFIG_RSEQ */
> --- /dev/null
> +++ b/include/linux/rseq_types.h
> @@ -0,0 +1,30 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_RSEQ_TYPES_H
> +#define _LINUX_RSEQ_TYPES_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * struct rseq_event - Storage for rseq related event management
> + * @all:		Compound to initialize and clear the data efficiently
> + * @events:		Compund to access events with a single load/store

Compound

> + * @sched_switch:	True if the task was scheduled out
> + * @has_rseq:		True if the task has a rseq pointer installed
> + */
> +struct rseq_event {
> +	union {
> +		u32				all;
> +		struct {
> +			union {
> +				u16		events;
> +				struct {
> +					u8	sched_switch;
> +				};

Is alpha still supported, or can we assume bytewise loads/stores ?

Are those events meant to each consume 1 byte (which limits us to 2
events for a 2-byte "events"/4-byte "all"), or is the plan to update
them with bitwise or/~ and ?

Thanks,

Mathieu

> +			};
> +
> +			u8			has_rseq;
> +		};
> +	};
> +};
> +
> +#endif
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -41,6 +41,7 @@
>   #include <linux/task_io_accounting.h>
>   #include <linux/posix-timers_types.h>
>   #include <linux/restart_block.h>
> +#include <linux/rseq_types.h>
>   #include <uapi/linux/rseq.h>
>   #include <linux/seqlock_types.h>
>   #include <linux/kcsan.h>
> @@ -1404,11 +1405,7 @@ struct task_struct {
>   	struct rseq __user		*rseq;
>   	u32				rseq_len;
>   	u32				rseq_sig;
> -	/*
> -	 * RmW on rseq_event_pending must be performed atomically
> -	 * with respect to preemption.
> -	 */
> -	bool				rseq_event_pending;
> +	struct rseq_event		rseq_event;
>   # ifdef CONFIG_DEBUG_RSEQ
>   	/*
>   	 * This is a place holder to save a copy of the rseq fields for
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -460,8 +460,8 @@ void __rseq_handle_notify_resume(struct
>   	 * inconsistencies.
>   	 */
>   	scoped_guard(RSEQ_EVENT_GUARD) {
> -		event = t->rseq_event_pending;
> -		t->rseq_event_pending = false;
> +		event = t->rseq_event.sched_switch;
> +		t->rseq_event.sched_switch = false;
>   	}
>   
>   	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
> @@ -523,6 +523,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   		current->rseq = NULL;
>   		current->rseq_sig = 0;
>   		current->rseq_len = 0;
> +		current->rseq_event.all = 0;
>   		return 0;
>   	}
>   
> @@ -595,6 +596,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   	 * registered, ensure the cpu_id_start and cpu_id fields
>   	 * are updated before returning to user-space.
>   	 */
> +	current->rseq_event.has_rseq = true;
>   	rseq_sched_switch_event(current);
>   
>   	return 0;
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 10/37] entry: Cleanup header
  2025-08-23 16:39 ` [patch V2 10/37] entry: Cleanup header Thomas Gleixner
@ 2025-08-25 18:13   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 18:13 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Cleanup the include ordering, kernel-doc and other trivialities before
> making further changes.

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> 
> ---
>   include/linux/entry-common.h     |    8 ++++----
>   include/linux/irq-entry-common.h |    2 ++
>   2 files changed, 6 insertions(+), 4 deletions(-)
> ---
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -3,11 +3,11 @@
>   #define __LINUX_ENTRYCOMMON_H
>   
>   #include <linux/irq-entry-common.h>
> +#include <linux/livepatch.h>
>   #include <linux/ptrace.h>
> +#include <linux/resume_user_mode.h>
>   #include <linux/seccomp.h>
>   #include <linux/sched.h>
> -#include <linux/livepatch.h>
> -#include <linux/resume_user_mode.h>
>   
>   #include <asm/entry-common.h>
>   #include <asm/syscall.h>
> @@ -37,6 +37,7 @@
>   				 SYSCALL_WORK_SYSCALL_AUDIT |		\
>   				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
>   				 ARCH_SYSCALL_WORK_ENTER)
> +
>   #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
>   				 SYSCALL_WORK_SYSCALL_TRACE |		\
>   				 SYSCALL_WORK_SYSCALL_AUDIT |		\
> @@ -61,8 +62,7 @@
>    */
>   void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
>   
> -long syscall_trace_enter(struct pt_regs *regs, long syscall,
> -			 unsigned long work);
> +long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
>   
>   /**
>    * syscall_enter_from_user_mode_work - Check and handle work before invoking
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_
>   
>   /**
>    * enter_from_user_mode - Establish state when coming from user mode
> + * @regs:	Pointer to currents pt_regs
>    *
>    * Syscall/interrupt entry disables interrupts, but user mode is traced as
>    * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
> @@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter(
>    * Conditional reschedule with additional sanity checks.
>    */
>   void raw_irqentry_exit_cond_resched(void);
> +
>   #ifdef CONFIG_PREEMPT_DYNAMIC
>   #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>   #define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 13/37] sched: Move MM CID related functions to sched.h
  2025-08-23 16:39 ` [patch V2 13/37] sched: Move MM CID related functions to sched.h Thomas Gleixner
@ 2025-08-25 18:14   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 18:14 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> There is nothing mm specific in that and including mm.h can cause header
> recursion hell.

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/mm.h    |   25 -------------------------
>   include/linux/sched.h |   26 ++++++++++++++++++++++++++
>   2 files changed, 26 insertions(+), 25 deletions(-)
> 
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2310,31 +2310,6 @@ struct zap_details {
>   /* Set in unmap_vmas() to indicate a final unmap call.  Only used by hugetlb */
>   #define  ZAP_FLAG_UNMAP              ((__force zap_flags_t) BIT(1))
>   
> -#ifdef CONFIG_SCHED_MM_CID
> -void sched_mm_cid_before_execve(struct task_struct *t);
> -void sched_mm_cid_after_execve(struct task_struct *t);
> -void sched_mm_cid_fork(struct task_struct *t);
> -void sched_mm_cid_exit_signals(struct task_struct *t);
> -static inline int task_mm_cid(struct task_struct *t)
> -{
> -	return t->mm_cid;
> -}
> -#else
> -static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
> -static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
> -static inline void sched_mm_cid_fork(struct task_struct *t) { }
> -static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
> -static inline int task_mm_cid(struct task_struct *t)
> -{
> -	/*
> -	 * Use the processor id as a fall-back when the mm cid feature is
> -	 * disabled. This provides functional per-cpu data structure accesses
> -	 * in user-space, althrough it won't provide the memory usage benefits.
> -	 */
> -	return raw_smp_processor_id();
> -}
> -#endif
> -
>   #ifdef CONFIG_MMU
>   extern bool can_do_mlock(void);
>   #else
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2309,4 +2309,30 @@ static __always_inline void alloc_tag_re
>   #define alloc_tag_restore(_tag, _old)		do {} while (0)
>   #endif
>   
> +/* Avoids recursive inclusion hell */
> +#ifdef CONFIG_SCHED_MM_CID
> +void sched_mm_cid_before_execve(struct task_struct *t);
> +void sched_mm_cid_after_execve(struct task_struct *t);
> +void sched_mm_cid_fork(struct task_struct *t);
> +void sched_mm_cid_exit_signals(struct task_struct *t);
> +static inline int task_mm_cid(struct task_struct *t)
> +{
> +	return t->mm_cid;
> +}
> +#else
> +static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
> +static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
> +static inline void sched_mm_cid_fork(struct task_struct *t) { }
> +static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
> +static inline int task_mm_cid(struct task_struct *t)
> +{
> +	/*
> +	 * Use the processor id as a fall-back when the mm cid feature is
> +	 * disabled. This provides functional per-cpu data structure accesses
> +	 * in user-space, althrough it won't provide the memory usage benefits.
> +	 */
> +	return task_cpu(t);
> +}
> +#endif
> +
>   #endif
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 14/37] rseq: Cache CPU ID and MM CID values
  2025-08-23 16:39 ` [patch V2 14/37] rseq: Cache CPU ID and MM CID values Thomas Gleixner
@ 2025-08-25 18:19   ` Mathieu Desnoyers
  2025-09-02 13:48     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 18:19 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> In preparation for rewriting RSEQ exit to user space handling provide
> storage to cache the CPU ID and MM CID values which were written to user
> space. That prepares for a quick check, which avoids the update when
> nothing changed.

What should we do about the numa node_id field ?

On pretty much all arch except powerpc (AFAIK) it's invariant for
the topology, so derived from cpu_id.

On powerpc, we could perhaps reset the cached cpu_id to ~0U for
each thread to trigger an update ? Or just don't care about this ?

Thanks,

Mathieu

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/rseq.h        |    3 +++
>   include/linux/rseq_types.h  |   19 +++++++++++++++++++
>   include/linux/sched.h       |    1 +
>   include/trace/events/rseq.h |    4 ++--
>   kernel/rseq.c               |    4 ++++
>   5 files changed, 29 insertions(+), 2 deletions(-)
> 
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -64,11 +64,13 @@ static inline void rseq_fork(struct task
>   		t->rseq = NULL;
>   		t->rseq_len = 0;
>   		t->rseq_sig = 0;
> +		t->rseq_ids.cpu_cid = ~0ULL;
>   		t->rseq_event.all = 0;
>   	} else {
>   		t->rseq = current->rseq;
>   		t->rseq_len = current->rseq_len;
>   		t->rseq_sig = current->rseq_sig;
> +		t->rseq_ids.cpu_cid = ~0ULL;
>   		t->rseq_event = current->rseq_event;
>   	}
>   }
> @@ -78,6 +80,7 @@ static inline void rseq_execve(struct ta
>   	t->rseq = NULL;
>   	t->rseq_len = 0;
>   	t->rseq_sig = 0;
> +	t->rseq_ids.cpu_cid = ~0ULL;
>   	t->rseq_event.all = 0;
>   }
>   
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -27,4 +27,23 @@ struct rseq_event {
>   	};
>   };
>   
> +/*
> + * struct rseq_ids - Cache for ids, which need to be updated
> + * @cpu_cid:	Compound of @cpu_id and @mm_cid to make the
> + *		compiler emit a single compare on 64-bit
> + * @cpu_id:	The CPU ID which was written last to user space
> + * @mm_cid:	The MM CID which was written last to user space
> + *
> + * @cpu_id and @mm_cid are updated when the data is written to user space.
> + */
> +struct rseq_ids {
> +	union {
> +		u64		cpu_cid;
> +		struct {
> +			u32	cpu_id;
> +			u32	mm_cid;
> +		};
> +	};
> +};
> +
>   #endif
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1406,6 +1406,7 @@ struct task_struct {
>   	u32				rseq_len;
>   	u32				rseq_sig;
>   	struct rseq_event		rseq_event;
> +	struct rseq_ids			rseq_ids;
>   # ifdef CONFIG_DEBUG_RSEQ
>   	/*
>   	 * This is a place holder to save a copy of the rseq fields for
> --- a/include/trace/events/rseq.h
> +++ b/include/trace/events/rseq.h
> @@ -21,9 +21,9 @@ TRACE_EVENT(rseq_update,
>   	),
>   
>   	TP_fast_assign(
> -		__entry->cpu_id = raw_smp_processor_id();
> +		__entry->cpu_id = t->rseq_ids.cpu_id;
>   		__entry->node_id = cpu_to_node(__entry->cpu_id);
> -		__entry->mm_cid = task_mm_cid(t);
> +		__entry->mm_cid = t->rseq_ids.mm_cid;
>   	),
>   
>   	TP_printk("cpu_id=%d node_id=%d mm_cid=%d", __entry->cpu_id,
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -184,6 +184,10 @@ static int rseq_update_cpu_node_id(struc
>   	rseq_unsafe_put_user(t, node_id, node_id, efault_end);
>   	rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
>   
> +	/* Cache the user space values */
> +	t->rseq_ids.cpu_id = cpu_id;
> +	t->rseq_ids.mm_cid = mm_cid;
> +
>   	/*
>   	 * Additional feature fields added after ORIG_RSEQ_SIZE
>   	 * need to be conditionally updated only if
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 15/37] rseq: Record interrupt from user space
  2025-08-23 16:39 ` [patch V2 15/37] rseq: Record interrupt from user space Thomas Gleixner
@ 2025-08-25 18:29   ` Mathieu Desnoyers
  2025-09-02 13:54     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 18:29 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> For RSEQ the only relevant reason to inspect and eventually fixup (abort)
> user space critical sections is when user space was interrupted and the
> task was scheduled out.
> 
> If the user to kernel entry was from a syscall no fixup is required. If
> user space invokes a syscall from a critical section it can keep the
> pieces as documented.
> 
> This is only supported on architectures, which utilize the generic entry

no comma between "architectures" and "which".

> code. If your architecture does not use it, bad luck.
> 

Should we eventually add a "depends on GENERIC_IRQ_ENTRY" to RSEQ then ?

> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/irq-entry-common.h |    3 ++-
>   include/linux/rseq.h             |   16 +++++++++++-----
>   include/linux/rseq_entry.h       |   18 ++++++++++++++++++
>   include/linux/rseq_types.h       |    2 ++
>   4 files changed, 33 insertions(+), 6 deletions(-)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -4,7 +4,7 @@
>   
>   #include <linux/context_tracking.h>
>   #include <linux/kmsan.h>
> -#include <linux/rseq.h>
> +#include <linux/rseq_entry.h>
>   #include <linux/static_call_types.h>
>   #include <linux/syscalls.h>
>   #include <linux/tick.h>
> @@ -281,6 +281,7 @@ static __always_inline void exit_to_user
>   static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
>   {
>   	enter_from_user_mode(regs);
> +	rseq_note_user_irq_entry();

As long as this also covers the following scenarios I'm ok with this:

- trap/exception from an rseq critical section,
- NMI over an rseq critical section.

Thanks,

Mathieu

>   }
>   
>   /**
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -31,11 +31,17 @@ static inline void rseq_sched_switch_eve
>   
>   static __always_inline void rseq_exit_to_user_mode(void)
>   {
> -	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
> -		if (WARN_ON_ONCE(current->rseq_event.has_rseq &&
> -				 current->rseq_event.events))
> -			current->rseq_event.events = 0;
> -	}
> +	struct rseq_event *ev = &current->rseq_event;
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
> +		WARN_ON_ONCE(ev->sched_switch);
> +
> +	/*
> +	 * Ensure that event (especially user_irq) is cleared when the
> +	 * interrupt did not result in a schedule and therefore the
> +	 * rseq processing did not clear it.
> +	 */
> +	ev->events = 0;
>   }
>   
>   /*
> --- /dev/null
> +++ b/include/linux/rseq_entry.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_RSEQ_ENTRY_H
> +#define _LINUX_RSEQ_ENTRY_H
> +
> +#ifdef CONFIG_RSEQ
> +#include <linux/rseq.h>
> +
> +static __always_inline void rseq_note_user_irq_entry(void)
> +{
> +	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
> +		current->rseq_event.user_irq = true;
> +}
> +
> +#else /* CONFIG_RSEQ */
> +static inline void rseq_note_user_irq_entry(void) { }
> +#endif /* !CONFIG_RSEQ */
> +
> +#endif /* _LINUX_RSEQ_ENTRY_H */
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -9,6 +9,7 @@
>    * @all:		Compound to initialize and clear the data efficiently
>    * @events:		Compund to access events with a single load/store
>    * @sched_switch:	True if the task was scheduled out
> + * @user_irq:		True on interrupt entry from user mode
>    * @has_rseq:		True if the task has a rseq pointer installed
>    */
>   struct rseq_event {
> @@ -19,6 +20,7 @@ struct rseq_event {
>   				u16		events;
>   				struct {
>   					u8	sched_switch;
> +					u8	user_irq;
>   				};
>   			};
>   
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 16/37] rseq: Provide tracepoint wrappers for inline code
  2025-08-23 16:39 ` [patch V2 16/37] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
@ 2025-08-25 18:32   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 18:32 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline
> fast path, so that the header can be safely included by code which defines
> actual trace points.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/rseq_entry.h |   30 ++++++++++++++++++++++++++++++
>   kernel/rseq.c              |   17 +++++++++++++++++
>   2 files changed, 47 insertions(+)
> 
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -5,6 +5,36 @@
>   #ifdef CONFIG_RSEQ
>   #include <linux/rseq.h>
>   
> +#include <linux/tracepoint-defs.h>
> +
> +#ifdef CONFIG_TRACEPOINTS
> +DECLARE_TRACEPOINT(rseq_update);
> +DECLARE_TRACEPOINT(rseq_ip_fixup);
> +void __rseq_trace_update(struct task_struct *t);
> +void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
> +			   unsigned long offset, unsigned long abort_ip);
> +
> +static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids)
> +{
> +	if (tracepoint_enabled(rseq_update)) {
> +		if (ids)

Does it work if you do:

if (tracepoint_enabled(rseq_update) && ids) {

or is there some macro magic preventing this ?

Otherwise:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> +			__rseq_trace_update(t);
> +	}
> +}
> +
> +static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
> +				       unsigned long offset, unsigned long abort_ip)
> +{
> +	if (tracepoint_enabled(rseq_ip_fixup))
> +		__rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
> +}
> +
> +#else /* CONFIG_TRACEPOINT */
> +static inline void rseq_trace_update(struct task_struct *t) { }
> +static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
> +				       unsigned long offset, unsigned long abort_ip) { }
> +#endif /* !CONFIG_TRACEPOINT */
> +
>   static __always_inline void rseq_note_user_irq_entry(void)
>   {
>   	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -91,6 +91,23 @@
>   				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
>   				  RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
>   
> +#ifdef CONFIG_TRACEPOINTS
> +/*
> + * Out of line, so the actual update functions can be in a header to be
> + * inlined into the exit to user code.
> + */
> +void __rseq_trace_update(struct task_struct *t)
> +{
> +	trace_rseq_update(t);
> +}
> +
> +void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
> +			   unsigned long offset, unsigned long abort_ip)
> +{
> +	trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip);
> +}
> +#endif /* CONFIG_TRACEPOINTS */
> +
>   #ifdef CONFIG_DEBUG_RSEQ
>   static struct rseq *rseq_kernel_fields(struct task_struct *t)
>   {
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 17/37] rseq: Expose lightweight statistics in debugfs
  2025-08-23 16:39 ` [patch V2 17/37] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
@ 2025-08-25 18:34   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 18:34 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Analyzing the call frequency without actually using tracing is helpful for
> analysis of this infrastructure. The overhead is minimal as it just
> increments a per CPU counter associated to each operation.
> 
> The debugfs readout provides a racy sum of all counters.

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/rseq.h       |   16 ---------
>   include/linux/rseq_entry.h |   49 +++++++++++++++++++++++++++
>   init/Kconfig               |   12 ++++++
>   kernel/rseq.c              |   79 +++++++++++++++++++++++++++++++++++++++++----
>   4 files changed, 133 insertions(+), 23 deletions(-)
> 
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -29,21 +29,6 @@ static inline void rseq_sched_switch_eve
>   	}
>   }
>   
> -static __always_inline void rseq_exit_to_user_mode(void)
> -{
> -	struct rseq_event *ev = &current->rseq_event;
> -
> -	if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
> -		WARN_ON_ONCE(ev->sched_switch);
> -
> -	/*
> -	 * Ensure that event (especially user_irq) is cleared when the
> -	 * interrupt did not result in a schedule and therefore the
> -	 * rseq processing did not clear it.
> -	 */
> -	ev->events = 0;
> -}
> -
>   /*
>    * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
>    * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
> @@ -97,7 +82,6 @@ static inline void rseq_sched_switch_eve
>   static inline void rseq_virt_userspace_exit(void) { }
>   static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
>   static inline void rseq_execve(struct task_struct *t) { }
> -static inline void rseq_exit_to_user_mode(void) { }
>   #endif  /* !CONFIG_RSEQ */
>   
>   #ifdef CONFIG_DEBUG_RSEQ
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -2,6 +2,37 @@
>   #ifndef _LINUX_RSEQ_ENTRY_H
>   #define _LINUX_RSEQ_ENTRY_H
>   
> +/* Must be outside the CONFIG_RSEQ guard to resolve the stubs */
> +#ifdef CONFIG_RSEQ_STATS
> +#include <linux/percpu.h>
> +
> +struct rseq_stats {
> +	unsigned long	exit;
> +	unsigned long	signal;
> +	unsigned long	slowpath;
> +	unsigned long	ids;
> +	unsigned long	cs;
> +	unsigned long	clear;
> +	unsigned long	fixup;
> +};
> +
> +DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
> +
> +/*
> + * Slow path has interrupts and preemption enabled, but the fast path
> + * runs with interrupts disabled so there is no point in having the
> + * preemption checks implied in __this_cpu_inc() for every operation.
> + */
> +#ifdef RSEQ_BUILD_SLOW_PATH
> +#define rseq_stat_inc(which)	this_cpu_inc((which))
> +#else
> +#define rseq_stat_inc(which)	raw_cpu_inc((which))
> +#endif
> +
> +#else /* CONFIG_RSEQ_STATS */
> +#define rseq_stat_inc(x)	do { } while (0)
> +#endif /* !CONFIG_RSEQ_STATS */
> +
>   #ifdef CONFIG_RSEQ
>   #include <linux/rseq.h>
>   
> @@ -41,8 +72,26 @@ static __always_inline void rseq_note_us
>   		current->rseq_event.user_irq = true;
>   }
>   
> +static __always_inline void rseq_exit_to_user_mode(void)
> +{
> +	struct rseq_event *ev = &current->rseq_event;
> +
> +	rseq_stat_inc(rseq_stats.exit);
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
> +		WARN_ON_ONCE(ev->sched_switch);
> +
> +	/*
> +	 * Ensure that event (especially user_irq) is cleared when the
> +	 * interrupt did not result in a schedule and therefore the
> +	 * rseq processing did not clear it.
> +	 */
> +	ev->events = 0;
> +}
> +
>   #else /* CONFIG_RSEQ */
>   static inline void rseq_note_user_irq_entry(void) { }
> +static inline void rseq_exit_to_user_mode(void) { }
>   #endif /* !CONFIG_RSEQ */
>   
>   #endif /* _LINUX_RSEQ_ENTRY_H */
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1883,6 +1883,18 @@ config RSEQ
>   
>   	  If unsure, say Y.
>   
> +config RSEQ_STATS
> +	default n
> +	bool "Enable lightweight statistics of restartable sequences" if EXPERT
> +	depends on RSEQ && DEBUG_FS
> +	help
> +	  Enable lightweight counters which expose information about the
> +	  frequency of RSEQ operations via debugfs. Mostly interesting for
> +	  kernel debugging or performance analysis. While lightweight it's
> +	  still adding code into the user/kernel mode transitions.
> +
> +	  If unsure, say N.
> +
>   config DEBUG_RSEQ
>   	default n
>   	bool "Enable debugging of rseq() system call" if EXPERT
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -67,12 +67,16 @@
>    *   F1. <failure>
>    */
>   
> +/* Required to select the proper per_cpu ops for rseq_stats_inc() */
> +#define RSEQ_BUILD_SLOW_PATH
> +
> +#include <linux/debugfs.h>
> +#include <linux/ratelimit.h>
> +#include <linux/rseq_entry.h>
>   #include <linux/sched.h>
> -#include <linux/uaccess.h>
>   #include <linux/syscalls.h>
> -#include <linux/rseq.h>
> +#include <linux/uaccess.h>
>   #include <linux/types.h>
> -#include <linux/ratelimit.h>
>   #include <asm/ptrace.h>
>   
>   #define CREATE_TRACE_POINTS
> @@ -108,6 +112,56 @@ void __rseq_trace_ip_fixup(unsigned long
>   }
>   #endif /* CONFIG_TRACEPOINTS */
>   
> +#ifdef CONFIG_RSEQ_STATS
> +DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
> +
> +static int rseq_debug_show(struct seq_file *m, void *p)
> +{
> +	struct rseq_stats stats = { };
> +	unsigned int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		stats.exit	+= data_race(per_cpu(rseq_stats.exit, cpu));
> +		stats.signal	+= data_race(per_cpu(rseq_stats.signal, cpu));
> +		stats.slowpath	+= data_race(per_cpu(rseq_stats.slowpath, cpu));
> +		stats.ids	+= data_race(per_cpu(rseq_stats.ids, cpu));
> +		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
> +		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
> +		stats.fixup	+= data_race(per_cpu(rseq_stats.fixup, cpu));
> +	}
> +
> +	seq_printf(m, "exit:   %16lu\n", stats.exit);
> +	seq_printf(m, "signal: %16lu\n", stats.signal);
> +	seq_printf(m, "slowp:  %16lu\n", stats.slowpath);
> +	seq_printf(m, "ids:    %16lu\n", stats.ids);
> +	seq_printf(m, "cs:     %16lu\n", stats.cs);
> +	seq_printf(m, "clear:  %16lu\n", stats.clear);
> +	seq_printf(m, "fixup:  %16lu\n", stats.fixup);
> +	return 0;
> +}
> +
> +static int rseq_debug_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, rseq_debug_show, inode->i_private);
> +}
> +
> +static const struct file_operations dfs_ops = {
> +	.open		= rseq_debug_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static int __init rseq_debugfs_init(void)
> +{
> +	struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
> +
> +	debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
> +	return 0;
> +}
> +__initcall(rseq_debugfs_init);
> +#endif /* CONFIG_RSEQ_STATS */
> +
>   #ifdef CONFIG_DEBUG_RSEQ
>   static struct rseq *rseq_kernel_fields(struct task_struct *t)
>   {
> @@ -187,12 +241,13 @@ static int rseq_update_cpu_node_id(struc
>   	u32 node_id = cpu_to_node(cpu_id);
>   	u32 mm_cid = task_mm_cid(t);
>   
> -	/*
> -	 * Validate read-only rseq fields.
> -	 */
> +	rseq_stat_inc(rseq_stats.ids);
> +
> +	/* Validate read-only rseq fields on debug kernels */
>   	if (rseq_validate_ro_fields(t))
>   		goto efault;
>   	WARN_ON_ONCE((int) mm_cid < 0);
> +
>   	if (!user_write_access_begin(rseq, t->rseq_len))
>   		goto efault;
>   
> @@ -403,6 +458,8 @@ static int rseq_ip_fixup(struct pt_regs
>   	struct rseq_cs rseq_cs;
>   	int ret;
>   
> +	rseq_stat_inc(rseq_stats.cs);
> +
>   	ret = rseq_get_rseq_cs(t, &rseq_cs);
>   	if (ret)
>   		return ret;
> @@ -412,8 +469,10 @@ static int rseq_ip_fixup(struct pt_regs
>   	 * If not nested over a rseq critical section, restart is useless.
>   	 * Clear the rseq_cs pointer and return.
>   	 */
> -	if (!in_rseq_cs(ip, &rseq_cs))
> +	if (!in_rseq_cs(ip, &rseq_cs)) {
> +		rseq_stat_inc(rseq_stats.clear);
>   		return clear_rseq_cs(t->rseq);
> +	}
>   	ret = rseq_check_flags(t, rseq_cs.flags);
>   	if (ret < 0)
>   		return ret;
> @@ -422,6 +481,7 @@ static int rseq_ip_fixup(struct pt_regs
>   	ret = clear_rseq_cs(t->rseq);
>   	if (ret)
>   		return ret;
> +	rseq_stat_inc(rseq_stats.fixup);
>   	trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
>   			    rseq_cs.abort_ip);
>   	instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
> @@ -462,6 +522,11 @@ void __rseq_handle_notify_resume(struct
>   	if (unlikely(t->flags & PF_EXITING))
>   		return;
>   
> +	if (ksig)
> +		rseq_stat_inc(rseq_stats.signal);
> +	else
> +		rseq_stat_inc(rseq_stats.slowpath);
> +
>   	/*
>   	 * Read and clear the event pending bit first. If the task
>   	 * was not preempted or migrated or a signal is on the way,
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 18/37] rseq: Provide static branch for runtime debugging
  2025-08-23 16:39 ` [patch V2 18/37] rseq: Provide static branch for runtime debugging Thomas Gleixner
@ 2025-08-25 18:36   ` Mathieu Desnoyers
  2025-08-25 20:30   ` Michael Jeanson
  1 sibling, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 18:36 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Config based debug is rarely turned on and is not available easily when
> things go wrong.
> 
> Provide a static branch to allow permanent integration of debug mechanisms
> along with the usual toggles in Kconfig, command line and debugfs.
> 
> Requested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   Documentation/admin-guide/kernel-parameters.txt |    4 +
>   include/linux/rseq_entry.h                      |    3
>   init/Kconfig                                    |   14 ++++
>   kernel/rseq.c                                   |   73 ++++++++++++++++++++++--
>   4 files changed, 90 insertions(+), 4 deletions(-)
> 
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6443,6 +6443,10 @@
>   			Memory area to be used by remote processor image,
>   			managed by CMA.
>   
> +	rseq_debug=	[KNL] Enable or disable restartable sequence
> +			debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE.
> +			Format: <bool>
> +
>   	rt_group_sched=	[KNL] Enable or disable SCHED_RR/FIFO group scheduling
>   			when CONFIG_RT_GROUP_SCHED=y. Defaults to
>   			!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
>   #endif /* !CONFIG_RSEQ_STATS */
>   
>   #ifdef CONFIG_RSEQ
> +#include <linux/jump_label.h>
>   #include <linux/rseq.h>
>   
>   #include <linux/tracepoint-defs.h>
> @@ -66,6 +67,8 @@ static inline void rseq_trace_ip_fixup(u
>   				       unsigned long offset, unsigned long abort_ip) { }
>   #endif /* !CONFIG_TRACEPOINT */
>   
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
> +
>   static __always_inline void rseq_note_user_irq_entry(void)
>   {
>   	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1893,10 +1893,24 @@ config RSEQ_STATS
>   
>   	  If unsure, say N.
>   
> +config RSEQ_DEBUG_DEFAULT_ENABLE
> +	default n
> +	bool "Enable restartable sequences debug mode by default" if EXPERT
> +	depends on RSEQ
> +	help
> +	  This enables the static branch for debug mode of restartable
> +	  sequences.
> +
> +	  This also can be controlled on the kernel command line via the
> +	  command line parameter "rseq_debug=0/1" and through debugfs.
> +
> +	  If unsure, say N.
> +
>   config DEBUG_RSEQ
>   	default n
>   	bool "Enable debugging of rseq() system call" if EXPERT
>   	depends on RSEQ && DEBUG_KERNEL
> +	select RSEQ_DEBUG_DEFAULT_ENABLE
>   	help
>   	  Enable extra debugging checks for the rseq system call.
>   
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -95,6 +95,27 @@
>   				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
>   				  RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
>   
> +DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
> +
> +static inline void rseq_control_debug(bool on)
> +{
> +	if (on)
> +		static_branch_enable(&rseq_debug_enabled);
> +	else
> +		static_branch_disable(&rseq_debug_enabled);
> +}
> +
> +static int __init rseq_setup_debug(char *str)
> +{
> +	bool on;
> +
> +	if (kstrtobool(str, &on))
> +		return -EINVAL;
> +	rseq_control_debug(on);
> +	return 0;
> +}
> +__setup("rseq_debug=", rseq_setup_debug);
> +
>   #ifdef CONFIG_TRACEPOINTS
>   /*
>    * Out of line, so the actual update functions can be in a header to be
> @@ -112,10 +133,11 @@ void __rseq_trace_ip_fixup(unsigned long
>   }
>   #endif /* CONFIG_TRACEPOINTS */
>   
> +#ifdef CONFIG_DEBUG_FS
>   #ifdef CONFIG_RSEQ_STATS
>   DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
>   
> -static int rseq_debug_show(struct seq_file *m, void *p)
> +static int rseq_stats_show(struct seq_file *m, void *p)
>   {
>   	struct rseq_stats stats = { };
>   	unsigned int cpu;
> @@ -140,14 +162,56 @@ static int rseq_debug_show(struct seq_fi
>   	return 0;
>   }
>   
> +static int rseq_stats_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, rseq_stats_show, inode->i_private);
> +}
> +
> +static const struct file_operations stat_ops = {
> +	.open		= rseq_stats_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static int __init rseq_stats_init(struct dentry *root_dir)
> +{
> +	debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops);
> +	return 0;
> +}
> +#else
> +static inline void rseq_stats_init(struct dentry *root_dir) { }
> +#endif /* CONFIG_RSEQ_STATS */
> +
> +static int rseq_debug_show(struct seq_file *m, void *p)
> +{
> +	bool on = static_branch_unlikely(&rseq_debug_enabled);
> +
> +	seq_printf(m, "%d\n", on);
> +	return 0;
> +}
> +
> +static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf,
> +			    size_t count, loff_t *ppos)
> +{
> +	bool on;
> +
> +	if (kstrtobool_from_user(ubuf, count, &on))
> +		return -EINVAL;
> +
> +	rseq_control_debug(on);
> +	return count;
> +}
> +
>   static int rseq_debug_open(struct inode *inode, struct file *file)
>   {
>   	return single_open(file, rseq_debug_show, inode->i_private);
>   }
>   
> -static const struct file_operations dfs_ops = {
> +static const struct file_operations debug_ops = {
>   	.open		= rseq_debug_open,
>   	.read		= seq_read,
> +	.write		= rseq_debug_write,
>   	.llseek		= seq_lseek,
>   	.release	= single_release,
>   };
> @@ -156,11 +220,12 @@ static int __init rseq_debugfs_init(void
>   {
>   	struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
>   
> -	debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
> +	debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops);
> +	rseq_stats_init(root_dir);
>   	return 0;
>   }
>   __initcall(rseq_debugfs_init);
> -#endif /* CONFIG_RSEQ_STATS */
> +#endif /* CONFIG_DEBUG_FS */
>   
>   #ifdef CONFIG_DEBUG_RSEQ
>   static struct rseq *rseq_kernel_fields(struct task_struct *t)
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 19/37] rseq: Provide and use rseq_update_user_cs()
  2025-08-23 16:39 ` [patch V2 19/37] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
@ 2025-08-25 19:16   ` Mathieu Desnoyers
  2025-09-02 15:19     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 19:16 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Provide a straight forward implementation to check for and eventually
> clear/fixup critical sections in user space.
> 
> The non-debug version does not any sanity checks and aims for efficiency.

"does not any sanity checks" -> "does no sanity checks"

> 
> The only attack vector, which needs to be reliably prevented is an abort IP
> which is in the kernel address space. That would cause at least x86 to
> return to kernel space via IRET. Instead of a check, just mask the address
> and be done with it.
> 
> The magic signature check along with it's obscure "possible attack" printk

its

> is just voodoo security. If an attacker manages to manipulate the abort_ip
> member in the critical section descriptor, then it can equally manipulate
> any other indirection in the application.

I disagree with this claim. What we are trying to prevent is not an
attacker manipulating an existing rseq_cs descriptor, but rather
crafting its own descriptor and using it to bypass ROP protections.


> If user space truly cares about
> the security of the critical section descriptors, then it should set them
> up once and map the descriptor memory read only.

AFAIR, the attack pattern we are trying to tackle here is:

The attacker has write access to some memory (e.g. stack or heap) and
uses his area to craft a custom rseq_cs descriptor. Using this home-made
descriptor and storing to rseq->rseq_cs, it can set an abort_ip to e.g.
glibc system(3) and easily call any library function through an aborting
rseq critical section, thus bypassing ROP prevention mechanisms.

Requiring the signature prior to the abort ip target prevents using rseq
to bypass ROP prevention, because those ROP gadget targets don't have
the signature.

> There is no justification
> for voodoo security in the kernel fast path to encourage user space to be
> careless under a completely non-sensical "security" claim.
> 
> If the section descriptors are invalid then the resulting misbehaviour of
> the user space application is not the kernels problem.
> 
> The kernel provides a run-time switchable debug slow path, which implements
> the full zoo of checks (except the silly attack message) including
> termination of the task when one of the gazillion conditions is not met.
> 
> Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME
> handler. Move the reminders into the CONFIG_DEBUG_RSEQ section, which will
> be replaced and removed in a subsequent step.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/rseq_entry.h |  194 ++++++++++++++++++++++++++++++++++++
>   include/linux/rseq_types.h |   11 +-
>   kernel/rseq.c              |  238 +++++++++++++--------------------------------
>   3 files changed, 273 insertions(+), 170 deletions(-)
> 
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -36,6 +36,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
>   #ifdef CONFIG_RSEQ
>   #include <linux/jump_label.h>
>   #include <linux/rseq.h>
> +#include <linux/uaccess.h>
>   
>   #include <linux/tracepoint-defs.h>
>   
> @@ -69,12 +70,205 @@ static inline void rseq_trace_ip_fixup(u
>   
>   DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
>   
> +#ifdef RSEQ_BUILD_SLOW_PATH
> +#define rseq_inline
> +#else
> +#define rseq_inline __always_inline
> +#endif
> +
> +bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
> +
>   static __always_inline void rseq_note_user_irq_entry(void)
>   {
>   	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
>   		current->rseq_event.user_irq = true;
>   }
>   
> +/*
> + * Check whether there is a valid critical section and whether the
> + * instruction pointer in @regs is inside the critical section.
> + *
> + *  - If the critical section is invalid, terminate the task.
> + *
> + *  - If valid and the instruction pointer is inside, set it to the abort IP
> + *
> + *  - If valid and the instruction pointer is outside, clear the critical
> + *    section address.
> + *
> + * Returns true, if the section was valid and either fixup or clear was
> + * done, false otherwise.
> + *
> + * In the failure case task::rseq_event::fatal is set when a invalid
> + * section was found. It's clear when the failure was an unresolved page
> + * fault.
> + *
> + * If inlined into the exit to user path with interrupts disabled, the
> + * caller has to protect against page faults with pagefault_disable().
> + *
> + * In preemptible task context this would be counterproductive as the page
> + * faults could not be fully resolved. As a consequence unresolved page
> + * faults in task context are fatal too.
> + */
> +
> +#ifdef RSEQ_BUILD_SLOW_PATH
> +/*
> + * The debug version is put out of line, but kept here so the code stays
> + * together.
> + *
> + * @csaddr has already been checked by the caller to be in user space
> + */
> +bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
> +{
> +	struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
> +	u64 start_ip, abort_ip, offset, cs_end, head, tasksize = TASK_SIZE;
> +	unsigned long ip = instruction_pointer(regs);
> +	u64 __user *uc_head = (u64 __user *) ucs;
> +	u32 usig, __user *uc_sig;
> +
> +	if (!user_rw_masked_begin(ucs))
> +		return false;
> +
> +	/*
> +	 * Evaluate the user pile and exit if one of the conditions is not
> +	 * fulfilled.
> +	 */
> +	unsafe_get_user(start_ip, &ucs->start_ip, fail);
> +	if (unlikely(start_ip >= tasksize))
> +		goto die;
> +	/* If outside, just clear the critical section. */
> +	if (ip < start_ip)
> +		goto clear;
> +
> +	unsafe_get_user(offset, &ucs->post_commit_offset, fail);
> +	cs_end = start_ip + offset;
> +	/* Check for overflow and wraparound */
> +	if (unlikely(cs_end >= tasksize || cs_end < start_ip))
> +		goto die;
> +
> +	/* If not inside, clear it. */
> +	if (ip >= cs_end)
> +		goto clear;
> +
> +	unsafe_get_user(abort_ip, &ucs->abort_ip, fail);
> +	/* Ensure it's "valid" */
> +	if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
> +		goto die;
> +	/* Validate that the abort IP is not in the critical section */
> +	if (unlikely(abort_ip - start_ip < offset))
> +		goto die;
> +
> +	/*
> +	 * Check version and flags for 0. No point in emitting deprecated
> +	 * warnings before dying. That could be done in the slow path
> +	 * eventually, but *shrug*.
> +	 */
> +	unsafe_get_user(head, uc_head, fail);
> +	if (unlikely(head))
> +		goto die;
> +
> +	/* abort_ip - 4 is >= 0. See abort_ip check above */
> +	uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
> +	unsafe_get_user(usig, uc_sig, fail);
> +	if (unlikely(usig != t->rseq_sig))
> +		goto die;
> +
> +	/* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
> +	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> +		/* If not in interrupt from user context, let it die */
> +		if (unlikely(!t->rseq_event.user_irq))
> +			goto die;
> +	}
> +
> +	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
> +	user_access_end();
> +
> +	instruction_pointer_set(regs, (unsigned long)abort_ip);
> +
> +	rseq_stat_inc(rseq_stats.fixup);
> +	rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
> +	return true;
> +clear:
> +	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
> +	user_access_end();
> +	rseq_stat_inc(rseq_stats.clear);
> +	return true;
> +die:
> +	t->rseq_event.fatal = true;
> +fail:
> +	user_access_end();
> +	return false;
> +}
> +#endif /* RSEQ_BUILD_SLOW_PATH */
> +
> +/*
> + * This only ensures that abort_ip is in the user address space by masking it.
> + * No other sanity checks are done here, that's what the debug code is for.

I wonder if we should consider adding a runtime config knob, perhaps
a kernel boot argument, a sysctl, or a per-process prctl, to allow
enforcing rseq ROP protection in security-focused use-cases ?

Thanks,

Mathieu

> + */
> +static rseq_inline bool
> +rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
> +{
> +	struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
> +	unsigned long ip = instruction_pointer(regs);
> +	u64 start_ip, abort_ip, offset;
> +
> +	rseq_stat_inc(rseq_stats.cs);
> +
> +	if (unlikely(csaddr >= TASK_SIZE)) {
> +		t->rseq_event.fatal = true;
> +		return false;
> +	}
> +
> +	if (static_branch_unlikely(&rseq_debug_enabled))
> +		return rseq_debug_update_user_cs(t, regs, csaddr);
> +
> +	if (!user_rw_masked_begin(ucs))
> +		return false;
> +
> +	unsafe_get_user(start_ip, &ucs->start_ip, fail);
> +	unsafe_get_user(offset, &ucs->post_commit_offset, fail);
> +	unsafe_get_user(abort_ip, &ucs->abort_ip, fail);
> +
> +	/*
> +	 * No sanity checks. If user space screwed it up, it can
> +	 * keep the pieces. That's what debug code is for.
> +	 *
> +	 * If outside, just clear the critical section.
> +	 */
> +	if (ip - start_ip >= offset)
> +		goto clear;
> +
> +	/*
> +	 * Force it to be in user space as x86 IRET would happily return to
> +	 * the kernel. Can't use TASK_SIZE as a mask because that's not
> +	 * necessarily a power of two. Just make sure it's in the user
> +	 * address space. Let the pagefault handler sort it out.
> +	 *
> +	 * Use LONG_MAX and not LLONG_MAX to keep it correct for 32 and 64
> +	 * bit architectures.
> +	 */
> +	abort_ip &= (u64)LONG_MAX;
> +
> +	/* Invalidate the critical section */
> +	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
> +	user_access_end();
> +
> +	/* Update the instruction pointer */
> +	instruction_pointer_set(regs, (unsigned long)abort_ip);
> +
> +	rseq_stat_inc(rseq_stats.fixup);
> +	rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
> +	return true;
> +clear:
> +	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
> +	user_access_end();
> +	rseq_stat_inc(rseq_stats.clear);
> +	return true;
> +
> +fail:
> +	user_access_end();
> +	return false;
> +}
> +
>   static __always_inline void rseq_exit_to_user_mode(void)
>   {
>   	struct rseq_event *ev = &current->rseq_event;
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -11,10 +11,12 @@
>    * @sched_switch:	True if the task was scheduled out
>    * @user_irq:		True on interrupt entry from user mode
>    * @has_rseq:		True if the task has a rseq pointer installed
> + * @error:		Compound error code for the slow path to analyze
> + * @fatal:		User space data corrupted or invalid
>    */
>   struct rseq_event {
>   	union {
> -		u32				all;
> +		u64				all;
>   		struct {
>   			union {
>   				u16		events;
> @@ -25,6 +27,13 @@ struct rseq_event {
>   			};
>   
>   			u8			has_rseq;
> +			u8			__pad;
> +			union {
> +				u16		error;
> +				struct {
> +					u8	fatal;
> +				};
> +			};
>   		};
>   	};
>   };
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -382,175 +382,15 @@ static int rseq_reset_rseq_cpu_node_id(s
>   	return -EFAULT;
>   }
>   
> -/*
> - * Get the user-space pointer value stored in the 'rseq_cs' field.
> - */
> -static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs)
> -{
> -	if (!rseq_cs)
> -		return -EFAULT;
> -
> -#ifdef CONFIG_64BIT
> -	if (get_user(*rseq_cs, &rseq->rseq_cs))
> -		return -EFAULT;
> -#else
> -	if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs)))
> -		return -EFAULT;
> -#endif
> -
> -	return 0;
> -}
> -
> -/*
> - * If the rseq_cs field of 'struct rseq' contains a valid pointer to
> - * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
> - */
> -static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
> -{
> -	struct rseq_cs __user *urseq_cs;
> -	u64 ptr;
> -	u32 __user *usig;
> -	u32 sig;
> -	int ret;
> -
> -	ret = rseq_get_rseq_cs_ptr_val(t->rseq, &ptr);
> -	if (ret)
> -		return ret;
> -
> -	/* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
> -	if (!ptr) {
> -		memset(rseq_cs, 0, sizeof(*rseq_cs));
> -		return 0;
> -	}
> -	/* Check that the pointer value fits in the user-space process space. */
> -	if (ptr >= TASK_SIZE)
> -		return -EINVAL;
> -	urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
> -	if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
> -		return -EFAULT;
> -
> -	if (rseq_cs->start_ip >= TASK_SIZE ||
> -	    rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
> -	    rseq_cs->abort_ip >= TASK_SIZE ||
> -	    rseq_cs->version > 0)
> -		return -EINVAL;
> -	/* Check for overflow. */
> -	if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
> -		return -EINVAL;
> -	/* Ensure that abort_ip is not in the critical section. */
> -	if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
> -		return -EINVAL;
> -
> -	usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
> -	ret = get_user(sig, usig);
> -	if (ret)
> -		return ret;
> -
> -	if (current->rseq_sig != sig) {
> -		printk_ratelimited(KERN_WARNING
> -			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
> -			sig, current->rseq_sig, current->pid, usig);
> -		return -EINVAL;
> -	}
> -	return 0;
> -}
> -
> -static bool rseq_warn_flags(const char *str, u32 flags)
> +static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
>   {
> -	u32 test_flags;
> +	u64 csaddr;
>   
> -	if (!flags)
> +	if (get_user_masked_u64(&csaddr, &t->rseq->rseq_cs))
>   		return false;
> -	test_flags = flags & RSEQ_CS_NO_RESTART_FLAGS;
> -	if (test_flags)
> -		pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, str);
> -	test_flags = flags & ~RSEQ_CS_NO_RESTART_FLAGS;
> -	if (test_flags)
> -		pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str);
> -	return true;
> -}
> -
> -static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
> -{
> -	u32 flags;
> -	int ret;
> -
> -	if (rseq_warn_flags("rseq_cs", cs_flags))
> -		return -EINVAL;
> -
> -	/* Get thread flags. */
> -	ret = get_user(flags, &t->rseq->flags);
> -	if (ret)
> -		return ret;
> -
> -	if (rseq_warn_flags("rseq", flags))
> -		return -EINVAL;
> -	return 0;
> -}
> -
> -static int clear_rseq_cs(struct rseq __user *rseq)
> -{
> -	/*
> -	 * The rseq_cs field is set to NULL on preemption or signal
> -	 * delivery on top of rseq assembly block, as well as on top
> -	 * of code outside of the rseq assembly block. This performs
> -	 * a lazy clear of the rseq_cs field.
> -	 *
> -	 * Set rseq_cs to NULL.
> -	 */
> -#ifdef CONFIG_64BIT
> -	return put_user(0UL, &rseq->rseq_cs);
> -#else
> -	if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs)))
> -		return -EFAULT;
> -	return 0;
> -#endif
> -}
> -
> -/*
> - * Unsigned comparison will be true when ip >= start_ip, and when
> - * ip < start_ip + post_commit_offset.
> - */
> -static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
> -{
> -	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
> -}
> -
> -static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
> -{
> -	unsigned long ip = instruction_pointer(regs);
> -	struct task_struct *t = current;
> -	struct rseq_cs rseq_cs;
> -	int ret;
> -
> -	rseq_stat_inc(rseq_stats.cs);
> -
> -	ret = rseq_get_rseq_cs(t, &rseq_cs);
> -	if (ret)
> -		return ret;
> -
> -	/*
> -	 * Handle potentially not being within a critical section.
> -	 * If not nested over a rseq critical section, restart is useless.
> -	 * Clear the rseq_cs pointer and return.
> -	 */
> -	if (!in_rseq_cs(ip, &rseq_cs)) {
> -		rseq_stat_inc(rseq_stats.clear);
> -		return clear_rseq_cs(t->rseq);
> -	}
> -	ret = rseq_check_flags(t, rseq_cs.flags);
> -	if (ret < 0)
> -		return ret;
> -	if (!abort)
> -		return 0;
> -	ret = clear_rseq_cs(t->rseq);
> -	if (ret)
> -		return ret;
> -	rseq_stat_inc(rseq_stats.fixup);
> -	trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
> -			    rseq_cs.abort_ip);
> -	instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
> -	return 0;
> +	if (likely(!csaddr))
> +		return true;
> +	return rseq_update_user_cs(t, regs, csaddr);
>   }
>   
>   /*
> @@ -567,8 +407,8 @@ static int rseq_ip_fixup(struct pt_regs
>   void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>   {
>   	struct task_struct *t = current;
> -	int ret, sig;
>   	bool event;
> +	int sig;
>   
>   	/*
>   	 * If invoked from hypervisors before entering the guest via
> @@ -618,8 +458,7 @@ void __rseq_handle_notify_resume(struct
>   	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
>   		return;
>   
> -	ret = rseq_ip_fixup(regs, event);
> -	if (unlikely(ret < 0))
> +	if (!rseq_handle_cs(t, regs))
>   		goto error;
>   
>   	if (unlikely(rseq_update_cpu_node_id(t)))
> @@ -632,6 +471,67 @@ void __rseq_handle_notify_resume(struct
>   }
>   
>   #ifdef CONFIG_DEBUG_RSEQ
> +/*
> + * Unsigned comparison will be true when ip >= start_ip, and when
> + * ip < start_ip + post_commit_offset.
> + */
> +static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
> +{
> +	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
> +}
> +
> +/*
> + * If the rseq_cs field of 'struct rseq' contains a valid pointer to
> + * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
> + */
> +static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
> +{
> +	struct rseq_cs __user *urseq_cs;
> +	u64 ptr;
> +	u32 __user *usig;
> +	u32 sig;
> +	int ret;
> +
> +	if (get_user_masked_u64(&ptr, &t->rseq->rseq_cs))
> +		return -EFAULT;
> +
> +	/* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
> +	if (!ptr) {
> +		memset(rseq_cs, 0, sizeof(*rseq_cs));
> +		return 0;
> +	}
> +	/* Check that the pointer value fits in the user-space process space. */
> +	if (ptr >= TASK_SIZE)
> +		return -EINVAL;
> +	urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
> +	if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
> +		return -EFAULT;
> +
> +	if (rseq_cs->start_ip >= TASK_SIZE ||
> +	    rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
> +	    rseq_cs->abort_ip >= TASK_SIZE ||
> +	    rseq_cs->version > 0)
> +		return -EINVAL;
> +	/* Check for overflow. */
> +	if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
> +		return -EINVAL;
> +	/* Ensure that abort_ip is not in the critical section. */
> +	if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
> +		return -EINVAL;
> +
> +	usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
> +	ret = get_user(sig, usig);
> +	if (ret)
> +		return ret;
> +
> +	if (current->rseq_sig != sig) {
> +		printk_ratelimited(KERN_WARNING
> +			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
> +			sig, current->rseq_sig, current->pid, usig);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
>   
>   /*
>    * Terminate the process if a syscall is issued within a restartable
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 31/37] asm-generic: Provide generic TIF infrastructure
  2025-08-23 16:40 ` [patch V2 31/37] asm-generic: Provide generic TIF infrastructure Thomas Gleixner
  2025-08-23 20:37   ` Arnd Bergmann
@ 2025-08-25 19:33   ` Mathieu Desnoyers
  1 sibling, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 19:33 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Arnd Bergmann, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> Common TIF bits do not have to be defined by every architecture. They can
> be defined in a generic header.
> 
> That allows adding generic TIF bits without chasing a gazillion of
> architecture headers, which is again a unjustified burden on anyone who
> works on generic infrastructure as it always needs a boat load of work to
> keep existing architecture code working when adding new stuff.
> 
> While it is not as horrible as the ignorance of the generic entry
> infrastructure, it is a welcome mechanism to make architecture people
> rethink their approach of just leaching generic improvements into
> architecture code and thereby making it accumulatingly harder to maintain
> and improve generic code. It's about time that this changea.

changes

> 
> Provide the infrastructure and split the TIF space in half, 16 generic and
> 16 architecture specific bits.
> 
> This could probably be extended by TIF_SINGLESTEP and BLOCKSTEP, but those
> are only used in architecture specific code. So leave them alone for now.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Arnd Bergmann <arnd@arndb.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   arch/Kconfig                          |    4 ++
>   include/asm-generic/thread_info_tif.h |   48 ++++++++++++++++++++++++++++++++++
>   2 files changed, 52 insertions(+)
> 
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1730,6 +1730,10 @@ config ARCH_VMLINUX_NEEDS_RELOCS
>   	  relocations preserved. This is used by some architectures to
>   	  construct bespoke relocation tables for KASLR.
>   
> +# Select if architecture uses the common generic TIF bits
> +config HAVE_GENERIC_TIF_BITS
> +       bool
> +
>   source "kernel/gcov/Kconfig"
>   
>   source "scripts/gcc-plugins/Kconfig"
> --- /dev/null
> +++ b/include/asm-generic/thread_info_tif.h
> @@ -0,0 +1,48 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_GENERIC_THREAD_INFO_TIF_H_
> +#define _ASM_GENERIC_THREAD_INFO_TIF_H_
> +
> +#include <vdso/bits.h>
> +
> +/* Bits 16-31 are reserved for architecture specific purposes */
> +
> +#define TIF_NOTIFY_RESUME	0	// callback before returning to user
> +#define _TIF_NOTIFY_RESUME	BIT(TIF_NOTIFY_RESUME)
> +
> +#define TIF_SIGPENDING		1	// signal pending
> +#define _TIF_SIGPENDING		BIT(TIF_SIGPENDING)
> +
> +#define TIF_NOTIFY_SIGNAL	2	// signal notifications exist
> +#define _TIF_NOTIFY_SIGNAL	BIT(TIF_NOTIFY_SIGNAL)
> +
> +#define TIF_MEMDIE		3	// is terminating due to OOM killer
> +#define _TIF_MEMDIE		BIT(TIF_MEMDIE)
> +
> +#define TIF_NEED_RESCHED	4	// rescheduling necessary
> +#define _TIF_NEED_RESCHED	BIT(TIF_NEED_RESCHED)
> +
> +#ifdef HAVE_TIF_NEED_RESCHED_LAZY
> +# define TIF_NEED_RESCHED_LAZY	5	// Lazy rescheduling needed
> +# define _TIF_NEED_RESCHED_LAZY	BIT(TIF_NEED_RESCHED_LAZY)
> +#endif
> +
> +#ifdef HAVE_TIF_POLLING_NRFLAG
> +# define TIF_POLLING_NRFLAG	6	// idle is polling for TIF_NEED_RESCHED
> +# define _TIF_POLLING_NRFLAG	BIT(TIF_POLLING_NRFLAG)
> +#endif
> +
> +#define TIF_USER_RETURN_NOTIFY	7	// notify kernel of userspace return
> +#define _TIF_USER_RETURN_NOTIFY	BIT(TIF_USER_RETURN_NOTIFY)
> +
> +#define TIF_UPROBE		8	// breakpointed or singlestepping
> +#define _TIF_UPROBE		BIT(TIF_UPROBE)
> +
> +#define TIF_PATCH_PENDING	9	// pending live patching update
> +#define _TIF_PATCH_PENDING	BIT(TIF_PATCH_PENDING)
> +
> +#ifdef HAVE_TIF_RESTORE_SIGMASK
> +# define TIF_RESTORE_SIGMASK	10	// Restore signal mask in do_signal() */
> +# define _TIF_RESTORE_SIGMASK	BIT(TIF_RESTORE_SIGMASK)
> +#endif
> +
> +#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 32/37] x86: Use generic TIF bits
  2025-08-23 16:40 ` [patch V2 32/37] x86: Use generic TIF bits Thomas Gleixner
@ 2025-08-25 19:34   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 19:34 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, x86, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> No point in defining generic items and the upcoming RSEQ optimizations are
> only available with this _and_ the generic entry infrastructure, which is
> already used by x86. So no further action required here.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: x86@kernel.org

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   arch/x86/Kconfig                   |    1
>   arch/x86/include/asm/thread_info.h |   74 +++++++++++++++----------------------
>   2 files changed, 31 insertions(+), 44 deletions(-)
> 
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -239,6 +239,7 @@ config X86
>   	select HAVE_EFFICIENT_UNALIGNED_ACCESS
>   	select HAVE_EISA			if X86_32
>   	select HAVE_EXIT_THREAD
> +	select HAVE_GENERIC_TIF_BITS
>   	select HAVE_GUP_FAST
>   	select HAVE_FENTRY			if X86_64 || DYNAMIC_FTRACE
>   	select HAVE_FTRACE_GRAPH_FUNC		if HAVE_FUNCTION_GRAPH_TRACER
> --- a/arch/x86/include/asm/thread_info.h
> +++ b/arch/x86/include/asm/thread_info.h
> @@ -80,56 +80,42 @@ struct thread_info {
>   #endif
>   
>   /*
> - * thread information flags
> - * - these are process state flags that various assembly files
> - *   may need to access
> + * Tell the generic TIF infrastructure which bits x86 supports
>    */
> -#define TIF_NOTIFY_RESUME	1	/* callback before returning to user */
> -#define TIF_SIGPENDING		2	/* signal pending */
> -#define TIF_NEED_RESCHED	3	/* rescheduling necessary */
> -#define TIF_NEED_RESCHED_LAZY	4	/* Lazy rescheduling needed */
> -#define TIF_SINGLESTEP		5	/* reenable singlestep on user return*/
> -#define TIF_SSBD		6	/* Speculative store bypass disable */
> -#define TIF_SPEC_IB		9	/* Indirect branch speculation mitigation */
> -#define TIF_SPEC_L1D_FLUSH	10	/* Flush L1D on mm switches (processes) */
> -#define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
> -#define TIF_UPROBE		12	/* breakpointed or singlestepping */
> -#define TIF_PATCH_PENDING	13	/* pending live patching update */
> -#define TIF_NEED_FPU_LOAD	14	/* load FPU on return to userspace */
> -#define TIF_NOCPUID		15	/* CPUID is not accessible in userland */
> -#define TIF_NOTSC		16	/* TSC is not accessible in userland */
> -#define TIF_NOTIFY_SIGNAL	17	/* signal notifications exist */
> -#define TIF_MEMDIE		20	/* is terminating due to OOM killer */
> -#define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
> +#define HAVE_TIF_NEED_RESCHED_LAZY
> +#define HAVE_TIF_POLLING_NRFLAG
> +#define HAVE_TIF_SINGLESTEP
> +
> +#include <asm-generic/thread_info_tif.h>
> +
> +/* Architecture specific TIF space starts at 16 */
> +#define TIF_SSBD		16	/* Speculative store bypass disable */
> +#define TIF_SPEC_IB		17	/* Indirect branch speculation mitigation */
> +#define TIF_SPEC_L1D_FLUSH	18	/* Flush L1D on mm switches (processes) */
> +#define TIF_NEED_FPU_LOAD	19	/* load FPU on return to userspace */
> +#define TIF_NOCPUID		20	/* CPUID is not accessible in userland */
> +#define TIF_NOTSC		21	/* TSC is not accessible in userland */
>   #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
>   #define TIF_SPEC_FORCE_UPDATE	23	/* Force speculation MSR update in context switch */
>   #define TIF_FORCED_TF		24	/* true if TF in eflags artificially */
> -#define TIF_BLOCKSTEP		25	/* set when we want DEBUGCTLMSR_BTF */
> +#define TIF_SINGLESTEP		25	/* reenable singlestep on user return*/
> +#define TIF_BLOCKSTEP		26	/* set when we want DEBUGCTLMSR_BTF */
>   #define TIF_LAZY_MMU_UPDATES	27	/* task is updating the mmu lazily */
> -#define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
> +#define TIF_ADDR32		28	/* 32-bit address space on 64 bits */
>   
> -#define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
> -#define _TIF_SIGPENDING		(1 << TIF_SIGPENDING)
> -#define _TIF_NEED_RESCHED	(1 << TIF_NEED_RESCHED)
> -#define _TIF_NEED_RESCHED_LAZY	(1 << TIF_NEED_RESCHED_LAZY)
> -#define _TIF_SINGLESTEP		(1 << TIF_SINGLESTEP)
> -#define _TIF_SSBD		(1 << TIF_SSBD)
> -#define _TIF_SPEC_IB		(1 << TIF_SPEC_IB)
> -#define _TIF_SPEC_L1D_FLUSH	(1 << TIF_SPEC_L1D_FLUSH)
> -#define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
> -#define _TIF_UPROBE		(1 << TIF_UPROBE)
> -#define _TIF_PATCH_PENDING	(1 << TIF_PATCH_PENDING)
> -#define _TIF_NEED_FPU_LOAD	(1 << TIF_NEED_FPU_LOAD)
> -#define _TIF_NOCPUID		(1 << TIF_NOCPUID)
> -#define _TIF_NOTSC		(1 << TIF_NOTSC)
> -#define _TIF_NOTIFY_SIGNAL	(1 << TIF_NOTIFY_SIGNAL)
> -#define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
> -#define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
> -#define _TIF_SPEC_FORCE_UPDATE	(1 << TIF_SPEC_FORCE_UPDATE)
> -#define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
> -#define _TIF_BLOCKSTEP		(1 << TIF_BLOCKSTEP)
> -#define _TIF_LAZY_MMU_UPDATES	(1 << TIF_LAZY_MMU_UPDATES)
> -#define _TIF_ADDR32		(1 << TIF_ADDR32)
> +#define _TIF_SSBD		BIT(TIF_SSBD)
> +#define _TIF_SPEC_IB		BIT(TIF_SPEC_IB)
> +#define _TIF_SPEC_L1D_FLUSH	BIT(TIF_SPEC_L1D_FLUSH)
> +#define _TIF_NEED_FPU_LOAD	BIT(TIF_NEED_FPU_LOAD)
> +#define _TIF_NOCPUID		BIT(TIF_NOCPUID)
> +#define _TIF_NOTSC		BIT(TIF_NOTSC)
> +#define _TIF_IO_BITMAP		BIT(TIF_IO_BITMAP)
> +#define _TIF_SPEC_FORCE_UPDATE	BIT(TIF_SPEC_FORCE_UPDATE)
> +#define _TIF_FORCED_TF		BIT(TIF_FORCED_TF)
> +#define _TIF_BLOCKSTEP		BIT(TIF_BLOCKSTEP)
> +#define _TIF_SINGLESTEP		BIT(TIF_SINGLESTEP)
> +#define _TIF_LAZY_MMU_UPDATES	BIT(TIF_LAZY_MMU_UPDATES)
> +#define _TIF_ADDR32		BIT(TIF_ADDR32)
>   
>   /* flags to check in __switch_to() */
>   #define _TIF_WORK_CTXSW_BASE					\
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported
  2025-08-23 16:40 ` [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
@ 2025-08-25 19:39   ` Mathieu Desnoyers
  2025-08-25 20:02   ` Sean Christopherson
  1 sibling, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 19:39 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially
> with the RSEQ fast path depending on it, but not really handling it.
> 
> Define a seperate TIF_RSEQ in the generic TIF space and enable the full
> seperation of fast and slow path for architectures which utilize that.
> 
> That avoids the hassle with invocations of resume_user_mode_work() from
> hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required
> re-evaluation at the end of vcpu_run() a NOOP on architectures which
> utilize the generic TIF space and have a seperate TIF_RSEQ.
> 
> The hypervisor TIF handling does not include the seperate TIF_RSEQ as there
> is no point in doing so. The guest does neither know nor care about the VMM
> host applications RSEQ state. That state is only relevant when the ioctl()
> returns to user space.
> 
> The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure
> handling, but this only happens within exit_to_user_mode_loop(), so
> arguably the hypervisor ioctl() code is long done when this happens.
> 
> This allows further optimizations for blocking syscall heavy workloads in a
> subsequent step.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   include/asm-generic/thread_info_tif.h |    3 +++
>   include/linux/irq-entry-common.h      |    2 +-
>   include/linux/rseq.h                  |   13 ++++++++++---
>   include/linux/rseq_entry.h            |   23 +++++++++++++++++++----
>   include/linux/thread_info.h           |    5 +++++
>   5 files changed, 38 insertions(+), 8 deletions(-)
> 
> --- a/include/asm-generic/thread_info_tif.h
> +++ b/include/asm-generic/thread_info_tif.h
> @@ -45,4 +45,7 @@
>   # define _TIF_RESTORE_SIGMASK	BIT(TIF_RESTORE_SIGMASK)
>   #endif
>   
> +#define TIF_RSEQ		11	// Run RSEQ fast path
> +#define _TIF_RSEQ		BIT(TIF_RSEQ)
> +
>   #endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -30,7 +30,7 @@
>   #define EXIT_TO_USER_MODE_WORK						\
>   	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
>   	 _TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY |			\
> -	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL |			\
> +	 _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ |		\
>   	 ARCH_EXIT_TO_USER_MODE_WORK)
>   
>   /**
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -40,7 +40,7 @@ static inline void rseq_signal_deliver(s
>   
>   static inline void rseq_raise_notify_resume(struct task_struct *t)
>   {
> -	set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> +	set_tsk_thread_flag(t, TIF_RSEQ);
>   }
>   
>   /* Invoked from context switch to force evaluation on exit to user */
> @@ -122,7 +122,7 @@ static inline void rseq_force_update(voi
>    */
>   static inline void rseq_virt_userspace_exit(void)
>   {
> -	if (current->rseq_event.sched_switch)
> +	if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) && current->rseq_event.sched_switch)
>   		rseq_raise_notify_resume(current);
>   }
>   
> @@ -147,9 +147,16 @@ static inline void rseq_fork(struct task
>   		/*
>   		 * If it has rseq, force it into the slow path right away
>   		 * because it is guaranteed to fault.
> +		 *
> +		 * Setting TIF_NOTIFY_RESUME is redundant but harmless for
> +		 * architectures which do not have a seperate TIF_RSEQ, but
> +		 * for those who do it's required to enforce the slow path
> +		 * as the scheduler sets only TIF_RSEQ.
>   		 */
> -		if (t->rseq_event.has_rseq)
> +		if (t->rseq_event.has_rseq) {
>   			t->rseq_event.slowpath = true;
> +			set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> +		}
>   	}
>   }
>   
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -502,18 +502,33 @@ static __always_inline bool __rseq_exit_
>   	return true;
>   }
>   
> +#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
> +# define CHECK_TIF_RSEQ		_TIF_RSEQ
> +static __always_inline void clear_tif_rseq(void)
> +{
> +	clear_thread_flag(TIF_RSEQ);
> +}
> +#else
> +# define CHECK_TIF_RSEQ		0UL
> +static inline void clear_tif_rseq(void) { }
> +#endif
> +
>   static __always_inline unsigned long
>   rseq_exit_to_user_mode_work(struct pt_regs *regs, unsigned long ti_work, const unsigned long mask)
>   {
>   	/*
>   	 * Check if all work bits have been cleared before handling rseq.
> +	 *
> +	 * In case of a seperate TIF_RSEQ this checks for all other bits to
> +	 * be cleared and TIF_RSEQ to be set.
>   	 */
> -	if ((ti_work & mask) != 0)
> -		return ti_work;
> -
> -	if (likely(!__rseq_exit_to_user_mode_restart(regs)))
> +	if ((ti_work & mask) != CHECK_TIF_RSEQ)
>   		return ti_work;
>   
> +	if (likely(!__rseq_exit_to_user_mode_restart(regs))) {
> +		clear_tif_rseq();
> +		return ti_work & ~CHECK_TIF_RSEQ;
> +	}
>   	return ti_work | _TIF_NOTIFY_RESUME;
>   }
>   
> --- a/include/linux/thread_info.h
> +++ b/include/linux/thread_info.h
> @@ -67,6 +67,11 @@ enum syscall_work_bit {
>   #define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
>   #endif
>   
> +#ifndef TIF_RSEQ
> +# define TIF_RSEQ	TIF_NOTIFY_RESUME
> +# define _TIF_RSEQ	_TIF_NOTIFY_RESUME
> +#endif
> +
>   #ifdef __KERNEL__
>   
>   #ifndef arch_set_restart_data
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit
  2025-08-23 16:40 ` [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit Thomas Gleixner
@ 2025-08-25 19:43   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 19:43 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> Further analysis of the exit path with the seperate TIF_RSEQ showed that
> depending on the workload a significant amount of invocations of
> resume_user_mode_work() ends up with no other bit set than TIF_RSEQ.
> 
> On architectures with a separate TIF_RSEQ this can be distinguished and
> checked right at the beginning of the function before entering the loop.
> 
> The quick check is lightweight so it does not impose a massive penalty on
> non-RSEQ use cases. It just checks for the work being empty, except for
> TIF_RSEQ and jumps right into the handling fast path.
> 
> This is truly the only TIF bit there which can be optimized that way
> because the handling runs only when all the other work has been done. The
> optimization spares a full round trip through the other conditionals and an
> interrupt enable/disable pair. The generated code looks reasonable enough
> to justify this and the resulting numbers do so as well.
> 
> The main beneficiaries are blocking syscall heavy work loads, where the
> tasks often end up being scheduled on a different CPU or get a different MM
> CID, but have no other work to handle on return.
> 
> A futex benchmark showed up to 90% shortcut utilization and a measurable
> improvement in perf of ~1%. Non-scheduling work loads do neither see an
> improvement nor degrade. A full kernel build shows about 15% shortcuts,
> but no measurable side effects in either direction.

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/rseq_entry.h |   14 ++++++++++++++
>   kernel/entry/common.c      |   13 +++++++++++--
>   kernel/rseq.c              |    2 ++
>   3 files changed, 27 insertions(+), 2 deletions(-)
> 
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -11,6 +11,7 @@ struct rseq_stats {
>   	unsigned long	signal;
>   	unsigned long	slowpath;
>   	unsigned long	fastpath;
> +	unsigned long	quicktif;
>   	unsigned long	ids;
>   	unsigned long	cs;
>   	unsigned long	clear;
> @@ -532,6 +533,14 @@ rseq_exit_to_user_mode_work(struct pt_re
>   	return ti_work | _TIF_NOTIFY_RESUME;
>   }
>   
> +static __always_inline bool
> +rseq_exit_to_user_mode_early(unsigned long ti_work, const unsigned long mask)
> +{
> +	if (IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS))
> +		return (ti_work & mask) == CHECK_TIF_RSEQ;
> +	return false;
> +}
> +
>   #endif /* !CONFIG_GENERIC_ENTRY */
>   
>   static __always_inline void rseq_syscall_exit_to_user_mode(void)
> @@ -577,6 +586,11 @@ static inline unsigned long rseq_exit_to
>   {
>   	return ti_work;
>   }
> +
> +static inline bool rseq_exit_to_user_mode_early(unsigned long ti_work, const unsigned long mask)
> +{
> +	return false;
> +}
>   static inline void rseq_note_user_irq_entry(void) { }
>   static inline void rseq_syscall_exit_to_user_mode(void) { }
>   static inline void rseq_irqentry_exit_to_user_mode(void) { }
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -22,7 +22,14 @@ void __weak arch_do_signal_or_restart(st
>   	/*
>   	 * Before returning to user space ensure that all pending work
>   	 * items have been completed.
> +	 *
> +	 * Optimize for TIF_RSEQ being the only bit set.
>   	 */
> +	if (rseq_exit_to_user_mode_early(ti_work, EXIT_TO_USER_MODE_WORK)) {
> +		rseq_stat_inc(rseq_stats.quicktif);
> +		goto do_rseq;
> +	}
> +
>   	do {
>   		local_irq_enable_exit_to_user(ti_work);
>   
> @@ -56,10 +63,12 @@ void __weak arch_do_signal_or_restart(st
>   
>   		ti_work = read_thread_flags();
>   
> +	do_rseq:
>   		/*
>   		 * This returns the unmodified ti_work, when ti_work is not
> -		 * empty. In that case it waits for the next round to avoid
> -		 * multiple updates in case of rescheduling.
> +		 * empty (except for TIF_RSEQ). In that case it waits for
> +		 * the next round to avoid multiple updates in case of
> +		 * rescheduling.
>   		 *
>   		 * When it handles rseq it returns either with empty work
>   		 * on success or with TIF_NOTIFY_RESUME set on failure to
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -134,6 +134,7 @@ static int rseq_stats_show(struct seq_fi
>   		stats.signal	+= data_race(per_cpu(rseq_stats.signal, cpu));
>   		stats.slowpath	+= data_race(per_cpu(rseq_stats.slowpath, cpu));
>   		stats.fastpath	+= data_race(per_cpu(rseq_stats.fastpath, cpu));
> +		stats.quicktif	+= data_race(per_cpu(rseq_stats.quicktif, cpu));
>   		stats.ids	+= data_race(per_cpu(rseq_stats.ids, cpu));
>   		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
>   		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
> @@ -144,6 +145,7 @@ static int rseq_stats_show(struct seq_fi
>   	seq_printf(m, "signal: %16lu\n", stats.signal);
>   	seq_printf(m, "slowp:  %16lu\n", stats.slowpath);
>   	seq_printf(m, "fastp:  %16lu\n", stats.fastpath);
> +	seq_printf(m, "quickt: %16lu\n", stats.quicktif);
>   	seq_printf(m, "ids:    %16lu\n", stats.ids);
>   	seq_printf(m, "cs:     %16lu\n", stats.cs);
>   	seq_printf(m, "clear:  %16lu\n", stats.clear);
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported
  2025-08-23 16:40 ` [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
  2025-08-25 19:39   ` Mathieu Desnoyers
@ 2025-08-25 20:02   ` Sean Christopherson
  2025-09-02 11:03     ` Thomas Gleixner
  1 sibling, 1 reply; 91+ messages in thread
From: Sean Christopherson @ 2025-08-25 20:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Paolo Bonzini, Wei Liu, Dexuan Cui,
	x86, Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Sat, Aug 23, 2025, Thomas Gleixner wrote:
> @@ -122,7 +122,7 @@ static inline void rseq_force_update(voi
>   */
>  static inline void rseq_virt_userspace_exit(void)
>  {
> -	if (current->rseq_event.sched_switch)
> +	if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) && current->rseq_event.sched_switch)

Rather than pivot on CONFIG_HAVE_GENERIC_TIF_BITS, which makes the "why" quite
difficult to find/understand, what if this checks TIF_RSEQ == TIF_NOTIFY_RESUME?
That would also allow architectures to define TIF_RSEQ without switching to the
generic TIF bits implementation (though I don't know that we want to encourage
that?).

Updating the comment to explain what's going on would also be helpful, e.g.

diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 185a4875b261..9a8e238ae9d1 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -112,17 +112,17 @@ static inline void rseq_force_update(void)
 
 /*
  * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
- * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
- * that case just to do it eventually again before returning to user space,
- * the entry resume_user_mode_work() invocation is ignored as the register
- * argument is NULL.
+ * which clears TIF_NOTIFY_RESUME on architectures that don't provide a separate
+ * TIF_RSEQ flag. To avoid updating user space RSEQ in that case just to do it
+ * eventually again before returning to user space, __rseq_handle_slowpath()
+ * does nothing when invoked with NULL register state.
  *
- * After returning from guest mode, they have to invoke this function to
- * re-raise TIF_NOTIFY_RESUME if necessary.
+ * After returning from guest mode, before exiting to userspace, hypervisors
+ * must invoke this function to re-raise TIF_NOTIFY_RESUME if necessary.
  */
 static inline void rseq_virt_userspace_exit(void)
 {
-       if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) && current->rseq_event.sched_switch)
+       if (TIF_RSEQ == TIF_NOTIFY_RESUME && current->rseq_event.sched_switch)
                rseq_raise_notify_resume(current);
 }
 
>  		rseq_raise_notify_resume(current);
>  }

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()
  2025-08-25 17:54   ` Mathieu Desnoyers
@ 2025-08-25 20:24     ` Sean Christopherson
  2025-09-02 15:37       ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Sean Christopherson @ 2025-08-25 20:24 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, LKML, Jens Axboe, Paolo Bonzini, Wei Liu,
	Dexuan Cui, Peter Zijlstra, Paul E. McKenney, Boqun Feng, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Mon, Aug 25, 2025, Mathieu Desnoyers wrote:
> On 2025-08-23 12:39, Thomas Gleixner wrote:
> > Hypervisors invoke resume_user_mode_work() before entering the guest, which
> > clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
> > space context available to them, so the rseq notify handler skips
> > inspecting the critical section, but updates the CPU/MM CID values
> > unconditionally so that the eventual pending rseq event is not lost on the
> > way to user space.
> > 
> > This is a pointless exercise as the task might be rescheduled before
> > actually returning to user space and it creates unnecessary work in the
> > vcpu_run() loops.
> 
> One question here: AFAIU, this removes the updates to the cpu_id_start,
> cpu_id, mm_cid, and node_id fields on exit to virt usermode. This means
> that while the virt guest is running in usermode, the host hypervisor
> process has stale rseq fields, until it eventually returns to the
> hypervisor's host userspace (from ioctl).
> 
> Considering the rseq uapi documentation, this should not matter.
> Each of those fields have this statement:
> 
> "This field should only be read by the thread which registered this data
> structure."
> 
> I can however think of use-cases for reading the rseq fields from other
> hypervisor threads to figure out information about thread placement.
> Doing so would however go against the documented uapi.
> 
> I'd rather ask whether anyone is misusing this uapi in that way before
> going ahead with the change, just to prevent surprises.
> 
> I'm OK with the re-trigger of rseq, as it does indeed appear to fix
> an issue, but I'm concerned about the ABI impact of skipping the
> rseq_update_cpu_node_id() on return to virt userspace.
> 
> Thoughts ?

I know the idea of exposing rseq to paravirtualized guests has been floated (more
than once), but I don't _think_ anyone has actually shipped anything of that 
nature.

> > @@ -49,6 +49,7 @@
> >   #include <linux/lockdep.h>
> >   #include <linux/kthread.h>
> >   #include <linux/suspend.h>
> > +#include <linux/rseq.h>
> >   #include <asm/processor.h>
> >   #include <asm/ioctl.h>
> > @@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
> >   		r = kvm_arch_vcpu_ioctl_run(vcpu);
> >   		vcpu->wants_to_run = false;
> > +		rseq_virt_userspace_exit();

I don't love bleeding even more entry/rseq details into KVM.  Rather than optimize
KVM and then add TIF_RSEQ, what if we do the opposite?  I.e. add TIF_RSEQ to
XFER_TO_GUEST_MODE_WORK as part of "rseq: Switch to TIF_RSEQ if supported", and
then drop TIF_RSEQ from XFER_TO_GUEST_MODE_WORK in a new patch?

That should make it easier to revert the KVM/virt change if it turns out PV setups
are playing games with rseq, and it would give the stragglers (arm64 in particular)
some motiviation to implement TIF_RSEQ and/or switch to generic TIF bits.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 18/37] rseq: Provide static branch for runtime debugging
  2025-08-23 16:39 ` [patch V2 18/37] rseq: Provide static branch for runtime debugging Thomas Gleixner
  2025-08-25 18:36   ` Mathieu Desnoyers
@ 2025-08-25 20:30   ` Michael Jeanson
  2025-09-02 13:56     ` Thomas Gleixner
  1 sibling, 1 reply; 91+ messages in thread
From: Michael Jeanson @ 2025-08-25 20:30 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Config based debug is rarely turned on and is not available easily when
> things go wrong.
> 
> Provide a static branch to allow permanent integration of debug mechanisms
> along with the usual toggles in Kconfig, command line and debugfs.
> 
> Requested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   Documentation/admin-guide/kernel-parameters.txt |    4 +
>   include/linux/rseq_entry.h                      |    3
>   init/Kconfig                                    |   14 ++++
>   kernel/rseq.c                                   |   73 ++++++++++++++++++++++--
>   4 files changed, 90 insertions(+), 4 deletions(-)
> 
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -6443,6 +6443,10 @@
>   			Memory area to be used by remote processor image,
>   			managed by CMA.
>   
> +	rseq_debug=	[KNL] Enable or disable restartable sequence
> +			debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE.
> +			Format: <bool>
> +
>   	rt_group_sched=	[KNL] Enable or disable SCHED_RR/FIFO group scheduling
>   			when CONFIG_RT_GROUP_SCHED=y. Defaults to
>   			!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
>   #endif /* !CONFIG_RSEQ_STATS */
>   
>   #ifdef CONFIG_RSEQ
> +#include <linux/jump_label.h>
>   #include <linux/rseq.h>
>   
>   #include <linux/tracepoint-defs.h>
> @@ -66,6 +67,8 @@ static inline void rseq_trace_ip_fixup(u
>   				       unsigned long offset, unsigned long abort_ip) { }
>   #endif /* !CONFIG_TRACEPOINT */
>   
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
> +
>   static __always_inline void rseq_note_user_irq_entry(void)
>   {
>   	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1893,10 +1893,24 @@ config RSEQ_STATS
>   
>   	  If unsure, say N.
>   
> +config RSEQ_DEBUG_DEFAULT_ENABLE
> +	default n
> +	bool "Enable restartable sequences debug mode by default" if EXPERT
> +	depends on RSEQ
> +	help
> +	  This enables the static branch for debug mode of restartable
> +	  sequences.
> +
> +	  This also can be controlled on the kernel command line via the
> +	  command line parameter "rseq_debug=0/1" and through debugfs.
> +
> +	  If unsure, say N.
> +
>   config DEBUG_RSEQ
>   	default n
>   	bool "Enable debugging of rseq() system call" if EXPERT
>   	depends on RSEQ && DEBUG_KERNEL
> +	select RSEQ_DEBUG_DEFAULT_ENABLE
>   	help
>   	  Enable extra debugging checks for the rseq system call.
>   
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -95,6 +95,27 @@
>   				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
>   				  RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
>   
> +DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
> +
> +static inline void rseq_control_debug(bool on)
> +{
> +	if (on)
> +		static_branch_enable(&rseq_debug_enabled);
> +	else
> +		static_branch_disable(&rseq_debug_enabled);
> +}
> +
> +static int __init rseq_setup_debug(char *str)
> +{
> +	bool on;
> +
> +	if (kstrtobool(str, &on))
> +		return -EINVAL;
> +	rseq_control_debug(on);
> +	return 0;

Functions used by __setup() have to return '1' to signal that the 
argument was handled, otherwise you get this in the kernel log:

kernel: Unknown kernel command line parameters "rseq_debug=1", will be 
passed to user space.


> +}
> +__setup("rseq_debug=", rseq_setup_debug);
> +
>   #ifdef CONFIG_TRACEPOINTS
>   /*
>    * Out of line, so the actual update functions can be in a header to be
> @@ -112,10 +133,11 @@ void __rseq_trace_ip_fixup(unsigned long
>   }
>   #endif /* CONFIG_TRACEPOINTS */
>   
> +#ifdef CONFIG_DEBUG_FS
>   #ifdef CONFIG_RSEQ_STATS
>   DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
>   
> -static int rseq_debug_show(struct seq_file *m, void *p)
> +static int rseq_stats_show(struct seq_file *m, void *p)
>   {
>   	struct rseq_stats stats = { };
>   	unsigned int cpu;
> @@ -140,14 +162,56 @@ static int rseq_debug_show(struct seq_fi
>   	return 0;
>   }
>   
> +static int rseq_stats_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, rseq_stats_show, inode->i_private);
> +}
> +
> +static const struct file_operations stat_ops = {
> +	.open		= rseq_stats_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
> +static int __init rseq_stats_init(struct dentry *root_dir)
> +{
> +	debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops);
> +	return 0;
> +}
> +#else
> +static inline void rseq_stats_init(struct dentry *root_dir) { }
> +#endif /* CONFIG_RSEQ_STATS */
> +
> +static int rseq_debug_show(struct seq_file *m, void *p)
> +{
> +	bool on = static_branch_unlikely(&rseq_debug_enabled);
> +
> +	seq_printf(m, "%d\n", on);
> +	return 0;
> +}
> +
> +static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf,
> +			    size_t count, loff_t *ppos)
> +{
> +	bool on;
> +
> +	if (kstrtobool_from_user(ubuf, count, &on))
> +		return -EINVAL;
> +
> +	rseq_control_debug(on);
> +	return count;
> +}
> +
>   static int rseq_debug_open(struct inode *inode, struct file *file)
>   {
>   	return single_open(file, rseq_debug_show, inode->i_private);
>   }
>   
> -static const struct file_operations dfs_ops = {
> +static const struct file_operations debug_ops = {
>   	.open		= rseq_debug_open,
>   	.read		= seq_read,
> +	.write		= rseq_debug_write,
>   	.llseek		= seq_lseek,
>   	.release	= single_release,
>   };
> @@ -156,11 +220,12 @@ static int __init rseq_debugfs_init(void
>   {
>   	struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
>   
> -	debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
> +	debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops);
> +	rseq_stats_init(root_dir);
>   	return 0;
>   }
>   __initcall(rseq_debugfs_init);
> -#endif /* CONFIG_RSEQ_STATS */
> +#endif /* CONFIG_DEBUG_FS */
>   
>   #ifdef CONFIG_DEBUG_RSEQ
>   static struct rseq *rseq_kernel_fields(struct task_struct *t)
> 
> 


^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 20/37] rseq: Replace the debug crud
  2025-08-23 16:39 ` [patch V2 20/37] rseq: Replace the debug crud Thomas Gleixner
@ 2025-08-26 14:21   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 14:21 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Just utilize the new infrastructure and put the original one to rest.

I would recommend changing the patch subject to e.g.

"rseq: Replace the syscall debug code by the new infrastructure"

since I don't think the pre-existing debug code qualifies as "crud",
even though your proposed changes are certainly improvements.

Other than that:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks,

Mathieu

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   kernel/rseq.c |   80 ++++++++--------------------------------------------------
>   1 file changed, 12 insertions(+), 68 deletions(-)
> 
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -472,83 +472,27 @@ void __rseq_handle_notify_resume(struct
>   
>   #ifdef CONFIG_DEBUG_RSEQ
>   /*
> - * Unsigned comparison will be true when ip >= start_ip, and when
> - * ip < start_ip + post_commit_offset.
> - */
> -static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
> -{
> -	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
> -}
> -
> -/*
> - * If the rseq_cs field of 'struct rseq' contains a valid pointer to
> - * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
> - */
> -static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
> -{
> -	struct rseq_cs __user *urseq_cs;
> -	u64 ptr;
> -	u32 __user *usig;
> -	u32 sig;
> -	int ret;
> -
> -	if (get_user_masked_u64(&ptr, &t->rseq->rseq_cs))
> -		return -EFAULT;
> -
> -	/* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
> -	if (!ptr) {
> -		memset(rseq_cs, 0, sizeof(*rseq_cs));
> -		return 0;
> -	}
> -	/* Check that the pointer value fits in the user-space process space. */
> -	if (ptr >= TASK_SIZE)
> -		return -EINVAL;
> -	urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
> -	if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
> -		return -EFAULT;
> -
> -	if (rseq_cs->start_ip >= TASK_SIZE ||
> -	    rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
> -	    rseq_cs->abort_ip >= TASK_SIZE ||
> -	    rseq_cs->version > 0)
> -		return -EINVAL;
> -	/* Check for overflow. */
> -	if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
> -		return -EINVAL;
> -	/* Ensure that abort_ip is not in the critical section. */
> -	if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
> -		return -EINVAL;
> -
> -	usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
> -	ret = get_user(sig, usig);
> -	if (ret)
> -		return ret;
> -
> -	if (current->rseq_sig != sig) {
> -		printk_ratelimited(KERN_WARNING
> -			"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
> -			sig, current->rseq_sig, current->pid, usig);
> -		return -EINVAL;
> -	}
> -	return 0;
> -}
> -
> -/*
>    * Terminate the process if a syscall is issued within a restartable
>    * sequence.
>    */
>   void rseq_syscall(struct pt_regs *regs)
>   {
> -	unsigned long ip = instruction_pointer(regs);
>   	struct task_struct *t = current;
> -	struct rseq_cs rseq_cs;
> +	u64 csaddr;
>   
> -	if (!t->rseq)
> +	if (!t->rseq_event.has_rseq)
> +		return;
> +	if (get_user_masked_u64(&csaddr, &t->rseq->rseq_cs))
> +		goto fail;
> +	if (likely(!csaddr))
>   		return;
> -	if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
> -		force_sig(SIGSEGV);
> +	if (unlikely(csaddr >= TASK_SIZE))
> +		goto fail;
> +	if (rseq_debug_update_user_cs(t, regs, csaddr))
> +		return;
> +fail:
> +	force_sig(SIGSEGV);
>   }
> -
>   #endif
>   
>   /*
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 21/37] rseq: Make exit debugging static branch based
  2025-08-23 16:39 ` [patch V2 21/37] rseq: Make exit debugging static branch based Thomas Gleixner
@ 2025-08-26 14:23   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 14:23 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Disconnect it from the config switch and use the static debug branch. This
> is a temporary measure for validating the rework. At the end this check
> needs to be hidden behind lockdep as it has nothing to do with the other
> debug infrastructure, which mainly aids user space debugging by enabling a
> zoo of checks which terminate misbehaving tasks instead of letting them
> keep the hard to diagnose pieces.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   include/linux/rseq_entry.h |    2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -275,7 +275,7 @@ static __always_inline void rseq_exit_to
>   
>   	rseq_stat_inc(rseq_stats.exit);
>   
> -	if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
> +	if (static_branch_unlikely(&rseq_debug_enabled))
>   		WARN_ON_ONCE(ev->sched_switch);
>   
>   	/*
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
  2025-08-23 16:40 ` [patch V2 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
@ 2025-08-26 14:28   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 14:28 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> Make the syscall exit debug mechanism available via the static branch on
> architectures which utilize the generic entry code.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   include/linux/entry-common.h |    2 +-
>   include/linux/rseq_entry.h   |    9 +++++++++
>   kernel/rseq.c                |   19 +++++++++++++------
>   3 files changed, 23 insertions(+), 7 deletions(-)
> 
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -146,7 +146,7 @@ static __always_inline void syscall_exit
>   			local_irq_enable();
>   	}
>   
> -	rseq_syscall(regs);
> +	rseq_debug_syscall_return(regs);
>   
>   	/*
>   	 * Do one-time syscall specific work. If these work items are
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -286,9 +286,18 @@ static __always_inline void rseq_exit_to
>   	ev->events = 0;
>   }
>   
> +void __rseq_debug_syscall_return(struct pt_regs *regs);
> +
> +static inline void rseq_debug_syscall_return(struct pt_regs *regs)
> +{
> +	if (static_branch_unlikely(&rseq_debug_enabled))
> +		__rseq_debug_syscall_return(regs);
> +}
> +
>   #else /* CONFIG_RSEQ */
>   static inline void rseq_note_user_irq_entry(void) { }
>   static inline void rseq_exit_to_user_mode(void) { }
> +static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
>   #endif /* !CONFIG_RSEQ */
>   
>   #endif /* _LINUX_RSEQ_ENTRY_H */
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -470,12 +470,7 @@ void __rseq_handle_notify_resume(struct
>   	force_sigsegv(sig);
>   }
>   
> -#ifdef CONFIG_DEBUG_RSEQ
> -/*
> - * Terminate the process if a syscall is issued within a restartable
> - * sequence.
> - */
> -void rseq_syscall(struct pt_regs *regs)
> +void __rseq_debug_syscall_return(struct pt_regs *regs)
>   {
>   	struct task_struct *t = current;
>   	u64 csaddr;
> @@ -493,6 +488,18 @@ void rseq_syscall(struct pt_regs *regs)
>   fail:
>   	force_sig(SIGSEGV);
>   }
> +
> +#ifdef CONFIG_DEBUG_RSEQ
> +/*
> + * Kept around to keep GENERIC_ENTRY=n architectures supported.
> + *
> + * Terminate the process if a syscall is issued within a restartable
> + * sequence.
> + */
> +void rseq_syscall(struct pt_regs *regs)
> +{
> +	__rseq_debug_syscall_return(regs);
> +}
>   #endif
>   
>   /*
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 23/37] rseq: Provide and use rseq_set_uids()
  2025-08-23 16:40 ` [patch V2 23/37] rseq: Provide and use rseq_set_uids() Thomas Gleixner
@ 2025-08-26 14:52   ` Mathieu Desnoyers
  2025-09-02 14:08     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 14:52 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> Provide a new and straight forward implementation to set the IDs (CPU ID,
> Node ID and MM CID), which can be later inlined into the fast path.
> 
> It does all operations in one user_rw_masked_begin() section and retrieves
> also the critical section member (rseq::cs_rseq) from user space to avoid
> another user..begin/end() pair. This is in preparation for optimizing the
> fast path to avoid extra work when not required.
> 
> Use it to replace the whole related zoo in rseq.c
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   fs/binfmt_elf.c            |    2
>   include/linux/rseq_entry.h |   95 ++++++++++++++++++++
>   include/linux/rseq_types.h |    2
>   include/linux/sched.h      |   10 --
>   kernel/rseq.c              |  208 ++++++---------------------------------------
>   5 files changed, 130 insertions(+), 187 deletions(-)
> 
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -46,7 +46,7 @@
>   #include <linux/cred.h>
>   #include <linux/dax.h>
>   #include <linux/uaccess.h>
> -#include <linux/rseq.h>
> +#include <uapi/linux/rseq.h>
>   #include <asm/param.h>
>   #include <asm/page.h>
>   
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -38,6 +38,8 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
>   #include <linux/rseq.h>
>   #include <linux/uaccess.h>
>   
> +#include <uapi/linux/rseq.h>
> +
>   #include <linux/tracepoint-defs.h>
>   
>   #ifdef CONFIG_TRACEPOINTS
> @@ -77,6 +79,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
>   #endif
>   
>   bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
> +bool rseq_debug_validate_uids(struct task_struct *t);
>   
>   static __always_inline void rseq_note_user_irq_entry(void)
>   {
> @@ -198,6 +201,44 @@ bool rseq_debug_update_user_cs(struct ta
>   	user_access_end();
>   	return false;
>   }
> +
> +/*
> + * On debug kernels validate that user space did not mess with it if
> + * DEBUG_RSEQ is enabled, but don't on the first exit to user space. In
> + * that case cpu_cid is ~0. See fork/execve.
> + */
> +bool rseq_debug_validate_uids(struct task_struct *t)

Typically "UIDs" are a well known term (user identity) associated with
getuid(2).

I understand that you're trying to use "uids" for "userspace IDs" here,
but I'm concerned about the TLA clash.

perhaps we should name this "rseq_debug_validate_user_fields" instead ?

> +{
> +	u32 cpu_id, uval, node_id = cpu_to_node(task_cpu(t));
> +	struct rseq __user *rseq = t->rseq;
> +
> +	if (t->rseq_ids.cpu_cid == ~0)
> +		return true;
> +
> +	if (!user_read_masked_begin(rseq))
> +		return false;
> +
> +	unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
> +	if (cpu_id != t->rseq_ids.cpu_id)
> +		goto die;
> +	unsafe_get_user(uval, &rseq->cpu_id, efault);
> +	if (uval != cpu_id)
> +		goto die;
> +	unsafe_get_user(uval, &rseq->node_id, efault);
> +	if (uval != node_id)
> +		goto die;

AFAIU, when a task migrates across NUMA nodes, userspace will have a
stale value and this check will fail, thus killing the process. To fix
this you'd need to derive "node_id" from
cpu_to_node(t->rseq_ids.cpu_id).

But doing that will not work on powerpc, where the mapping between
node_id and cpu_id can change dynamically, AFAIU this can kill processes
even though userspace did not alter the node_id behind the kernel's
back.

The difference with the preexisting code is that this compares
the userspace node_id with the current node_id that comes
from cpu_to_node(task_cpu(t)), whereas the preexisting
rseq_validate_ro_fields() compares the userspace node_id with the
prior node_id copy we have in the kernel.

> +	unsafe_get_user(uval, &rseq->mm_cid, efault);
> +	if (uval != t->rseq_ids.mm_cid)
> +		goto die;
> +	user_access_end();
> +	return true;
> +die:
> +	t->rseq_event.fatal = true;
> +efault:
> +	user_access_end();
> +	return false;
> +}
> +
>   #endif /* RSEQ_BUILD_SLOW_PATH */
>   
>   /*
> @@ -268,6 +309,60 @@ rseq_update_user_cs(struct task_struct *
>   	user_access_end();
>   	return false;
>   }
> +
> +/*
> + * Updates CPU ID, Node ID and MM CID and reads the critical section
> + * address, when @csaddr != NULL. This allows to put the ID update and the
> + * read under the same uaccess region to spare a seperate begin/end.

separate

> + *
> + * As this is either invoked from a C wrapper with @csaddr = NULL or from
> + * the fast path code with a valid pointer, a clever compiler should be
> + * able to optimize the read out. Spares a duplicate implementation.
> + *
> + * Returns true, if the operation was successful, false otherwise.
> + *
> + * In the failure case task::rseq_event::fatal is set when invalid data
> + * was found on debug kernels. It's clear when the failure was an unresolved page
> + * fault.
> + *
> + * If inlined into the exit to user path with interrupts disabled, the
> + * caller has to protect against page faults with pagefault_disable().
> + *
> + * In preemptible task context this would be counterproductive as the page
> + * faults could not be fully resolved. As a consequence unresolved page
> + * faults in task context are fatal too.
> + */
> +static rseq_inline
> +bool rseq_set_uids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
> +			      u32 node_id, u64 *csaddr)
> +{
> +	struct rseq __user *rseq = t->rseq;
> +
> +	if (static_branch_unlikely(&rseq_debug_enabled)) {
> +		if (!rseq_debug_validate_uids(t))
> +			return false;
> +	}
> +
> +	if (!user_rw_masked_begin(rseq))
> +		return false;
> +
> +	unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
> +	unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
> +	unsafe_put_user(node_id, &rseq->node_id, efault);
> +	unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
> +	if (csaddr)
> +		unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
> +	user_access_end();
> +
> +	/* Cache the new values */
> +	t->rseq_ids.cpu_cid = ids->cpu_cid;

I may be missing something, but I think we're missing updates to
t->rseq_ids.mm_cid and we may want to keep track of t->rseq_ids.node_id
as well.

Thanks,

Mathieu

> +	rseq_stat_inc(rseq_stats.ids);
> +	rseq_trace_update(t, ids);
> +	return true;
> +efault:
> +	user_access_end();
> +	return false;
> +}
>   
>   static __always_inline void rseq_exit_to_user_mode(void)
>   {
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -3,6 +3,8 @@
>   #define _LINUX_RSEQ_TYPES_H
>   
>   #include <linux/types.h>
> +/* Forward declaration for the sched.h */
> +struct rseq;
>   
>   /*
>    * struct rseq_event - Storage for rseq related event management
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -42,7 +42,6 @@
>   #include <linux/posix-timers_types.h>
>   #include <linux/restart_block.h>
>   #include <linux/rseq_types.h>
> -#include <uapi/linux/rseq.h>
>   #include <linux/seqlock_types.h>
>   #include <linux/kcsan.h>
>   #include <linux/rv.h>
> @@ -1407,15 +1406,6 @@ struct task_struct {
>   	u32				rseq_sig;
>   	struct rseq_event		rseq_event;
>   	struct rseq_ids			rseq_ids;
> -# ifdef CONFIG_DEBUG_RSEQ
> -	/*
> -	 * This is a place holder to save a copy of the rseq fields for
> -	 * validation of read-only fields. The struct rseq has a
> -	 * variable-length array at the end, so it cannot be used
> -	 * directly. Reserve a size large enough for the known fields.
> -	 */
> -	char				rseq_fields[sizeof(struct rseq)];
> -# endif
>   #endif
>   
>   #ifdef CONFIG_SCHED_MM_CID
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -88,13 +88,6 @@
>   # define RSEQ_EVENT_GUARD	preempt
>   #endif
>   
> -/* The original rseq structure size (including padding) is 32 bytes. */
> -#define ORIG_RSEQ_SIZE		32
> -
> -#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
> -				  RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
> -				  RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
> -
>   DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
>   
>   static inline void rseq_control_debug(bool on)
> @@ -227,159 +220,9 @@ static int __init rseq_debugfs_init(void
>   __initcall(rseq_debugfs_init);
>   #endif /* CONFIG_DEBUG_FS */
>   
> -#ifdef CONFIG_DEBUG_RSEQ
> -static struct rseq *rseq_kernel_fields(struct task_struct *t)
> -{
> -	return (struct rseq *) t->rseq_fields;
> -}
> -
> -static int rseq_validate_ro_fields(struct task_struct *t)
> -{
> -	static DEFINE_RATELIMIT_STATE(_rs,
> -				      DEFAULT_RATELIMIT_INTERVAL,
> -				      DEFAULT_RATELIMIT_BURST);
> -	u32 cpu_id_start, cpu_id, node_id, mm_cid;
> -	struct rseq __user *rseq = t->rseq;
> -
> -	/*
> -	 * Validate fields which are required to be read-only by
> -	 * user-space.
> -	 */
> -	if (!user_read_access_begin(rseq, t->rseq_len))
> -		goto efault;
> -	unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
> -	unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
> -	unsafe_get_user(node_id, &rseq->node_id, efault_end);
> -	unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end);
> -	user_read_access_end();
> -
> -	if ((cpu_id_start != rseq_kernel_fields(t)->cpu_id_start ||
> -	    cpu_id != rseq_kernel_fields(t)->cpu_id ||
> -	    node_id != rseq_kernel_fields(t)->node_id ||
> -	    mm_cid != rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) {
> -
> -		pr_warn("Detected rseq corruption for pid: %d, name: %s\n"
> -			"\tcpu_id_start: %u ?= %u\n"
> -			"\tcpu_id:       %u ?= %u\n"
> -			"\tnode_id:      %u ?= %u\n"
> -			"\tmm_cid:       %u ?= %u\n",
> -			t->pid, t->comm,
> -			cpu_id_start, rseq_kernel_fields(t)->cpu_id_start,
> -			cpu_id, rseq_kernel_fields(t)->cpu_id,
> -			node_id, rseq_kernel_fields(t)->node_id,
> -			mm_cid, rseq_kernel_fields(t)->mm_cid);
> -	}
> -
> -	/* For now, only print a console warning on mismatch. */
> -	return 0;
> -
> -efault_end:
> -	user_read_access_end();
> -efault:
> -	return -EFAULT;
> -}
> -
> -/*
> - * Update an rseq field and its in-kernel copy in lock-step to keep a coherent
> - * state.
> - */
> -#define rseq_unsafe_put_user(t, value, field, error_label)		\
> -	do {								\
> -		unsafe_put_user(value, &t->rseq->field, error_label);	\
> -		rseq_kernel_fields(t)->field = value;			\
> -	} while (0)
> -
> -#else
> -static int rseq_validate_ro_fields(struct task_struct *t)
> -{
> -	return 0;
> -}
> -
> -#define rseq_unsafe_put_user(t, value, field, error_label)		\
> -	unsafe_put_user(value, &t->rseq->field, error_label)
> -#endif
> -
> -static int rseq_update_cpu_node_id(struct task_struct *t)
> -{
> -	struct rseq __user *rseq = t->rseq;
> -	u32 cpu_id = raw_smp_processor_id();
> -	u32 node_id = cpu_to_node(cpu_id);
> -	u32 mm_cid = task_mm_cid(t);
> -
> -	rseq_stat_inc(rseq_stats.ids);
> -
> -	/* Validate read-only rseq fields on debug kernels */
> -	if (rseq_validate_ro_fields(t))
> -		goto efault;
> -	WARN_ON_ONCE((int) mm_cid < 0);
> -
> -	if (!user_write_access_begin(rseq, t->rseq_len))
> -		goto efault;
> -
> -	rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
> -	rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
> -	rseq_unsafe_put_user(t, node_id, node_id, efault_end);
> -	rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
> -
> -	/* Cache the user space values */
> -	t->rseq_ids.cpu_id = cpu_id;
> -	t->rseq_ids.mm_cid = mm_cid;
> -
> -	/*
> -	 * Additional feature fields added after ORIG_RSEQ_SIZE
> -	 * need to be conditionally updated only if
> -	 * t->rseq_len != ORIG_RSEQ_SIZE.
> -	 */
> -	user_write_access_end();
> -	trace_rseq_update(t);
> -	return 0;
> -
> -efault_end:
> -	user_write_access_end();
> -efault:
> -	return -EFAULT;
> -}
> -
> -static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
> +static bool rseq_set_uids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
>   {
> -	struct rseq __user *rseq = t->rseq;
> -	u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
> -	    mm_cid = 0;
> -
> -	/*
> -	 * Validate read-only rseq fields.
> -	 */
> -	if (rseq_validate_ro_fields(t))
> -		goto efault;
> -
> -	if (!user_write_access_begin(rseq, t->rseq_len))
> -		goto efault;
> -
> -	/*
> -	 * Reset all fields to their initial state.
> -	 *
> -	 * All fields have an initial state of 0 except cpu_id which is set to
> -	 * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after
> -	 * unregistration can figure out that rseq needs to be registered
> -	 * again.
> -	 */
> -	rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end);
> -	rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
> -	rseq_unsafe_put_user(t, node_id, node_id, efault_end);
> -	rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
> -
> -	/*
> -	 * Additional feature fields added after ORIG_RSEQ_SIZE
> -	 * need to be conditionally reset only if
> -	 * t->rseq_len != ORIG_RSEQ_SIZE.
> -	 */
> -	user_write_access_end();
> -	return 0;
> -
> -efault_end:
> -	user_write_access_end();
> -efault:
> -	return -EFAULT;
> +	return rseq_set_uids_get_csaddr(t, ids, node_id, NULL);
>   }
>   
>   static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
> @@ -407,6 +250,8 @@ static bool rseq_handle_cs(struct task_s
>   void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
>   {
>   	struct task_struct *t = current;
> +	struct rseq_ids ids;
> +	u32 node_id;
>   	bool event;
>   	int sig;
>   
> @@ -453,6 +298,8 @@ void __rseq_handle_notify_resume(struct
>   	scoped_guard(RSEQ_EVENT_GUARD) {
>   		event = t->rseq_event.sched_switch;
>   		t->rseq_event.sched_switch = false;
> +		ids.cpu_id = task_cpu(t);
> +		ids.mm_cid = task_mm_cid(t);
>   	}
>   
>   	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
> @@ -461,7 +308,8 @@ void __rseq_handle_notify_resume(struct
>   	if (!rseq_handle_cs(t, regs))
>   		goto error;
>   
> -	if (unlikely(rseq_update_cpu_node_id(t)))
> +	node_id = cpu_to_node(ids.cpu_id);
> +	if (!rseq_set_uids(t, &ids, node_id))
>   		goto error;
>   	return;
>   
> @@ -502,13 +350,33 @@ void rseq_syscall(struct pt_regs *regs)
>   }
>   #endif
>   
> +static bool rseq_reset_ids(void)
> +{
> +	struct rseq_ids ids = {
> +		.cpu_id		= RSEQ_CPU_ID_UNINITIALIZED,
> +		.mm_cid		= 0,
> +	};
> +
> +	/*
> +	 * If this fails, terminate it because this leaves the kernel in
> +	 * stupid state as exit to user space will try to fixup the ids
> +	 * again.
> +	 */
> +	if (rseq_set_uids(current, &ids, 0))
> +		return true;
> +
> +	force_sig(SIGSEGV);
> +	return false;
> +}
> +
> +/* The original rseq structure size (including padding) is 32 bytes. */
> +#define ORIG_RSEQ_SIZE		32
> +
>   /*
>    * sys_rseq - setup restartable sequences for caller thread.
>    */
>   SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
>   {
> -	int ret;
> -
>   	if (flags & RSEQ_FLAG_UNREGISTER) {
>   		if (flags & ~RSEQ_FLAG_UNREGISTER)
>   			return -EINVAL;
> @@ -519,9 +387,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   			return -EINVAL;
>   		if (current->rseq_sig != sig)
>   			return -EPERM;
> -		ret = rseq_reset_rseq_cpu_node_id(current);
> -		if (ret)
> -			return ret;
> +		if (!rseq_reset_ids())
> +			return -EFAULT;
>   		current->rseq = NULL;
>   		current->rseq_sig = 0;
>   		current->rseq_len = 0;
> @@ -574,17 +441,6 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   	if (put_user_masked_u64(0UL, &rseq->rseq_cs))
>   		return -EFAULT;
>   
> -#ifdef CONFIG_DEBUG_RSEQ
> -	/*
> -	 * Initialize the in-kernel rseq fields copy for validation of
> -	 * read-only fields.
> -	 */
> -	if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_start) ||
> -	    get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) ||
> -	    get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) ||
> -	    get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid))
> -		return -EFAULT;
> -#endif
>   	/*
>   	 * Activate the registration by setting the rseq area address, length
>   	 * and signature in the task struct.
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 24/37] rseq: Seperate the signal delivery path
  2025-08-23 16:40 ` [patch V2 24/37] rseq: Seperate the signal delivery path Thomas Gleixner
@ 2025-08-26 15:08   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 15:08 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:

Patch title "rseq: Separate the signal delivery path"

(Seperate -> Separate)

> Completely seperate the signal delivery path from the notify handler as

seperate -> separate

> they have different semantics versus the event handling.
> 
> The signal delivery only needs to ensure that the interrupted user context
> was not in a critical section or the section is aborted before it switches
> to the signal frame context. The signal frame context does not have the
> original instruction pointer anymore, so that can't be handled on exit to
> user space.
> 
> No point in updating the CPU/CID ids as they might change again before the
> task returns to user space for real.
> 
> The fast path optimization, which checks for the 'entry from user via
> interrupt' condition is only available for architectures which use the
> generic entry code.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/rseq.h       |   21 ++++++++++++++++-----
>   include/linux/rseq_entry.h |   29 +++++++++++++++++++++++++++++
>   kernel/rseq.c              |   30 ++++++++++++++++++++++--------
>   3 files changed, 67 insertions(+), 13 deletions(-)
> 
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -5,22 +5,33 @@
>   #ifdef CONFIG_RSEQ
>   #include <linux/sched.h>
>   
> -void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
> +void __rseq_handle_notify_resume(struct pt_regs *regs);
>   
>   static inline void rseq_handle_notify_resume(struct pt_regs *regs)
>   {
>   	if (current->rseq_event.has_rseq)
> -		__rseq_handle_notify_resume(NULL, regs);
> +		__rseq_handle_notify_resume(regs);
>   }
>   
> +void __rseq_signal_deliver(int sig, struct pt_regs *regs);
> +
> +/*
> + * Invoked from signal delivery to fixup based on the register context before
> + * switching to the signal delivery context.
> + */
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
>   {
> -	if (current->rseq_event.has_rseq) {
> -		current->rseq_event.sched_switch = true;
> -		__rseq_handle_notify_resume(ksig, regs);
> +	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> +		/* '&' is intentional to spare one conditional branch */
> +		if (current->rseq_event.has_rseq & current->rseq_event.user_irq)
> +			__rseq_signal_deliver(ksig->sig, regs);
> +	} else {
> +		if (current->rseq_event.has_rseq)
> +			__rseq_signal_deliver(ksig->sig, regs);
>   	}
>   }
>   
> +/* Raised from context switch and exevce to force evaluation on exit to user */

Missing punctuation at the end of comment.

>   static inline void rseq_sched_switch_event(struct task_struct *t)
>   {
>   	if (t->rseq_event.has_rseq) {
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -364,6 +364,35 @@ bool rseq_set_uids_get_csaddr(struct tas
>   	return false;
>   }
>   
> +/*
> + * Update user space with new IDs and conditionally check whether the task
> + * is in a critical section.
> + */
> +static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *regs,
> +					struct rseq_ids *ids, u32 node_id)

This patch introduces rseq_update_usr with no caller. Those come in
follow up patches. It would be good to say it up front in the commit
message if this is indeed the intended sequence of changes.

Thanks,

Mathieu


> +{
> +	u64 csaddr;
> +
> +	if (!rseq_set_uids_get_csaddr(t, ids, node_id, &csaddr))
> +		return false;
> +
> +	/*
> +	 * On architectures which utilize the generic entry code this
> +	 * allows to skip the critical section when the entry was not from
> +	 * a user space interrupt, unless debug mode is enabled.
> +	 */
> +	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> +		if (!static_branch_unlikely(&rseq_debug_enabled)) {
> +			if (likely(!t->rseq_event.user_irq))
> +				return true;
> +		}
> +	}
> +	if (likely(!csaddr))
> +		return true;
> +	/* Sigh, this really needs to do work */
> +	return rseq_update_user_cs(t, regs, csaddr);
> +}
> +
>   static __always_inline void rseq_exit_to_user_mode(void)
>   {
>   	struct rseq_event *ev = &current->rseq_event;
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -247,13 +247,12 @@ static bool rseq_handle_cs(struct task_s
>    * respect to other threads scheduled on the same CPU, and with respect
>    * to signal handlers.
>    */
> -void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
> +void __rseq_handle_notify_resume(struct pt_regs *regs)
>   {
>   	struct task_struct *t = current;
>   	struct rseq_ids ids;
>   	u32 node_id;
>   	bool event;
> -	int sig;
>   
>   	/*
>   	 * If invoked from hypervisors before entering the guest via
> @@ -272,10 +271,7 @@ void __rseq_handle_notify_resume(struct
>   	if (unlikely(t->flags & PF_EXITING))
>   		return;
>   
> -	if (ksig)
> -		rseq_stat_inc(rseq_stats.signal);
> -	else
> -		rseq_stat_inc(rseq_stats.slowpath);
> +	rseq_stat_inc(rseq_stats.slowpath);
>   
>   	/*
>   	 * Read and clear the event pending bit first. If the task
> @@ -314,8 +310,26 @@ void __rseq_handle_notify_resume(struct
>   	return;
>   
>   error:
> -	sig = ksig ? ksig->sig : 0;
> -	force_sigsegv(sig);
> +	force_sig(SIGSEGV);
> +}
> +
> +void __rseq_signal_deliver(int sig, struct pt_regs *regs)
> +{
> +	rseq_stat_inc(rseq_stats.signal);
> +	/*
> +	 * Don't update IDs, they are handled on exit to user if
> +	 * necessary. The important thing is to abort a critical section of
> +	 * the interrupted context as after this point the instruction
> +	 * pointer in @regs points to the signal handler.
> +	 */
> +	if (unlikely(!rseq_handle_cs(current, regs))) {
> +		/*
> +		 * Clear the errors just in case this might survive
> +		 * magically, but leave the rest intact.
> +		 */
> +		current->rseq_event.error = 0;
> +		force_sigsegv(sig);
> +	}
>   }
>   
>   void __rseq_debug_syscall_return(struct pt_regs *regs)
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler
  2025-08-23 16:40 ` [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
@ 2025-08-26 15:12   ` Mathieu Desnoyers
  2025-09-02 17:32     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 15:12 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> Replace the whole logic with the new implementation, which is shared with
> signal delivery and the upcoming exit fast path.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   kernel/rseq.c |   78 +++++++++++++++++++++++++---------------------------------
>   1 file changed, 34 insertions(+), 44 deletions(-)
> 
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -82,12 +82,6 @@
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/rseq.h>
>   
> -#ifdef CONFIG_MEMBARRIER
> -# define RSEQ_EVENT_GUARD	irq
> -#else
> -# define RSEQ_EVENT_GUARD	preempt
> -#endif
> -
>   DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
>   
>   static inline void rseq_control_debug(bool on)
> @@ -236,38 +230,15 @@ static bool rseq_handle_cs(struct task_s
>   	return rseq_update_user_cs(t, regs, csaddr);
>   }
>   
> -/*
> - * This resume handler must always be executed between any of:
> - * - preemption,
> - * - signal delivery,
> - * and return to user-space.
> - *
> - * This is how we can ensure that the entire rseq critical section
> - * will issue the commit instruction only if executed atomically with
> - * respect to other threads scheduled on the same CPU, and with respect
> - * to signal handlers.
> - */
> -void __rseq_handle_notify_resume(struct pt_regs *regs)
> +static void rseq_slowpath_update_usr(struct pt_regs *regs)
>   {
> +	/* Preserve rseq state and user_irq state for exit to user */
> +	const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
>   	struct task_struct *t = current;
>   	struct rseq_ids ids;
>   	u32 node_id;
>   	bool event;
>   
> -	/*
> -	 * If invoked from hypervisors before entering the guest via
> -	 * resume_user_mode_work(), then @regs is a NULL pointer.
> -	 *
> -	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
> -	 * it before returning from the ioctl() to user space when
> -	 * rseq_event.sched_switch is set.
> -	 *
> -	 * So it's safe to ignore here instead of pointlessly updating it
> -	 * in the vcpu_run() loop.
> -	 */
> -	if (!regs)
> -		return;
> -
>   	if (unlikely(t->flags & PF_EXITING))
>   		return;
>   
> @@ -291,26 +262,45 @@ void __rseq_handle_notify_resume(struct
>   	 * with the result handed in to allow the detection of
>   	 * inconsistencies.
>   	 */
> -	scoped_guard(RSEQ_EVENT_GUARD) {
> -		event = t->rseq_event.sched_switch;
> -		t->rseq_event.sched_switch = false;
> +	scoped_guard(irq) {
>   		ids.cpu_id = task_cpu(t);
>   		ids.mm_cid = task_mm_cid(t);
> +		event = t->rseq_event.sched_switch;
> +		t->rseq_event.all &= evt_mask.all;
>   	}
>   
> -	if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
> +	if (!event)
>   		return;
>   
> -	if (!rseq_handle_cs(t, regs))
> -		goto error;
> -
>   	node_id = cpu_to_node(ids.cpu_id);
> -	if (!rseq_set_uids(t, &ids, node_id))
> -		goto error;
> -	return;
>   
> -error:
> -	force_sig(SIGSEGV);
> +	if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) {
> +		/*
> +		 * Clear the errors just in case this might survive magically, but
> +		 * leave the rest intact.
> +		 */
> +		t->rseq_event.error = 0;
> +		force_sig(SIGSEGV);
> +	}
> +}
> +
> +void __rseq_handle_notify_resume(struct pt_regs *regs)
> +{
> +	/*
> +	 * If invoked from hypervisors before entering the guest via
> +	 * resume_user_mode_work(), then @regs is a NULL pointer.
> +	 *
> +	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
> +	 * it before returning from the ioctl() to user space when
> +	 * rseq_event.sched_switch is set.
> +	 *
> +	 * So it's safe to ignore here instead of pointlessly updating it
> +	 * in the vcpu_run() loop.

I don't think any virt user should expect the userspace fields to be
updated on the host process while running in guest mode, but it's good
to clarify that we intend to change this user-visible behavior within
this series, to spare any unwelcome surprise.

Other than that:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Thanks,

Mathieu

> +	 */
> +	if (!regs)
> +		return;
> +
> +	rseq_slowpath_update_usr(regs);
>   }
>   
>   void __rseq_signal_deliver(int sig, struct pt_regs *regs)
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 26/37] rseq: Optimize event setting
  2025-08-23 16:40 ` [patch V2 26/37] rseq: Optimize event setting Thomas Gleixner
@ 2025-08-26 15:26   ` Mathieu Desnoyers
  2025-09-02 14:17     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 15:26 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> After removing the various condition bits earlier it turns out that one
> extra information is needed to avoid setting event::sched_switch and
> TIF_NOTIFY_RESUME unconditionally on every context switch.
> 
> The update of the RSEQ user space memory is only required, when either
> 
>    the task was interrupted in user space and schedules
> 
> or
> 
>    the CPU or MM CID changes in schedule() independent of the entry mode
> 
> Right now only the interrupt from user information is available.
> 
> Add a event flag, which is set when the CPU or MM CID or both change.

We should figure out what to do for powerpc's dynamic numa node id
to cpu mapping here.

> 
> Evaluate this event in the scheduler to decide whether the sched_switch
> event and the TIF bit need to be set.
> 
> It's an extra conditional in context_switch(), but the downside of
> unconditionally handling RSEQ after a context switch to user is way more
> significant. The utilized boolean logic minimizes this to a single
> conditional branch.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   fs/exec.c                  |    2 -
>   include/linux/rseq.h       |   81 +++++++++++++++++++++++++++++++++++++++++----
>   include/linux/rseq_types.h |   11 +++++-
>   kernel/rseq.c              |    2 -
>   kernel/sched/core.c        |    7 +++
>   kernel/sched/sched.h       |    5 ++
>   6 files changed, 95 insertions(+), 13 deletions(-)
> 
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp
>   		force_fatal_sig(SIGSEGV);
>   
>   	sched_mm_cid_after_execve(current);
> -	rseq_sched_switch_event(current);
> +	rseq_force_update();
>   	current->in_execve = 0;
>   
>   	return retval;
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -9,7 +9,8 @@ void __rseq_handle_notify_resume(struct
>   
>   static inline void rseq_handle_notify_resume(struct pt_regs *regs)
>   {
> -	if (current->rseq_event.has_rseq)
> +	/* '&' is intentional to spare one conditional branch */
> +	if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
>   		__rseq_handle_notify_resume(regs);
>   }
>   
> @@ -31,12 +32,75 @@ static inline void rseq_signal_deliver(s
>   	}
>   }
>   
> -/* Raised from context switch and exevce to force evaluation on exit to user */
> -static inline void rseq_sched_switch_event(struct task_struct *t)
> +static inline void rseq_raise_notify_resume(struct task_struct *t)
>   {
> -	if (t->rseq_event.has_rseq) {
> -		t->rseq_event.sched_switch = true;
> -		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> +	set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> +}
> +
> +/* Invoked from context switch to force evaluation on exit to user */
> +static __always_inline void rseq_sched_switch_event(struct task_struct *t)
> +{
> +	struct rseq_event *ev = &t->rseq_event;
> +
> +	if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> +		/*
> +		 * Avoid a boat load of conditionals by using simple logic
> +		 * to determine whether NOTIFY_RESUME needs to be raised.
> +		 *
> +		 * It's required when the CPU or MM CID has changed or
> +		 * the entry was from user space.
> +		 */
> +		bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
> +
> +		if (raise) {
> +			ev->sched_switch = true;
> +			rseq_raise_notify_resume(t);
> +		}
> +	} else {
> +		if (ev->has_rseq) {
> +			t->rseq_event.sched_switch = true;
> +			rseq_raise_notify_resume(t);
> +		}
> +	}
> +}
> +
> +/*
> + * Invoked from __set_task_cpu() when a task migrates to enforce an IDs
> + * update.
> + *
> + * This does not raise TIF_NOTIFY_RESUME as that happens in
> + * rseq_sched_switch_event().
> + */
> +static __always_inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu)
> +{
> +	t->rseq_event.ids_changed = true;
> +}
> +
> +/*
> + * Invoked from switch_mm_cid() in context switch when the task gets a MM
> + * CID assigned.
> + *
> + * This does not raise TIF_NOTIFY_RESUME as that happens in
> + * rseq_sched_switch_event().
> + */
> +static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid)
> +{
> +	/*
> +	 * Requires a comparison as the switch_mm_cid() code does not
> +	 * provide a conditional for it readily. So avoid excessive updates
> +	 * when nothing changes.
> +	 */
> +	if (t->rseq_ids.mm_cid != cid)
> +		t->rseq_event.ids_changed = true;
> +}
> +
> +/* Enforce a full update after RSEQ registration and when execve() failed */
> +static inline void rseq_force_update(void)
> +{
> +	if (current->rseq_event.has_rseq) {
> +		current->rseq_event.ids_changed = true;
> +		current->rseq_event.sched_switch = true;
> +		rseq_raise_notify_resume(current);
>   	}
>   }
>   
> @@ -53,7 +117,7 @@ static inline void rseq_sched_switch_eve
>   static inline void rseq_virt_userspace_exit(void)
>   {
>   	if (current->rseq_event.sched_switch)
> -		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> +		rseq_raise_notify_resume(current);
>   }
>   
>   /*
> @@ -90,6 +154,9 @@ static inline void rseq_execve(struct ta
>   static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
> +static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
> +static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid) { }
> +static inline void rseq_force_update(void) { }
>   static inline void rseq_virt_userspace_exit(void) { }
>   static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
>   static inline void rseq_execve(struct task_struct *t) { }
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -10,20 +10,27 @@ struct rseq;
>    * struct rseq_event - Storage for rseq related event management
>    * @all:		Compound to initialize and clear the data efficiently
>    * @events:		Compund to access events with a single load/store
> - * @sched_switch:	True if the task was scheduled out
> + * @sched_switch:	True if the task was scheduled and needs update on
> + *			exit to user
> + * @ids_changed:	Indicator that IDs need to be updated
>    * @user_irq:		True on interrupt entry from user mode
>    * @has_rseq:		True if the task has a rseq pointer installed
>    * @error:		Compound error code for the slow path to analyze
>    * @fatal:		User space data corrupted or invalid
> + *
> + * @sched_switch and @ids_changed must be adjacent and the combo must be
> + * 16bit aligned to allow a single store, when both are set at the same
> + * time in the scheduler.
>    */
>   struct rseq_event {
>   	union {
>   		u64				all;
>   		struct {
>   			union {
> -				u16		events;
> +				u32		events;
>   				struct {
>   					u8	sched_switch;
> +					u8	ids_changed;
>   					u8	user_irq;
>   				};
>   			};
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -459,7 +459,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
>   	 * are updated before returning to user-space.
>   	 */
>   	current->rseq_event.has_rseq = true;
> -	rseq_sched_switch_event(current);
> +	rseq_force_update();
>   
>   	return 0;
>   }
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5150,7 +5150,6 @@ prepare_task_switch(struct rq *rq, struc
>   	kcov_prepare_switch(prev);
>   	sched_info_switch(rq, prev, next);
>   	perf_event_task_sched_out(prev, next);
> -	rseq_sched_switch_event(prev);
>   	fire_sched_out_preempt_notifiers(prev, next);
>   	kmap_local_sched_out();
>   	prepare_task(next);
> @@ -5348,6 +5347,12 @@ context_switch(struct rq *rq, struct tas
>   	/* switch_mm_cid() requires the memory barriers above. */
>   	switch_mm_cid(rq, prev, next);
>   
> +	/*
> +	 * Tell rseq that the task was scheduled in. Must be after
> +	 * switch_mm_cid() to get the TIF flag set.
> +	 */
> +	rseq_sched_switch_event(next);
> +
>   	prepare_lock_switch(rq, next, rf);
>   
>   	/* Here we just switch the register state and the stack. */
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2181,6 +2181,7 @@ static inline void __set_task_cpu(struct
>   	smp_wmb();
>   	WRITE_ONCE(task_thread_info(p)->cpu, cpu);
>   	p->wake_cpu = cpu;
> +	rseq_sched_set_task_cpu(p, cpu);

The combination of patch
"rseq: Simplify the event notification" and this
ends up moving those three rseq_migrate events to __set_task_cpu:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be00629f0ba4..695c23939345 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3364,7 +3364,6 @@ void set_task_cpu(struct task_struct *p, unsigned 
int new_cpu)
                 if (p->sched_class->migrate_task_rq)
                         p->sched_class->migrate_task_rq(p, new_cpu);
                 p->se.nr_migrations++;
-               rseq_migrate(p);
                 sched_mm_cid_migrate_from(p);
                 perf_event_task_migrate(p);
         }
@@ -4795,7 +4794,6 @@ int sched_cgroup_fork(struct task_struct *p, 
struct kernel_clone_args *kargs)
                 p->sched_task_group = tg;
         }
  #endif
-       rseq_migrate(p);
         /*
          * We're setting the CPU for the first time, we don't migrate,
          * so use __set_task_cpu().
@@ -4859,7 +4857,6 @@ void wake_up_new_task(struct task_struct *p)
          * as we're not fully set-up yet.
          */
         p->recent_used_cpu = task_cpu(p);
-       rseq_migrate(p);
         __set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
         rq = __task_rq_lock(p, &rf);
         update_rq_clock(rq);

AFAIR those were placed in the callers to benefit from the conditional
in set_task_cpu():

         if (task_cpu(p) != new_cpu) {

perhaps it's not a big deal, but I think it's relevant to point it out.

Thanks,

Mathieu

>   #endif /* CONFIG_SMP */
>   }
>   
> @@ -3778,8 +3779,10 @@ static inline void switch_mm_cid(struct
>   		mm_cid_put_lazy(prev);
>   		prev->mm_cid = -1;
>   	}
> -	if (next->mm_cid_active)
> +	if (next->mm_cid_active) {
>   		next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
> +		rseq_sched_set_task_mm_cid(next, next->mm_cid);
> +	}
>   }
>   
>   #else /* !CONFIG_SCHED_MM_CID: */
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [patch V2 27/37] rseq: Implement fast path for exit to user
  2025-08-23 16:40 ` [patch V2 27/37] rseq: Implement fast path for exit to user Thomas Gleixner
@ 2025-08-26 15:33   ` Mathieu Desnoyers
  2025-09-02 18:31     ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 15:33 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> Implement the actual logic for handling RSEQ updates in a fast path after
> handling the TIF work and at the point where the task is actually returning
> to user space.
> 
> This is the right point to do that because at this point the CPU and the MM
> CID are stable and cannot longer change due to yet another reschedule.
> That happens when the task is handling it via TIF_NOTIFY_RESUME in
> resume_user_mode_work(), which is invoked from the exit to user mode work
> loop.
> 
> The function is invoked after the TIF work is handled and runs with
> interrupts disabled, which means it cannot resolve page faults. It
> therefore disables page faults and in case the access to the user space
> memory faults, it:
> 
>    - notes the fail in the event struct
>    - raises TIF_NOTIFY_RESUME
>    - returns false to the caller
> 
> The caller has to go back to the TIF work, which runs with interrupts
> enabled and therefore can resolve the page faults. This happens mostly on
> fork() when the memory is marked COW. That will be optimized by setting the
> failure flag and raising TIF_NOTIFY_RESUME right on fork to avoid the
> otherwise unavoidable round trip.
> 
> If the user memory inspection finds invalid data, the function returns
> false as well and sets the fatal flag in the event struct along with
> TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
> and terminate the task with SIGSEGV as documented.
> 
> The initial decision to invoke any of this is based on two flags in the
> event struct: @has_rseq and @sched_switch. The decision is in pseudo ASM:
> 
>        load	tsk::event::has_rseq
>        and	tsk::event::sched_switch
>        jnz	inspect_user_space
>        mov	$0, tsk::event::events
>        ...
>        leave
> 
> So for the common case where the task was not scheduled out, this really
> boils down to four instructions before going out if the compiler is not
> completely stupid (and yes, some of them are).
> 
> If the condition is true, then it checks, whether CPU ID or MM CID have
> changed. If so, then the CPU/MM IDs have to be updated and are thereby
> cached for the next round. The update unconditionally retrieves the user
> space critical section address to spare another user*begin/end() pair.  If
> that's not zero and tsk::event::user_irq is set, then the critical section
> is analyzed and acted upon. If either zero or the entry came via syscall
> the critical section analysis is skipped.
> 
> If the comparison is false then the critical section has to be analyzed
> because the event flag is then only true when entry from user was by
> interrupt.
> 
> This is provided without the actual hookup to let reviewers focus on the
> implementation details. The hookup happens in the next step.
> 
> Note: As with quite some other optimizations this depends on the generic
> entry infrastructure and is not enabled to be sucked into random
> architecture implementations.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/rseq_entry.h |  137 ++++++++++++++++++++++++++++++++++++++++++++-
>   include/linux/rseq_types.h |    3
>   kernel/rseq.c              |    2
>   3 files changed, 139 insertions(+), 3 deletions(-)
> 
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -10,6 +10,7 @@ struct rseq_stats {
>   	unsigned long	exit;
>   	unsigned long	signal;
>   	unsigned long	slowpath;
> +	unsigned long	fastpath;
>   	unsigned long	ids;
>   	unsigned long	cs;
>   	unsigned long	clear;
> @@ -204,8 +205,8 @@ bool rseq_debug_update_user_cs(struct ta
>   
>   /*
>    * On debug kernels validate that user space did not mess with it if
> - * DEBUG_RSEQ is enabled, but don't on the first exit to user space. In
> - * that case cpu_cid is ~0. See fork/execve.
> + * debugging is enabled, but don't do that on the first exit to user
> + * space. In that case cpu_cid is ~0. See fork/execve.
>    */
>   bool rseq_debug_validate_uids(struct task_struct *t)
>   {
> @@ -393,6 +394,131 @@ static rseq_inline bool rseq_update_usr(
>   	return rseq_update_user_cs(t, regs, csaddr);
>   }
>   
> +/*
> + * If you want to use this then convert your architecture to the generic
> + * entry code. I'm tired of building workarounds for people who can't be
> + * bothered to make the maintainence of generic infrastructure less
> + * burdensome. Just sucking everything into the architecture code and
> + * thereby making others chase the horrible hacks and keep them working is
> + * neither acceptable nor sustainable.
> + */
> +#ifdef CONFIG_GENERIC_ENTRY
> +
> +/*
> + * This is inlined into the exit path because:
> + *
> + * 1) It's a one time comparison in the fast path when there is no event to
> + *    handle
> + *
> + * 2) The access to the user space rseq memory (TLS) is unlikely to fault
> + *    so the straight inline operation is:
> + *
> + *	- Four 32-bit stores only if CPU ID/ MM CID need to be updated
> + *	- One 64-bit load to retrieve the critical section address
> + *
> + * 3) In the unlikely case that the critical section address is != NULL:
> + *
> + *     - One 64-bit load to retrieve the start IP
> + *     - One 64-bit load to retrieve the offset for calculating the end
> + *     - One 64-bit load to retrieve the abort IP
> + *     - One store to clear the critical section address
> + *
> + * The non-debug case implements only the minimal required checking and
> + * protection against a rogue abort IP in kernel space, which would be
> + * exploitable at least on x86. Any fallout from invalid critical section
> + * descriptors is a user space problem. The debug case provides the full
> + * set of checks and terminates the task if a condition is not met.
> + *
> + * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and
> + * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
> + * slow path there will handle the fail.
> + */
> +static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
> +{
> +	struct task_struct *t = current;
> +
> +	/*
> +	 * If the task did not go through schedule or got the flag enforced
> +	 * by the rseq syscall or execve, then nothing to do here.
> +	 *
> +	 * CPU ID and MM CID can only change when going through a context
> +	 * switch.
> +	 *
> +	 * This can only be done when rseq_event::has_rseq is true.
> +	 * rseq_sched_switch_event() sets rseq_event::sched unconditionally
> +	 * true to avoid a load of rseq_event::has_rseq in the context
> +	 * switch path.
> +	 *
> +	 * This check uses a '&' and not a '&&' to force the compiler to do
> +	 * an actual AND operation instead of two seperate conditionals.
> +	 *
> +	 * A sane compiler requires four instructions for the nothing to do
> +	 * case including clearing the events, but your milage might vary.

See my earlier comments about:

- Handling of dynamic numa node id to cpu mapping reconfiguration on
   powerpc.

- Validation of the abort handler signature on production kernels.

Thanks,

Mathieu


> +	 */
> +	if (likely(!(t->rseq_event.sched_switch & t->rseq_event.has_rseq)))
> +		goto done;
> +
> +	rseq_stat_inc(rseq_stats.fastpath);
> +
> +	pagefault_disable();
> +
> +	if (likely(!t->rseq_event.ids_changed)) {
> +		/*
> +		 * If IDs have not changed rseq_event::user_irq must be true
> +		 * See rseq_sched_switch_event().
> +		 */
> +		u64 csaddr;
> +
> +		if (unlikely(get_user_masked_u64(&csaddr, &t->rseq->rseq_cs)))
> +			goto fail;
> +
> +		if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
> +			if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
> +				goto fail;
> +		}
> +	} else {
> +		struct rseq_ids ids = {
> +			.cpu_id = task_cpu(t),
> +			.mm_cid = task_mm_cid(t),
> +		};
> +		u32 node_id = cpu_to_node(ids.cpu_id);
> +
> +		if (unlikely(!rseq_update_usr(t, regs, &ids, node_id)))
> +			goto fail;
> +	}
> +
> +	pagefault_enable();
> +
> +done:
> +	/* Clear state so next entry starts from a clean slate */
> +	t->rseq_event.events = 0;
> +	return false;
> +
> +fail:
> +	pagefault_enable();
> +	/* Force it into the slow path. Don't clear the state! */
> +	t->rseq_event.slowpath = true;
> +	set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> +	return true;
> +}
> +
> +static __always_inline unsigned long
> +rseq_exit_to_user_mode_work(struct pt_regs *regs, unsigned long ti_work, const unsigned long mask)
> +{
> +	/*
> +	 * Check if all work bits have been cleared before handling rseq.
> +	 */
> +	if ((ti_work & mask) != 0)
> +		return ti_work;
> +
> +	if (likely(!__rseq_exit_to_user_mode_restart(regs)))
> +		return ti_work;
> +
> +	return ti_work | _TIF_NOTIFY_RESUME;
> +}
> +
> +#endif /* !CONFIG_GENERIC_ENTRY */
> +
>   static __always_inline void rseq_exit_to_user_mode(void)
>   {
>   	struct rseq_event *ev = &current->rseq_event;
> @@ -417,8 +543,13 @@ static inline void rseq_debug_syscall_re
>   	if (static_branch_unlikely(&rseq_debug_enabled))
>   		__rseq_debug_syscall_return(regs);
>   }
> -
>   #else /* CONFIG_RSEQ */
> +static inline unsigned long rseq_exit_to_user_mode_work(struct pt_regs *regs,
> +							unsigned long ti_work,
> +							const unsigned long mask)
> +{
> +	return ti_work;
> +}
>   static inline void rseq_note_user_irq_entry(void) { }
>   static inline void rseq_exit_to_user_mode(void) { }
>   static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -17,6 +17,8 @@ struct rseq;
>    * @has_rseq:		True if the task has a rseq pointer installed
>    * @error:		Compound error code for the slow path to analyze
>    * @fatal:		User space data corrupted or invalid
> + * @slowpath:		Indicator that slow path processing via TIF_NOTIFY_RESUME
> + *			is required
>    *
>    * @sched_switch and @ids_changed must be adjacent and the combo must be
>    * 16bit aligned to allow a single store, when both are set at the same
> @@ -41,6 +43,7 @@ struct rseq_event {
>   				u16		error;
>   				struct {
>   					u8	fatal;
> +					u8	slowpath;
>   				};
>   			};
>   		};
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -133,6 +133,7 @@ static int rseq_stats_show(struct seq_fi
>   		stats.exit	+= data_race(per_cpu(rseq_stats.exit, cpu));
>   		stats.signal	+= data_race(per_cpu(rseq_stats.signal, cpu));
>   		stats.slowpath	+= data_race(per_cpu(rseq_stats.slowpath, cpu));
> +		stats.fastpath	+= data_race(per_cpu(rseq_stats.fastpath, cpu));
>   		stats.ids	+= data_race(per_cpu(rseq_stats.ids, cpu));
>   		stats.cs	+= data_race(per_cpu(rseq_stats.cs, cpu));
>   		stats.clear	+= data_race(per_cpu(rseq_stats.clear, cpu));
> @@ -142,6 +143,7 @@ static int rseq_stats_show(struct seq_fi
>   	seq_printf(m, "exit:   %16lu\n", stats.exit);
>   	seq_printf(m, "signal: %16lu\n", stats.signal);
>   	seq_printf(m, "slowp:  %16lu\n", stats.slowpath);
> +	seq_printf(m, "fastp:  %16lu\n", stats.fastpath);
>   	seq_printf(m, "ids:    %16lu\n", stats.ids);
>   	seq_printf(m, "cs:     %16lu\n", stats.cs);
>   	seq_printf(m, "clear:  %16lu\n", stats.clear);
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 28/37] rseq: Switch to fast path processing on exit to user
  2025-08-23 16:40 ` [patch V2 28/37] rseq: Switch to fast path processing on " Thomas Gleixner
@ 2025-08-26 15:40   ` Mathieu Desnoyers
  2025-08-27 13:45     ` Mathieu Desnoyers
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 15:40 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> Now that all bits and pieces are in place, hook the RSEQ handling fast path
> function into exit_to_user_mode_prepare() after the TIF work bits have been
> handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
> and the caller needs to take another turn through the TIF handling slow
> path.
> 
> This only works for architectures, which use the generic entry code.
> Architectures, who still have their own incomplete hacks are not supported
> and won't be.
> 
> This results in the following improvements:
> 
>    Kernel build	       Before		  After		      Reduction
> 		
>    exit to user         80692981		  80514451
>    signal checks:          32581		       121	       99%
>    slowpath runs:        1201408   1.49%	       198 0.00%      100%
>    fastpath runs:           	  	    675941 0.84%       N/A
>    id updates:           1233989   1.53%	     50541 0.06%       96%
>    cs checks:            1125366   1.39%	         0 0.00%      100%
>      cs cleared:         1125366      100%	 0            100%
>      cs fixup:                 0        0%	 0
> 
>    RSEQ selftests      Before		  After		      Reduction
> 
>    exit to user:       386281778		  387373750
>    signal checks:       35661203		          0           100%
>    slowpath runs:      140542396 36.38%	        100  0.00%    100%
>    fastpath runs:           	  	    9509789  2.51%     N/A
>    id updates:         176203599 45.62%	    9087994  2.35%     95%
>    cs checks:          175587856 45.46%	    4728394  1.22%     98%
>      cs cleared:       172359544   98.16%    1319307   27.90%   99%
>      cs fixup:           3228312    1.84%    3409087   72.10%
> 
> The 'cs cleared' and 'cs fixup' percentanges are not relative to the exit
> to user invocations, they are relative to the actual 'cs check'
> invocations.
> 
> While some of this could have been avoided in the original code, like the
> obvious clearing of CS when it's already clear, the main problem of going
> through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
> notify handler is invoked more than once before going out to user
> space. Doing this once when everything has stabilized is the only solution
> to avoid this.
> 
> The initial attempt to completely decouple it from the TIF work turned out
> to be suboptimal for workloads, which do a lot of quick and short system
> calls. Even if the fast path decision is only 4 instructions (including a
> conditional branch), this adds up quickly and becomes measurable when the
> rate for actually having to handle rseq is in the low single digit
> percentage range of user/kernel transitions.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
>   include/linux/irq-entry-common.h |    7 ++-----
>   include/linux/resume_user_mode.h |    2 +-
>   include/linux/rseq.h             |   24 ++++++++++++++++++------
>   include/linux/rseq_entry.h       |    2 +-
>   init/Kconfig                     |    2 +-
>   kernel/entry/common.c            |   17 ++++++++++++++---
>   kernel/rseq.c                    |    8 ++++++--
>   7 files changed, 43 insertions(+), 19 deletions(-)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -197,11 +197,8 @@ static __always_inline void arch_exit_to
>    */
>   void arch_do_signal_or_restart(struct pt_regs *regs);
>   
> -/**
> - * exit_to_user_mode_loop - do any pending work before leaving to user space
> - */
> -unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> -				     unsigned long ti_work);
> +/* Handle pending TIF work */
> +unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
>   
>   /**
>    * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
> --- a/include/linux/resume_user_mode.h
> +++ b/include/linux/resume_user_mode.h
> @@ -59,7 +59,7 @@ static inline void resume_user_mode_work
>   	mem_cgroup_handle_over_high(GFP_KERNEL);
>   	blkcg_maybe_throttle_current();
>   
> -	rseq_handle_notify_resume(regs);
> +	rseq_handle_slowpath(regs);
>   }
>   
>   #endif /* LINUX_RESUME_USER_MODE_H */
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -5,13 +5,19 @@
>   #ifdef CONFIG_RSEQ
>   #include <linux/sched.h>
>   
> -void __rseq_handle_notify_resume(struct pt_regs *regs);
> +void __rseq_handle_slowpath(struct pt_regs *regs);
>   
> -static inline void rseq_handle_notify_resume(struct pt_regs *regs)
> +/* Invoked from resume_user_mode_work() */
> +static inline void rseq_handle_slowpath(struct pt_regs *regs)
>   {
> -	/* '&' is intentional to spare one conditional branch */
> -	if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
> -		__rseq_handle_notify_resume(regs);
> +	if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
> +		if (current->rseq_event.slowpath)
> +			__rseq_handle_slowpath(regs);
> +	} else {
> +		/* '&' is intentional to spare one conditional branch */
> +		if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
> +			__rseq_handle_slowpath(regs);
> +	}
>   }
>   
>   void __rseq_signal_deliver(int sig, struct pt_regs *regs);
> @@ -138,6 +144,12 @@ static inline void rseq_fork(struct task
>   		t->rseq_sig = current->rseq_sig;
>   		t->rseq_ids.cpu_cid = ~0ULL;
>   		t->rseq_event = current->rseq_event;
> +		/*
> +		 * If it has rseq, force it into the slow path right away
> +		 * because it is guaranteed to fault.
> +		 */
> +		if (t->rseq_event.has_rseq)
> +			t->rseq_event.slowpath = true;
>   	}
>   }
>   
> @@ -151,7 +163,7 @@ static inline void rseq_execve(struct ta
>   }
>   
>   #else /* CONFIG_RSEQ */
> -static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
> +static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
>   static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -433,7 +433,7 @@ static rseq_inline bool rseq_update_usr(
>    * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
>    * slow path there will handle the fail.
>    */
> -static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
> +static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
>   {
>   	struct task_struct *t = current;
>   
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1911,7 +1911,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>   config DEBUG_RSEQ
>   	default n
>   	bool "Enable debugging of rseq() system call" if EXPERT
> -	depends on RSEQ && DEBUG_KERNEL
> +	depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
>   	select RSEQ_DEBUG_DEFAULT_ENABLE
>   	help
>   	  Enable extra debugging checks for the rseq system call.
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -23,8 +23,7 @@ void __weak arch_do_signal_or_restart(st
>   	 * Before returning to user space ensure that all pending work
>   	 * items have been completed.
>   	 */
> -	while (ti_work & EXIT_TO_USER_MODE_WORK) {
> -
> +	do {
>   		local_irq_enable_exit_to_user(ti_work);
>   
>   		if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
> @@ -56,7 +55,19 @@ void __weak arch_do_signal_or_restart(st
>   		tick_nohz_user_enter_prepare();
>   
>   		ti_work = read_thread_flags();
> -	}
> +
> +		/*
> +		 * This returns the unmodified ti_work, when ti_work is not
> +		 * empty. In that case it waits for the next round to avoid
> +		 * multiple updates in case of rescheduling.
> +		 *
> +		 * When it handles rseq it returns either with empty work
> +		 * on success or with TIF_NOTIFY_RESUME set on failure to
> +		 * kick the handling into the slow path.
> +		 */
> +		ti_work = rseq_exit_to_user_mode_work(regs, ti_work, EXIT_TO_USER_MODE_WORK);
> +
> +	} while (ti_work & EXIT_TO_USER_MODE_WORK);
>   
>   	/* Return the latest work state for arch_exit_to_user_mode() */
>   	return ti_work;
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -234,7 +234,11 @@ static bool rseq_handle_cs(struct task_s
>   
>   static void rseq_slowpath_update_usr(struct pt_regs *regs)
>   {
> -	/* Preserve rseq state and user_irq state for exit to user */
> +	/*
> +	 * Preserve rseq state and user_irq state. The generic entry code
> +	 * clears user_irq on the way out, the non-generic entry
> +	 * architectures are not having user_irq.
> +	 */
>   	const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
>   	struct task_struct *t = current;
>   	struct rseq_ids ids;
> @@ -286,7 +290,7 @@ static void rseq_slowpath_update_usr(str
>   	}
>   }
>   
> -void __rseq_handle_notify_resume(struct pt_regs *regs)
> +void __rseq_handle_slowpath(struct pt_regs *regs)
>   {
>   	/*
>   	 * If invoked from hypervisors before entering the guest via
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 29/37] entry: Split up exit_to_user_mode_prepare()
  2025-08-23 16:40 ` [patch V2 29/37] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
@ 2025-08-26 15:41   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 15:41 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> exit_to_user_mode_prepare() is used for both interrupts and syscalls, but
> there is extra rseq work, which is only required for in the interrupt exit
> case.
> 
> Split up the function and provide wrappers for syscalls and interrupts,
> which allows to seperate the rseq exit work in the next step.

separate

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/entry-common.h     |    2 -
>   include/linux/irq-entry-common.h |   42 ++++++++++++++++++++++++++++++++++-----
>   2 files changed, 38 insertions(+), 6 deletions(-)
> 
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -156,7 +156,7 @@ static __always_inline void syscall_exit
>   	if (unlikely(work & SYSCALL_WORK_EXIT))
>   		syscall_exit_work(regs, work);
>   	local_irq_disable_exit_to_user();
> -	exit_to_user_mode_prepare(regs);
> +	syscall_exit_to_user_mode_prepare(regs);
>   }
>   
>   /**
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -201,7 +201,7 @@ void arch_do_signal_or_restart(struct pt
>   unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
>   
>   /**
> - * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
> + * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
>    * @regs:	Pointer to pt_regs on entry stack
>    *
>    * 1) check that interrupts are disabled
> @@ -209,8 +209,10 @@ unsigned long exit_to_user_mode_loop(str
>    * 3) call exit_to_user_mode_loop() if any flags from
>    *    EXIT_TO_USER_MODE_WORK are set
>    * 4) check that interrupts are still disabled
> + *
> + * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
>    */
> -static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
> +static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
>   {
>   	unsigned long ti_work;
>   
> @@ -224,15 +226,45 @@ static __always_inline void exit_to_user
>   		ti_work = exit_to_user_mode_loop(regs, ti_work);
>   
>   	arch_exit_to_user_mode_prepare(regs, ti_work);
> +}
>   
> -	rseq_exit_to_user_mode();
> -
> +static __always_inline void __exit_to_user_mode_validate(void)
> +{
>   	/* Ensure that kernel state is sane for a return to userspace */
>   	kmap_assert_nomap();
>   	lockdep_assert_irqs_disabled();
>   	lockdep_sys_exit();
>   }
>   
> +
> +/**
> + * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
> + * @regs:	Pointer to pt_regs on entry stack
> + *
> + * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work for

separate

> + * syscalls and interrupts.
> + */
> +static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
> +{
> +	__exit_to_user_mode_prepare(regs);
> +	rseq_exit_to_user_mode();
> +	__exit_to_user_mode_validate();
> +}
> +
> +/**
> + * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
> + * @regs:	Pointer to pt_regs on entry stack
> + *
> + * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work for

separate

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> + * syscalls and interrupts.
> + */
> +static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
> +{
> +	__exit_to_user_mode_prepare(regs);
> +	rseq_exit_to_user_mode();
> +	__exit_to_user_mode_validate();
> +}
> +
>   /**
>    * exit_to_user_mode - Fixup state when exiting to user mode
>    *
> @@ -297,7 +329,7 @@ static __always_inline void irqentry_ent
>   static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
>   {
>   	instrumentation_begin();
> -	exit_to_user_mode_prepare(regs);
> +	irqentry_exit_to_user_mode_prepare(regs);
>   	instrumentation_end();
>   	exit_to_user_mode();
>   }
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 30/37] rseq: Split up rseq_exit_to_user_mode()
  2025-08-23 16:40 ` [patch V2 30/37] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
@ 2025-08-26 15:45   ` Mathieu Desnoyers
  0 siblings, 0 replies; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-26 15:45 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-23 12:40, Thomas Gleixner wrote:
> Seperate the interrupt and syscall exit handling. Syscall exit does not

Separate

> require to clear the user_irq bit as it can't be set. On interrupt exit it
> can be set when the interrupt did not result in a scheduling event and
> therefore the return path did not invoke the TIF work handling, which would
> have cleared it.
> 
> The debug check for the event state is also not really required even when
> debug mode is enabled via the static key. Debug mode is largely aiding user
> space by enabling a larger amount of validation checks, which cause a
> segfault when a malformed critical section is detected. In production mode
> the critical section handling takes the content mostly as is and lets user
> space keep the pieces when it screwed up.
> 
> On kernel changes in that area the state check is useful, but that can be
> done when lockdep is enabled, which is anyway a required test scenario for
> fundamental changes.

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   include/linux/irq-entry-common.h |    4 ++--
>   include/linux/rseq_entry.h       |   21 +++++++++++++++++----
>   2 files changed, 19 insertions(+), 6 deletions(-)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -247,7 +247,7 @@ static __always_inline void __exit_to_us
>   static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
>   {
>   	__exit_to_user_mode_prepare(regs);
> -	rseq_exit_to_user_mode();
> +	rseq_syscall_exit_to_user_mode();
>   	__exit_to_user_mode_validate();
>   }
>   
> @@ -261,7 +261,7 @@ static __always_inline void syscall_exit
>   static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
>   {
>   	__exit_to_user_mode_prepare(regs);
> -	rseq_exit_to_user_mode();
> +	rseq_irqentry_exit_to_user_mode();
>   	__exit_to_user_mode_validate();
>   }
>   
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -519,19 +519,31 @@ rseq_exit_to_user_mode_work(struct pt_re
>   
>   #endif /* !CONFIG_GENERIC_ENTRY */
>   
> -static __always_inline void rseq_exit_to_user_mode(void)
> +static __always_inline void rseq_syscall_exit_to_user_mode(void)
>   {
>   	struct rseq_event *ev = &current->rseq_event;
>   
>   	rseq_stat_inc(rseq_stats.exit);
>   
> -	if (static_branch_unlikely(&rseq_debug_enabled))
> +	/* Needed to remove the store for the !lockdep case */
> +	if (IS_ENABLED(CONFIG_LOCKDEP)) {
>   		WARN_ON_ONCE(ev->sched_switch);
> +		ev->events = 0;
> +	}
> +}
> +
> +static __always_inline void rseq_irqentry_exit_to_user_mode(void)
> +{
> +	struct rseq_event *ev = &current->rseq_event;
> +
> +	rseq_stat_inc(rseq_stats.exit);
> +
> +	lockdep_assert_once(!ev->sched_switch);
>   
>   	/*
>   	 * Ensure that event (especially user_irq) is cleared when the
>   	 * interrupt did not result in a schedule and therefore the
> -	 * rseq processing did not clear it.
> +	 * rseq processing could not clear it.
>   	 */
>   	ev->events = 0;
>   }
> @@ -551,7 +563,8 @@ static inline unsigned long rseq_exit_to
>   	return ti_work;
>   }
>   static inline void rseq_note_user_irq_entry(void) { }
> -static inline void rseq_exit_to_user_mode(void) { }
> +static inline void rseq_syscall_exit_to_user_mode(void) { }
> +static inline void rseq_irqentry_exit_to_user_mode(void) { }
>   static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
>   #endif /* !CONFIG_RSEQ */
>   
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 28/37] rseq: Switch to fast path processing on exit to user
  2025-08-26 15:40   ` Mathieu Desnoyers
@ 2025-08-27 13:45     ` Mathieu Desnoyers
  2025-09-02 18:36       ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Mathieu Desnoyers @ 2025-08-27 13:45 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On 2025-08-26 11:40, Mathieu Desnoyers wrote:
> On 2025-08-23 12:40, Thomas Gleixner wrote:
>> Now that all bits and pieces are in place, hook the RSEQ handling fast 
>> path
>> function into exit_to_user_mode_prepare() after the TIF work bits have 
>> been
>> handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
>> and the caller needs to take another turn through the TIF handling slow
>> path.
>>
>> This only works for architectures, which use the generic entry code.
>> Architectures, who still have their own incomplete hacks are not 
>> supported
>> and won't be.
>>
>> This results in the following improvements:
>>
>>    Kernel build           Before          After              Reduction
>>
>>    exit to user         80692981          80514451
>>    signal checks:          32581               121           99%
>>    slowpath runs:        1201408   1.49%           198 0.00%      100%
>>    fastpath runs:                         675941 0.84%       N/A
>>    id updates:           1233989   1.53%         50541 0.06%       96%
>>    cs checks:            1125366   1.39%             0 0.00%      100%
>>      cs cleared:         1125366      100%     0            100%
>>      cs fixup:                 0        0%     0
>>
>>    RSEQ selftests      Before          After              Reduction
>>
>>    exit to user:       386281778          387373750
>>    signal checks:       35661203                  0           100%
>>    slowpath runs:      140542396 36.38%            100  0.00%    100%
>>    fastpath runs:                         9509789  2.51%     N/A
>>    id updates:         176203599 45.62%        9087994  2.35%     95%
>>    cs checks:          175587856 45.46%        4728394  1.22%     98%
>>      cs cleared:       172359544   98.16%    1319307   27.90%   99%
>>      cs fixup:           3228312    1.84%    3409087   72.10%

By the way, you should really not be using the entire rseq selftests
as a representative workload for profiling the kernel rseq implementation.

Those selftests include "loop injection", "yield injection", "kill
injection" and "sleep injection" within the relevant userspace code
paths, which really increase the likelihood of hitting stuff like
"cs fixup" compared to anything that comes close to a realistic
use-case. This is really useful for testing correctness, but not
for profiling. For instance, the "loop injection" introduces busy
loops within rseq critical sections to significantly increase the
likelihood of hitting a cs fixup.

Those specific selftests are really just "stress-tests" that don't
represent any relevant workload.

The rseq selftests that are more relevant for the type of profiling
you are trying to do here are the "param_test_benchmark". Those
entirely compile-out the injection code and focus on the performance
of rseq fast-path under heavy use. This is already more representative
of a semi-realistic "super-heavy" rseq use workload (you could see it
as a rseq worse-case use upper bound).

I suspect that using this for profiling, you will find out that
optimizing the "cs fixup" code path is not relevant.

The following script runs the "benchmark" tests, which are more relevant
for profiling:

diff --git a/tools/testing/selftests/rseq/Makefile b/tools/testing/selftests/rseq/Makefile
index 0d0a5fae5954..30339183f8a2 100644
--- a/tools/testing/selftests/rseq/Makefile
+++ b/tools/testing/selftests/rseq/Makefile
@@ -21,7 +21,7 @@ TEST_GEN_PROGS = basic_test basic_percpu_ops_test basic_percpu_ops_mm_cid_test p
  
  TEST_GEN_PROGS_EXTENDED = librseq.so
  
-TEST_PROGS = run_param_test.sh run_syscall_errors_test.sh
+TEST_PROGS = run_param_test.sh run_param_test_benchmark.sh run_syscall_errors_test.sh
  
  TEST_FILES := settings
  
diff --git a/tools/testing/selftests/rseq/run_param_test_benchmark.sh b/tools/testing/selftests/rseq/run_param_test_benchmark.sh
new file mode 100755
index 000000000000..17b3dfcfcdd4
--- /dev/null
+++ b/tools/testing/selftests/rseq/run_param_test_benchmark.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0+ or MIT
+
+NR_CPUS=`grep '^processor' /proc/cpuinfo | wc -l`
+
+EXTRA_ARGS=${@}
+
+OLDIFS="$IFS"
+IFS=$'\n'
+TEST_LIST=(
+	"-T s"
+	"-T l"
+	"-T b"
+	"-T b -M"
+	"-T m"
+	"-T m -M"
+	"-T i"
+	"-T r"
+)
+
+TEST_NAME=(
+	"spinlock"
+	"list"
+	"buffer"
+	"buffer with barrier"
+	"memcpy"
+	"memcpy with barrier"
+	"increment"
+	"membarrier"
+)
+IFS="$OLDIFS"
+
+REPS=10000000
+NR_THREADS=$((6*${NR_CPUS}))
+
+function do_tests()
+{
+	local i=0
+	while [ "$i" -lt "${#TEST_LIST[@]}" ]; do
+		echo "Running benchmark test ${TEST_NAME[$i]}"
+		./param_test_benchmark ${TEST_LIST[$i]} -r ${REPS} -t ${NR_THREADS} ${@} ${EXTRA_ARGS} || exit 1
+
+		echo "Running mm_cid benchmark test ${TEST_NAME[$i]}"
+		./param_test_mm_cid_benchmark ${TEST_LIST[$i]} -r ${REPS} -t ${NR_THREADS} ${@} ${EXTRA_ARGS} || exit 1
+		let "i++"
+	done
+}
+
+do_tests

Thanks,

Mathieu

>>
>> The 'cs cleared' and 'cs fixup' percentanges are not relative to the exit
>> to user invocations, they are relative to the actual 'cs check'
>> invocations.
>>
>> While some of this could have been avoided in the original code, like the
>> obvious clearing of CS when it's already clear, the main problem of going
>> through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
>> notify handler is invoked more than once before going out to user
>> space. Doing this once when everything has stabilized is the only 
>> solution
>> to avoid this.
>>
>> The initial attempt to completely decouple it from the TIF work turned 
>> out
>> to be suboptimal for workloads, which do a lot of quick and short system
>> calls. Even if the fast path decision is only 4 instructions (including a
>> conditional branch), this adds up quickly and becomes measurable when the
>> rate for actually having to handle rseq is in the low single digit
>> percentage range of user/kernel transitions.
>>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 
>> ---
>>   include/linux/irq-entry-common.h |    7 ++-----
>>   include/linux/resume_user_mode.h |    2 +-
>>   include/linux/rseq.h             |   24 ++++++++++++++++++------
>>   include/linux/rseq_entry.h       |    2 +-
>>   init/Kconfig                     |    2 +-
>>   kernel/entry/common.c            |   17 ++++++++++++++---
>>   kernel/rseq.c                    |    8 ++++++--
>>   7 files changed, 43 insertions(+), 19 deletions(-)
>>
>> --- a/include/linux/irq-entry-common.h
>> +++ b/include/linux/irq-entry-common.h
>> @@ -197,11 +197,8 @@ static __always_inline void arch_exit_to
>>    */
>>   void arch_do_signal_or_restart(struct pt_regs *regs);
>> -/**
>> - * exit_to_user_mode_loop - do any pending work before leaving to 
>> user space
>> - */
>> -unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
>> -                     unsigned long ti_work);
>> +/* Handle pending TIF work */
>> +unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned 
>> long ti_work);
>>   /**
>>    * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if 
>> required
>> --- a/include/linux/resume_user_mode.h
>> +++ b/include/linux/resume_user_mode.h
>> @@ -59,7 +59,7 @@ static inline void resume_user_mode_work
>>       mem_cgroup_handle_over_high(GFP_KERNEL);
>>       blkcg_maybe_throttle_current();
>> -    rseq_handle_notify_resume(regs);
>> +    rseq_handle_slowpath(regs);
>>   }
>>   #endif /* LINUX_RESUME_USER_MODE_H */
>> --- a/include/linux/rseq.h
>> +++ b/include/linux/rseq.h
>> @@ -5,13 +5,19 @@
>>   #ifdef CONFIG_RSEQ
>>   #include <linux/sched.h>
>> -void __rseq_handle_notify_resume(struct pt_regs *regs);
>> +void __rseq_handle_slowpath(struct pt_regs *regs);
>> -static inline void rseq_handle_notify_resume(struct pt_regs *regs)
>> +/* Invoked from resume_user_mode_work() */
>> +static inline void rseq_handle_slowpath(struct pt_regs *regs)
>>   {
>> -    /* '&' is intentional to spare one conditional branch */
>> -    if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
>> -        __rseq_handle_notify_resume(regs);
>> +    if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
>> +        if (current->rseq_event.slowpath)
>> +            __rseq_handle_slowpath(regs);
>> +    } else {
>> +        /* '&' is intentional to spare one conditional branch */
>> +        if (current->rseq_event.sched_switch & current- 
>> >rseq_event.has_rseq)
>> +            __rseq_handle_slowpath(regs);
>> +    }
>>   }
>>   void __rseq_signal_deliver(int sig, struct pt_regs *regs);
>> @@ -138,6 +144,12 @@ static inline void rseq_fork(struct task
>>           t->rseq_sig = current->rseq_sig;
>>           t->rseq_ids.cpu_cid = ~0ULL;
>>           t->rseq_event = current->rseq_event;
>> +        /*
>> +         * If it has rseq, force it into the slow path right away
>> +         * because it is guaranteed to fault.
>> +         */
>> +        if (t->rseq_event.has_rseq)
>> +            t->rseq_event.slowpath = true;
>>       }
>>   }
>> @@ -151,7 +163,7 @@ static inline void rseq_execve(struct ta
>>   }
>>   #else /* CONFIG_RSEQ */
>> -static inline void rseq_handle_notify_resume(struct ksignal *ksig, 
>> struct pt_regs *regs) { }
>> +static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
>>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct 
>> pt_regs *regs) { }
>>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
>>   static inline void rseq_sched_set_task_cpu(struct task_struct *t, 
>> unsigned int cpu) { }
>> --- a/include/linux/rseq_entry.h
>> +++ b/include/linux/rseq_entry.h
>> @@ -433,7 +433,7 @@ static rseq_inline bool rseq_update_usr(
>>    * tells the caller to loop back into exit_to_user_mode_loop(). The 
>> rseq
>>    * slow path there will handle the fail.
>>    */
>> -static __always_inline bool rseq_exit_to_user_mode_restart(struct 
>> pt_regs *regs)
>> +static __always_inline bool __rseq_exit_to_user_mode_restart(struct 
>> pt_regs *regs)
>>   {
>>       struct task_struct *t = current;
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1911,7 +1911,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
>>   config DEBUG_RSEQ
>>       default n
>>       bool "Enable debugging of rseq() system call" if EXPERT
>> -    depends on RSEQ && DEBUG_KERNEL
>> +    depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
>>       select RSEQ_DEBUG_DEFAULT_ENABLE
>>       help
>>         Enable extra debugging checks for the rseq system call.
>> --- a/kernel/entry/common.c
>> +++ b/kernel/entry/common.c
>> @@ -23,8 +23,7 @@ void __weak arch_do_signal_or_restart(st
>>        * Before returning to user space ensure that all pending work
>>        * items have been completed.
>>        */
>> -    while (ti_work & EXIT_TO_USER_MODE_WORK) {
>> -
>> +    do {
>>           local_irq_enable_exit_to_user(ti_work);
>>           if (ti_work & (_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY))
>> @@ -56,7 +55,19 @@ void __weak arch_do_signal_or_restart(st
>>           tick_nohz_user_enter_prepare();
>>           ti_work = read_thread_flags();
>> -    }
>> +
>> +        /*
>> +         * This returns the unmodified ti_work, when ti_work is not
>> +         * empty. In that case it waits for the next round to avoid
>> +         * multiple updates in case of rescheduling.
>> +         *
>> +         * When it handles rseq it returns either with empty work
>> +         * on success or with TIF_NOTIFY_RESUME set on failure to
>> +         * kick the handling into the slow path.
>> +         */
>> +        ti_work = rseq_exit_to_user_mode_work(regs, ti_work, 
>> EXIT_TO_USER_MODE_WORK);
>> +
>> +    } while (ti_work & EXIT_TO_USER_MODE_WORK);
>>       /* Return the latest work state for arch_exit_to_user_mode() */
>>       return ti_work;
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -234,7 +234,11 @@ static bool rseq_handle_cs(struct task_s
>>   static void rseq_slowpath_update_usr(struct pt_regs *regs)
>>   {
>> -    /* Preserve rseq state and user_irq state for exit to user */
>> +    /*
>> +     * Preserve rseq state and user_irq state. The generic entry code
>> +     * clears user_irq on the way out, the non-generic entry
>> +     * architectures are not having user_irq.
>> +     */
>>       const struct rseq_event evt_mask = { .has_rseq = true, .user_irq 
>> = true, };
>>       struct task_struct *t = current;
>>       struct rseq_ids ids;
>> @@ -286,7 +290,7 @@ static void rseq_slowpath_update_usr(str
>>       }
>>   }
>> -void __rseq_handle_notify_resume(struct pt_regs *regs)
>> +void __rseq_handle_slowpath(struct pt_regs *regs)
>>   {
>>       /*
>>        * If invoked from hypervisors before entering the guest via
>>
> 
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported
  2025-08-25 20:02   ` Sean Christopherson
@ 2025-09-02 11:03     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 11:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: LKML, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Paolo Bonzini, Wei Liu, Dexuan Cui,
	x86, Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Mon, Aug 25 2025 at 13:02, Sean Christopherson wrote:
> On Sat, Aug 23, 2025, Thomas Gleixner wrote:
>> @@ -122,7 +122,7 @@ static inline void rseq_force_update(voi
>>   */
>>  static inline void rseq_virt_userspace_exit(void)
>>  {
>> -	if (current->rseq_event.sched_switch)
>> +	if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) && current->rseq_event.sched_switch)
>
> Rather than pivot on CONFIG_HAVE_GENERIC_TIF_BITS, which makes the "why" quite
> difficult to find/understand, what if this checks TIF_RSEQ == TIF_NOTIFY_RESUME?
> That would also allow architectures to define TIF_RSEQ without switching to the
> generic TIF bits implementation (though I don't know that we want to encourage
> that?).

Did you read the cover letter?

Consolidating on common infrastructure is the goal here. Stop
proliferating the architecture specific hackery, which has zero value
and justification. If people want to harvest the core improvements, then
they should get their act together and mop up their architecture
code. If they can't be bothered, so be it.

I'm happy to add a comment which explains that.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 06/37] rseq: Simplify the event notification
  2025-08-25 17:36   ` Mathieu Desnoyers
@ 2025-09-02 13:39     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 13:39 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Mon, Aug 25 2025 at 13:36, Mathieu Desnoyers wrote:
> On 2025-08-23 12:39, Thomas Gleixner wrote:
>> Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
>> flags") the bits in task::rseq_event_mask are meaningless and just extra
>> work in terms of setting them individually.
>> 
>> Aside of that the only relevant point where an event has to be raised is
>> context switch. Neither the CPU nor MM CID can change without going through
>> a context switch.
>
> Note: we may want to include the numa node id field as well in this
> list of fields.

What for? The node to CPU relationship is not magically changing, so you
can't have a situation where the task stays on the same CPU and suddenly
runs on a different node.

>> -	unsigned long rseq_event_mask;
>> +	bool				rseq_event_pending;
>
> AFAIU, this rseq_event_pending field is now concurrently set from:
>
> - rseq_signal_deliver (without any preempt nor irqoff guard)
> - rseq_sched_switch_event (with preemption disabled)
>
> Is it safe to concurrently store to a "bool" field within a structure
> without any protection against concurrent stores ? Typically I've used
> an integer field just to be on the safe side in that kind of situation.
>
> AFAIR, a bool type needs to be at least 1 byte. Do all architectures
> supported by Linux have a single byte store instruction, or can we end
> up incorrectly storing to other nearby fields ? (for instance, DEC
> Alpha ?)

All architectures which support RSEQ do and I really don't care about
ALPHA, which has other problems than that.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending
  2025-08-25 18:02   ` Mathieu Desnoyers
@ 2025-09-02 13:41     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 13:41 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Mon, Aug 25 2025 at 14:02, Mathieu Desnoyers wrote:
> On 2025-08-23 12:39, Thomas Gleixner wrote:
>> There is no need to update these values unconditionally if there is no
>> event pending.
>
> I agree with this change.
>
> On a related note, I wonder if arch/powerpc/mm/numa.c:
> find_and_update_cpu_nid() should set the rseq_event pending bool to true
> for each thread in the system ?

What for? That's the hotplug path, which establishes the CPU to node
relationship for newly added CPUs and their [hyper]threads _before_
those CPUs are able to run anything.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 09/37] rseq: Introduce struct rseq_event
  2025-08-25 18:11   ` Mathieu Desnoyers
@ 2025-09-02 13:45     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 13:45 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Mon, Aug 25 2025 at 14:11, Mathieu Desnoyers wrote:
>> + * @sched_switch:	True if the task was scheduled out
>> + * @has_rseq:		True if the task has a rseq pointer installed
>> + */
>> +struct rseq_event {
>> +	union {
>> +		u32				all;
>> +		struct {
>> +			union {
>> +				u16		events;
>> +				struct {
>> +					u8	sched_switch;
>> +				};
>
> Is alpha still supported, or can we assume bytewise loads/stores ?

Alpha is on life support, but that does not mean we have to cater for it
in new features.

> Are those events meant to each consume 1 byte (which limits us to 2
> events for a 2-byte "events"/4-byte "all"), or is the plan to update
> them with bitwise or/~ and ?

No. Bitwise or/and is creating horrible ASM code and needs serialization
in the worst case.

There is a no need for tons of events. See changes further down the
series.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 14/37] rseq: Cache CPU ID and MM CID values
  2025-08-25 18:19   ` Mathieu Desnoyers
@ 2025-09-02 13:48     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 13:48 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Mon, Aug 25 2025 at 14:19, Mathieu Desnoyers wrote:
> On 2025-08-23 12:39, Thomas Gleixner wrote:
>> In preparation for rewriting RSEQ exit to user space handling provide
>> storage to cache the CPU ID and MM CID values which were written to user
>> space. That prepares for a quick check, which avoids the update when
>> nothing changed.
>
> What should we do about the numa node_id field ?
>
> On pretty much all arch except powerpc (AFAIK) it's invariant for
> the topology, so derived from cpu_id.
>
> On powerpc, we could perhaps reset the cached cpu_id to ~0U for
> each thread to trigger an update ? Or just don't care about this ?

It's invariant on powerPC as well after the CPU was [hot]added to the
kernel.

Otherwise any usage of cpu_to_node() would be broken on powerPC, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 15/37] rseq: Record interrupt from user space
  2025-08-25 18:29   ` Mathieu Desnoyers
@ 2025-09-02 13:54     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 13:54 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Mon, Aug 25 2025 at 14:29, Mathieu Desnoyers wrote:
> On 2025-08-23 12:39, Thomas Gleixner wrote:
>> code. If your architecture does not use it, bad luck.
>> 
>
> Should we eventually add a "depends on GENERIC_IRQ_ENTRY" to RSEQ then ?

I wish we could, but that'd break MIPS, POWER and ARM*...

>> @@ -281,6 +281,7 @@ static __always_inline void exit_to_user
>>   static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
>>   {
>>   	enter_from_user_mode(regs);
>> +	rseq_note_user_irq_entry();
>
> As long as this also covers the following scenarios I'm ok with this:
>
> - trap/exception from an rseq critical section,

It does. Traps and exceptions go through that entry path.

> - NMI over an rseq critical section.

That's irrelevant as NMIs are not going through the regular exit to user
path and therefore can't reschedule. If they trigger something which
requires a reschedule they raise IRQ work, which then goes through the
regular irqentry/exit path. 

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 18/37] rseq: Provide static branch for runtime debugging
  2025-08-25 20:30   ` Michael Jeanson
@ 2025-09-02 13:56     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 13:56 UTC (permalink / raw)
  To: Michael Jeanson, LKML
  Cc: Jens Axboe, Peter Zijlstra, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

On Mon, Aug 25 2025 at 16:30, Michael Jeanson wrote:
> On 2025-08-23 12:39, Thomas Gleixner wrote:
>> +static int __init rseq_setup_debug(char *str)
>> +{
>> +	bool on;
>> +
>> +	if (kstrtobool(str, &on))
>> +		return -EINVAL;
>> +	rseq_control_debug(on);
>> +	return 0;
>
> Functions used by __setup() have to return '1' to signal that the 
> argument was handled, otherwise you get this in the kernel log:
>
> kernel: Unknown kernel command line parameters "rseq_debug=1", will be 
> passed to user space.

Duh, yes.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 23/37] rseq: Provide and use rseq_set_uids()
  2025-08-26 14:52   ` Mathieu Desnoyers
@ 2025-09-02 14:08     ` Thomas Gleixner
  2025-09-02 16:33       ` Thomas Gleixner
  0 siblings, 1 reply; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 14:08 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Tue, Aug 26 2025 at 10:52, Mathieu Desnoyers wrote:
>> +{
>> +	u32 cpu_id, uval, node_id = cpu_to_node(task_cpu(t));
>> +	struct rseq __user *rseq = t->rseq;
>> +
>> +	if (t->rseq_ids.cpu_cid == ~0)
>> +		return true;
>> +
>> +	if (!user_read_masked_begin(rseq))
>> +		return false;
>> +
>> +	unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
>> +	if (cpu_id != t->rseq_ids.cpu_id)
>> +		goto die;
>> +	unsafe_get_user(uval, &rseq->cpu_id, efault);
>> +	if (uval != cpu_id)
>> +		goto die;
>> +	unsafe_get_user(uval, &rseq->node_id, efault);
>> +	if (uval != node_id)
>> +		goto die;
>
> AFAIU, when a task migrates across NUMA nodes, userspace will have a
> stale value and this check will fail, thus killing the process. To fix
> this you'd need to derive "node_id" from
> cpu_to_node(t->rseq_ids.cpu_id).

Good catch.

> But doing that will not work on powerpc, where the mapping between
> node_id and cpu_id can change dynamically, AFAIU this can kill processes
> even though userspace did not alter the node_id behind the kernel's
> back.

Still not an issue. You might need to reread the related PPC code :)

>> +
>> +	/* Cache the new values */
>> +	t->rseq_ids.cpu_cid = ids->cpu_cid;
>
> I may be missing something, but I think we're missing updates to
> t->rseq_ids.mm_cid and we may want to keep track of t->rseq_ids.node_id
> as well.

Oops. I'm sure I had that mm_cid caching, but somehow dropped it. And
again, no need to keep track of the node id. It's stable vs. CPU ID.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 26/37] rseq: Optimize event setting
  2025-08-26 15:26   ` Mathieu Desnoyers
@ 2025-09-02 14:17     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 14:17 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Tue, Aug 26 2025 at 11:26, Mathieu Desnoyers wrote:
>> Add a event flag, which is set when the CPU or MM CID or both change.
>
> We should figure out what to do for powerpc's dynamic numa node id
> to cpu mapping here.

:)

> The combination of patch
> "rseq: Simplify the event notification" and this
> ends up moving those three rseq_migrate events to __set_task_cpu:
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index be00629f0ba4..695c23939345 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3364,7 +3364,6 @@ void set_task_cpu(struct task_struct *p, unsigned 
> int new_cpu)
>                  if (p->sched_class->migrate_task_rq)
>                          p->sched_class->migrate_task_rq(p, new_cpu);
>                  p->se.nr_migrations++;
> -               rseq_migrate(p);
>                  sched_mm_cid_migrate_from(p);
>                  perf_event_task_migrate(p);
>          }
> @@ -4795,7 +4794,6 @@ int sched_cgroup_fork(struct task_struct *p, 
> struct kernel_clone_args *kargs)
>                  p->sched_task_group = tg;
>          }
>   #endif
> -       rseq_migrate(p);
>          /*
>           * We're setting the CPU for the first time, we don't migrate,
>           * so use __set_task_cpu().
> @@ -4859,7 +4857,6 @@ void wake_up_new_task(struct task_struct *p)
>           * as we're not fully set-up yet.
>           */
>          p->recent_used_cpu = task_cpu(p);
> -       rseq_migrate(p);
>          __set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
>          rq = __task_rq_lock(p, &rf);
>          update_rq_clock(rq);
>
> AFAIR those were placed in the callers to benefit from the conditional
> in set_task_cpu():
>
>          if (task_cpu(p) != new_cpu) {
>
> perhaps it's not a big deal, but I think it's relevant to point it out.

They ended up setting the event all over the place. The only relevant
point which matters is __set_task_cpu().

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 19/37] rseq: Provide and use rseq_update_user_cs()
  2025-08-25 19:16   ` Mathieu Desnoyers
@ 2025-09-02 15:19     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 15:19 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Mon, Aug 25 2025 at 15:16, Mathieu Desnoyers wrote:
> On 2025-08-23 12:39, Thomas Gleixner wrote:
>> If user space truly cares about
>> the security of the critical section descriptors, then it should set them
>> up once and map the descriptor memory read only.
>
> AFAIR, the attack pattern we are trying to tackle here is:

  ^^^^^ - so I'm not the only one who struggles to find some explanation
          for that in code, change logs etc. :)

> The attacker has write access to some memory (e.g. stack or heap) and
> uses his area to craft a custom rseq_cs descriptor. Using this home-made
> descriptor and storing to rseq->rseq_cs, it can set an abort_ip to e.g.
> glibc system(3) and easily call any library function through an aborting
> rseq critical section, thus bypassing ROP prevention mechanisms.
>
> Requiring the signature prior to the abort ip target prevents using rseq
> to bypass ROP prevention, because those ROP gadget targets don't have
> the signature.

Fair enough. Let me see how to integrate this properly along with a big
fat comment explaining what it actually does.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()
  2025-08-25 20:24     ` Sean Christopherson
@ 2025-09-02 15:37       ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 15:37 UTC (permalink / raw)
  To: Sean Christopherson, Mathieu Desnoyers
  Cc: LKML, Jens Axboe, Paolo Bonzini, Wei Liu, Dexuan Cui,
	Peter Zijlstra, Paul E. McKenney, Boqun Feng, x86, Arnd Bergmann,
	Heiko Carstens, Christian Borntraeger, Sven Schnelle, Huacai Chen,
	Paul Walmsley, Palmer Dabbelt

On Mon, Aug 25 2025 at 13:24, Sean Christopherson wrote:
> On Mon, Aug 25, 2025, Mathieu Desnoyers wrote:
>> > @@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
>> >   		r = kvm_arch_vcpu_ioctl_run(vcpu);
>> >   		vcpu->wants_to_run = false;
>> > +		rseq_virt_userspace_exit();
>
> I don't love bleeding even more entry/rseq details into KVM.

Neither do I.

> Rather than optimize KVM and then add TIF_RSEQ, what if we do the
> opposite?

I'm not optimizing KVM. I'm simplifying the RSEQ parts to ignore
TIF_NOTIFY_RESUME when invoked with @regs == NULL.

> I.e. add TIF_RSEQ to XFER_TO_GUEST_MODE_WORK as part of "rseq: Switch
> to TIF_RSEQ if supported", and then drop TIF_RSEQ from
> XFER_TO_GUEST_MODE_WORK in a new patch?

The problem is that I have to keep all the architectures which

    - do not use the generic entry code
    - therefore can't be switched trivially over to the TIF_RSEQ scheme
    - have RSEQ support enabled

alive and working.

> That should make it easier to revert the KVM/virt change if it turns
> out PV setups are playing games with rseq,

I can't find a hint of such an insanity in kernel, so *shrug*.

If there is out of tree code which plays games with the vCPU's user
space thread::TLS::rseq, then it rightfully breaks. The update, which
happens today, is just coincidence and a kernel internal implementation
detail.

> and it would give the stragglers (arm64 in particular) some
> motiviation to implement TIF_RSEQ and/or switch to generic TIF bits.

There is enough motivation in this series to do so :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 23/37] rseq: Provide and use rseq_set_uids()
  2025-09-02 14:08     ` Thomas Gleixner
@ 2025-09-02 16:33       ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 16:33 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Tue, Sep 02 2025 at 16:08, Thomas Gleixner wrote:
> On Tue, Aug 26 2025 at 10:52, Mathieu Desnoyers wrote:
>>> +
>>> +	/* Cache the new values */
>>> +	t->rseq_ids.cpu_cid = ids->cpu_cid;
>>
>> I may be missing something, but I think we're missing updates to
>> t->rseq_ids.mm_cid and we may want to keep track of t->rseq_ids.node_id
>> as well.
>
> Oops. I'm sure I had that mm_cid caching, but somehow dropped it. And
> again, no need to keep track of the node id. It's stable vs. CPU ID.

Correcting myself. You are missing that this caches the compound value
and not only the CPU id.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler
  2025-08-26 15:12   ` Mathieu Desnoyers
@ 2025-09-02 17:32     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 17:32 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Tue, Aug 26 2025 at 11:12, Mathieu Desnoyers wrote:
> On 2025-08-23 12:40, Thomas Gleixner wrote:
>> +void __rseq_handle_notify_resume(struct pt_regs *regs)
>> +{
>> +	/*
>> +	 * If invoked from hypervisors before entering the guest via
>> +	 * resume_user_mode_work(), then @regs is a NULL pointer.
>> +	 *
>> +	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
>> +	 * it before returning from the ioctl() to user space when
>> +	 * rseq_event.sched_switch is set.
>> +	 *
>> +	 * So it's safe to ignore here instead of pointlessly updating it
>> +	 * in the vcpu_run() loop.
>
> I don't think any virt user should expect the userspace fields to be
> updated on the host process while running in guest mode, but it's good
> to clarify that we intend to change this user-visible behavior within
> this series, to spare any unwelcome surprise.

Actually it is not really a user-visible change.

TLS::rseq is thread local and any update to it becomes only visible to
user space once the vCPU thread actually returns to user space. Arguably
no guest has legitimately access to the hosts VCPU thread's TLS.

You might argue, that GDB might look at the thread's TLS::rseq while the
task runs in VCPUs guest mode. But that's completely irrelevant because
once a task enters the kernel the RSEQ CPU/NODE/MM ids have no meaning
anymore. They are only valid as long as the task runs in user space.
When a task hits a breakpoint GDB can only look at the state _before_
that and that's all what it can see when it looks at the TLS of a
thread, which voluntarily went into the kernel via the KVM ioctl.

That update is truly a kernel internal implementation detail and it got
introduced way _after_ the initial RSEQ implementation.

Before 5.9 KVM ignored most of the pending TIF work including
TIF_NOTIFY_RESUME. Once that got fixed it turned out that handling the
other TIF_NOTIFY_RESUME work could result in losing an RSEQ update. To
cure that the rseq handler got pulled in to that TIF_NOTIFY_RESUME
demultiplexing function and gained that NULL pointer check inside to
exclude the critical section check.

In hindsight RSEQ should have used a separate TIF bit right from the
beginning, but that's water under the bridge...

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 27/37] rseq: Implement fast path for exit to user
  2025-08-26 15:33   ` Mathieu Desnoyers
@ 2025-09-02 18:31     ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 18:31 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Tue, Aug 26 2025 at 11:33, Mathieu Desnoyers wrote:
> On 2025-08-23 12:40, Thomas Gleixner wrote:
>> +	 * A sane compiler requires four instructions for the nothing to do
>> +	 * case including clearing the events, but your milage might vary.
>
> See my earlier comments about:
>
> - Handling of dynamic numa node id to cpu mapping reconfiguration on
>    powerpc.

Still not relevant :)

> - Validation of the abort handler signature on production kernels.

Brought it back already.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [patch V2 28/37] rseq: Switch to fast path processing on exit to user
  2025-08-27 13:45     ` Mathieu Desnoyers
@ 2025-09-02 18:36       ` Thomas Gleixner
  0 siblings, 0 replies; 91+ messages in thread
From: Thomas Gleixner @ 2025-09-02 18:36 UTC (permalink / raw)
  To: Mathieu Desnoyers, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt

On Wed, Aug 27 2025 at 09:45, Mathieu Desnoyers wrote:
> On 2025-08-26 11:40, Mathieu Desnoyers wrote:
>>>    RSEQ selftests      Before          After              Reduction
>>>
>>>    exit to user:       386281778          387373750
>>>    signal checks:       35661203                  0           100%
>>>    slowpath runs:      140542396 36.38%            100  0.00%    100%
>>>    fastpath runs:                         9509789  2.51%     N/A
>>>    id updates:         176203599 45.62%        9087994  2.35%     95%
>>>    cs checks:          175587856 45.46%        4728394  1.22%     98%
>>>      cs cleared:       172359544   98.16%    1319307   27.90%   99%
>>>      cs fixup:           3228312    1.84%    3409087   72.10%
>
> By the way, you should really not be using the entire rseq selftests
> as a representative workload for profiling the kernel rseq implementation.
>
> Those selftests include "loop injection", "yield injection", "kill
> injection" and "sleep injection" within the relevant userspace code
> paths, which really increase the likelihood of hitting stuff like
> "cs fixup" compared to anything that comes close to a realistic
> use-case. This is really useful for testing correctness, but not
> for profiling. For instance, the "loop injection" introduces busy
> loops within rseq critical sections to significantly increase the
> likelihood of hitting a cs fixup.
>
> Those specific selftests are really just "stress-tests" that don't
> represent any relevant workload.

True, they still tell how much useless work the kernel was doing, no?


^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2025-09-02 18:36 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
2025-08-23 16:39 ` [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-08-25 15:39   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 02/37] rseq: Condense the inline stubs Thomas Gleixner
2025-08-25 15:40   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 03/37] resq: Move algorithm comment to top Thomas Gleixner
2025-08-25 15:41   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
2025-08-25 15:43   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 05/37] rseq: Simplify registration Thomas Gleixner
2025-08-25 15:44   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 06/37] rseq: Simplify the event notification Thomas Gleixner
2025-08-25 17:36   ` Mathieu Desnoyers
2025-09-02 13:39     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
2025-08-25 17:54   ` Mathieu Desnoyers
2025-08-25 20:24     ` Sean Christopherson
2025-09-02 15:37       ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
2025-08-25 18:02   ` Mathieu Desnoyers
2025-09-02 13:41     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 09/37] rseq: Introduce struct rseq_event Thomas Gleixner
2025-08-25 18:11   ` Mathieu Desnoyers
2025-09-02 13:45     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 10/37] entry: Cleanup header Thomas Gleixner
2025-08-25 18:13   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 11/37] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
2025-08-23 16:39 ` [patch V2 12/37] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
2025-08-23 16:39 ` [patch V2 13/37] sched: Move MM CID related functions to sched.h Thomas Gleixner
2025-08-25 18:14   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 14/37] rseq: Cache CPU ID and MM CID values Thomas Gleixner
2025-08-25 18:19   ` Mathieu Desnoyers
2025-09-02 13:48     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 15/37] rseq: Record interrupt from user space Thomas Gleixner
2025-08-25 18:29   ` Mathieu Desnoyers
2025-09-02 13:54     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 16/37] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
2025-08-25 18:32   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 17/37] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
2025-08-25 18:34   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 18/37] rseq: Provide static branch for runtime debugging Thomas Gleixner
2025-08-25 18:36   ` Mathieu Desnoyers
2025-08-25 20:30   ` Michael Jeanson
2025-09-02 13:56     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 19/37] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
2025-08-25 19:16   ` Mathieu Desnoyers
2025-09-02 15:19     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 20/37] rseq: Replace the debug crud Thomas Gleixner
2025-08-26 14:21   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 21/37] rseq: Make exit debugging static branch based Thomas Gleixner
2025-08-26 14:23   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
2025-08-26 14:28   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 23/37] rseq: Provide and use rseq_set_uids() Thomas Gleixner
2025-08-26 14:52   ` Mathieu Desnoyers
2025-09-02 14:08     ` Thomas Gleixner
2025-09-02 16:33       ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 24/37] rseq: Seperate the signal delivery path Thomas Gleixner
2025-08-26 15:08   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
2025-08-26 15:12   ` Mathieu Desnoyers
2025-09-02 17:32     ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 26/37] rseq: Optimize event setting Thomas Gleixner
2025-08-26 15:26   ` Mathieu Desnoyers
2025-09-02 14:17     ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 27/37] rseq: Implement fast path for exit to user Thomas Gleixner
2025-08-26 15:33   ` Mathieu Desnoyers
2025-09-02 18:31     ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 28/37] rseq: Switch to fast path processing on " Thomas Gleixner
2025-08-26 15:40   ` Mathieu Desnoyers
2025-08-27 13:45     ` Mathieu Desnoyers
2025-09-02 18:36       ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 29/37] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
2025-08-26 15:41   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 30/37] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
2025-08-26 15:45   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 31/37] asm-generic: Provide generic TIF infrastructure Thomas Gleixner
2025-08-23 20:37   ` Arnd Bergmann
2025-08-25 19:33   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 32/37] x86: Use generic TIF bits Thomas Gleixner
2025-08-25 19:34   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 33/37] s390: " Thomas Gleixner
2025-08-23 16:40 ` [patch V2 34/37] loongarch: " Thomas Gleixner
2025-08-23 16:40 ` [patch V2 35/37] riscv: " Thomas Gleixner
2025-08-23 16:40 ` [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-08-25 19:39   ` Mathieu Desnoyers
2025-08-25 20:02   ` Sean Christopherson
2025-09-02 11:03     ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit Thomas Gleixner
2025-08-25 19:43   ` Mathieu Desnoyers
2025-08-25 15:10 ` [patch V2 00/37] rseq: Optimize exit to user space Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).