[patch V2 00/37] rseq: Optimize exit to user space

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch V2 00/37] rseq: Optimize exit to user space
@ 2025-08-23 16:39 Thomas Gleixner
  2025-08-25 15:10 ` Mathieu Desnoyers
  0 siblings, 1 reply; 2+ messages in thread
From: Thomas Gleixner @ 2025-08-23 16:39 UTC (permalink / raw)
  To: LKML
  Cc: Jens Axboe, Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney,
	Boqun Feng, Paolo Bonzini, Sean Christopherson, Wei Liu,
	Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt, linux-arch, Thomas Bogendoerfer, Michael Ellerman,
	Jonas Bonn

OA
This is a follow up on the initial series, which did a very basic attempt
to sanitize the RSEQ handling in the kernel:

   https://lore.kernel.org/all/20250813155941.014821755@linutronix.de

Further analysis turned up more than these initial problems:

  1) task::rseq_event_mask is a pointless bit-field despite the fact that
     the ABI flags it was meant to support have been deprecated and
     functionally disabled three years ago.

  2) task::rseq_event_mask is accumulating bits unless there is a critical
     section discovered in the user space rseq memory. This results in
     pointless invocations of the rseq user space exit handler even if
     there had nothing changed. As a matter of correctness these bits have
     to be clear when exiting to user space and therefore pristine when
     coming back into the kernel. Aside of correctness, this also avoids
     pointless evaluation of the user space memory, which is a performance
     benefit.

  3) The evaluation of critical sections does not differentiate between
     syscall and interrupt/exception exits. The current implementation
     silently tolerates and fixes up critical sections which invoked a
     syscall unless CONFIG_DEBUG_RSEQ is enabled.

     That's just wrong. If user space does that on a production kernel it
     can keep the pieces. The kernel is not there to proliferate mindless
     user space programming and letting everyone pay the performance
     penalty.

Additional findings:

  4) The decision to raise the work for exit is more than suboptimal.
     Basically every context switch does so if the task has rseq, which is
     nowadays likely as glibc makes use of it if available.

     The consequence is that a lot of exits have to process RSEQ just for
     nothing. The only reasons to do so are:

       the task was interrupted in user space and schedules

     or

       the CPU or MM CID changes in schedule() independent of the entry
       mode

     That reduces the invocation space obviously significantly.

  5) Signal handling does the RSEQ update unconditionally.

     That's wrong as the only reason to do so is when the task was
     interrupted in user space independent of a schedule event.

     The only important task in that case is to handle the critical section
     because after switching to the signal frame the original return IP is
     not longer available.

     The CPU/MM CID values do not need to be updated at that point as they
     can change again before the signal delivery goes out to user space.

     Again, if the task was in a critical section and issued a syscall then
     it can keep the pieces as that's a violation of the ABI contract.

  6) CPU and MM CID are updated unconditionally

     That's again a pointless exercise when they didn't change. Then the
     only action required is to check the critical section if and only if
     the entry came via an interrupt.

     That can obviously be avoided by caching the values written to user
     space and avoiding that path if they haven't changed

  7) The TIF_NOTIFY_RESUME mechanism is a horrorshow

     TIF_NOTIFY_RESUME is a multiplexing TIF bit and needs to invoke the
     world and some more. Depending on workloads this can be set by
     task_work, security, block and memory management. All unrelated to
     RSEQ and quite some of them are likely to cause a reschedule.
     But most of them are low frequency.

     So doing this work in the loop unconditionally is just waste. The
     correct point to do this is at the end of that loop once all other bits
     have been processed, because that's the point where the task is
     actually going out to user space.

  8) #7 caused another subtle work for nothing issue

     IO/URING and hypervisors invoke resume_user_mode_work() with a NULL
     pointer for pt_regs, which causes the RSEQ code to ignore the critical
     section check, but updating the CPU ID/ MM CID values unconditionally.

     For IO/URING this invocation is irrelevant because the IO workers can
     never go out to user space and therefore do not have RSEQ memory in
     the first place. So it's a non problem in the existing code as
     task::rseq is NULL in that case.

     Hypervisors are a different story. They need to drain task_work and
     other pending items, which are multiplexed by TIF_NOTIFY_RESUME,
     before entering guest mode.

     The invocation of resume_user_mode_work() clears TIF_NOTIFY_RESUME,
     which means if rseq would ignore that case then it could miss a CPU
     or MM CID update on the way back to user space.

     The handling of that is just a horrible and mindless hack as the event
     might be re-raised between the time the ioctl() enters guest mode and
     the actual exit to user space.

     So the obvious thing is to ignore the regs=NULL call and let the
     offending hypervisor calls check when returning from the ioctl()
     whether the event bit is set and re-raise the notification again.

  9) Code efficiency

     RSEQ aims to improve performance for user space, but it completely
     ignores the fact, that this needs to be implemented in a way which
     does not impact the performance of the kernel significantly.

     So far this did not pop up as just a few people used it, but that has
     changed because glibc started to use it widely.

     It's not so efficiOS^H^Hent as advertised:

     For a full kernel rebuild:

     	 exit to user:                  50106428
	 signal checks:                    24703
	 slowpath runs:                   943785 1.88%
	 id updates:                      968488 1.93%
	 cs checks:                       888768 1.77%
	   cs cleared:                      888768 100.00%
	   cs fixup:                             0 0.00%

     The cs_cleared/fixup numbers are relative to cs_checks, which means
     this does pointless clears even if it was clear already because that's
     what is the case with glibc. glibc is only interested in the CPU/MM
     CID values so far. And no, it's not only a store, it's the whole dance
     of spectre-v1 mitigation plus user access enable/disable.

     I really have to ask why all those people who are deeply caring about
     performance and have utilized this stuff for quite some time have not
     noticed what a performance sh*tshow this inflicts on the kernel.

     Seriously?

     Aside of that did anyone ever look at the resulting assembly code?

     While the user space counterpart is carefully crafted in hand written
     assembly for efficiency the kernel implementation is a complete and
     utter nightmare.

     C allows to implement very efficient code, but the fact that the
     compiler requires help from the programmer to turn it into actual
     efficient assembly is not a totally new finding.

     The absence of properly designed data structures, which allow to build
     efficient and comprehensible code, is amazing.

     Slapping random storage members into struct task_struct, implementing
     a conditional maze around them and then hoping that the compiler will
     create sensible code out of it, is wishful thinking at best.

     I completely understand that all the sanity checks and loops and hoops
     were required to get this actually working. But I'm seriously grumpy
     about the fact, that once the PoC code was shoved into the kernel, the
     'performance work' was completed and just the next fancy 'performance'
     features were piled on top of it.

     Once this is addressed the above picture changes completely:

          exit to user:                  50071399
	  signal checks:                      182
	  slowpath runs:                     8149 0.02%
	  fastpath runs:                   763366 1.52%
	  id updates:                       78419 0.16%
	  cs checks:                            0 0.00%
	  cs cleared:                           0
	  cs fixup:                             0

     And according to perf that very same kernel build consistently gets
     faster from:

           92.500894648 seconds time elapsed (upstream)

     to

           91.693778124 seconds time elapsed

     Not a lot but not in the noise either, i.e. >= 1%

     For a 1e9 gettid() loop this results in going from:

           49.782703878 seconds time elapsed (upstream)

     to

      	   49.327360851 seconds time elapsed

     Not a lot either, but consistently >=1% as the above.

     For actual rseq critical section usage this makes even more of a
     difference. Aside of avoiding pointless work in the kernel this does
     also not abort critical sections when not necessary, which improves
     user space performance too. The kernel selftests magically complete
     six seconds faster, which is again not a lot compared to a total run
     time of ~540s. But those tests are not really reflecting real work
     loads, they emphasize on stress testing the kernel implementation,
     which is perfectly fine. But still, the reduced amount of pointless
     kernel work is what makes the difference:

     Upstream:

	exit to user:                 736568321
	signal checks:                 62002182
	slowpath runs:                358362121 57.07%
	fastpath runs:                        0 0.00%
	id updates:                   358362129 57.07%
	cs checks:                    358362110 57.07%
	  cs cleared:                   347611246 97.01%
	  cs fixup:                      10750863  2.99%

     Upstream + simple obvious fixes:

	exit to user:                 736462043
	signal checks:                 61902154
	slowpath runs:                223394362 30.33%
	fastpath runs:                        0 0.00%
	id updates:                   285296516 38.74%
	cs checks:                    284430371 38.62%
	  cs cleared:                   277110571 97.43%
	  cs fixup:                       7319800  2.57%

     Fully patched:

        exit to user:                 736477630
	signal checks:                        0
	slowpath runs:                     1367 0.00%
	fastpath runs:                114230240 15.51%
	id updates:                   106498492 14.46%
	cs checks:                      7665511 1.04%
	  cs cleared:                     2092386 27.30%
	  cs fixup:                       5573118 72.70%

     Perf confirms:

       46,242,994,988,432      cycles
        8,390,996,405,969      instructions

           1949.923762000 seconds user
          16111.947776000 seconds sys

       versus:

       44,910,362,743,328      cycles		-3 %
        8,305,138,767,914      instructions     -1 %

           1985.342453000 seconds user		+2 %
          15653.639218000 seconds sys		-3 %

     Running the same with only looking at the kernel counts:

      1) Upstream:

         39,530,448,181,974      cycles:khH
          2,872,985,410,904      instructions:khH
            640,968,094,639      branches:khH

      2) Upstream + simple obvious fixes:
                                                            -> #1
         39,005,607,758,104      cycles:khH                 -1.5 %
          2,853,676,629,923      instructions:khH           -0.8 %
            635,538,256,875      branches:khH               -1.0 %

      3) Fully patched:
						   -> #2    -> #1
         38,786,867,052,870      cycles:khH        -0.6 %   -2.3 %
          2,784,773,121,491      instructions:khH  -2.5 %   -3.1 %
            618,677,047,160      branches:khH      -2.7 %   -3.6 %

     Looking at the kernel build that way:

      1) Upstream:

        474,191,665,063      cycles:khH
        164,176,072,767      instructions:khH
         33,037,122,843      branches:khH

      2) Upstream + simple obvious fixes:
                                                            -> #1
         473,423,488,450      cycles:khH 		    ~0.0 %          
         162,749,873,914      instructions:khH              -1.0 %
          32,264,397,605      branches:khH                  -2.5 %

      3) Fully patched:
                                                   -> #2    -> #1
         468,658,348,098      cycles:khH           -1.1 %   -1.2 %
         160,941,028,283      instructions:khH     -1.2 %   -2.0 %
          31,893,296,267      branches:khH         -1.5 %   -3.7 %

     That's pretty much in line with the 2-3% regressions observed by Jens.

     TBH, I absolutely do not care about the way how performance is
     evaluated in the entities who pushed for this, but I very much care
     about the general impact of this.

     It seems to me the performance evaluation by the $corp power users of
     RSEQ stopped at the point where a significant improvement in their
     user space was observed. It's interesting, that the resulting overhead
     in the kernel seems not that relevant for the overall performance
     evaluation. That's left to others to mop up.

     The tragedy of the commons in action.

TBH, what surprised me most when looking into this in detail, was the large
amount of low hanging fruit, which was sitting there in plain sight.

That said, this series addresses the overall problems by:

  1) Limiting the RSEQ work to the actual conditions where it is
     required. The full benefit is only available for architectures using
     the generic entry infrastructure. All others get at least the basic
     improvements.

  2) Re-implementing the whole user space handling based on proper data
     structures and by actually looking at the impact it creates in the
     fast path.

  3) Moving the actual handling of RSEQ out to the latest point in the exit
     path, where possible. This is fully inlined into the fast path to keep
     the impact confined.

     The initial attempt to make it completely independent of TIF bits and
     just handle it with a quick check unconditionally on exit to user
     space turned out to be not feasible. On workloads which are doing a
     lot of quick syscalls the extra four instructions add up
     significantly.

     So instead I ended up doing it at the end of the exit to user TIF
     work loop once when all other TIF bits have been processed. At this
     point interrupts are disabled and there is no way that the state
     can change before the task goes out to user space for real.

  Versus the limitations of #1 and #3:

   I wasted several days of my so copious time to figure out how to not
   break all the architectures, which still insist on benefiting from core
   code improvements by pulling everything and the world into their
   architecture specific hackery.

   It's more than five years now that the generic entry code infrastructure
   has been introduced for the very reason to lower the burden for core
   code developers and maintainers and to share the common functionality
   across the architecture zoo.

   Aside of the initial x86 move, which started this effort, there are only
   three architectures who actually made the effort to utilize this. Two of
   them were new ones, which were asked to use it right away.

   The only existing one, which converted over since then is S390 and I'm
   truly grateful that they improved the generic infrastructure in that
   process significantly.

   On ARM[64] there are at least serious efforts underway to move their
   code over.

   Does everybody else think that core code improvements come for free and
   the architecture specific hackery does not put any burden on others?

   Here is the hall of fame as far as RSEQ goes:

   	arch/mips/Kconfig:      select HAVE_RSEQ
	arch/openrisc/Kconfig:  select HAVE_RSEQ
	arch/powerpc/Kconfig:   select HAVE_RSEQ

   Two of them are barely maintained and shouldn't have RSEQ in the first
   place....

   While I was very forthcoming in the past to accomodate for that and went
   out of my way to enable stuff for everyone, but I'm drawing a line now.

   All extra improvements which are enabled by #1/#3 depend hard on the
   generic infrastructure.

   I know that it's quite some effort to move an architecture over, but
   it's a one time effort and investment into the future. This 'my
   architecture is special for no reason' mindset is not sustainable and
   just pushes the burden on others. There is zero justification for this.

   Not converging on common infrastructure is not only a burden for the
   core people, it's also a missed opportunity for the architectures to
   lower their own burden of chasing core improvements and implementing
   them each with a different set of bugs.

   This is not the first time this happens. There are enough other examples
   where it took ages to consolidate on common code. This just accumulates
   technical debt and needless complexity, which everyone suffers from.

   I have happily converted the four architectures, which use the generic
   entry code over, to utilize a shared generic TIF bit header so that
   adding the TIF_RSEQ bit becomes a two line change and all four get the
   benefit immediately. That was more consequent than just adding the bits
   for each of them and it makes further maintainence of core
   infrastructure simpler for all sides. See?


That said, as for the first version these patches have a pile of dependencies:

The series depends on the separately posted rseq bugfix:

   https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/

and the uaccess generic helper series:

   https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/

and a related futex fix in

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent

The combination of all of them and some other related fixes (rseq
selftests) are available here:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/base

For your convenience all of it is also available as a conglomerate from
git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

The diffstat looks large, but a lot of that is due to extensive comments
and the extra hackery to accommodate for random architecture code.

I did not yet come around to test this on anything else than x86. Help with
that would be truly appreciated.

Thanks,

	tglx

  "Additional problems are the offspring of poor solutions." - Mark Twain
	
---
 Documentation/admin-guide/kernel-parameters.txt |    4 
 arch/Kconfig                                    |    4 
 arch/loongarch/Kconfig                          |    1 
 arch/loongarch/include/asm/thread_info.h        |   76 +-
 arch/riscv/Kconfig                              |    1 
 arch/riscv/include/asm/thread_info.h            |   29 -
 arch/s390/Kconfig                               |    1 
 arch/s390/include/asm/thread_info.h             |   44 -
 arch/x86/Kconfig                                |    1 
 arch/x86/entry/syscall_32.c                     |    3 
 arch/x86/include/asm/thread_info.h              |   74 +-
 b/include/asm-generic/thread_info_tif.h         |   51 ++
 b/include/linux/rseq_entry.h                    |  601 +++++++++++++++++++++++
 b/include/linux/rseq_types.h                    |   72 ++
 drivers/hv/mshv_root_main.c                     |    2 
 fs/binfmt_elf.c                                 |    2 
 fs/exec.c                                       |    2 
 include/linux/entry-common.h                    |   38 -
 include/linux/irq-entry-common.h                |   68 ++
 include/linux/mm.h                              |   25 
 include/linux/resume_user_mode.h                |    2 
 include/linux/rseq.h                            |  216 +++++---
 include/linux/sched.h                           |   50 +
 include/linux/thread_info.h                     |    5 
 include/trace/events/rseq.h                     |    4 
 include/uapi/linux/rseq.h                       |   21 
 init/Kconfig                                    |   28 +
 kernel/entry/common.c                           |   37 -
 kernel/entry/syscall-common.c                   |    8 
 kernel/rseq.c                                   |  610 ++++++++++--------------
 kernel/sched/core.c                             |   10 
 kernel/sched/membarrier.c                       |    8 
 kernel/sched/sched.h                            |    5 
 virt/kvm/kvm_main.c                             |    3 
 34 files changed, 1406 insertions(+), 700 deletions(-)

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [patch V2 00/37] rseq: Optimize exit to user space
  2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
@ 2025-08-25 15:10 ` Mathieu Desnoyers
  0 siblings, 0 replies; 2+ messages in thread
From: Mathieu Desnoyers @ 2025-08-25 15:10 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jens Axboe, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Paolo Bonzini, Sean Christopherson, Wei Liu, Dexuan Cui, x86,
	Arnd Bergmann, Heiko Carstens, Christian Borntraeger,
	Sven Schnelle, Huacai Chen, Paul Walmsley, Palmer Dabbelt,
	linux-arch, Thomas Bogendoerfer, Michael Ellerman, Jonas Bonn,
	Florian Weimer

On 2025-08-23 12:39, Thomas Gleixner wrote:
> OA
> This is a follow up on the initial series, which did a very basic attempt
> to sanitize the RSEQ handling in the kernel:
> 
>     https://lore.kernel.org/all/20250813155941.014821755@linutronix.de
> 
> Further analysis turned up more than these initial problems:

Thanks Thomas for looking into this. Sorry for the delayed reply,
I was on vacation.

> 
>    1) task::rseq_event_mask is a pointless bit-field despite the fact that
>       the ABI flags it was meant to support have been deprecated and
>       functionally disabled three years ago.

Yes, this should be converted to a simple boolean now.

> 
>    2) task::rseq_event_mask is accumulating bits unless there is a critical
>       section discovered in the user space rseq memory. This results in
>       pointless invocations of the rseq user space exit handler even if
>       there had nothing changed. As a matter of correctness these bits have
>       to be clear when exiting to user space and therefore pristine when
>       coming back into the kernel. Aside of correctness, this also avoids
>       pointless evaluation of the user space memory, which is a performance
>       benefit.

Thanks for catching this, that's indeed not the intended behavior.

> 
>    3) The evaluation of critical sections does not differentiate between
>       syscall and interrupt/exception exits. The current implementation
>       silently tolerates and fixes up critical sections which invoked a
>       syscall unless CONFIG_DEBUG_RSEQ is enabled.
> 
>       That's just wrong. If user space does that on a production kernel it
>       can keep the pieces. The kernel is not there to proliferate mindless
>       user space programming and letting everyone pay the performance
>       penalty.

Agreed. There is no point in supporting a userspace behavior on
production kernels that is prevented on debug kernels, especially
if this adds overhead to production kernels.

> 
> Additional findings:
> 
>    4) The decision to raise the work for exit is more than suboptimal.

Terminology-wise, just making sure we are on the same page: here
you are talking about "exit to usermode", and *not* the exit(2) system
call. I'm clarifying because I know mm people care about fork/clone/exit
syscall overhead as well.

>       Basically every context switch does so if the task has rseq, which is
>       nowadays likely as glibc makes use of it if available.

Correct.

> 
>       The consequence is that a lot of exits have to process RSEQ just for
>       nothing. The only reasons to do so are:
> 
>         the task was interrupted in user space and schedules

interrupted or takes a trap/exception (I will assume you consider traps and
exceptions as an interrupt classes within this discussion).

> 
>       or
> 
>         the CPU or MM CID changes in schedule() independent of the entry
>         mode

or the numa node id, which is typically tied to the CPU number, except
on powerpc AFAIR where numa node id to cpu mapping can be reconfigured
dynamically.

> 
>       That reduces the invocation space obviously significantly.

Yes.

> 
>    5) Signal handling does the RSEQ update unconditionally.
> 
>       That's wrong as the only reason to do so is when the task was
>       interrupted in user space independent of a schedule event.
> 
>       The only important task in that case is to handle the critical section
>       because after switching to the signal frame the original return IP is
>       not longer available.
> 
>       The CPU/MM CID values do not need to be updated at that point as they
>       can change again before the signal delivery goes out to user space.
> 
>       Again, if the task was in a critical section and issued a syscall then
>       it can keep the pieces as that's a violation of the ABI contract.

The key thing here is that the state of the cpu/mm cid/numa node id
fields are updated before returning to the userspace signal handler,
and that a rseq critical section within a signal handler works.
 From your description here we can indeed do less work on signal
delivery and still meet those requirements. That's good.

> 
>    6) CPU and MM CID are updated unconditionally
> 
>       That's again a pointless exercise when they didn't change. Then the
>       only action required is to check the critical section if and only if
>       the entry came via an interrupt.
> 
>       That can obviously be avoided by caching the values written to user
>       space and avoiding that path if they haven't changed

This is a good performance improvement, I agree. Note that this
will likely break Google's tcmalloc hack of re-using the cpu_id_start
field as a way to get notified about preemption. But they were warned
not to do that, and it breaks the documented userspace ABI contract,
so they will need to adapt. This situation is documented here:

commit 7d5265ffcd8b
     rseq: Validate read-only fields under DEBUG_RSEQ config

> 
>    7) The TIF_NOTIFY_RESUME mechanism is a horrorshow
> 
>       TIF_NOTIFY_RESUME is a multiplexing TIF bit and needs to invoke the
>       world and some more. Depending on workloads this can be set by
>       task_work, security, block and memory management. All unrelated to
>       RSEQ and quite some of them are likely to cause a reschedule.
>       But most of them are low frequency.
> 
>       So doing this work in the loop unconditionally is just waste. The
>       correct point to do this is at the end of that loop once all other bits
>       have been processed, because that's the point where the task is
>       actually going out to user space.

Note that the rseq work can trigger a page fault and a reschedule, which
makes me whether moving this work out of the work loop can be OK ?
I'll need to have a closer look at the relevant patches to understand
this better.

I suspect that the underlying problem here is not that the notify resume
adds too much overhead, but rather that we're setting the
TIF_NOTIFY_RESUME bit way more often than is good for us.

Initially I focused on minimizing the scheduler overhead, at the
expense of doing more work than strictly needed on exit to usermode.
But of course reality is not as simple, and we need to fine the right
balance.

If your goal is to have fewer calls to the resume notifier triggered
by rseq, we could perhaps change the way rseq_set_notify_resume()
is implemented to make it conditional on either:

- migration,
- preemption AND
     current->rseq->rseq_cs userspace value is set OR
     would require a page fault to be read.
- preemption AND mm_cid changes.

I'm not sure what to do about the powerpc numa node id reconfiguration
fringe use-case though.

> 
>    8) #7 caused another subtle work for nothing issue
> 
>       IO/URING and hypervisors invoke resume_user_mode_work() with a NULL
>       pointer for pt_regs, which causes the RSEQ code to ignore the critical
>       section check, but updating the CPU ID/ MM CID values unconditionally.
> 
>       For IO/URING this invocation is irrelevant because the IO workers can
>       never go out to user space and therefore do not have RSEQ memory in
>       the first place. So it's a non problem in the existing code as
>       task::rseq is NULL in that case.

OK

> 
>       Hypervisors are a different story. They need to drain task_work and
>       other pending items, which are multiplexed by TIF_NOTIFY_RESUME,
>       before entering guest mode.
> 
>       The invocation of resume_user_mode_work() clears TIF_NOTIFY_RESUME,
>       which means if rseq would ignore that case then it could miss a CPU
>       or MM CID update on the way back to user space.
> 
>       The handling of that is just a horrible and mindless hack as the event
>       might be re-raised between the time the ioctl() enters guest mode and
>       the actual exit to user space.
> 
>       So the obvious thing is to ignore the regs=NULL call and let the
>       offending hypervisor calls check when returning from the ioctl()
>       whether the event bit is set and re-raise the notification again.

Is this a useless work issue or a correctness issue ?

Also, AFAIU, your proposed approach will ensure that the rseq fields
are up to date when returning to userspace from the ioctl in the host
process, but there are no guarantees about having up to date fields
when running the guest VM. I'm fine with this, but I think it needs to
be clearly spelled out.

> 
>    9) Code efficiency
> 
>       RSEQ aims to improve performance for user space, but it completely
>       ignores the fact, that this needs to be implemented in a way which
>       does not impact the performance of the kernel significantly.
> 
>       So far this did not pop up as just a few people used it, but that has
>       changed because glibc started to use it widely.

The history of incremental rseq upstreaming into the kernel and then glibc can
help understand how this situation came to be:

Linux commit d7822b1e24f2 ("rseq: Introduce restartable sequences system call")

(2018)

     This includes microbenchmarks of specific use-cases, which clearly
     do not stress neither the scheduler nor exit to usermode.

     It includes jemalloc memory allocator (Facebook) latency benchmarks,
     which depend heavily on the userspace fast-paths.

     It includes hackbench results, cover the scheduling and return
     to usermode overhead. Given that there was no libc integration
     back then, the hackbench results do not include any rseq-triggered
     resume notifier work.

     Based on the rseq use at that point in time, the overhead of
     rseq was acceptable for it to be merged.

glibc commit 95e114a0919d ("nptl: Add rseq registration")

(2021)

     Florian Weimer added this feature to glibc. At this point
     the overhead you observe now should have become visible.
     This integration and performance testing was done by the
     glibc maintainers and distribution vendors who packaged this
     libc.

[...]

> That said, this series addresses the overall problems by:
> 
>    1) Limiting the RSEQ work to the actual conditions where it is
>       required. The full benefit is only available for architectures using
>       the generic entry infrastructure. All others get at least the basic
>       improvements.
> 
>    2) Re-implementing the whole user space handling based on proper data
>       structures and by actually looking at the impact it creates in the
>       fast path.
> 
>    3) Moving the actual handling of RSEQ out to the latest point in the exit
>       path, where possible. This is fully inlined into the fast path to keep
>       the impact confined.
> 
>       The initial attempt to make it completely independent of TIF bits and
>       just handle it with a quick check unconditionally on exit to user
>       space turned out to be not feasible. On workloads which are doing a
>       lot of quick syscalls the extra four instructions add up
>       significantly.
> 
>       So instead I ended up doing it at the end of the exit to user TIF
>       work loop once when all other TIF bits have been processed. At this
>       point interrupts are disabled and there is no way that the state
>       can change before the task goes out to user space for real.

I'll need to review what happens in case rseq needs to take a page fault
after your proposed changes. More discussion to come around the specific
patch in the series.

> 
>    Versus the limitations of #1 and #3:
> 
>     I wasted several days of my so copious time to figure out how to not
>     break all the architectures, which still insist on benefiting from core
>     code improvements by pulling everything and the world into their
>     architecture specific hackery.
> 
>     It's more than five years now that the generic entry code infrastructure
>     has been introduced for the very reason to lower the burden for core
>     code developers and maintainers and to share the common functionality
>     across the architecture zoo.
> 
>     Aside of the initial x86 move, which started this effort, there are only
>     three architectures who actually made the effort to utilize this. Two of
>     them were new ones, which were asked to use it right away.
> 
>     The only existing one, which converted over since then is S390 and I'm
>     truly grateful that they improved the generic infrastructure in that
>     process significantly.
> 
>     On ARM[64] there are at least serious efforts underway to move their
>     code over.
> 
>     Does everybody else think that core code improvements come for free and
>     the architecture specific hackery does not put any burden on others?
> 
>     Here is the hall of fame as far as RSEQ goes:
> 
>     	arch/mips/Kconfig:      select HAVE_RSEQ
> 	arch/openrisc/Kconfig:  select HAVE_RSEQ
> 	arch/powerpc/Kconfig:   select HAVE_RSEQ
> 
>     Two of them are barely maintained and shouldn't have RSEQ in the first
>     place....
> 
>     While I was very forthcoming in the past to accomodate for that and went
>     out of my way to enable stuff for everyone, but I'm drawing a line now.
> 
>     All extra improvements which are enabled by #1/#3 depend hard on the
>     generic infrastructure.
> 
>     I know that it's quite some effort to move an architecture over, but
>     it's a one time effort and investment into the future. This 'my
>     architecture is special for no reason' mindset is not sustainable and
>     just pushes the burden on others. There is zero justification for this.
> 
>     Not converging on common infrastructure is not only a burden for the
>     core people, it's also a missed opportunity for the architectures to
>     lower their own burden of chasing core improvements and implementing
>     them each with a different set of bugs.
> 
>     This is not the first time this happens. There are enough other examples
>     where it took ages to consolidate on common code. This just accumulates
>     technical debt and needless complexity, which everyone suffers from.
> 
>     I have happily converted the four architectures, which use the generic
>     entry code over, to utilize a shared generic TIF bit header so that
>     adding the TIF_RSEQ bit becomes a two line change and all four get the
>     benefit immediately. That was more consequent than just adding the bits
>     for each of them and it makes further maintainence of core
>     infrastructure simpler for all sides. See?


Can we make RSEQ depend on CONFIG_GENERIC_ENTRY (if that is the correct
option) ?

This would force additional architectures to move to the generic entry
infrastructure if they want to benefit from rseq.

I'll start looking into the series now.

Thanks,

Mathieu

> 
> 
> That said, as for the first version these patches have a pile of dependencies:
> 
> The series depends on the separately posted rseq bugfix:
> 
>     https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/
> 
> and the uaccess generic helper series:
> 
>     https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/
> 
> and a related futex fix in
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent
> 
> The combination of all of them and some other related fixes (rseq
> selftests) are available here:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/base
> 
> For your convenience all of it is also available as a conglomerate from
> git:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
> 
> The diffstat looks large, but a lot of that is due to extensive comments
> and the extra hackery to accommodate for random architecture code.
> 
> I did not yet come around to test this on anything else than x86. Help with
> that would be truly appreciated.
> 
> Thanks,
> 
> 	tglx
> 
>    "Additional problems are the offspring of poor solutions." - Mark Twain
> 	
> ---
>   Documentation/admin-guide/kernel-parameters.txt |    4
>   arch/Kconfig                                    |    4
>   arch/loongarch/Kconfig                          |    1
>   arch/loongarch/include/asm/thread_info.h        |   76 +-
>   arch/riscv/Kconfig                              |    1
>   arch/riscv/include/asm/thread_info.h            |   29 -
>   arch/s390/Kconfig                               |    1
>   arch/s390/include/asm/thread_info.h             |   44 -
>   arch/x86/Kconfig                                |    1
>   arch/x86/entry/syscall_32.c                     |    3
>   arch/x86/include/asm/thread_info.h              |   74 +-
>   b/include/asm-generic/thread_info_tif.h         |   51 ++
>   b/include/linux/rseq_entry.h                    |  601 +++++++++++++++++++++++
>   b/include/linux/rseq_types.h                    |   72 ++
>   drivers/hv/mshv_root_main.c                     |    2
>   fs/binfmt_elf.c                                 |    2
>   fs/exec.c                                       |    2
>   include/linux/entry-common.h                    |   38 -
>   include/linux/irq-entry-common.h                |   68 ++
>   include/linux/mm.h                              |   25
>   include/linux/resume_user_mode.h                |    2
>   include/linux/rseq.h                            |  216 +++++---
>   include/linux/sched.h                           |   50 +
>   include/linux/thread_info.h                     |    5
>   include/trace/events/rseq.h                     |    4
>   include/uapi/linux/rseq.h                       |   21
>   init/Kconfig                                    |   28 +
>   kernel/entry/common.c                           |   37 -
>   kernel/entry/syscall-common.c                   |    8
>   kernel/rseq.c                                   |  610 ++++++++++--------------
>   kernel/sched/core.c                             |   10
>   kernel/sched/membarrier.c                       |    8
>   kernel/sched/sched.h                            |    5
>   virt/kvm/kvm_main.c                             |    3
>   34 files changed, 1406 insertions(+), 700 deletions(-)


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-08-25 15:10 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
2025-08-25 15:10 ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).