All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch V3 00/37] rseq: Optimize exit to user space
@ 2025-09-04 22:20 Thomas Gleixner
  2025-09-04 22:20 ` [patch V3 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
                   ` (36 more replies)
  0 siblings, 37 replies; 40+ messages in thread
From: Thomas Gleixner @ 2025-09-04 22:20 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Paolo Bonzini, Sean Christopherson,
	Wei Liu, Dexuan Cui, x86, Arnd Bergmann, Heiko Carstens,
	Christian Borntraeger, Sven Schnelle, Huacai Chen, Paul Walmsley,
	Palmer Dabbelt

This is a follow up on the V2 series, which can be found here:

   https://lore.kernel.org/all/20250823161326.635281786@linutronix.de

The V2 posting contains a detailed list of the addressed problems. TLDR:

    - A significant amount of pointless RSEQ operations on exit to user
      space, which have been reported by people as measurable impact after
      glibc switched to use RSEQ

    - Suboptimal hotpath handling both in the scheduler and on exit to user
      space.

This series addresses these issues by:

  1) Limiting the RSEQ work to the actual conditions where it is
     required. The full benefit is only available for architectures using
     the generic entry infrastructure. All others get at least the basic
     improvements.

  2) Re-implementing the whole user space handling based on proper data
     structures and by actually looking at the impact it creates in the
     fast path.

  3) Moving the actual handling of RSEQ out to the latest point in the exit
     path, where possible. This is fully inlined into the fast path to keep
     the impact confined.

Changes vs. V2:

  - Bring back the ROP protection - Mathieu

  - Document the guest visible change when host TLS is mapped into guest - Sean

  - Document the TIF_RSEQ optimization for virt - Sean

  - Fix the __setup() return value - Michael

  - Add the missing include in HV - 0-day

  - Rename *uids to *ids - Mathieu

  - Spelling and grammar fixes in comments and change logs - Mathieu

  - Picked up tags where appropriate

Delta patch to V2 is below.

As for the previous version these patches have a pile of dependencies:

The series depends on the separately posted rseq bugfix:

   https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/

and the uaccess generic helper series:

   https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/

and a related futex fix in

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent

The combination of all of them and some other related fixes (rseq
selftests) are available here:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/base

For your convenience all of it is also available as a conglomerate from
git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

Thanks,

	tglx
---
 Documentation/admin-guide/kernel-parameters.txt |    4 
 arch/Kconfig                                    |    4 
 arch/loongarch/Kconfig                          |    1 
 arch/loongarch/include/asm/thread_info.h        |   76 +-
 arch/riscv/Kconfig                              |    1 
 arch/riscv/include/asm/thread_info.h            |   31 -
 arch/s390/Kconfig                               |    1 
 arch/s390/include/asm/thread_info.h             |   44 -
 arch/x86/Kconfig                                |    1 
 arch/x86/entry/syscall_32.c                     |    3 
 arch/x86/include/asm/thread_info.h              |   76 +-
 drivers/hv/mshv_root_main.c                     |    3 
 fs/binfmt_elf.c                                 |    2 
 fs/exec.c                                       |    2 
 include/asm-generic/thread_info_tif.h           |   51 +
 include/linux/entry-common.h                    |   38 -
 include/linux/irq-entry-common.h                |   68 ++
 include/linux/mm.h                              |   25 
 include/linux/resume_user_mode.h                |    2 
 include/linux/rseq.h                            |  223 +++++---
 include/linux/rseq_entry.h                      |  621 ++++++++++++++++++++++++
 include/linux/rseq_types.h                      |   72 ++
 include/linux/sched.h                           |   50 +
 include/linux/thread_info.h                     |    5 
 include/trace/events/rseq.h                     |    4 
 include/uapi/linux/rseq.h                       |   21 
 init/Kconfig                                    |   28 +
 kernel/entry/common.c                           |   37 -
 kernel/entry/syscall-common.c                   |    8 
 kernel/rseq.c                                   |  604 +++++++++--------------
 kernel/sched/core.c                             |   10 
 kernel/sched/membarrier.c                       |    8 
 kernel/sched/sched.h                            |    5 
 virt/kvm/kvm_main.c                             |    3 
 34 files changed, 1433 insertions(+), 699 deletions(-)
---
Delta to V2:

--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -28,6 +28,7 @@
 #include <linux/crash_dump.h>
 #include <linux/panic_notifier.h>
 #include <linux/vmalloc.h>
+#include <linux/rseq.h>
 
 #include "mshv_eventfd.h"
 #include "mshv.h"
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -241,7 +241,7 @@ static __always_inline void __exit_to_us
  * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
  *
- * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work for
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
  * syscalls and interrupts.
  */
 static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
@@ -255,7 +255,7 @@ static __always_inline void syscall_exit
  * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
  *
- * Wrapper around __exit_to_user_mode_prepare() to seperate the exit work for
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
  * syscalls and interrupts.
  */
 static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -112,17 +112,24 @@ static inline void rseq_force_update(voi
 
 /*
  * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
- * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
- * that case just to do it eventually again before returning to user space,
- * the entry resume_user_mode_work() invocation is ignored as the register
- * argument is NULL.
+ * which clears TIF_NOTIFY_RESUME on architectures that don't use the
+ * generic TIF bits and therefore can't provide a separate TIF_RSEQ flag.
  *
- * After returning from guest mode, they have to invoke this function to
- * re-raise TIF_NOTIFY_RESUME if necessary.
+ * To avoid updating user space RSEQ in that case just to do it eventually
+ * again before returning to user space, because __rseq_handle_slowpath()
+ * does nothing when invoked with NULL register state.
+ *
+ * After returning from guest mode, before exiting to userspace, hypervisors
+ * must invoke this function to re-raise TIF_NOTIFY_RESUME if necessary.
  */
 static inline void rseq_virt_userspace_exit(void)
 {
-	if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) && current->rseq_event.sched_switch)
+	/*
+	 * The generic optimization for deferring RSEQ updates until the next
+	 * exit relies on having a dedicated TIF_RSEQ.
+	 */
+	if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) &&
+	    current->rseq_event.sched_switch)
 		rseq_raise_notify_resume(current);
 }
 
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -53,10 +53,8 @@ void __rseq_trace_ip_fixup(unsigned long
 
 static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids)
 {
-	if (tracepoint_enabled(rseq_update)) {
-		if (ids)
-			__rseq_trace_update(t);
-	}
+	if (tracepoint_enabled(rseq_update) && ids)
+		__rseq_trace_update(t);
 }
 
 static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
@@ -81,7 +79,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
 #endif
 
 bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
-bool rseq_debug_validate_uids(struct task_struct *t);
+bool rseq_debug_validate_ids(struct task_struct *t);
 
 static __always_inline void rseq_note_user_irq_entry(void)
 {
@@ -209,14 +207,20 @@ bool rseq_debug_update_user_cs(struct ta
  * debugging is enabled, but don't do that on the first exit to user
  * space. In that case cpu_cid is ~0. See fork/execve.
  */
-bool rseq_debug_validate_uids(struct task_struct *t)
+bool rseq_debug_validate_ids(struct task_struct *t)
 {
-	u32 cpu_id, uval, node_id = cpu_to_node(task_cpu(t));
 	struct rseq __user *rseq = t->rseq;
+	u32 cpu_id, uval, node_id;
 
 	if (t->rseq_ids.cpu_cid == ~0)
 		return true;
 
+	/*
+	 * Look it up outside of the user access section as cpu_to_node()
+	 * can end up in debug code.
+	 */
+	node_id = cpu_to_node(t->rseq_ids.cpu_id);
+
 	if (!user_read_masked_begin(rseq))
 		return false;
 
@@ -252,11 +256,13 @@ rseq_update_user_cs(struct task_struct *
 {
 	struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
 	unsigned long ip = instruction_pointer(regs);
+	unsigned long tasksize = TASK_SIZE;
 	u64 start_ip, abort_ip, offset;
+	u32 usig, __user *uc_sig;
 
 	rseq_stat_inc(rseq_stats.cs);
 
-	if (unlikely(csaddr >= TASK_SIZE)) {
+	if (unlikely(csaddr >= tasksize)) {
 		t->rseq_event.fatal = true;
 		return false;
 	}
@@ -281,15 +287,28 @@ rseq_update_user_cs(struct task_struct *
 		goto clear;
 
 	/*
-	 * Force it to be in user space as x86 IRET would happily return to
-	 * the kernel. Can't use TASK_SIZE as a mask because that's not
-	 * necessarily a power of two. Just make sure it's in the user
-	 * address space. Let the pagefault handler sort it out.
+	 * Two requirements for @abort_ip:
+	 *   - Must be in user space as x86 IRET would happily return to
+	 *     the kernel.
+	 *   - The four bytes preceeding the instruction at @abort_ip must
+	 *     contain the signature.
+	 *
+	 * The latter protects against the following attack vector:
 	 *
-	 * Use LONG_MAX and not LLONG_MAX to keep it correct for 32 and 64
-	 * bit architectures.
+	 * An attacker with limited abilities to write, creates a critical
+	 * section descriptor, sets the abort IP to a library function or
+	 * some other ROP gadget and stores the address of the descriptor
+	 * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
+	 * protection.
 	 */
-	abort_ip &= (u64)LONG_MAX;
+	if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
+		goto die;
+
+	/* The address is guaranteed to be >= 0 and < TASK_SIZE */
+	uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+	unsafe_get_user(usig, uc_sig, fail);
+	if (unlikely(usig != t->rseq_sig))
+		goto die;
 
 	/* Invalidate the critical section */
 	unsafe_put_user(0ULL, &t->rseq->rseq_cs, fail);
@@ -306,7 +325,8 @@ rseq_update_user_cs(struct task_struct *
 	user_access_end();
 	rseq_stat_inc(rseq_stats.clear);
 	return true;
-
+die:
+	t->rseq_event.fatal = true;
 fail:
 	user_access_end();
 	return false;
@@ -335,13 +355,13 @@ rseq_update_user_cs(struct task_struct *
  * faults in task context are fatal too.
  */
 static rseq_inline
-bool rseq_set_uids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
-			      u32 node_id, u64 *csaddr)
+bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
+			     u32 node_id, u64 *csaddr)
 {
 	struct rseq __user *rseq = t->rseq;
 
 	if (static_branch_unlikely(&rseq_debug_enabled)) {
-		if (!rseq_debug_validate_uids(t))
+		if (!rseq_debug_validate_ids(t))
 			return false;
 	}
 
@@ -375,7 +395,7 @@ static rseq_inline bool rseq_update_usr(
 {
 	u64 csaddr;
 
-	if (!rseq_set_uids_get_csaddr(t, ids, node_id, &csaddr))
+	if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr))
 		return false;
 
 	/*
@@ -507,6 +527,7 @@ static __always_inline bool __rseq_exit_
 # define CHECK_TIF_RSEQ		_TIF_RSEQ
 static __always_inline void clear_tif_rseq(void)
 {
+	static_assert(TIF_RSEQ != TIF_NOTIFY_RESUME);
 	clear_thread_flag(TIF_RSEQ);
 }
 #else
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -3,13 +3,13 @@
 #define _LINUX_RSEQ_TYPES_H
 
 #include <linux/types.h>
-/* Forward declaration for the sched.h */
+/* Forward declaration for sched.h */
 struct rseq;
 
 /*
  * struct rseq_event - Storage for rseq related event management
  * @all:		Compound to initialize and clear the data efficiently
- * @events:		Compund to access events with a single load/store
+ * @events:		Compound to access events with a single load/store
  * @sched_switch:	True if the task was scheduled and needs update on
  *			exit to user
  * @ids_changed:	Indicator that IDs need to be updated
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -99,7 +99,7 @@ static int __init rseq_setup_debug(char
 	if (kstrtobool(str, &on))
 		return -EINVAL;
 	rseq_control_debug(on);
-	return 0;
+	return 1;
 }
 __setup("rseq_debug=", rseq_setup_debug);
 
@@ -218,9 +218,9 @@ static int __init rseq_debugfs_init(void
 __initcall(rseq_debugfs_init);
 #endif /* CONFIG_DEBUG_FS */
 
-static bool rseq_set_uids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
+static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
 {
-	return rseq_set_uids_get_csaddr(t, ids, node_id, NULL);
+	return rseq_set_ids_get_csaddr(t, ids, node_id, NULL);
 }
 
 static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
@@ -374,7 +374,7 @@ static bool rseq_reset_ids(void)
 	 * stupid state as exit to user space will try to fixup the ids
 	 * again.
 	 */
-	if (rseq_set_uids(current, &ids, 0))
+	if (rseq_set_ids(current, &ids, 0))
 		return true;
 
 	force_sig(SIGSEGV);



^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-09-04 22:40 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-04 22:20 [patch V3 00/37] rseq: Optimize exit to user space Thomas Gleixner
2025-09-04 22:20 ` [patch V3 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-09-04 22:20 ` [patch V3 02/37] rseq: Condense the inline stubs Thomas Gleixner
2025-09-04 22:20 ` [patch V3 03/37] rseq: Move algorithm comment to top Thomas Gleixner
2025-09-04 22:20 ` [patch V3 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
2025-09-04 22:20 ` [patch V3 05/37] rseq: Simplify registration Thomas Gleixner
2025-09-04 22:20 ` [patch V3 06/37] rseq: Simplify the event notification Thomas Gleixner
2025-09-04 22:20 ` [patch V3 07/37] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
2025-09-04 22:20 ` [patch V3 08/37] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
2025-09-04 22:20 ` [patch V3 09/37] rseq: Introduce struct rseq_event Thomas Gleixner
2025-09-04 22:20 ` [patch V3 10/37] entry: Cleanup header Thomas Gleixner
2025-09-04 22:20 ` [patch V3 11/37] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
2025-09-04 22:20 ` [patch V3 12/37] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
2025-09-04 22:20 ` [patch V3 13/37] sched: Move MM CID related functions to sched.h Thomas Gleixner
2025-09-04 22:21 ` [patch V3 14/37] rseq: Cache CPU ID and MM CID values Thomas Gleixner
2025-09-04 22:21 ` [patch V3 15/37] rseq: Record interrupt from user space Thomas Gleixner
2025-09-04 22:21 ` [patch V3 16/37] From: Thomas Gleixner <tglx@linutronix.de> 65;7006;1cSubject: rseq: Provide tracepoint wrappers for inline code Date: Sat, 23 Aug 2025 18:39:45 +0200 Thomas Gleixner
2025-09-04 22:37   ` Mathieu Desnoyers
2025-09-04 22:40   ` [patch V3 RESEND 16/37] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
2025-09-04 22:21 ` [patch V3 17/37] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
2025-09-04 22:21 ` [patch V3 18/37] rseq: Provide static branch for runtime debugging Thomas Gleixner
2025-09-04 22:21 ` [patch V3 19/37] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
2025-09-04 22:21 ` [patch V3 20/37] rseq: Replace the original debug implementation Thomas Gleixner
2025-09-04 22:21 ` [patch V3 21/37] rseq: Make exit debugging static branch based Thomas Gleixner
2025-09-04 22:21 ` [patch V3 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
2025-09-04 22:21 ` [patch V3 23/37] rseq: Provide and use rseq_set_ids() Thomas Gleixner
2025-09-04 22:21 ` [patch V3 24/37] rseq: Separate the signal delivery path Thomas Gleixner
2025-09-04 22:21 ` [patch V3 25/37] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
2025-09-04 22:21 ` [patch V3 26/37] rseq: Optimize event setting Thomas Gleixner
2025-09-04 22:21 ` [patch V3 27/37] rseq: Implement fast path for exit to user Thomas Gleixner
2025-09-04 22:21 ` [patch V3 28/37] rseq: Switch to fast path processing on " Thomas Gleixner
2025-09-04 22:21 ` [patch V3 29/37] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
2025-09-04 22:21 ` [patch V3 30/37] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
2025-09-04 22:21 ` [patch V3 31/37] asm-generic: Provide generic TIF infrastructure Thomas Gleixner
2025-09-04 22:21 ` [patch V3 32/37] x86: Use generic TIF bits Thomas Gleixner
2025-09-04 22:21 ` [patch V3 33/37] s390: " Thomas Gleixner
2025-09-04 22:21 ` [patch V3 34/37] loongarch: " Thomas Gleixner
2025-09-04 22:21 ` [patch V3 35/37] riscv: " Thomas Gleixner
2025-09-04 22:21 ` [patch V3 36/37] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-09-04 22:21 ` [patch V3 37/37] entry/rseq: Optimize for TIF_RSEQ on exit Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.