linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/11] rseq: Optimize exit to user space
@ 2025-08-13 16:29 Thomas Gleixner
  2025-08-13 16:29 ` [patch 01/11] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
                   ` (12 more replies)
  0 siblings, 13 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

With the more wide spread usage of rseq in glibc, rseq is not longer a
niche use case for special applications.

While working on a sane implementation of a rseq based time slice extension
mechanism, I noticed several shortcomings of the current rseq code:

  1) task::rseq_event_mask is a pointless bitfield despite the fact that
     the ABI flags it was meant to support have been deprecated and
     functionally disabled three years ago.

  2) task::rseq_event_mask is accumulating bits unless there is a critical
     section discovered in the user space rseq memory. This results in
     pointless invocations of the rseq user space exit handler even if
     there had nothing changed. As a matter of correctness these bits have
     to be clear when exiting to user space and therefore pristine when
     coming back into the kernel. Aside of correctness, this also avoids
     pointless evaluation of the user space memory, which is a performance
     benefit.

  3) The evaluation of critical sections does not differentiate between
     syscall and interrupt/exception exits. The current implementation
     silently fixes up critical sections which invoked a syscall unless
     CONFIG_DEBUG_RSEQ is enabled.

     That's just wrong. If user space does that on a production kernel it
     can keep the pieces. The kernel is not there to proliferate mindless
     user space programming and letting everyone pay the performance
     penalty.

This series addresses these issues and on top converts parts of the user
space access over to the new masked access model, which lowers the overhead
of Spectre-V1 mitigations significantly on architectures which support it
(x86 as of today). This is especially noticable in the access to the
rseq_cs field in struct rseq, which is the first quick check to figure out
whether a critical section is installed or not.

It survives the kernels rseq selftests, but I did not any performance tests
vs. rseq because I have no idea how to use the gazillion of undocumented
command line parameters of the benchmark. I leave that to people who are so
familiar with them, that they assume everyone else is too :)

The performance gain on regular workloads is clearly measurable and the
consistent event flag state allows now to build the time slice extension
mechanism on top. The first POC I implemented:

   https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/

suffered badly from the stale eventmask bits and the cleaned up version
brought a whopping 25+% performance gain.

The series depends on the seperately posted rseq bugfix:

   https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/

and the uaccess generic helper series:

   https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/

and a related futex fix in

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent

For your conveniance it is also available as a conglomorate from git:

    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf

Thanks,

	tglx
---
 drivers/hv/mshv_common.c         |    2 
 include/linux/entry-common.h     |   12 +--
 include/linux/irq-entry-common.h |   31 ++++++--
 include/linux/resume_user_mode.h |   40 +++++++---
 include/linux/rseq.h             |  115 +++++++++----------------------
 include/linux/sched.h            |   10 +-
 include/uapi/linux/rseq.h        |   21 +----
 io_uring/io_uring.h              |    2 
 kernel/entry/common.c            |    9 +-
 kernel/entry/kvm.c               |    2 
 kernel/rseq.c                    |  142 +++++++++++++++++++++++++--------------
 kernel/sched/core.c              |    8 +-
 kernel/sched/membarrier.c        |    8 +-
 13 files changed, 213 insertions(+), 189 deletions(-)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 01/11] rseq: Avoid pointless evaluation in __rseq_notify_resume()
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-20 14:23   ` Mathieu Desnoyers
  2025-08-13 16:29 ` [patch 02/11] rseq: Condense the inline stubs Thomas Gleixner
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

The RSEQ critical section mechanism only clears the event mask when a
critical section is registered, otherwise it is stale and collects
bits.

That means once a critical section is installed the first invocation of
that code when TIF_NOTIFY_RESUME is set will abort the critical section,
even when the TIF bit was not raised by the rseq preempt/migrate/signal
helpers.

This also has a performance implication because TIF_NOTIFY_RESUME is a
multiplexing TIF bit, which is utilized by quite some infrastructure. That
means every invocation of __rseq_notify_resume() goes unconditionally
through the heavy lifting of user space access and consistency checks even
if there is no reason to do so.

Keeping the stale event mask around when exiting to user space also
prevents it from being utilized by the upcoming time slice extension
mechanism.

Avoid this by reading and clearing the event mask before doing the user
space critical section access with preemption disabled, which ensures that
the read and clear operation is CPU local atomic versus scheduling.

This is correct as after re-enabling preemption any relevant event will set
the bit again and raise TIF_NOTIFY_RESUME, which makes the user space exit
code take another round of TIF bit clearing.

If the event mask was non-zero, invoke the slow path. On debug kernels the
slow path is invoked unconditionally and the result of the event mask
evaluation is handed in.

Add a exit path check after the TIF bit loop, which validates on debug
kernels that the event mask is zero before exiting to user space.

While at it reword the convoluted comment why the pt_regs pointer can be
NULL under certain circumstances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/irq-entry-common.h |    7 ++--
 include/linux/rseq.h             |   10 +++++
 kernel/rseq.c                    |   66 ++++++++++++++++++++++++++-------------
 3 files changed, 58 insertions(+), 25 deletions(-)

--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -2,11 +2,12 @@
 #ifndef __LINUX_IRQENTRYCOMMON_H
 #define __LINUX_IRQENTRYCOMMON_H
 
+#include <linux/context_tracking.h>
+#include <linux/kmsan.h>
+#include <linux/rseq.h>
 #include <linux/static_call_types.h>
 #include <linux/syscalls.h>
-#include <linux/context_tracking.h>
 #include <linux/tick.h>
-#include <linux/kmsan.h>
 #include <linux/unwind_deferred.h>
 
 #include <asm/entry-common.h>
@@ -226,6 +227,8 @@ static __always_inline void exit_to_user
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
+	rseq_exit_to_user_mode();
+
 	/* Ensure that kernel state is sane for a return to userspace */
 	kmap_assert_nomap();
 	lockdep_assert_irqs_disabled();
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -66,6 +66,14 @@ static inline void rseq_migrate(struct t
 	rseq_set_notify_resume(t);
 }
 
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
+		if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
+			current->rseq_event_mask = 0;
+	}
+}
+
 /*
  * If parent process has a registered restartable sequences area, the
  * child inherits. Unregister rseq for a clone with CLONE_VM set.
@@ -118,7 +126,7 @@ static inline void rseq_fork(struct task
 static inline void rseq_execve(struct task_struct *t)
 {
 }
-
+static inline void rseq_exit_to_user_mode(void) { }
 #endif
 
 #ifdef CONFIG_DEBUG_RSEQ
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char *
 	return true;
 }
 
-static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
+static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
 {
-	u32 flags, event_mask;
+	u32 flags;
 	int ret;
 
 	if (rseq_warn_flags("rseq_cs", cs_flags))
@@ -339,17 +339,7 @@ static int rseq_need_restart(struct task
 
 	if (rseq_warn_flags("rseq", flags))
 		return -EINVAL;
-
-	/*
-	 * Load and clear event mask atomically with respect to
-	 * scheduler preemption and membarrier IPIs.
-	 */
-	scoped_guard(RSEQ_EVENT_GUARD) {
-		event_mask = t->rseq_event_mask;
-		t->rseq_event_mask = 0;
-	}
-
-	return !!event_mask;
+	return 0;
 }
 
 static int clear_rseq_cs(struct rseq __user *rseq)
@@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip,
 	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
 }
 
-static int rseq_ip_fixup(struct pt_regs *regs)
+static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
 {
 	unsigned long ip = instruction_pointer(regs);
 	struct task_struct *t = current;
@@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs
 	 */
 	if (!in_rseq_cs(ip, &rseq_cs))
 		return clear_rseq_cs(t->rseq);
-	ret = rseq_need_restart(t, rseq_cs.flags);
-	if (ret <= 0)
+	ret = rseq_check_flags(t, rseq_cs.flags);
+	if (ret < 0)
 		return ret;
+	if (!abort)
+		return 0;
 	ret = clear_rseq_cs(t->rseq);
 	if (ret)
 		return ret;
@@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct
 		return;
 
 	/*
-	 * regs is NULL if and only if the caller is in a syscall path.  Skip
-	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
-	 * kill a misbehaving userspace on debug kernels.
+	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
+	 * pointer, so fixup cannot be done. If the syscall which led to
+	 * this invocation was invoked inside a critical section, then it
+	 * will either end up in this code again or a possible violation of
+	 * a syscall inside a critical region can only be detected by the
+	 * debug code in rseq_syscall() in a debug enabled kernel.
 	 */
 	if (regs) {
-		ret = rseq_ip_fixup(regs);
-		if (unlikely(ret < 0))
-			goto error;
+		/*
+		 * Read and clear the event mask first. If the task was not
+		 * preempted or migrated or a signal is on the way, there
+		 * is no point in doing any of the heavy lifting here on
+		 * production kernels. In that case TIF_NOTIFY_RESUME was
+		 * raised by some other functionality.
+		 *
+		 * This is correct because the read/clear operation is
+		 * guarded against scheduler preemption, which makes it CPU
+		 * local atomic. If the task is preempted right after
+		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
+		 * again and this function is invoked another time _before_
+		 * the task is able to return to user mode.
+		 *
+		 * On a debug kernel, invoke the fixup code unconditionally
+		 * with the result handed in to allow the detection of
+		 * inconsistencies.
+		 */
+		u32 event_mask;
+
+		scoped_guard(RSEQ_EVENT_GUARD) {
+			event_mask = t->rseq_event_mask;
+			t->rseq_event_mask = 0;
+		}
+
+		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
+			ret = rseq_ip_fixup(regs, !!event_mask);
+			if (unlikely(ret < 0))
+				goto error;
+		}
 	}
 	if (unlikely(rseq_update_cpu_node_id(t)))
 		goto error;


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 02/11] rseq: Condense the inline stubs
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
  2025-08-13 16:29 ` [patch 01/11] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-20 14:24   ` Mathieu Desnoyers
  2025-08-13 16:29 ` [patch 03/11] rseq: Rename rseq_syscall() to rseq_debug_syscall_exit() Thomas Gleixner
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

Scrolling over tons of pointless

{
}

lines to find the actual code is annoying at best.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq.h |   47 ++++++++++++-----------------------------------
 1 file changed, 12 insertions(+), 35 deletions(-)

--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -101,44 +101,21 @@ static inline void rseq_execve(struct ta
 	t->rseq_event_mask = 0;
 }
 
-#else
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
-}
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
-					     struct pt_regs *regs)
-{
-}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
-				       struct pt_regs *regs)
-{
-}
-static inline void rseq_preempt(struct task_struct *t)
-{
-}
-static inline void rseq_migrate(struct task_struct *t)
-{
-}
-static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
-{
-}
-static inline void rseq_execve(struct task_struct *t)
-{
-}
+#else /* CONFIG_RSEQ */
+static inline void rseq_set_notify_resume(struct task_struct *t) { }
+static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_preempt(struct task_struct *t) { }
+static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
+static inline void rseq_execve(struct task_struct *t) { }
 static inline void rseq_exit_to_user_mode(void) { }
-#endif
+#endif  /* !CONFIG_RSEQ */
 
 #ifdef CONFIG_DEBUG_RSEQ
-
 void rseq_syscall(struct pt_regs *regs);
-
-#else
-
-static inline void rseq_syscall(struct pt_regs *regs)
-{
-}
-
-#endif
+#else /* CONFIG_DEBUG_RSEQ */
+static inline void rseq_syscall(struct pt_regs *regs) { }
+#endif /* !CONFIG_DEBUG_RSEQ */
 
 #endif /* _LINUX_RSEQ_H */


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 03/11] rseq: Rename rseq_syscall() to rseq_debug_syscall_exit()
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
  2025-08-13 16:29 ` [patch 01/11] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
  2025-08-13 16:29 ` [patch 02/11] rseq: Condense the inline stubs Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-20 14:25   ` Mathieu Desnoyers
  2025-08-13 16:29 ` [patch 04/11] rseq: Replace the pointless event mask bit fiddling Thomas Gleixner
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

rseq_syscall() is a debug function, which is invoked before the syscall
exit work is handled. Name it so it's clear what it does.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/entry-common.h |    2 +-
 include/linux/rseq.h         |    4 ++--
 kernel/rseq.c                |    5 +++--
 3 files changed, 6 insertions(+), 5 deletions(-)

--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -162,7 +162,7 @@ static __always_inline void syscall_exit
 			local_irq_enable();
 	}
 
-	rseq_syscall(regs);
+	rseq_debug_syscall_exit(regs);
 
 	/*
 	 * Do one-time syscall specific work. If these work items are
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -113,9 +113,9 @@ static inline void rseq_exit_to_user_mod
 #endif  /* !CONFIG_RSEQ */
 
 #ifdef CONFIG_DEBUG_RSEQ
-void rseq_syscall(struct pt_regs *regs);
+void rseq_debug_syscall_exit(struct pt_regs *regs);
 #else /* CONFIG_DEBUG_RSEQ */
-static inline void rseq_syscall(struct pt_regs *regs) { }
+static inline void rseq_debug_syscall_exit(struct pt_regs *regs) { }
 #endif /* !CONFIG_DEBUG_RSEQ */
 
 #endif /* _LINUX_RSEQ_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -427,7 +427,8 @@ void __rseq_handle_notify_resume(struct
 	 * this invocation was invoked inside a critical section, then it
 	 * will either end up in this code again or a possible violation of
 	 * a syscall inside a critical region can only be detected by the
-	 * debug code in rseq_syscall() in a debug enabled kernel.
+	 * debug code in rseq_debug_syscall_exit() in a debug enabled
+	 * kernel.
 	 */
 	if (regs) {
 		/*
@@ -476,7 +477,7 @@ void __rseq_handle_notify_resume(struct
  * Terminate the process if a syscall is issued within a restartable
  * sequence.
  */
-void rseq_syscall(struct pt_regs *regs)
+void rseq_debug_syscall_exit(struct pt_regs *regs)
 {
 	unsigned long ip = instruction_pointer(regs);
 	struct task_struct *t = current;


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 04/11] rseq: Replace the pointless event mask bit fiddling
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (2 preceding siblings ...)
  2025-08-13 16:29 ` [patch 03/11] rseq: Rename rseq_syscall() to rseq_debug_syscall_exit() Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-13 16:29 ` [patch 05/11] rseq: Optimize the signal delivery path Thomas Gleixner
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
flags") the bits in task::rseq_event_mask are meaningless and just extra
work in terms of setting them individually.

Collapse them all into a single boolean which simplifies the code a lot.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq.h      |   59 +++++++++++-----------------------------------
 include/linux/sched.h     |   10 +++----
 include/uapi/linux/rseq.h |   21 +++++-----------
 kernel/rseq.c             |   26 ++++++++++++--------
 kernel/sched/core.c       |    8 +++---
 kernel/sched/membarrier.c |    8 +++---
 6 files changed, 51 insertions(+), 81 deletions(-)

--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -7,28 +7,6 @@
 #include <linux/preempt.h>
 #include <linux/sched.h>
 
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD	irq
-#else
-# define RSEQ_EVENT_GUARD	preempt
-#endif
-
-/*
- * Map the event mask on the user-space ABI enum rseq_cs_flags
- * for direct mask checks.
- */
-enum rseq_event_mask_bits {
-	RSEQ_EVENT_PREEMPT_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
-	RSEQ_EVENT_SIGNAL_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
-	RSEQ_EVENT_MIGRATE_BIT	= RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
-};
-
-enum rseq_event_mask {
-	RSEQ_EVENT_PREEMPT	= (1U << RSEQ_EVENT_PREEMPT_BIT),
-	RSEQ_EVENT_SIGNAL	= (1U << RSEQ_EVENT_SIGNAL_BIT),
-	RSEQ_EVENT_MIGRATE	= (1U << RSEQ_EVENT_MIGRATE_BIT),
-};
-
 static inline void rseq_set_notify_resume(struct task_struct *t)
 {
 	if (t->rseq)
@@ -47,30 +25,25 @@ static inline void rseq_handle_notify_re
 static inline void rseq_signal_deliver(struct ksignal *ksig,
 				       struct pt_regs *regs)
 {
-	scoped_guard(RSEQ_EVENT_GUARD)
-		__set_bit(RSEQ_EVENT_SIGNAL_BIT, &current->rseq_event_mask);
-	rseq_handle_notify_resume(ksig, regs);
-}
-
-/* rseq_preempt() requires preemption to be disabled. */
-static inline void rseq_preempt(struct task_struct *t)
-{
-	__set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
-	rseq_set_notify_resume(t);
+	if (current->rseq) {
+		current->rseq_event_pending = true;
+		__rseq_handle_notify_resume(ksig, regs);
+	}
 }
 
-/* rseq_migrate() requires preemption to be disabled. */
-static inline void rseq_migrate(struct task_struct *t)
+static inline void rseq_notify_event(struct task_struct *t)
 {
-	__set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
-	rseq_set_notify_resume(t);
+	if (t->rseq) {
+		t->rseq_event_pending = true;
+		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+	}
 }
 
 static __always_inline void rseq_exit_to_user_mode(void)
 {
 	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
-		if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
-			current->rseq_event_mask = 0;
+		if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
+			current->rseq_event_pending = false;
 	}
 }
 
@@ -84,12 +57,12 @@ static inline void rseq_fork(struct task
 		t->rseq = NULL;
 		t->rseq_len = 0;
 		t->rseq_sig = 0;
-		t->rseq_event_mask = 0;
+		t->rseq_event_pending = false;
 	} else {
 		t->rseq = current->rseq;
 		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
-		t->rseq_event_mask = current->rseq_event_mask;
+		t->rseq_event_pending = current->rseq_event_pending;
 	}
 }
 
@@ -98,17 +71,15 @@ static inline void rseq_execve(struct ta
 	t->rseq = NULL;
 	t->rseq_len = 0;
 	t->rseq_sig = 0;
-	t->rseq_event_mask = 0;
+	t->rseq_event_pending = false;
 }
 
 #else /* CONFIG_RSEQ */
 static inline void rseq_set_notify_resume(struct task_struct *t) { }
 static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
 static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
-static inline void rseq_preempt(struct task_struct *t) { }
-static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_notify_event(struct task_struct *t) { }
 static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
-static inline void rseq_execve(struct task_struct *t) { }
 static inline void rseq_exit_to_user_mode(void) { }
 #endif  /* !CONFIG_RSEQ */
 
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1401,14 +1401,14 @@ struct task_struct {
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_RSEQ
-	struct rseq __user *rseq;
-	u32 rseq_len;
-	u32 rseq_sig;
+	struct rseq __user		*rseq;
+	u32				rseq_len;
+	u32				rseq_sig;
 	/*
-	 * RmW on rseq_event_mask must be performed atomically
+	 * RmW on rseq_event_pending must be performed atomically
 	 * with respect to preemption.
 	 */
-	unsigned long rseq_event_mask;
+	bool				rseq_event_pending;
 # ifdef CONFIG_DEBUG_RSEQ
 	/*
 	 * This is a place holder to save a copy of the rseq fields for
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -114,20 +114,13 @@ struct rseq {
 	/*
 	 * Restartable sequences flags field.
 	 *
-	 * This field should only be updated by the thread which
-	 * registered this data structure. Read by the kernel.
-	 * Mainly used for single-stepping through rseq critical sections
-	 * with debuggers.
-	 *
-	 * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
-	 *     Inhibit instruction sequence block restart on preemption
-	 *     for this thread.
-	 * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
-	 *     Inhibit instruction sequence block restart on signal
-	 *     delivery for this thread.
-	 * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
-	 *     Inhibit instruction sequence block restart on migration for
-	 *     this thread.
+	 * This field was initialy intended to allow event masking for for
+	 * single-stepping through rseq critical sections with debuggers.
+	 * The kernel does not support this anymore and the relevant bits
+	 * are checked for being always false:
+	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+	 *	- RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
 	 */
 	__u32 flags;
 
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -19,6 +19,12 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/rseq.h>
 
+#ifdef CONFIG_MEMBARRIER
+# define RSEQ_EVENT_GUARD	irq
+#else
+# define RSEQ_EVENT_GUARD	preempt
+#endif
+
 /* The original rseq structure size (including padding) is 32 bytes. */
 #define ORIG_RSEQ_SIZE		32
 
@@ -432,11 +438,11 @@ void __rseq_handle_notify_resume(struct
 	 */
 	if (regs) {
 		/*
-		 * Read and clear the event mask first. If the task was not
-		 * preempted or migrated or a signal is on the way, there
-		 * is no point in doing any of the heavy lifting here on
-		 * production kernels. In that case TIF_NOTIFY_RESUME was
-		 * raised by some other functionality.
+		 * Read and clear the event pending bit first. If the task
+		 * was not preempted or migrated or a signal is on the way,
+		 * there is no point in doing any of the heavy lifting here
+		 * on production kernels. In that case TIF_NOTIFY_RESUME
+		 * was raised by some other functionality.
 		 *
 		 * This is correct because the read/clear operation is
 		 * guarded against scheduler preemption, which makes it CPU
@@ -449,15 +455,15 @@ void __rseq_handle_notify_resume(struct
 		 * with the result handed in to allow the detection of
 		 * inconsistencies.
 		 */
-		u32 event_mask;
+		bool event;
 
 		scoped_guard(RSEQ_EVENT_GUARD) {
-			event_mask = t->rseq_event_mask;
-			t->rseq_event_mask = 0;
+			event = t->rseq_event_pending;
+			t->rseq_event_pending = false;
 		}
 
-		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
-			ret = rseq_ip_fixup(regs, !!event_mask);
+		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+			ret = rseq_ip_fixup(regs, event);
 			if (unlikely(ret < 0))
 				goto error;
 		}
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3364,7 +3364,7 @@ void set_task_cpu(struct task_struct *p,
 		if (p->sched_class->migrate_task_rq)
 			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
-		rseq_migrate(p);
+		rseq_notify_event(p);
 		sched_mm_cid_migrate_from(p);
 		perf_event_task_migrate(p);
 	}
@@ -4795,7 +4795,7 @@ int sched_cgroup_fork(struct task_struct
 		p->sched_task_group = tg;
 	}
 #endif
-	rseq_migrate(p);
+	rseq_notify_event(p);
 	/*
 	 * We're setting the CPU for the first time, we don't migrate,
 	 * so use __set_task_cpu().
@@ -4859,7 +4859,7 @@ void wake_up_new_task(struct task_struct
 	 * as we're not fully set-up yet.
 	 */
 	p->recent_used_cpu = task_cpu(p);
-	rseq_migrate(p);
+	rseq_notify_event(p);
 	__set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
 	rq = __task_rq_lock(p, &rf);
 	update_rq_clock(rq);
@@ -5153,7 +5153,7 @@ prepare_task_switch(struct rq *rq, struc
 	kcov_prepare_switch(prev);
 	sched_info_switch(rq, prev, next);
 	perf_event_task_sched_out(prev, next);
-	rseq_preempt(prev);
+	rseq_notify_event(prev);
 	fire_sched_out_preempt_notifiers(prev, next);
 	kmap_local_sched_out();
 	prepare_task(next);
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -199,7 +199,7 @@ static void ipi_rseq(void *info)
 	 * is negligible.
 	 */
 	smp_mb();
-	rseq_preempt(current);
+	rseq_notify_event(current);
 }
 
 static void ipi_sync_rq_state(void *info)
@@ -407,9 +407,9 @@ static int membarrier_private_expedited(
 		 * membarrier, we will end up with some thread in the mm
 		 * running without a core sync.
 		 *
-		 * For RSEQ, don't rseq_preempt() the caller.  User code
-		 * is not supposed to issue syscalls at all from inside an
-		 * rseq critical section.
+		 * For RSEQ, don't invoke rseq_notify_event() on the
+		 * caller.  User code is not supposed to issue syscalls at
+		 * all from inside an rseq critical section.
 		 */
 		if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
 			preempt_disable();


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 05/11] rseq: Optimize the signal delivery path
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (3 preceding siblings ...)
  2025-08-13 16:29 ` [patch 04/11] rseq: Replace the pointless event mask bit fiddling Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-13 16:29 ` [patch 06/11] rseq: Optimize exit to user space further Thomas Gleixner
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

Now that the individual event mask bits are gone, there is no point in
setting the event flag before invoking rseq_handle_notify_resume(). The
fact that the signal pointer is not NULL indicates that there is an event.

Simply drop the setting of the event bit and just fold the event in
rseq_handle_notify_resume() when the signal pointer is non-NULL.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/resume_user_mode.h |    2 +-
 include/linux/rseq.h             |    9 +++------
 kernel/rseq.c                    |    7 ++++++-
 3 files changed, 10 insertions(+), 8 deletions(-)

--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 
-	rseq_handle_notify_resume(NULL, regs);
+	rseq_handle_notify_resume(regs);
 }
 
 #endif /* LINUX_RESUME_USER_MODE_H */
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -15,20 +15,17 @@ static inline void rseq_set_notify_resum
 
 void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
 
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
-					     struct pt_regs *regs)
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
 {
 	if (current->rseq)
-		__rseq_handle_notify_resume(ksig, regs);
+		__rseq_handle_notify_resume(NULL, regs);
 }
 
 static inline void rseq_signal_deliver(struct ksignal *ksig,
 				       struct pt_regs *regs)
 {
-	if (current->rseq) {
-		current->rseq_event_pending = true;
+	if (current->rseq)
 		__rseq_handle_notify_resume(ksig, regs);
-	}
 }
 
 static inline void rseq_notify_event(struct task_struct *t)
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -451,6 +451,11 @@ void __rseq_handle_notify_resume(struct
 		 * again and this function is invoked another time _before_
 		 * the task is able to return to user mode.
 		 *
+		 * If directly invoked from the signal delivery path, @ksig
+		 * is not NULL and @regs are valid. The pending bit is not
+		 * set by the caller as it can easily be folded in during
+		 * the evaluation when @ksig != NULL.
+		 *
 		 * On a debug kernel, invoke the fixup code unconditionally
 		 * with the result handed in to allow the detection of
 		 * inconsistencies.
@@ -458,7 +463,7 @@ void __rseq_handle_notify_resume(struct
 		bool event;
 
 		scoped_guard(RSEQ_EVENT_GUARD) {
-			event = t->rseq_event_pending;
+			event = t->rseq_event_pending || !!ksig;
 			t->rseq_event_pending = false;
 		}
 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 06/11] rseq: Optimize exit to user space further
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (4 preceding siblings ...)
  2025-08-13 16:29 ` [patch 05/11] rseq: Optimize the signal delivery path Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-13 16:29 ` [patch 07/11] entry: Cleanup header Thomas Gleixner
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

Now that the event pending bit management is consistent, the invocation of
__rseq_handle_notify_resume() can be avoided if the event pending bit is
not set.

This is correct because of the following order:

  1)  if (TIF_NOTIFY_RESUME)
  2)    clear(TIF_NOTIFY_RESUME);
  	smp_mb__after_atomic();
  3)  	if (event_pending)
  4)       __rseq_handle_notify_resume()
  5)         guard()
  6)           work = check_and_clear_pending();

Any new event, which hits between #1 and #2 will be visible in #3.

Any new event, which hits after #2, will either be visible in #3 and
therefore consumed in #6 or missed in #3. The latter is not a problem as
the new event will also re-raise TIF_NOTIFY_RESUME, which will cause the
calling exit loop take another round.

The quick check #3 optimizes for the common case, where event_pending is
false. Ignore the quick check when CONFIG_DEBUG_RSEQ is enabled to widen
the test coverage.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/rseq.h |    8 +++++---
 kernel/rseq.c        |   17 +++++++++++++----
 2 files changed, 18 insertions(+), 7 deletions(-)

--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -17,7 +17,7 @@ void __rseq_handle_notify_resume(struct
 
 static inline void rseq_handle_notify_resume(struct pt_regs *regs)
 {
-	if (current->rseq)
+	if (IS_ENABLED(CONFIG_DEBUG_RESQ) || READ_ONCE(current->rseq_event_pending))
 		__rseq_handle_notify_resume(NULL, regs);
 }
 
@@ -30,8 +30,10 @@ static inline void rseq_signal_deliver(s
 
 static inline void rseq_notify_event(struct task_struct *t)
 {
+	lockdep_assert_irqs_disabled();
+
 	if (t->rseq) {
-		t->rseq_event_pending = true;
+		WRITE_ONCE(t->rseq_event_pending, true);
 		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
 	}
 }
@@ -59,7 +61,7 @@ static inline void rseq_fork(struct task
 		t->rseq = current->rseq;
 		t->rseq_len = current->rseq_len;
 		t->rseq_sig = current->rseq_sig;
-		t->rseq_event_pending = current->rseq_event_pending;
+		t->rseq_event_pending = READ_ONCE(current->rseq_event_pending);
 	}
 }
 
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -524,9 +524,17 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 		ret = rseq_reset_rseq_cpu_node_id(current);
 		if (ret)
 			return ret;
-		current->rseq = NULL;
-		current->rseq_sig = 0;
-		current->rseq_len = 0;
+
+		/*
+		 * Ensure consistency of tsk::rseq and tsk::rseq_event_pending
+		 * vs. the scheduler and the RSEQ IPIs.
+		 */
+		scoped_guard(RSEQ_EVENT_GUARD) {
+			current->rseq = NULL;
+			current->rseq_sig = 0;
+			current->rseq_len = 0;
+			current->rseq_event_pending = false;
+		}
 		return 0;
 	}
 
@@ -601,7 +609,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
 	 * registered, ensure the cpu_id_start and cpu_id fields
 	 * are updated before returning to user-space.
 	 */
-	rseq_set_notify_resume(current);
+	scoped_guard(irq)
+		rseq_notify_event(current);
 
 	return 0;
 }


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 07/11] entry: Cleanup header
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (5 preceding siblings ...)
  2025-08-13 16:29 ` [patch 06/11] rseq: Optimize exit to user space further Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-13 17:09   ` Giorgi Tchankvetadze
  2025-08-13 16:29 ` [patch 08/11] entry: Distinguish between syscall and interrupt exit Thomas Gleixner
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

Cleanup the include ordering, kernel-doc and other trivialities before
making further changes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/entry-common.h     |    8 ++++----
 include/linux/irq-entry-common.h |    2 ++
 2 files changed, 6 insertions(+), 4 deletions(-)

--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -3,11 +3,11 @@
 #define __LINUX_ENTRYCOMMON_H
 
 #include <linux/irq-entry-common.h>
+#include <linux/livepatch.h>
 #include <linux/ptrace.h>
+#include <linux/resume_user_mode.h>
 #include <linux/seccomp.h>
 #include <linux/sched.h>
-#include <linux/livepatch.h>
-#include <linux/resume_user_mode.h>
 
 #include <asm/entry-common.h>
 #include <asm/syscall.h>
@@ -37,6 +37,7 @@
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
 				 SYSCALL_WORK_SYSCALL_USER_DISPATCH |	\
 				 ARCH_SYSCALL_WORK_ENTER)
+
 #define SYSCALL_WORK_EXIT	(SYSCALL_WORK_SYSCALL_TRACEPOINT |	\
 				 SYSCALL_WORK_SYSCALL_TRACE |		\
 				 SYSCALL_WORK_SYSCALL_AUDIT |		\
@@ -61,8 +62,7 @@
  */
 void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
 
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
-			 unsigned long work);
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
 
 /**
  * syscall_enter_from_user_mode_work - Check and handle work before invoking
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_
 
 /**
  * enter_from_user_mode - Establish state when coming from user mode
+ * @regs:	Pointer to currents pt_regs
  *
  * Syscall/interrupt entry disables interrupts, but user mode is traced as
  * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter(
  * Conditional reschedule with additional sanity checks.
  */
 void raw_irqentry_exit_cond_resched(void);
+
 #ifdef CONFIG_PREEMPT_DYNAMIC
 #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
 #define irqentry_exit_cond_resched_dynamic_enabled	raw_irqentry_exit_cond_resched


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 08/11] entry: Distinguish between syscall and interrupt exit
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (6 preceding siblings ...)
  2025-08-13 16:29 ` [patch 07/11] entry: Cleanup header Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-13 16:29 ` [patch 09/11] entry: Provide exit_to_user_notify_resume() Thomas Gleixner
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

The upcoming time slice extension mechanism needs to know whether
enter_from_user_mode() is invoked from a syscall or from an interrupt
because time slice extensions are only granted on exit to user more from an
interrupt.

Add a function argument and provide wrappers so the call sites don't end up
with incomprehensible true/false arguments.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 include/linux/entry-common.h     |    2 +-
 include/linux/irq-entry-common.h |   22 +++++++++++++++-------
 kernel/entry/common.c            |    7 ++++---
 3 files changed, 20 insertions(+), 11 deletions(-)

--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -172,7 +172,7 @@ static __always_inline void syscall_exit
 	if (unlikely(work & SYSCALL_WORK_EXIT))
 		syscall_exit_work(regs, work);
 	local_irq_disable_exit_to_user();
-	exit_to_user_mode_prepare(regs);
+	syscall_exit_to_user_mode_prepare(regs);
 }
 
 /**
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -197,15 +197,13 @@ static __always_inline void arch_exit_to
  */
 void arch_do_signal_or_restart(struct pt_regs *regs);
 
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- */
-unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-				     unsigned long ti_work);
+/* Handle pending TIF work */
+unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work, bool from_irq);
 
 /**
  * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
  * @regs:	Pointer to pt_regs on entry stack
+ * @from_irq:	Exiting to user space from an interrupt
  *
  * 1) check that interrupts are disabled
  * 2) call tick_nohz_user_enter_prepare()
@@ -213,7 +211,7 @@ unsigned long exit_to_user_mode_loop(str
  *    EXIT_TO_USER_MODE_WORK are set
  * 4) check that interrupts are still disabled
  */
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs, bool from_irq)
 {
 	unsigned long ti_work;
 
@@ -224,7 +222,7 @@ static __always_inline void exit_to_user
 
 	ti_work = read_thread_flags();
 	if (unlikely(ti_work & EXIT_TO_USER_MODE_WORK))
-		ti_work = exit_to_user_mode_loop(regs, ti_work);
+		ti_work = exit_to_user_mode_loop(regs, ti_work, from_irq);
 
 	arch_exit_to_user_mode_prepare(regs, ti_work);
 
@@ -236,6 +234,16 @@ static __always_inline void exit_to_user
 	lockdep_sys_exit();
 }
 
+static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	exit_to_user_mode_prepare(regs, false);
+}
+
+static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+	exit_to_user_mode_prepare(regs, true);
+}
+
 /**
  * exit_to_user_mode - Fixup state when exiting to user mode
  *
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -15,9 +15,10 @@ void __weak arch_do_signal_or_restart(st
  * exit_to_user_mode_loop - do any pending work before leaving to user space
  * @regs:	Pointer to pt_regs on entry stack
  * @ti_work:	TIF work flags as read by the caller
+ * @from_irq:	Exiting to user space from an interrupt
  */
-__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
-						     unsigned long ti_work)
+__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work,
+						     bool from_irq)
 {
 	/*
 	 * Before returning to user space ensure that all pending work
@@ -70,7 +71,7 @@ noinstr void irqentry_enter_from_user_mo
 noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
 {
 	instrumentation_begin();
-	exit_to_user_mode_prepare(regs);
+	irqentry_exit_to_user_mode_prepare(regs);
 	instrumentation_end();
 	exit_to_user_mode();
 }


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 09/11] entry: Provide exit_to_user_notify_resume()
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (7 preceding siblings ...)
  2025-08-13 16:29 ` [patch 08/11] entry: Distinguish between syscall and interrupt exit Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-13 16:29 ` [patch 10/11] rseq: Skip fixup when returning from a syscall Thomas Gleixner
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Wei Liu, Jens Axboe, Peter Zijlstra,
	Mathieu Desnoyers, Paul E. McKenney, Boqun Feng

The TIF_NOTIFY_RESUME handler of restartable sequences is invoked as all
other functionality unconditionally when TIF_NOTIFY_RESUME is set for
what ever reason.

The invocation is already conditional on the rseq_event_pending bit being
set, but there is further room for improvement.

The actual invocation cannot be avoided when the event bit is set, but the
actual heavy lifting of accessing user space can be avoided, when the exit
to user mode loop is from a syscall unless it's a debug kernel. There is no
way for the RSEQ code to distinguish that case.

That's trivial for all architectures which use the generic entry code, but
for all others it's non-trivial work, which is beyond the scope of
this. The architectures, which want to benefit should convert their code
over to the generic entry code finally.

To prepare for that optimization rename resume_user_mode_work() to
exit_to_user_notify_resume() and add a @from_irq argument to it, which can
be supplied by the caller.

Let the generic entry code and all non-entry code users like hypervisors
and IO-URING use this new function and supply the correct information.

Any NOTIFY_RESUME work, which evaluates this new argument, has to make the
evaluation dependent on CONFIG_GENERIC_ENTRY because otherwise there is no
guarantee that the value is correct at all.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 drivers/hv/mshv_common.c         |    2 +-
 include/linux/resume_user_mode.h |   38 +++++++++++++++++++++++++++-----------
 io_uring/io_uring.h              |    2 +-
 kernel/entry/common.c            |    2 +-
 kernel/entry/kvm.c               |    2 +-
 5 files changed, 31 insertions(+), 15 deletions(-)

--- a/drivers/hv/mshv_common.c
+++ b/drivers/hv/mshv_common.c
@@ -155,7 +155,7 @@ int mshv_do_pre_guest_mode_work(ulong th
 		schedule();
 
 	if (th_flags & _TIF_NOTIFY_RESUME)
-		resume_user_mode_work(NULL);
+		exit_to_user_notify_resume(NULL, false);
 
 	return 0;
 }
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -24,21 +24,22 @@ static inline void set_notify_resume(str
 		kick_process(task);
 }
 
-
 /**
- * resume_user_mode_work - Perform work before returning to user mode
- * @regs:		user-mode registers of @current task
+ * exit_to_user_notify_resume - Perform work before returning to user mode
+ * @regs:	user-mode registers of @current task
+ * @from_irq:	If true this is a return from interrupt, if false it's
+ *		a syscall return.
  *
- * This is called when %TIF_NOTIFY_RESUME has been set.  Now we are
- * about to return to user mode, and the user state in @regs can be
- * inspected or adjusted.  The caller in arch code has cleared
- * %TIF_NOTIFY_RESUME before the call.  If the flag gets set again
- * asynchronously, this will be called again before we return to
- * user mode.
+ * This is called when %TIF_NOTIFY_RESUME has been set to handle the exit
+ * to user work, which is multiplexed under this TIF bit. The bit is
+ * cleared and work is probed as pending. If the flag gets set again before
+ * exiting to user space caller will invoke this again.
  *
- * Called without locks.
+ * Any work invoked here, which wants to make decisions on @from_irq, must
+ * make these decisions dependent on CONFIG_GENERIC_ENTRY to retain the
+ * historical behaviour of resume_user_mode_work().
  */
-static inline void resume_user_mode_work(struct pt_regs *regs)
+static inline void exit_to_user_notify_resume(struct pt_regs *regs, bool from_irq)
 {
 	clear_thread_flag(TIF_NOTIFY_RESUME);
 	/*
@@ -62,4 +63,19 @@ static inline void resume_user_mode_work
 	rseq_handle_notify_resume(regs);
 }
 
+#ifndef CONFIG_GENERIC_ENTRY
+/**
+ * resume_user_mode_work - Perform work before returning to user mode
+ * @regs:		user-mode registers of @current task
+ *
+ * This is a wrapper around exit_to_user_notify_resume() for the existing
+ * call sites in architecture code, which do not use the generic entry
+ * code.
+ */
+static inline void resume_user_mode_work(struct pt_regs *regs)
+{
+	exit_to_user_notify_resume(regs, false);
+}
+#endif
+
 #endif /* LINUX_RESUME_USER_MODE_H */
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -365,7 +365,7 @@ static inline int io_run_task_work(void)
 	if (current->flags & PF_IO_WORKER) {
 		if (test_thread_flag(TIF_NOTIFY_RESUME)) {
 			__set_current_state(TASK_RUNNING);
-			resume_user_mode_work(NULL);
+			exit_to_user_notify_resume(NULL, false);
 		}
 		if (current->io_uring) {
 			unsigned int count = 0;
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -41,7 +41,7 @@ void __weak arch_do_signal_or_restart(st
 			arch_do_signal_or_restart(regs);
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
-			resume_user_mode_work(regs);
+			exit_to_user_notify_resume(regs, from_irq);
 
 		/* Architecture specific TIF work */
 		arch_exit_to_user_mode_work(regs, ti_work);
--- a/kernel/entry/kvm.c
+++ b/kernel/entry/kvm.c
@@ -17,7 +17,7 @@ static int xfer_to_guest_mode_work(struc
 			schedule();
 
 		if (ti_work & _TIF_NOTIFY_RESUME)
-			resume_user_mode_work(NULL);
+			exit_to_user_notify_resume(NULL, false);
 
 		ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
 		if (ret)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 10/11] rseq: Skip fixup when returning from a syscall
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (8 preceding siblings ...)
  2025-08-13 16:29 ` [patch 09/11] entry: Provide exit_to_user_notify_resume() Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-14  8:54   ` Peter Zijlstra
  2025-08-13 16:29 ` [patch 11/11] rseq: Convert to masked user access where applicable Thomas Gleixner
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

The TIF_NOTIFY_RESUME handler of restartable sequences is invoked as all
other functionality unconditionally when TIF_NOTIFY_RESUME is set for
what ever reason.

The invocation is already conditional on the rseq_event_pending bit being
set, but there is further room for improvement.

The heavy lifting of critical section fixup can be completely avoided, when
the exit to user mode loop is from a syscall unless it's a debug
kernel. There was no way for the RSEQ code to distinguish that case so far.

On architectures, which enable CONFIG_GENERIC_ENTRY, the information is now
available through a function argument to exit_to_user_notify_resume(),
which tells whether the invocation comes from return from syscall or return
from interrupt.

Let the RSEQ code utilize this 'from_irq' argument when

    - CONFIG_GENERIC_ENTRY is enabled
    - CONFIG_DEBUG_RSEQ is disabled

and skip the critical section fixup when the invocation comes from a
syscall return. The update of CPU and node ID has to happen in both cases,
so the out of line call has always to happen, when a event is pending
whether it's a syscall return or not.

This changes the current behaviour, which just blindly fixes up the
critical section unconditionally in the syscall case. But that's a user
space problem when it invokes a syscall from within a critical section and
expects it to work. That code was clearly never tested on a debug kernel
and user space can keep the pieces.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 include/linux/resume_user_mode.h |    2 +-
 include/linux/rseq.h             |   12 ++++++------
 kernel/rseq.c                    |   22 +++++++++++++++++++++-
 3 files changed, 28 insertions(+), 8 deletions(-)

--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -60,7 +60,7 @@ static inline void exit_to_user_notify_r
 	mem_cgroup_handle_over_high(GFP_KERNEL);
 	blkcg_maybe_throttle_current();
 
-	rseq_handle_notify_resume(regs);
+	rseq_handle_notify_resume(regs, from_irq);
 }
 
 #ifndef CONFIG_GENERIC_ENTRY
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -13,19 +13,19 @@ static inline void rseq_set_notify_resum
 		set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
 }
 
-void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
+void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs,
+				 bool from_irq);
 
-static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+static inline void rseq_handle_notify_resume(struct pt_regs *regs, bool from_irq)
 {
 	if (IS_ENABLED(CONFIG_DEBUG_RESQ) || READ_ONCE(current->rseq_event_pending))
-		__rseq_handle_notify_resume(NULL, regs);
+		__rseq_handle_notify_resume(NULL, regs, from_irq);
 }
 
-static inline void rseq_signal_deliver(struct ksignal *ksig,
-				       struct pt_regs *regs)
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
 {
 	if (current->rseq)
-		__rseq_handle_notify_resume(ksig, regs);
+		__rseq_handle_notify_resume(ksig, regs, false);
 }
 
 static inline void rseq_notify_event(struct task_struct *t)
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -408,6 +408,22 @@ static int rseq_ip_fixup(struct pt_regs
 	return 0;
 }
 
+static inline bool rseq_ignore_event(bool from_irq, bool ksig)
+{
+	/*
+	 * On architectures which do not select_GENERIC_ENTRY
+	 * @from_irq is not usable.
+	 */
+	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || !IS_ENABLED(CONFIG_GENERIC_ENTRY))
+		return false;
+
+	/*
+	 * Avoid the heavy lifting when this is a return from syscall,
+	 * i.e. not from interrupt and not from signal delivery.
+	 */
+	return !from_irq && !ksig;
+}
+
 /*
  * This resume handler must always be executed between any of:
  * - preemption,
@@ -419,7 +435,8 @@ static int rseq_ip_fixup(struct pt_regs
  * respect to other threads scheduled on the same CPU, and with respect
  * to signal handlers.
  */
-void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
+void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs,
+				 bool from_irq)
 {
 	struct task_struct *t = current;
 	int ret, sig;
@@ -467,6 +484,9 @@ void __rseq_handle_notify_resume(struct
 			t->rseq_event_pending = false;
 		}
 
+		if (rseq_ignore_event(from_irq, !!ksig))
+			event = false;
+
 		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
 			ret = rseq_ip_fixup(regs, event);
 			if (unlikely(ret < 0))


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [patch 11/11] rseq: Convert to masked user access where applicable
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (9 preceding siblings ...)
  2025-08-13 16:29 ` [patch 10/11] rseq: Skip fixup when returning from a syscall Thomas Gleixner
@ 2025-08-13 16:29 ` Thomas Gleixner
  2025-08-13 17:45 ` [patch 00/11] rseq: Optimize exit to user space Jens Axboe
  2025-08-20 14:10 ` Mathieu Desnoyers
  12 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 16:29 UTC (permalink / raw)
  To: LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

Masked user access optimizes the Spectre-V1 speculation barrier on
architectures which support it. As rseq_handle_notify_resume() is
frequently invoked, the access to the critical section pointer in the rseq
ABI is a hotpath, which is worth to optimize.

Replace also the clearing with the optimized version.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
---
 kernel/rseq.c |   23 +++--------------------
 1 file changed, 3 insertions(+), 20 deletions(-)

--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -245,20 +245,9 @@ static int rseq_reset_rseq_cpu_node_id(s
 /*
  * Get the user-space pointer value stored in the 'rseq_cs' field.
  */
-static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs)
+static inline int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs)
 {
-	if (!rseq_cs)
-		return -EFAULT;
-
-#ifdef CONFIG_64BIT
-	if (get_user(*rseq_cs, &rseq->rseq_cs))
-		return -EFAULT;
-#else
-	if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs)))
-		return -EFAULT;
-#endif
-
-	return 0;
+	return get_user_masked_u64(rseq_cs, &rseq->rseq_cs);
 }
 
 /*
@@ -358,13 +347,7 @@ static int clear_rseq_cs(struct rseq __u
 	 *
 	 * Set rseq_cs to NULL.
 	 */
-#ifdef CONFIG_64BIT
-	return put_user(0UL, &rseq->rseq_cs);
-#else
-	if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs)))
-		return -EFAULT;
-	return 0;
-#endif
+	return put_user_masked_u64(0UL, &rseq->rseq_cs);
 }
 
 /*


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 07/11] entry: Cleanup header
  2025-08-13 16:29 ` [patch 07/11] entry: Cleanup header Thomas Gleixner
@ 2025-08-13 17:09   ` Giorgi Tchankvetadze
  2025-08-13 21:30     ` Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: Giorgi Tchankvetadze @ 2025-08-13 17:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Michael Jeanson, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

Typo in kernel-doc:
In irq-entry-common.h the doc line reads Pointer to currents pt_regs —
should be Pointer to current's pt_regs (or Pointer to current->pt_regs
depending on local style).


 Stray blank/trailing whitespace:
In entry-common.h there is an extra blank line inserted after the
ARCH_SYSCALL_WORK_ENTER) macro continuation — remove the stray blank
to avoid style/noise.

On Wed, Aug 13, 2025 at 8:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Cleanup the include ordering, kernel-doc and other trivialities before
> making further changes.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---
>  include/linux/entry-common.h     |    8 ++++----
>  include/linux/irq-entry-common.h |    2 ++
>  2 files changed, 6 insertions(+), 4 deletions(-)
>
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -3,11 +3,11 @@
>  #define __LINUX_ENTRYCOMMON_H
>
>  #include <linux/irq-entry-common.h>
> +#include <linux/livepatch.h>
>  #include <linux/ptrace.h>
> +#include <linux/resume_user_mode.h>
>  #include <linux/seccomp.h>
>  #include <linux/sched.h>
> -#include <linux/livepatch.h>
> -#include <linux/resume_user_mode.h>
>
>  #include <asm/entry-common.h>
>  #include <asm/syscall.h>
> @@ -37,6 +37,7 @@
>                                  SYSCALL_WORK_SYSCALL_AUDIT |           \
>                                  SYSCALL_WORK_SYSCALL_USER_DISPATCH |   \
>                                  ARCH_SYSCALL_WORK_ENTER)
> +
>  #define SYSCALL_WORK_EXIT      (SYSCALL_WORK_SYSCALL_TRACEPOINT |      \
>                                  SYSCALL_WORK_SYSCALL_TRACE |           \
>                                  SYSCALL_WORK_SYSCALL_AUDIT |           \
> @@ -61,8 +62,7 @@
>   */
>  void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
>
> -long syscall_trace_enter(struct pt_regs *regs, long syscall,
> -                        unsigned long work);
> +long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
>
>  /**
>   * syscall_enter_from_user_mode_work - Check and handle work before invoking
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_
>
>  /**
>   * enter_from_user_mode - Establish state when coming from user mode
> + * @regs:      Pointer to currents pt_regs
>   *
>   * Syscall/interrupt entry disables interrupts, but user mode is traced as
>   * interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
> @@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter(
>   * Conditional reschedule with additional sanity checks.
>   */
>  void raw_irqentry_exit_cond_resched(void);
> +
>  #ifdef CONFIG_PREEMPT_DYNAMIC
>  #if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
>  #define irqentry_exit_cond_resched_dynamic_enabled     raw_irqentry_exit_cond_resched
>
>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (10 preceding siblings ...)
  2025-08-13 16:29 ` [patch 11/11] rseq: Convert to masked user access where applicable Thomas Gleixner
@ 2025-08-13 17:45 ` Jens Axboe
  2025-08-13 21:32   ` Thomas Gleixner
  2025-08-20 14:10 ` Mathieu Desnoyers
  12 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2025-08-13 17:45 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu

On 8/13/25 10:29 AM, Thomas Gleixner wrote:
> With the more wide spread usage of rseq in glibc, rseq is not longer a
> niche use case for special applications.
> 
> While working on a sane implementation of a rseq based time slice extension
> mechanism, I noticed several shortcomings of the current rseq code:
> 
>   1) task::rseq_event_mask is a pointless bitfield despite the fact that
>      the ABI flags it was meant to support have been deprecated and
>      functionally disabled three years ago.
> 
>   2) task::rseq_event_mask is accumulating bits unless there is a critical
>      section discovered in the user space rseq memory. This results in
>      pointless invocations of the rseq user space exit handler even if
>      there had nothing changed. As a matter of correctness these bits have
>      to be clear when exiting to user space and therefore pristine when
>      coming back into the kernel. Aside of correctness, this also avoids
>      pointless evaluation of the user space memory, which is a performance
>      benefit.
> 
>   3) The evaluation of critical sections does not differentiate between
>      syscall and interrupt/exception exits. The current implementation
>      silently fixes up critical sections which invoked a syscall unless
>      CONFIG_DEBUG_RSEQ is enabled.
> 
>      That's just wrong. If user space does that on a production kernel it
>      can keep the pieces. The kernel is not there to proliferate mindless
>      user space programming and letting everyone pay the performance
>      penalty.
> 
> This series addresses these issues and on top converts parts of the user
> space access over to the new masked access model, which lowers the overhead
> of Spectre-V1 mitigations significantly on architectures which support it
> (x86 as of today). This is especially noticable in the access to the
> rseq_cs field in struct rseq, which is the first quick check to figure out
> whether a critical section is installed or not.
> 
> It survives the kernels rseq selftests, but I did not any performance tests
> vs. rseq because I have no idea how to use the gazillion of undocumented
> command line parameters of the benchmark. I leave that to people who are so
> familiar with them, that they assume everyone else is too :)
> 
> The performance gain on regular workloads is clearly measurable and the
> consistent event flag state allows now to build the time slice extension
> mechanism on top. The first POC I implemented:
> 
>    https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/
> 
> suffered badly from the stale eventmask bits and the cleaned up version
> brought a whopping 25+% performance gain.

Thanks for doing this work, it's been on my list to take a look at rseq
as it's quite the pig currently and enabled by default (with what I
assume is from a newer libc).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 07/11] entry: Cleanup header
  2025-08-13 17:09   ` Giorgi Tchankvetadze
@ 2025-08-13 21:30     ` Thomas Gleixner
  0 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 21:30 UTC (permalink / raw)
  To: Giorgi Tchankvetadze
  Cc: LKML, Michael Jeanson, Peter Zijlstra, Mathieu Desnoyers,
	Paul E. McKenney, Boqun Feng, Wei Liu, Jens Axboe

On Wed, Aug 13 2025 at 21:09, Giorgi Tchankvetadze wrote:

Please do not top-post and reply inline so there is context:

       https://people.kernel.org/tglx/notes-about-netiquette

> Typo in kernel-doc:
> In irq-entry-common.h the doc line reads Pointer to currents pt_regs —
> should be Pointer to current's pt_regs (or Pointer to current->pt_regs
> depending on local style).

Good point.

>  Stray blank/trailing whitespace:
> In entry-common.h there is an extra blank line inserted after the
> ARCH_SYSCALL_WORK_ENTER) macro continuation — remove the stray blank
> to avoid style/noise.

No.

>> @@ -37,6 +37,7 @@
>>                                  SYSCALL_WORK_SYSCALL_AUDIT |           \
>>                                  SYSCALL_WORK_SYSCALL_USER_DISPATCH |   \
>>                                  ARCH_SYSCALL_WORK_ENTER)

There is no continuation of the SYSCALL_WORK_ENTER macro.

>> +

And this is an empty new line to visually seperate SYSCALL_WORK_ENTER
from SYSCALL_WORK_EXIT.

>>  #define SYSCALL_WORK_EXIT      (SYSCALL_WORK_SYSCALL_TRACEPOINT |      \

No?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-13 17:45 ` [patch 00/11] rseq: Optimize exit to user space Jens Axboe
@ 2025-08-13 21:32   ` Thomas Gleixner
  2025-08-13 21:36     ` Jens Axboe
  0 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 21:32 UTC (permalink / raw)
  To: Jens Axboe, LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu

On Wed, Aug 13 2025 at 11:45, Jens Axboe wrote:
> On 8/13/25 10:29 AM, Thomas Gleixner wrote:
>> 
>> suffered badly from the stale eventmask bits and the cleaned up version
>> brought a whopping 25+% performance gain.
>
> Thanks for doing this work, it's been on my list to take a look at rseq
> as it's quite the pig currently and enabled by default (with what I
> assume is from a newer libc).

Yes, recent glibc uses it.

Could you give it a test ride to see whether this makes a difference in
your environment?

Thanks

        tglx


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-13 21:32   ` Thomas Gleixner
@ 2025-08-13 21:36     ` Jens Axboe
  2025-08-13 22:08       ` Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: Jens Axboe @ 2025-08-13 21:36 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu

On 8/13/25 3:32 PM, Thomas Gleixner wrote:
> On Wed, Aug 13 2025 at 11:45, Jens Axboe wrote:
>> On 8/13/25 10:29 AM, Thomas Gleixner wrote:
>>>
>>> suffered badly from the stale eventmask bits and the cleaned up version
>>> brought a whopping 25+% performance gain.
>>
>> Thanks for doing this work, it's been on my list to take a look at rseq
>> as it's quite the pig currently and enabled by default (with what I
>> assume is from a newer libc).
> 
> Yes, recent glibc uses it.
> 
> Could you give it a test ride to see whether this makes a difference in
> your environment?

Yep, I'll give a spin. Won't be before Monday as I'm out of town thu/fri
this week, just as a heads-up.

If I recall correct, I'd see rseq at 2-3% of overall CPU time last I
tested. I'll try and do some pre/post numbers for some of the stuff I
usually run.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-13 21:36     ` Jens Axboe
@ 2025-08-13 22:08       ` Thomas Gleixner
  2025-08-17 21:23         ` Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-13 22:08 UTC (permalink / raw)
  To: Jens Axboe, LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu

On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote:
> On 8/13/25 3:32 PM, Thomas Gleixner wrote:
>> Could you give it a test ride to see whether this makes a difference in
>> your environment?
>
> Yep, I'll give a spin.

Appreciated.

> If I recall correct, I'd see rseq at 2-3% of overall CPU time last I

I'm not surprised.

> tested. I'll try and do some pre/post numbers for some of the stuff I
> usually run.

Cool. Thanks

      tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 10/11] rseq: Skip fixup when returning from a syscall
  2025-08-13 16:29 ` [patch 10/11] rseq: Skip fixup when returning from a syscall Thomas Gleixner
@ 2025-08-14  8:54   ` Peter Zijlstra
  2025-08-14 13:24     ` Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2025-08-14  8:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Michael Jeanson, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Wei Liu, Jens Axboe

On Wed, Aug 13, 2025 at 06:29:37PM +0200, Thomas Gleixner wrote:

> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -408,6 +408,22 @@ static int rseq_ip_fixup(struct pt_regs
>  	return 0;
>  }
>  
> +static inline bool rseq_ignore_event(bool from_irq, bool ksig)
> +{
> +	/*
> +	 * On architectures which do not select_GENERIC_ENTRY
> +	 * @from_irq is not usable.
> +	 */
> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || !IS_ENABLED(CONFIG_GENERIC_ENTRY))
> +		return false;
> +
> +	/*
> +	 * Avoid the heavy lifting when this is a return from syscall,
> +	 * i.e. not from interrupt and not from signal delivery.
> +	 */
> +	return !from_irq && !ksig;
> +}
> +
>  /*
>   * This resume handler must always be executed between any of:
>   * - preemption,

> @@ -467,6 +484,9 @@ void __rseq_handle_notify_resume(struct
>  			t->rseq_event_pending = false;
>  		}
>  
> +		if (rseq_ignore_event(from_irq, !!ksig))
> +			event = false;
> +
>  		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
>  			ret = rseq_ip_fixup(regs, event);
>  			if (unlikely(ret < 0))
> 

You now have a double check for CONFIG_DEBUG_RSEQ.

Since the value of @event is immaterial when DEBUG_RSEQ, might as well
remove it from rseq_ignore_event(), right?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 10/11] rseq: Skip fixup when returning from a syscall
  2025-08-14  8:54   ` Peter Zijlstra
@ 2025-08-14 13:24     ` Thomas Gleixner
  0 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-14 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Michael Jeanson, Mathieu Desnoyers, Paul E. McKenney,
	Boqun Feng, Wei Liu, Jens Axboe

On Thu, Aug 14 2025 at 10:54, Peter Zijlstra wrote:
> On Wed, Aug 13, 2025 at 06:29:37PM +0200, Thomas Gleixner wrote:
>
>> --- a/kernel/rseq.c
>> +++ b/kernel/rseq.c
>> @@ -408,6 +408,22 @@ static int rseq_ip_fixup(struct pt_regs
>>  	return 0;
>>  }
>>  
>> +static inline bool rseq_ignore_event(bool from_irq, bool ksig)
>> +{
>> +	/*
>> +	 * On architectures which do not select_GENERIC_ENTRY
>> +	 * @from_irq is not usable.
>> +	 */
>> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || !IS_ENABLED(CONFIG_GENERIC_ENTRY))
>> +		return false;
>> +
>> +	/*
>> +	 * Avoid the heavy lifting when this is a return from syscall,
>> +	 * i.e. not from interrupt and not from signal delivery.
>> +	 */
>> +	return !from_irq && !ksig;
>> +}
>> +
>>  /*
>>   * This resume handler must always be executed between any of:
>>   * - preemption,
>
>> @@ -467,6 +484,9 @@ void __rseq_handle_notify_resume(struct
>>  			t->rseq_event_pending = false;
>>  		}
>>  
>> +		if (rseq_ignore_event(from_irq, !!ksig))
>> +			event = false;
>> +
>>  		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
>>  			ret = rseq_ip_fixup(regs, event);
>>  			if (unlikely(ret < 0))
>> 
>
> You now have a double check for CONFIG_DEBUG_RSEQ.
>
> Since the value of @event is immaterial when DEBUG_RSEQ, might as well
> remove it from rseq_ignore_event(), right?

Not really. debug wants the event preserved even if it's !from_irq

Yes, it's not pretty, but I wanted to preserve the debug behaviour as
much as it goes.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-13 22:08       ` Thomas Gleixner
@ 2025-08-17 21:23         ` Thomas Gleixner
  2025-08-18 14:00           ` BUG: rseq selftests and librseq vs. glibc fail Thomas Gleixner
                             ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-17 21:23 UTC (permalink / raw)
  To: Jens Axboe, LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu

On Thu, Aug 14 2025 at 00:08, Thomas Gleixner wrote:
> On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote:
>> On 8/13/25 3:32 PM, Thomas Gleixner wrote:
>>> Could you give it a test ride to see whether this makes a difference in
>>> your environment?
>>
>> Yep, I'll give a spin.
>
> Appreciated.

Please do not use the git branch I had in the cover letter. I did some
more analysis of this and it's even worse than I thought. Use

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/wip

instead.

I've rewritten the whole pile by now and made it a real fast path
without the TIF_NOTIFY horror show, unless the fast path, which runs
_after_ the TIF work loop faults. So far that happens once for each
fork() as that has to fault in the copy of the user space rseq region.

There are lightweight per CPU stats for the various events, which can be
accessed via

      /sys/kernel/debug/rseq/stat

which exposes a racy sum of them. Here is the output after a reboot and
a full kernel recompile

exit:           85703905    // Total invocations
signal:            34635    // Invocations from signal delivery
slowp:               134    // Slow path via TIF_NOTIFY_RESUME
ids:               70052    // Updates of CPU and MM CID
cs:                    0    // Critical section analysis
clear:                 0    // Clearing of critical section
fixup:                 0    // Fixup of critical section (abort)

Before the rewrite this took more than a million of ID updates and
critical section evaluations even when completely pointless.

So on any syscall or interrupt heavy workload this should be clearly
visible as a difference in the profile.

I'm still not happy about the exit to user fast path decision as it's
two conditionals instead of one, but all attempts to do that lightweight
somewhere else turned out to make stuff worse as I just burdened other
fast path operations, i.e. the scheduler with pointless conditionals.

I'll think about that more, but nevertheless this is way better than the
current horror show.

I also have no real good plan yet how to gradually convert this over,
but I'm way too tired to think about that now.

It survives the self test suite after I wasted a day to figure out why
the selftests reliably segfault on a machine which has debian trixie
installed. The fix is in the branch.

Michael, can you please run your librseq tests against that too? They
have the same segfault problem as the kernel and they lack a run script,
so I couldn't be bothered to test against them. See commit 2bff3a0e5998
in that branch. I'll send out a patch with a proper change log later.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 43+ messages in thread

* BUG: rseq selftests and librseq vs. glibc fail
  2025-08-17 21:23         ` Thomas Gleixner
@ 2025-08-18 14:00           ` Thomas Gleixner
  2025-08-18 14:15             ` Florian Weimer
  2025-08-18 17:38           ` [patch 00/11] rseq: Optimize exit to user space Michael Jeanson
  2025-08-20 14:27           ` Mathieu Desnoyers
  2 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-18 14:00 UTC (permalink / raw)
  To: Jens Axboe, LKML
  Cc: Michael Jeanson, Mathieu Desnoyers, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu, Sean Christopherson,
	Florian Weimer, Samuel Thibault

On Sun, Aug 17 2025 at 23:23, Thomas Gleixner wrote:
> It survives the self test suite after I wasted a day to figure out why
> the selftests reliably segfault on a machine which has debian trixie
> installed. The fix is in the branch.

That's glibc 2.41 FWIW. glibc 2.36 from Debian 12 does not have this
problem.

The fix unfortunately only works with a dynamically linked libc,
statically linked libc fails. The fix is basically a revert of

   3bcbc20942db ("selftests/rseq: Play nice with binaries statically linked
                  against glibc 2.35+")

which introduced these weak libc symbols to make static libc linking work.

I have no idea why this creates havoc, but in GDB I saw that libc
manages to overwrite the TLS of the pthread at some place, but I gave up
decoding it further. If no pthread is created it just works. Removing
this weak muck makes it work too.

It's trivial to reproduce. All it needs is to have in the source:

__weak ptrdiff_t __rseq_offset;

w/o even being referenced and creating a pthread. Reproducer below.

TBH, this interface is a horrible hack. libc should expose a proper
function for querying whether it has registered rseq and return the
parameters in a struct. But well...

Build reproducer with:

# gcc -O2 -Wall -o t test.c
# ./t
Segmentation fault

# gcc -O2 -Wall -o t test.c -static
# ./t
#

Removing the weak __rseq_offset makes the dynamic case work too.

Yours sufficiently grumpy

      tglx

----
#define _GNU_SOURCE
#include <pthread.h>
#include <stddef.h>

#define __weak  __attribute__((__weak__))
__weak ptrdiff_t __rseq_offset;

static void *foo(void *arg)
{
	return NULL;
}

int main(int argc, char **argv)
{
	pthread_t t;

	pthread_create(&t, NULL, foo, NULL);
	pthread_join(t, NULL);
	return 0;
}


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-18 14:00           ` BUG: rseq selftests and librseq vs. glibc fail Thomas Gleixner
@ 2025-08-18 14:15             ` Florian Weimer
  2025-08-18 17:13               ` Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: Florian Weimer @ 2025-08-18 14:15 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jens Axboe, LKML, Michael Jeanson, Mathieu Desnoyers,
	Peter Zijlstra, Paul E. McKenney, Boqun Feng, Wei Liu,
	Sean Christopherson, Samuel Thibault

* Thomas Gleixner:

> On Sun, Aug 17 2025 at 23:23, Thomas Gleixner wrote:
>> It survives the self test suite after I wasted a day to figure out why
>> the selftests reliably segfault on a machine which has debian trixie
>> installed. The fix is in the branch.
>
> That's glibc 2.41 FWIW. glibc 2.36 from Debian 12 does not have this
> problem.
>
> The fix unfortunately only works with a dynamically linked libc,
> statically linked libc fails. The fix is basically a revert of
>
>    3bcbc20942db ("selftests/rseq: Play nice with binaries statically linked
>                   against glibc 2.35+")
>
> which introduced these weak libc symbols to make static libc linking work.
>
> I have no idea why this creates havoc, but in GDB I saw that libc
> manages to overwrite the TLS of the pthread at some place, but I gave up
> decoding it further. If no pthread is created it just works. Removing
> this weak muck makes it work too.
>
> It's trivial to reproduce. All it needs is to have in the source:
>
> __weak ptrdiff_t __rseq_offset;
>
> w/o even being referenced and creating a pthread. Reproducer below.

Well, that's sort of expected.  You can't define glibc symbols that are
not intended for interposition and expect things to work.  It's kind of
like writing:

int _rtld_global;

That's going to fail rather spectaculary, too.  We make an exception for
symbols that are not reserved (you can build in ISO C mode and define
open, close, etc., at least as long as you link to glibc only).  But
__rseq_offset is a reserved name, so that is not applicable here.

The real change here is GCC changing from -fcommon (which made a lot of
these things work in the past) to -fno-common.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-18 14:15             ` Florian Weimer
@ 2025-08-18 17:13               ` Thomas Gleixner
  2025-08-18 19:33                 ` Florian Weimer
  2025-08-29 18:44                 ` Prakash Sangappa
  0 siblings, 2 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-18 17:13 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Jens Axboe, LKML, Michael Jeanson, Mathieu Desnoyers,
	Peter Zijlstra, Paul E. McKenney, Boqun Feng, Wei Liu,
	Sean Christopherson, Samuel Thibault

On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote:
> * Thomas Gleixner:
>> It's trivial to reproduce. All it needs is to have in the source:
>>
>> __weak ptrdiff_t __rseq_offset;
>>
>> w/o even being referenced and creating a pthread. Reproducer below.
>
> Well, that's sort of expected.  You can't define glibc symbols that are
> not intended for interposition and expect things to work.  It's kind of
> like writing:
>
> int _rtld_global;
>
> That's going to fail rather spectaculary, too.  We make an exception for
> symbols that are not reserved (you can build in ISO C mode and define
> open, close, etc., at least as long as you link to glibc only).  But
> __rseq_offset is a reserved name, so that is not applicable here.
>
> The real change here is GCC changing from -fcommon (which made a lot of
> these things work in the past) to -fno-common.

Thanks for the explanation!

So the only way to make this actually work is to revert that commit and
the folks who want to link that statically need to come up with:

#ifdef _BUILD_STATICALLY
extern ....

#else
        ptr = dlsym(...);
#endif

or something daft like that. A proper function interface would avoid all
that nonsense, but we can't have nice things or can we?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-17 21:23         ` Thomas Gleixner
  2025-08-18 14:00           ` BUG: rseq selftests and librseq vs. glibc fail Thomas Gleixner
@ 2025-08-18 17:38           ` Michael Jeanson
  2025-08-18 20:21             ` Thomas Gleixner
  2025-08-20 14:27           ` Mathieu Desnoyers
  2 siblings, 1 reply; 43+ messages in thread
From: Michael Jeanson @ 2025-08-18 17:38 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu

On 2025-08-17 17:23, Thomas Gleixner wrote:
> Michael, can you please run your librseq tests against that too? They
> have the same segfault problem as the kernel and they lack a run script,
> so I couldn't be bothered to test against them. See commit 2bff3a0e5998
> in that branch. I'll send out a patch with a proper change log later.

I ran the librseq test suite on the new branch on a Debian Trixie amd64 
system and it succeeds, here are the rseq stats before and after.

Before:

exit:             746809
signal:                3
slowp:                99
ids:                1053
cs:                    0
clear:                 0
fixup:                 0

After:

exit:          229294046
signal:               11
slowp:              4570
ids:              615950
cs:              2493682
clear:            194637
fixup:           2299044


And I also ran the same test suite in a 32bit chroot on the same system 
which also succeeds with the following rseq stats.

Before:

exit:             717945
signal:                1
slowp:               102
ids:                1039
cs:                    0
clear:                 0
fixup:                 0

After:

exit:          201051038
signal:                9
slowp:              4909
ids:              793551
cs:                28887
clear:             12871
fixup:             16016


If you want to run the librseq tests on your system, just do the regular 
autotools dance and then run 'make check'.

Regards,

Michael

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-18 17:13               ` Thomas Gleixner
@ 2025-08-18 19:33                 ` Florian Weimer
  2025-08-18 19:46                   ` Sean Christopherson
  2025-08-29 18:44                 ` Prakash Sangappa
  1 sibling, 1 reply; 43+ messages in thread
From: Florian Weimer @ 2025-08-18 19:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Jens Axboe, LKML, Michael Jeanson, Mathieu Desnoyers,
	Peter Zijlstra, Paul E. McKenney, Boqun Feng, Wei Liu,
	Sean Christopherson, Samuel Thibault

* Thomas Gleixner:

> On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote:
>> * Thomas Gleixner:
>>> It's trivial to reproduce. All it needs is to have in the source:
>>>
>>> __weak ptrdiff_t __rseq_offset;
>>>
>>> w/o even being referenced and creating a pthread. Reproducer below.
>>
>> Well, that's sort of expected.  You can't define glibc symbols that are
>> not intended for interposition and expect things to work.  It's kind of
>> like writing:
>>
>> int _rtld_global;
>>
>> That's going to fail rather spectaculary, too.  We make an exception for
>> symbols that are not reserved (you can build in ISO C mode and define
>> open, close, etc., at least as long as you link to glibc only).  But
>> __rseq_offset is a reserved name, so that is not applicable here.
>>
>> The real change here is GCC changing from -fcommon (which made a lot of
>> these things work in the past) to -fno-common.
>
> Thanks for the explanation!
>
> So the only way to make this actually work is to revert that commit and
> the folks who want to link that statically need to come up with:
>
> #ifdef _BUILD_STATICALLY
> extern ....
>
> #else
>         ptr = dlsym(...);
> #endif
>
> or something daft like that. A proper function interface would avoid all
> that nonsense, but we can't have nice things or can we?

I don't understand why a function would be different.  Well, a function
*declaration* would be implicitly extern, in a way a variable
declaration is not (without -fcommon).  Maybe it's just about the
missing extern keyword?

You could add the extern keyword and check &__rseq_offset for NULL if
you want to probe for the availability of the signal?  Or use:

#if __has_include(<sys/rseq.h>)
#include <sys/rseq.h>
/* Code that depends on glibc's rseq support goes here. */
#endif

Thanks,
Florian


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-18 19:33                 ` Florian Weimer
@ 2025-08-18 19:46                   ` Sean Christopherson
  2025-08-18 19:55                     ` Florian Weimer
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-08-18 19:46 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Thomas Gleixner, Jens Axboe, LKML, Michael Jeanson,
	Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Samuel Thibault

On Mon, Aug 18, 2025, Florian Weimer wrote:
> * Thomas Gleixner:
> 
> > On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote:
> >> * Thomas Gleixner:
> >>> It's trivial to reproduce. All it needs is to have in the source:
> >>>
> >>> __weak ptrdiff_t __rseq_offset;
> >>>
> >>> w/o even being referenced and creating a pthread. Reproducer below.
> >>
> >> Well, that's sort of expected.  You can't define glibc symbols that are
> >> not intended for interposition and expect things to work.  It's kind of
> >> like writing:
> >>
> >> int _rtld_global;
> >>
> >> That's going to fail rather spectaculary, too.  We make an exception for
> >> symbols that are not reserved (you can build in ISO C mode and define
> >> open, close, etc., at least as long as you link to glibc only).  But
> >> __rseq_offset is a reserved name, so that is not applicable here.
> >>
> >> The real change here is GCC changing from -fcommon (which made a lot of
> >> these things work in the past) to -fno-common.
> >
> > Thanks for the explanation!
> >
> > So the only way to make this actually work is to revert that commit and
> > the folks who want to link that statically need to come up with:
> >
> > #ifdef _BUILD_STATICALLY
> > extern ....
> >
> > #else
> >         ptr = dlsym(...);
> > #endif
> >
> > or something daft like that. A proper function interface would avoid all
> > that nonsense, but we can't have nice things or can we?
> 
> I don't understand why a function would be different.  Well, a function
> *declaration* would be implicitly extern, in a way a variable
> declaration is not (without -fcommon).  Maybe it's just about the
> missing extern keyword?
> 
> You could add the extern keyword and check &__rseq_offset for NULL if
> you want to probe for the availability of the signal?

That will fail to link if the glibc version doesn't support rseq in any capacity,
which is why I added the __weak crud.

>Or use:
> 
> #if __has_include(<sys/rseq.h>)
> #include <sys/rseq.h>
> /* Code that depends on glibc's rseq support goes here. */

FWIW, the code in question doesn't depend on rseq per se, rather the problem is
that attempting to register a restartable sequence fails if glibc has already
"claimed" rseq.

What about something horrific like this?  Or if __has_include(<sys/rseq.h>) is
preferrable to checking the glibc version, go with that.  The idea with checking
LIBC_RSEQ_STATIC_LINK is to give folks a way to force static linking if their
libc registers rseq, but doesn't satisfy the glibc version checks for whatever
reason.

diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 663a9cef1952..1a88352fcff3 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -36,17 +36,18 @@
 #include "../kselftest.h"
 #include "rseq.h"
 
-/*
- * Define weak versions to play nice with binaries that are statically linked
- * against a libc that doesn't support registering its own rseq.
- */
-__weak ptrdiff_t __rseq_offset;
-__weak unsigned int __rseq_size;
-__weak unsigned int __rseq_flags;
+#if defined(LIBC_RSEQ_STATIC_LINK) || __GLIBC__ > 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ >= 35)
+extern ptrdiff_t __rseq_offset;
+extern unsigned int __rseq_size;
+extern unsigned int __rseq_flags;
+#define GLIBC_RSEQ_PTR(x) &x
+#else
+#define GLIBC_RSEQ_PTR(x) NULL
+#endif
 
-static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset;
-static const unsigned int *libc_rseq_size_p = &__rseq_size;
-static const unsigned int *libc_rseq_flags_p = &__rseq_flags;
+static const ptrdiff_t *libc_rseq_offset_p = GLIBC_RSEQ_PTR(__rseq_offset);
+static const unsigned int *libc_rseq_size_p = GLIBC_RSEQ_PTR(__rseq_size);
+static const unsigned int *libc_rseq_flags_p = GLIBC_RSEQ_PTR(__rseq_flags);
 
 /* Offset from the thread pointer to the rseq area. */
 ptrdiff_t rseq_offset;
@@ -209,7 +210,7 @@ void rseq_init(void)
         * libc not having registered a restartable sequence.  Try to find the
         * symbols if that's the case.
         */
-       if (!*libc_rseq_size_p) {
+       if (!libc_rseq_size_p || !*libc_rseq_size_p) {
                libc_rseq_offset_p = dlsym(RTLD_NEXT, "__rseq_offset");
                libc_rseq_size_p = dlsym(RTLD_NEXT, "__rseq_size");
                libc_rseq_flags_p = dlsym(RTLD_NEXT, "__rseq_flags");

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-18 19:46                   ` Sean Christopherson
@ 2025-08-18 19:55                     ` Florian Weimer
  2025-08-18 20:27                       ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: Florian Weimer @ 2025-08-18 19:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Jens Axboe, LKML, Michael Jeanson,
	Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Samuel Thibault

* Sean Christopherson:

> On Mon, Aug 18, 2025, Florian Weimer wrote:
>> You could add the extern keyword and check &__rseq_offset for NULL if
>> you want to probe for the availability of the signal?
>
> That will fail to link if the glibc version doesn't support rseq in
> any capacity, which is why I added the __weak crud.

You need both (extern and weak) to get a weak symbol reference instead
of a weak symbol definition.  You still need to check &__rseq_offset, of
course.

>>Or use:
>> 
>> #if __has_include(<sys/rseq.h>)
>> #include <sys/rseq.h>
>> /* Code that depends on glibc's rseq support goes here. */
>
> FWIW, the code in question doesn't depend on rseq per se, rather the problem is
> that attempting to register a restartable sequence fails if glibc has already
> "claimed" rseq.

You can set GLIBC_TUNABLES=glibc.pthread.rseq=0 as an environment
variable to prevent glibc from registering it.  For a test that's
probably okay?  It won't help with other libcs that might use rseq
eventually.

> What about something horrific like this?  Or if __has_include(<sys/rseq.h>) is
> preferrable to checking the glibc version, go with that.  The idea with checking
> LIBC_RSEQ_STATIC_LINK is to give folks a way to force static linking if their
> libc registers rseq, but doesn't satisfy the glibc version checks for whatever
> reason.
>
> diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
> index 663a9cef1952..1a88352fcff3 100644
> --- a/tools/testing/selftests/rseq/rseq.c
> +++ b/tools/testing/selftests/rseq/rseq.c
> @@ -36,17 +36,18 @@
>  #include "../kselftest.h"
>  #include "rseq.h"
>  
> -/*
> - * Define weak versions to play nice with binaries that are statically linked
> - * against a libc that doesn't support registering its own rseq.
> - */
> -__weak ptrdiff_t __rseq_offset;
> -__weak unsigned int __rseq_size;
> -__weak unsigned int __rseq_flags;
> +#if defined(LIBC_RSEQ_STATIC_LINK) || __GLIBC__ > 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ >= 35)
> +extern ptrdiff_t __rseq_offset;
> +extern unsigned int __rseq_size;
> +extern unsigned int __rseq_flags;
> +#define GLIBC_RSEQ_PTR(x) &x
> +#else
> +#define GLIBC_RSEQ_PTR(x) NULL
> +#endif

That doesn't work for a couple of distributions nowadays that are
nominally based on glibc 2.34.  The AArch64 performance improvement for
sched_getcpu was just too sweet to give up, so these distributions have
__rseq_offset@@GLIBC_2.35, too.  We added these symbols specifically so
that applications that need to interact with rseq can do so safely, even
though glibc claimed it for the sched_getcpu acceleration.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-18 17:38           ` [patch 00/11] rseq: Optimize exit to user space Michael Jeanson
@ 2025-08-18 20:21             ` Thomas Gleixner
  2025-08-18 21:29               ` Michael Jeanson
  0 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-18 20:21 UTC (permalink / raw)
  To: Michael Jeanson, Jens Axboe, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu

On Mon, Aug 18 2025 at 13:38, Michael Jeanson wrote:

> On 2025-08-17 17:23, Thomas Gleixner wrote:
>> Michael, can you please run your librseq tests against that too? They
>> have the same segfault problem as the kernel and they lack a run script,
>> so I couldn't be bothered to test against them. See commit 2bff3a0e5998
>> in that branch. I'll send out a patch with a proper change log later.
>
> I ran the librseq test suite on the new branch on a Debian Trixie amd64 
> system and it succeeds, here are the rseq stats before and after.

Thanks!

> Before:
>
> exit:             746809
> signal:                3
> slowp:                99
> ids:                1053
> cs:                    0
> clear:                 0
> fixup:                 0
>
> After:
>
> exit:          229294046
> signal:               11
> slowp:              4570
> ids:              615950
> cs:              2493682
> clear:            194637
> fixup:           2299044

That looks about right. Can you reset the branch to

     commit 85b61b265635 ("rseq: Expose stats")

which is just adding primitive stats on top of the current mainline
code, and provide numbers for that too?

That gives you 'notify: , cpuid:, fixup:' numbers, which are not 1:1
mappable to the final ones, but that should give some interesting
insight.

> If you want to run the librseq tests on your system, just do the regular 
> autotools dance and then run 'make check'.

Might be useful to put such instructions into README.md, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-18 19:55                     ` Florian Weimer
@ 2025-08-18 20:27                       ` Sean Christopherson
  2025-08-18 23:54                         ` Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-08-18 20:27 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Thomas Gleixner, Jens Axboe, LKML, Michael Jeanson,
	Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Samuel Thibault

On Mon, Aug 18, 2025, Florian Weimer wrote:
> * Sean Christopherson:
> 
> > On Mon, Aug 18, 2025, Florian Weimer wrote:
> >> You could add the extern keyword and check &__rseq_offset for NULL if
> >> you want to probe for the availability of the signal?
> >
> > That will fail to link if the glibc version doesn't support rseq in
> > any capacity, which is why I added the __weak crud.
> 
> You need both (extern and weak) to get a weak symbol reference instead
> of a weak symbol definition.  You still need to check &__rseq_offset, of
> course.

Ooh, you're saying add "extern" to the existing __weak symbol, not replace it.
Huh, TIL weak symbol references are a thing.

This works with static and dynamic linking, with and without an rseq-aware glibc.

Thomas, does this fix the problem you were seeing?

diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 663a9cef1952..d17ded120d48 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -40,9 +40,9 @@
  * Define weak versions to play nice with binaries that are statically linked
  * against a libc that doesn't support registering its own rseq.
  */
-__weak ptrdiff_t __rseq_offset;
-__weak unsigned int __rseq_size;
-__weak unsigned int __rseq_flags;
+extern __weak ptrdiff_t __rseq_offset;
+extern __weak unsigned int __rseq_size;
+extern __weak unsigned int __rseq_flags;
 
 static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset;
 static const unsigned int *libc_rseq_size_p = &__rseq_size;
@@ -209,7 +209,7 @@ void rseq_init(void)
         * libc not having registered a restartable sequence.  Try to find the
         * symbols if that's the case.
         */
-       if (!*libc_rseq_size_p) {
+       if (!libc_rseq_offset_p || !*libc_rseq_size_p) {
                libc_rseq_offset_p = dlsym(RTLD_NEXT, "__rseq_offset");
                libc_rseq_size_p = dlsym(RTLD_NEXT, "__rseq_size");
                libc_rseq_flags_p = dlsym(RTLD_NEXT, "__rseq_flags");

> >>Or use:
> >> 
> >> #if __has_include(<sys/rseq.h>)
> >> #include <sys/rseq.h>
> >> /* Code that depends on glibc's rseq support goes here. */
> >
> > FWIW, the code in question doesn't depend on rseq per se, rather the problem is
> > that attempting to register a restartable sequence fails if glibc has already
> > "claimed" rseq.
> 
> You can set GLIBC_TUNABLES=glibc.pthread.rseq=0 as an environment
> variable to prevent glibc from registering it.  For a test that's
> probably okay?  It won't help with other libcs that might use rseq
> eventually.

I vaguely recall exploring that option when trying to get static linking working.
I think I shied away from it because I wanted to have something that Just Worked
for all users, and didn't like the idea of hardcoding that into the KVM selftests
makefile.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-18 20:21             ` Thomas Gleixner
@ 2025-08-18 21:29               ` Michael Jeanson
  2025-08-18 23:43                 ` Thomas Gleixner
  0 siblings, 1 reply; 43+ messages in thread
From: Michael Jeanson @ 2025-08-18 21:29 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu

On 2025-08-18 16:21, Thomas Gleixner wrote:
> That looks about right. Can you reset the branch to
> 
>       commit 85b61b265635 ("rseq: Expose stats")
> 
> which is just adding primitive stats on top of the current mainline
> code, and provide numbers for that too?
> 
> That gives you 'notify: , cpuid:, fixup:' numbers, which are not 1:1
> mappable to the final ones, but that should give some interesting
> insight.

For amd64 kernel and userspace.

Before:

notify:             12467
fixup:              12467
cpuid:              12467


After:

notify:         123669528
fixup:          123669528
cpuid:          123669528


For amd64 kernel, i386 userspace.

Before:

notify:             12857
fixup:              12857
cpuid:              12857


After:

notify:         120621210
fixup:          120621210
cpuid:          120621210


> Might be useful to put such instructions into README.md, no?

Will do.

Regards,

Michael

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-18 21:29               ` Michael Jeanson
@ 2025-08-18 23:43                 ` Thomas Gleixner
  0 siblings, 0 replies; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-18 23:43 UTC (permalink / raw)
  To: Michael Jeanson, Jens Axboe, LKML
  Cc: Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu

On Mon, Aug 18 2025 at 17:29, Michael Jeanson wrote:

> On 2025-08-18 16:21, Thomas Gleixner wrote:
>> That looks about right. Can you reset the branch to
>> 
>>       commit 85b61b265635 ("rseq: Expose stats")
>> 
>> which is just adding primitive stats on top of the current mainline
>> code, and provide numbers for that too?
>> 
>> That gives you 'notify: , cpuid:, fixup:' numbers, which are not 1:1
>> mappable to the final ones, but that should give some interesting
>> insight.
>
> For amd64 kernel and userspace.
>
> Before:
>
> notify:             12467
> fixup:              12467
> cpuid:              12467
>
>
> After:
>
> notify:         123669528
> fixup:          123669528
> cpuid:          123669528

That's insane compared to this:

> exit:          229294046
> signal:               11
> slowp:              4570
> ids:              615950
> cs:              2493682
> clear:            194637
> fixup:           2299044

You can assume that the number of exits (to user) is roughly the same,
i.e. ~23M, so 12M (> 50%) take the TIF_NOTIFY dance on the way out and
most of them for no good reason.

While with the rework only 4.5K go into the NOTIFY slow path and 2.5M
(10 %) do the critical section evaluation and 600k (~ 3%) update CPU/MM
CID.

No suprise that Jens is seeing this in his profiles...

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-18 20:27                       ` Sean Christopherson
@ 2025-08-18 23:54                         ` Thomas Gleixner
  2025-08-19  0:28                           ` Sean Christopherson
  0 siblings, 1 reply; 43+ messages in thread
From: Thomas Gleixner @ 2025-08-18 23:54 UTC (permalink / raw)
  To: Sean Christopherson, Florian Weimer
  Cc: Jens Axboe, LKML, Michael Jeanson, Mathieu Desnoyers,
	Peter Zijlstra, Paul E. McKenney, Boqun Feng, Wei Liu,
	Samuel Thibault

On Mon, Aug 18 2025 at 13:27, Sean Christopherson wrote:
> On Mon, Aug 18, 2025, Florian Weimer wrote:
>> You need both (extern and weak) to get a weak symbol reference instead
>> of a weak symbol definition.  You still need to check &__rseq_offset, of
>> course.
>
> Ooh, you're saying add "extern" to the existing __weak symbol, not replace it.
> Huh, TIL weak symbol references are a thing.
>
> This works with static and dynamic linking, with and without an rseq-aware glibc.
>
> Thomas, does this fix the problem you were seeing?
>
> diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
> index 663a9cef1952..d17ded120d48 100644
> --- a/tools/testing/selftests/rseq/rseq.c
> +++ b/tools/testing/selftests/rseq/rseq.c
> @@ -40,9 +40,9 @@
>   * Define weak versions to play nice with binaries that are statically linked
>   * against a libc that doesn't support registering its own rseq.
>   */
> -__weak ptrdiff_t __rseq_offset;
> -__weak unsigned int __rseq_size;
> -__weak unsigned int __rseq_flags;
> +extern __weak ptrdiff_t __rseq_offset;
> +extern __weak unsigned int __rseq_size;
> +extern __weak unsigned int __rseq_flags;
>  
>  static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset;
>  static const unsigned int *libc_rseq_size_p = &__rseq_size;
> @@ -209,7 +209,7 @@ void rseq_init(void)
>          * libc not having registered a restartable sequence.  Try to find the
>          * symbols if that's the case.
>          */
> -       if (!*libc_rseq_size_p) {
> +       if (!libc_rseq_offset_p || !*libc_rseq_size_p) {

If I make that:

+       if (!*libc_rseq_offset_p || !*libc_rseq_size_p) {

then it makes sense and actually works. The pointer can hardly be NULL,
even when statically linked, no?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-18 23:54                         ` Thomas Gleixner
@ 2025-08-19  0:28                           ` Sean Christopherson
  2025-08-19  6:18                             ` Florian Weimer
  0 siblings, 1 reply; 43+ messages in thread
From: Sean Christopherson @ 2025-08-19  0:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Florian Weimer, Jens Axboe, LKML, Michael Jeanson,
	Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Samuel Thibault

On Tue, Aug 19, 2025, Thomas Gleixner wrote:
> On Mon, Aug 18 2025 at 13:27, Sean Christopherson wrote:
> > On Mon, Aug 18, 2025, Florian Weimer wrote:
> >> You need both (extern and weak) to get a weak symbol reference instead
> >> of a weak symbol definition.  You still need to check &__rseq_offset, of
> >> course.
> >
> > Ooh, you're saying add "extern" to the existing __weak symbol, not replace it.
> > Huh, TIL weak symbol references are a thing.
> >
> > This works with static and dynamic linking, with and without an rseq-aware glibc.
> >
> > Thomas, does this fix the problem you were seeing?
> >
> > diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
> > index 663a9cef1952..d17ded120d48 100644
> > --- a/tools/testing/selftests/rseq/rseq.c
> > +++ b/tools/testing/selftests/rseq/rseq.c
> > @@ -40,9 +40,9 @@
> >   * Define weak versions to play nice with binaries that are statically linked
> >   * against a libc that doesn't support registering its own rseq.
> >   */
> > -__weak ptrdiff_t __rseq_offset;
> > -__weak unsigned int __rseq_size;
> > -__weak unsigned int __rseq_flags;
> > +extern __weak ptrdiff_t __rseq_offset;
> > +extern __weak unsigned int __rseq_size;
> > +extern __weak unsigned int __rseq_flags;
> >  
> >  static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset;
> >  static const unsigned int *libc_rseq_size_p = &__rseq_size;
> > @@ -209,7 +209,7 @@ void rseq_init(void)
> >          * libc not having registered a restartable sequence.  Try to find the
> >          * symbols if that's the case.
> >          */
> > -       if (!*libc_rseq_size_p) {
> > +       if (!libc_rseq_offset_p || !*libc_rseq_size_p) {

Doh, I meant to check libc_rseq_size_p for NULL, i.e.

	if (!libc_rseq_size_p || !*libc_rseq_size_p) {

> 
> If I make that:
> 
> +       if (!*libc_rseq_offset_p || !*libc_rseq_size_p) {
> 
> then it makes sense and actually works. The pointer can hardly be NULL,
> even when statically linked, no?

IIUC, it is indeed the pointers that are set to NULL/0, because for unresolved
symbols, the symbol itself, not its value, is set to '0'.  Which makes sense,
because if there is no symbol, then it can't have a value.

I.e. the address of the symbol is '0', and its value is undefined.

E.g. statically linking this against glibc without rseq support:

diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c
index 663a9cef1952..959bdcb32e96 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -40,9 +40,9 @@
  * Define weak versions to play nice with binaries that are statically linked
  * against a libc that doesn't support registering its own rseq.
  */
-__weak ptrdiff_t __rseq_offset;
-__weak unsigned int __rseq_size;
-__weak unsigned int __rseq_flags;
+extern __weak ptrdiff_t __rseq_offset;
+extern __weak unsigned int __rseq_size;
+extern __weak unsigned int __rseq_flags;
 
 static const ptrdiff_t *libc_rseq_offset_p = &__rseq_offset;
 static const unsigned int *libc_rseq_size_p = &__rseq_size;
@@ -209,7 +209,12 @@ void rseq_init(void)
         * libc not having registered a restartable sequence.  Try to find the
         * symbols if that's the case.
         */
-       if (!*libc_rseq_size_p) {
+       printf("libc_rseq_offset_p = %lx (%lx), libc_rseq_size_p = %lx (%lx)\n",
+              (unsigned long)libc_rseq_offset_p, (unsigned long)libc_rseq_size_p,
+              (unsigned long)&__rseq_offset, (unsigned long)&__rseq_size);
+       printf("__rseq_size = %u\n", __rseq_size);
+
+       if (!libc_rseq_size_p || !*libc_rseq_size_p) {
                libc_rseq_offset_p = dlsym(RTLD_NEXT, "__rseq_offset");
                libc_rseq_size_p = dlsym(RTLD_NEXT, "__rseq_size");
                libc_rseq_flags_p = dlsym(RTLD_NEXT, "__rseq_flags");

Generates this output:

  $ ./rseq_test 
  libc_rseq_offset_p = 0 (0), libc_rseq_size_p = 0 (0)
  Segmentation fault

Because trying to dereference __rseq_size hits NULL/0.

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-19  0:28                           ` Sean Christopherson
@ 2025-08-19  6:18                             ` Florian Weimer
  0 siblings, 0 replies; 43+ messages in thread
From: Florian Weimer @ 2025-08-19  6:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Thomas Gleixner, Jens Axboe, LKML, Michael Jeanson,
	Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Samuel Thibault

* Sean Christopherson:

>> If I make that:
>> 
>> +       if (!*libc_rseq_offset_p || !*libc_rseq_size_p) {
>> 
>> then it makes sense and actually works. The pointer can hardly be NULL,
>> even when statically linked, no?
>
> IIUC, it is indeed the pointers that are set to NULL/0, because for unresolved
> symbols, the symbol itself, not its value, is set to '0'.  Which makes sense,
> because if there is no symbol, then it can't have a value.

Right, that's how weak symbol references work.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
                   ` (11 preceding siblings ...)
  2025-08-13 17:45 ` [patch 00/11] rseq: Optimize exit to user space Jens Axboe
@ 2025-08-20 14:10 ` Mathieu Desnoyers
  12 siblings, 0 replies; 43+ messages in thread
From: Mathieu Desnoyers @ 2025-08-20 14:10 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Michael Jeanson, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Jens Axboe

On 2025-08-13 12:29, Thomas Gleixner wrote:
> With the more wide spread usage of rseq in glibc, rseq is not longer a
> niche use case for special applications.
> 
> While working on a sane implementation of a rseq based time slice extension
> mechanism, I noticed several shortcomings of the current rseq code:
> 
>    1) task::rseq_event_mask is a pointless bitfield despite the fact that
>       the ABI flags it was meant to support have been deprecated and
>       functionally disabled three years ago.

Correct. We could replace this mask by a single flag, set when the
events happen, test-and-cleared on return to userspace. This would
lessen the synchronization requirements because bitops are not needed
anymore.

> 
>    2) task::rseq_event_mask is accumulating bits unless there is a critical
>       section discovered in the user space rseq memory. This results in
>       pointless invocations of the rseq user space exit handler even if
>       there had nothing changed. As a matter of correctness these bits have
>       to be clear when exiting to user space and therefore pristine when
>       coming back into the kernel. Aside of correctness, this also avoids
>       pointless evaluation of the user space memory, which is a performance
>       benefit.

Ouch. Yes. The intent here is indeed to clear those bits in all cases,
whether there is an rseq critical section ongoing or not. Good catch!

> 
>    3) The evaluation of critical sections does not differentiate between
>       syscall and interrupt/exception exits. The current implementation
>       silently fixes up critical sections which invoked a syscall unless
>       CONFIG_DEBUG_RSEQ is enabled.
> 
>       That's just wrong. If user space does that on a production kernel it
>       can keep the pieces. The kernel is not there to proliferate mindless
>       user space programming and letting everyone pay the performance
>       penalty.

Agreed. There is no need to fixup the rseq critical section, however we
still need to update other rseq fields (e.g. current CPU number) when
returning from a system call.

> 
> This series addresses these issues and on top converts parts of the user
> space access over to the new masked access model, which lowers the overhead
> of Spectre-V1 mitigations significantly on architectures which support it
> (x86 as of today). This is especially noticable in the access to the
> rseq_cs field in struct rseq, which is the first quick check to figure out
> whether a critical section is installed or not.

Nice!

I'll try to have a closer look at the series soon.

Thanks,

Mathieu

> 
> It survives the kernels rseq selftests, but I did not any performance tests
> vs. rseq because I have no idea how to use the gazillion of undocumented
> command line parameters of the benchmark. I leave that to people who are so
> familiar with them, that they assume everyone else is too :)
> 
> The performance gain on regular workloads is clearly measurable and the
> consistent event flag state allows now to build the time slice extension
> mechanism on top. The first POC I implemented:
> 
>     https://lore.kernel.org/lkml/87o6smb3a0.ffs@tglx/
> 
> suffered badly from the stale eventmask bits and the cleaned up version
> brought a whopping 25+% performance gain.
> 
> The series depends on the seperately posted rseq bugfix:
> 
>     https://lore.kernel.org/lkml/87o6sj6z95.ffs@tglx/
> 
> and the uaccess generic helper series:
> 
>     https://lore.kernel.org/lkml/20250813150610.521355442@linutronix.de/
> 
> and a related futex fix in
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/urgent
> 
> For your conveniance it is also available as a conglomorate from git:
> 
>      git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
> 
> Thanks,
> 
> 	tglx
> ---
>   drivers/hv/mshv_common.c         |    2
>   include/linux/entry-common.h     |   12 +--
>   include/linux/irq-entry-common.h |   31 ++++++--
>   include/linux/resume_user_mode.h |   40 +++++++---
>   include/linux/rseq.h             |  115 +++++++++----------------------
>   include/linux/sched.h            |   10 +-
>   include/uapi/linux/rseq.h        |   21 +----
>   io_uring/io_uring.h              |    2
>   kernel/entry/common.c            |    9 +-
>   kernel/entry/kvm.c               |    2
>   kernel/rseq.c                    |  142 +++++++++++++++++++++++++--------------
>   kernel/sched/core.c              |    8 +-
>   kernel/sched/membarrier.c        |    8 +-
>   13 files changed, 213 insertions(+), 189 deletions(-)
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 01/11] rseq: Avoid pointless evaluation in __rseq_notify_resume()
  2025-08-13 16:29 ` [patch 01/11] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
@ 2025-08-20 14:23   ` Mathieu Desnoyers
  0 siblings, 0 replies; 43+ messages in thread
From: Mathieu Desnoyers @ 2025-08-20 14:23 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Michael Jeanson, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Jens Axboe

On 2025-08-13 12:29, Thomas Gleixner wrote:
> The RSEQ critical section mechanism only clears the event mask when a
> critical section is registered, otherwise it is stale and collects
> bits.
> 
> That means once a critical section is installed the first invocation of
> that code when TIF_NOTIFY_RESUME is set will abort the critical section,
> even when the TIF bit was not raised by the rseq preempt/migrate/signal
> helpers.
> 
> This also has a performance implication because TIF_NOTIFY_RESUME is a
> multiplexing TIF bit, which is utilized by quite some infrastructure. That
> means every invocation of __rseq_notify_resume() goes unconditionally
> through the heavy lifting of user space access and consistency checks even
> if there is no reason to do so.
> 
> Keeping the stale event mask around when exiting to user space also
> prevents it from being utilized by the upcoming time slice extension
> mechanism.
> 
> Avoid this by reading and clearing the event mask before doing the user
> space critical section access with preemption disabled, which ensures that
> the read and clear operation is CPU local atomic versus scheduling.
> 
> This is correct as after re-enabling preemption any relevant event will set
> the bit again and raise TIF_NOTIFY_RESUME, which makes the user space exit
> code take another round of TIF bit clearing.
> 
> If the event mask was non-zero, invoke the slow path. On debug kernels the
> slow path is invoked unconditionally and the result of the event mask
> evaluation is handed in.
> 
> Add a exit path check after the TIF bit loop, which validates on debug
> kernels that the event mask is zero before exiting to user space.
> 
> While at it reword the convoluted comment why the pt_regs pointer can be
> NULL under certain circumstances.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
>   include/linux/irq-entry-common.h |    7 ++--
>   include/linux/rseq.h             |   10 +++++
>   kernel/rseq.c                    |   66 ++++++++++++++++++++++++++-------------
>   3 files changed, 58 insertions(+), 25 deletions(-)
> 
> --- a/include/linux/irq-entry-common.h
> +++ b/include/linux/irq-entry-common.h
> @@ -2,11 +2,12 @@
>   #ifndef __LINUX_IRQENTRYCOMMON_H
>   #define __LINUX_IRQENTRYCOMMON_H
>   
> +#include <linux/context_tracking.h>
> +#include <linux/kmsan.h>
> +#include <linux/rseq.h>
>   #include <linux/static_call_types.h>
>   #include <linux/syscalls.h>
> -#include <linux/context_tracking.h>
>   #include <linux/tick.h>
> -#include <linux/kmsan.h>
>   #include <linux/unwind_deferred.h>
>   
>   #include <asm/entry-common.h>
> @@ -226,6 +227,8 @@ static __always_inline void exit_to_user
>   
>   	arch_exit_to_user_mode_prepare(regs, ti_work);
>   
> +	rseq_exit_to_user_mode();
> +
>   	/* Ensure that kernel state is sane for a return to userspace */
>   	kmap_assert_nomap();
>   	lockdep_assert_irqs_disabled();
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -66,6 +66,14 @@ static inline void rseq_migrate(struct t
>   	rseq_set_notify_resume(t);
>   }
>   
> +static __always_inline void rseq_exit_to_user_mode(void)
> +{
> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
> +		if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
> +			current->rseq_event_mask = 0;
> +	}
> +}
> +
>   /*
>    * If parent process has a registered restartable sequences area, the
>    * child inherits. Unregister rseq for a clone with CLONE_VM set.
> @@ -118,7 +126,7 @@ static inline void rseq_fork(struct task
>   static inline void rseq_execve(struct task_struct *t)
>   {
>   }
> -
> +static inline void rseq_exit_to_user_mode(void) { }
>   #endif
>   
>   #ifdef CONFIG_DEBUG_RSEQ
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char *
>   	return true;
>   }
>   
> -static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
> +static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
>   {
> -	u32 flags, event_mask;
> +	u32 flags;
>   	int ret;
>   
>   	if (rseq_warn_flags("rseq_cs", cs_flags))
> @@ -339,17 +339,7 @@ static int rseq_need_restart(struct task
>   
>   	if (rseq_warn_flags("rseq", flags))
>   		return -EINVAL;
> -
> -	/*
> -	 * Load and clear event mask atomically with respect to
> -	 * scheduler preemption and membarrier IPIs.
> -	 */
> -	scoped_guard(RSEQ_EVENT_GUARD) {
> -		event_mask = t->rseq_event_mask;
> -		t->rseq_event_mask = 0;
> -	}
> -
> -	return !!event_mask;
> +	return 0;
>   }
>   
>   static int clear_rseq_cs(struct rseq __user *rseq)
> @@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip,
>   	return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
>   }
>   
> -static int rseq_ip_fixup(struct pt_regs *regs)
> +static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
>   {
>   	unsigned long ip = instruction_pointer(regs);
>   	struct task_struct *t = current;
> @@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs
>   	 */
>   	if (!in_rseq_cs(ip, &rseq_cs))
>   		return clear_rseq_cs(t->rseq);
> -	ret = rseq_need_restart(t, rseq_cs.flags);
> -	if (ret <= 0)
> +	ret = rseq_check_flags(t, rseq_cs.flags);
> +	if (ret < 0)
>   		return ret;
> +	if (!abort)

Sorry I'm missing context. Which prior patch introduces this "abort"
variable ?

The rest looks good,

Thanks,

Mathieu

> +		return 0;
>   	ret = clear_rseq_cs(t->rseq);
>   	if (ret)
>   		return ret;
> @@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct
>   		return;
>   
>   	/*
> -	 * regs is NULL if and only if the caller is in a syscall path.  Skip
> -	 * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> -	 * kill a misbehaving userspace on debug kernels.
> +	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
> +	 * pointer, so fixup cannot be done. If the syscall which led to
> +	 * this invocation was invoked inside a critical section, then it
> +	 * will either end up in this code again or a possible violation of
> +	 * a syscall inside a critical region can only be detected by the
> +	 * debug code in rseq_syscall() in a debug enabled kernel.
>   	 */
>   	if (regs) {
> -		ret = rseq_ip_fixup(regs);
> -		if (unlikely(ret < 0))
> -			goto error;
> +		/*
> +		 * Read and clear the event mask first. If the task was not
> +		 * preempted or migrated or a signal is on the way, there
> +		 * is no point in doing any of the heavy lifting here on
> +		 * production kernels. In that case TIF_NOTIFY_RESUME was
> +		 * raised by some other functionality.
> +		 *
> +		 * This is correct because the read/clear operation is
> +		 * guarded against scheduler preemption, which makes it CPU
> +		 * local atomic. If the task is preempted right after
> +		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> +		 * again and this function is invoked another time _before_
> +		 * the task is able to return to user mode.
> +		 *
> +		 * On a debug kernel, invoke the fixup code unconditionally
> +		 * with the result handed in to allow the detection of
> +		 * inconsistencies.
> +		 */
> +		u32 event_mask;
> +
> +		scoped_guard(RSEQ_EVENT_GUARD) {
> +			event_mask = t->rseq_event_mask;
> +			t->rseq_event_mask = 0;
> +		}
> +
> +		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
> +			ret = rseq_ip_fixup(regs, !!event_mask);
> +			if (unlikely(ret < 0))
> +				goto error;
> +		}
>   	}
>   	if (unlikely(rseq_update_cpu_node_id(t)))
>   		goto error;
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 02/11] rseq: Condense the inline stubs
  2025-08-13 16:29 ` [patch 02/11] rseq: Condense the inline stubs Thomas Gleixner
@ 2025-08-20 14:24   ` Mathieu Desnoyers
  0 siblings, 0 replies; 43+ messages in thread
From: Mathieu Desnoyers @ 2025-08-20 14:24 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Michael Jeanson, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Jens Axboe

On 2025-08-13 12:29, Thomas Gleixner wrote:
> Scrolling over tons of pointless
> 
> {
> }
> 
> lines to find the actual code is annoying at best.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
>   include/linux/rseq.h |   47 ++++++++++++-----------------------------------
>   1 file changed, 12 insertions(+), 35 deletions(-)
> 
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -101,44 +101,21 @@ static inline void rseq_execve(struct ta
>   	t->rseq_event_mask = 0;
>   }
>   
> -#else
> -
> -static inline void rseq_set_notify_resume(struct task_struct *t)
> -{
> -}
> -static inline void rseq_handle_notify_resume(struct ksignal *ksig,
> -					     struct pt_regs *regs)
> -{
> -}
> -static inline void rseq_signal_deliver(struct ksignal *ksig,
> -				       struct pt_regs *regs)
> -{
> -}
> -static inline void rseq_preempt(struct task_struct *t)
> -{
> -}
> -static inline void rseq_migrate(struct task_struct *t)
> -{
> -}
> -static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags)
> -{
> -}
> -static inline void rseq_execve(struct task_struct *t)
> -{
> -}
> +#else /* CONFIG_RSEQ */
> +static inline void rseq_set_notify_resume(struct task_struct *t) { }
> +static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
> +static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
> +static inline void rseq_preempt(struct task_struct *t) { }
> +static inline void rseq_migrate(struct task_struct *t) { }
> +static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
> +static inline void rseq_execve(struct task_struct *t) { }
>   static inline void rseq_exit_to_user_mode(void) { }
> -#endif
> +#endif  /* !CONFIG_RSEQ */
>   
>   #ifdef CONFIG_DEBUG_RSEQ
> -
>   void rseq_syscall(struct pt_regs *regs);
> -
> -#else
> -
> -static inline void rseq_syscall(struct pt_regs *regs)
> -{
> -}
> -
> -#endif
> +#else /* CONFIG_DEBUG_RSEQ */
> +static inline void rseq_syscall(struct pt_regs *regs) { }
> +#endif /* !CONFIG_DEBUG_RSEQ */
>   
>   #endif /* _LINUX_RSEQ_H */
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 03/11] rseq: Rename rseq_syscall() to rseq_debug_syscall_exit()
  2025-08-13 16:29 ` [patch 03/11] rseq: Rename rseq_syscall() to rseq_debug_syscall_exit() Thomas Gleixner
@ 2025-08-20 14:25   ` Mathieu Desnoyers
  0 siblings, 0 replies; 43+ messages in thread
From: Mathieu Desnoyers @ 2025-08-20 14:25 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Michael Jeanson, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Jens Axboe

On 2025-08-13 12:29, Thomas Gleixner wrote:
> rseq_syscall() is a debug function, which is invoked before the syscall
> exit work is handled. Name it so it's clear what it does.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: Boqun Feng <boqun.feng@gmail.com>
> ---
>   include/linux/entry-common.h |    2 +-
>   include/linux/rseq.h         |    4 ++--
>   kernel/rseq.c                |    5 +++--
>   3 files changed, 6 insertions(+), 5 deletions(-)
> 
> --- a/include/linux/entry-common.h
> +++ b/include/linux/entry-common.h
> @@ -162,7 +162,7 @@ static __always_inline void syscall_exit
>   			local_irq_enable();
>   	}
>   
> -	rseq_syscall(regs);
> +	rseq_debug_syscall_exit(regs);
>   
>   	/*
>   	 * Do one-time syscall specific work. If these work items are
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -113,9 +113,9 @@ static inline void rseq_exit_to_user_mod
>   #endif  /* !CONFIG_RSEQ */
>   
>   #ifdef CONFIG_DEBUG_RSEQ
> -void rseq_syscall(struct pt_regs *regs);
> +void rseq_debug_syscall_exit(struct pt_regs *regs);
>   #else /* CONFIG_DEBUG_RSEQ */
> -static inline void rseq_syscall(struct pt_regs *regs) { }
> +static inline void rseq_debug_syscall_exit(struct pt_regs *regs) { }
>   #endif /* !CONFIG_DEBUG_RSEQ */
>   
>   #endif /* _LINUX_RSEQ_H */
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -427,7 +427,8 @@ void __rseq_handle_notify_resume(struct
>   	 * this invocation was invoked inside a critical section, then it
>   	 * will either end up in this code again or a possible violation of
>   	 * a syscall inside a critical region can only be detected by the
> -	 * debug code in rseq_syscall() in a debug enabled kernel.
> +	 * debug code in rseq_debug_syscall_exit() in a debug enabled
> +	 * kernel.
>   	 */
>   	if (regs) {
>   		/*
> @@ -476,7 +477,7 @@ void __rseq_handle_notify_resume(struct
>    * Terminate the process if a syscall is issued within a restartable
>    * sequence.
>    */
> -void rseq_syscall(struct pt_regs *regs)
> +void rseq_debug_syscall_exit(struct pt_regs *regs)
>   {
>   	unsigned long ip = instruction_pointer(regs);
>   	struct task_struct *t = current;
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [patch 00/11] rseq: Optimize exit to user space
  2025-08-17 21:23         ` Thomas Gleixner
  2025-08-18 14:00           ` BUG: rseq selftests and librseq vs. glibc fail Thomas Gleixner
  2025-08-18 17:38           ` [patch 00/11] rseq: Optimize exit to user space Michael Jeanson
@ 2025-08-20 14:27           ` Mathieu Desnoyers
  2 siblings, 0 replies; 43+ messages in thread
From: Mathieu Desnoyers @ 2025-08-20 14:27 UTC (permalink / raw)
  To: Thomas Gleixner, Jens Axboe, LKML
  Cc: Michael Jeanson, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu

On 2025-08-17 17:23, Thomas Gleixner wrote:
> On Thu, Aug 14 2025 at 00:08, Thomas Gleixner wrote:
>> On Wed, Aug 13 2025 at 15:36, Jens Axboe wrote:
>>> On 8/13/25 3:32 PM, Thomas Gleixner wrote:
>>>> Could you give it a test ride to see whether this makes a difference in
>>>> your environment?
>>>
>>> Yep, I'll give a spin.
>>
>> Appreciated.
> 
> Please do not use the git branch I had in the cover letter. I did some
> more analysis of this and it's even worse than I thought. Use
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/wip
> 
> instead.
> 
> I've rewritten the whole pile by now and made it a real fast path
> without the TIF_NOTIFY horror show, unless the fast path, which runs
> _after_ the TIF work loop faults. So far that happens once for each
> fork() as that has to fault in the copy of the user space rseq region.
> 

OK. I'll stop reviewing this version of the series and wait for an
updated version.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-18 17:13               ` Thomas Gleixner
  2025-08-18 19:33                 ` Florian Weimer
@ 2025-08-29 18:44                 ` Prakash Sangappa
  2025-08-29 18:50                   ` Mathieu Desnoyers
  1 sibling, 1 reply; 43+ messages in thread
From: Prakash Sangappa @ 2025-08-29 18:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Florian Weimer, Jens Axboe, LKML, Michael Jeanson,
	Mathieu Desnoyers, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Sean Christopherson, Samuel Thibault



> On Aug 18, 2025, at 10:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote:
>> * Thomas Gleixner:
>>> It's trivial to reproduce. All it needs is to have in the source:
>>> 
>>> __weak ptrdiff_t __rseq_offset;
>>> 
>>> w/o even being referenced and creating a pthread. Reproducer below.
>> 
>> Well, that's sort of expected.  You can't define glibc symbols that are
>> not intended for interposition and expect things to work.  It's kind of
>> like writing:
>> 
>> int _rtld_global;
>> 
>> That's going to fail rather spectaculary, too.  We make an exception for
>> symbols that are not reserved (you can build in ISO C mode and define
>> open, close, etc., at least as long as you link to glibc only).  But
>> __rseq_offset is a reserved name, so that is not applicable here.
>> 
>> The real change here is GCC changing from -fcommon (which made a lot of
>> these things work in the past) to -fno-common.
> 
> Thanks for the explanation!
> 
> So the only way to make this actually work is to revert that commit and
> the folks who want to link that statically need to come up with:
> 
> #ifdef _BUILD_STATICALLY
> extern ....
> 
> #else
>        ptr = dlsym(...);
> #endif
> 
> or something daft like that. A proper function interface would avoid all
> that nonsense, but we can't have nice things or can we?


Could the rseq(2) syscall itself return the already registered rseq structure address?
Perhaps a new flag argument to the rseq(2) syscall to query the registered rseq address
or return the address of the already registered rseq structure when it fails to register a new one.

Application can call it when the call to register a rseq structure fails.

-Prakash

> 
> Thanks,
> 
>        tglx
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-29 18:44                 ` Prakash Sangappa
@ 2025-08-29 18:50                   ` Mathieu Desnoyers
  2025-09-01 19:30                     ` Prakash Sangappa
  0 siblings, 1 reply; 43+ messages in thread
From: Mathieu Desnoyers @ 2025-08-29 18:50 UTC (permalink / raw)
  To: Prakash Sangappa, Thomas Gleixner
  Cc: Florian Weimer, Jens Axboe, LKML, Michael Jeanson, Peter Zijlstra,
	Paul E. McKenney, Boqun Feng, Wei Liu, Sean Christopherson,
	Samuel Thibault

On 2025-08-29 14:44, Prakash Sangappa wrote:
> 
> 
>> On Aug 18, 2025, at 10:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote:
>>> * Thomas Gleixner:
>>>> It's trivial to reproduce. All it needs is to have in the source:
>>>>
>>>> __weak ptrdiff_t __rseq_offset;
>>>>
>>>> w/o even being referenced and creating a pthread. Reproducer below.
>>>
>>> Well, that's sort of expected.  You can't define glibc symbols that are
>>> not intended for interposition and expect things to work.  It's kind of
>>> like writing:
>>>
>>> int _rtld_global;
>>>
>>> That's going to fail rather spectaculary, too.  We make an exception for
>>> symbols that are not reserved (you can build in ISO C mode and define
>>> open, close, etc., at least as long as you link to glibc only).  But
>>> __rseq_offset is a reserved name, so that is not applicable here.
>>>
>>> The real change here is GCC changing from -fcommon (which made a lot of
>>> these things work in the past) to -fno-common.
>>
>> Thanks for the explanation!
>>
>> So the only way to make this actually work is to revert that commit and
>> the folks who want to link that statically need to come up with:
>>
>> #ifdef _BUILD_STATICALLY
>> extern ....
>>
>> #else
>>         ptr = dlsym(...);
>> #endif
>>
>> or something daft like that. A proper function interface would avoid all
>> that nonsense, but we can't have nice things or can we?
> 
> 
> Could the rseq(2) syscall itself return the already registered rseq structure address?
> Perhaps a new flag argument to the rseq(2) syscall to query the registered rseq address
> or return the address of the already registered rseq structure when it fails to register a new one.
> 
> Application can call it when the call to register a rseq structure fails.

There is a ptrace(2) PTRACE_GET_RSEQ_CONFIGURATION to achieve
something similar. I don't know if a dependency on ptrace would
be acceptable for that use-case though.

Thanks,

Mathieu



> 
> -Prakash
> 
>>
>> Thanks,
>>
>>         tglx
>>
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: BUG: rseq selftests and librseq vs. glibc fail
  2025-08-29 18:50                   ` Mathieu Desnoyers
@ 2025-09-01 19:30                     ` Prakash Sangappa
  0 siblings, 0 replies; 43+ messages in thread
From: Prakash Sangappa @ 2025-09-01 19:30 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Thomas Gleixner, Florian Weimer, Jens Axboe, LKML,
	Michael Jeanson, Peter Zijlstra, Paul E. McKenney, Boqun Feng,
	Wei Liu, Sean Christopherson, Samuel Thibault



> On Aug 29, 2025, at 11:50 AM, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> On 2025-08-29 14:44, Prakash Sangappa wrote:
>>> On Aug 18, 2025, at 10:13 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> On Mon, Aug 18 2025 at 16:15, Florian Weimer wrote:
>>>> * Thomas Gleixner:
>>>>> It's trivial to reproduce. All it needs is to have in the source:
>>>>> 
>>>>> __weak ptrdiff_t __rseq_offset;
>>>>> 
>>>>> w/o even being referenced and creating a pthread. Reproducer below.
>>>> 
>>>> Well, that's sort of expected.  You can't define glibc symbols that are
>>>> not intended for interposition and expect things to work.  It's kind of
>>>> like writing:
>>>> 
>>>> int _rtld_global;
>>>> 
>>>> That's going to fail rather spectaculary, too.  We make an exception for
>>>> symbols that are not reserved (you can build in ISO C mode and define
>>>> open, close, etc., at least as long as you link to glibc only).  But
>>>> __rseq_offset is a reserved name, so that is not applicable here.
>>>> 
>>>> The real change here is GCC changing from -fcommon (which made a lot of
>>>> these things work in the past) to -fno-common.
>>> 
>>> Thanks for the explanation!
>>> 
>>> So the only way to make this actually work is to revert that commit and
>>> the folks who want to link that statically need to come up with:
>>> 
>>> #ifdef _BUILD_STATICALLY
>>> extern ....
>>> 
>>> #else
>>>        ptr = dlsym(...);
>>> #endif
>>> 
>>> or something daft like that. A proper function interface would avoid all
>>> that nonsense, but we can't have nice things or can we?
>> Could the rseq(2) syscall itself return the already registered rseq structure address?
>> Perhaps a new flag argument to the rseq(2) syscall to query the registered rseq address
>> or return the address of the already registered rseq structure when it fails to register a new one.
>> Application can call it when the call to register a rseq structure fails.
> 
> There is a ptrace(2) PTRACE_GET_RSEQ_CONFIGURATION to achieve
> something similar. I don't know if a dependency on ptrace would
> be acceptable for that use-case though.

Can a thread call ptrace(PTRACE_GET_RSEQ_CONFIGURATION,..) on itself? 

May be something similar can be added to rseq(2) .

Thanks,
-Prakash. 

> 
> Thanks,
> 
> Mathieu
> 
> 
> 
>> -Prakash
>>> 
>>> Thanks,
>>> 
>>>        tglx
>>> 
> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2025-09-01 19:31 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-13 16:29 [patch 00/11] rseq: Optimize exit to user space Thomas Gleixner
2025-08-13 16:29 ` [patch 01/11] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-08-20 14:23   ` Mathieu Desnoyers
2025-08-13 16:29 ` [patch 02/11] rseq: Condense the inline stubs Thomas Gleixner
2025-08-20 14:24   ` Mathieu Desnoyers
2025-08-13 16:29 ` [patch 03/11] rseq: Rename rseq_syscall() to rseq_debug_syscall_exit() Thomas Gleixner
2025-08-20 14:25   ` Mathieu Desnoyers
2025-08-13 16:29 ` [patch 04/11] rseq: Replace the pointless event mask bit fiddling Thomas Gleixner
2025-08-13 16:29 ` [patch 05/11] rseq: Optimize the signal delivery path Thomas Gleixner
2025-08-13 16:29 ` [patch 06/11] rseq: Optimize exit to user space further Thomas Gleixner
2025-08-13 16:29 ` [patch 07/11] entry: Cleanup header Thomas Gleixner
2025-08-13 17:09   ` Giorgi Tchankvetadze
2025-08-13 21:30     ` Thomas Gleixner
2025-08-13 16:29 ` [patch 08/11] entry: Distinguish between syscall and interrupt exit Thomas Gleixner
2025-08-13 16:29 ` [patch 09/11] entry: Provide exit_to_user_notify_resume() Thomas Gleixner
2025-08-13 16:29 ` [patch 10/11] rseq: Skip fixup when returning from a syscall Thomas Gleixner
2025-08-14  8:54   ` Peter Zijlstra
2025-08-14 13:24     ` Thomas Gleixner
2025-08-13 16:29 ` [patch 11/11] rseq: Convert to masked user access where applicable Thomas Gleixner
2025-08-13 17:45 ` [patch 00/11] rseq: Optimize exit to user space Jens Axboe
2025-08-13 21:32   ` Thomas Gleixner
2025-08-13 21:36     ` Jens Axboe
2025-08-13 22:08       ` Thomas Gleixner
2025-08-17 21:23         ` Thomas Gleixner
2025-08-18 14:00           ` BUG: rseq selftests and librseq vs. glibc fail Thomas Gleixner
2025-08-18 14:15             ` Florian Weimer
2025-08-18 17:13               ` Thomas Gleixner
2025-08-18 19:33                 ` Florian Weimer
2025-08-18 19:46                   ` Sean Christopherson
2025-08-18 19:55                     ` Florian Weimer
2025-08-18 20:27                       ` Sean Christopherson
2025-08-18 23:54                         ` Thomas Gleixner
2025-08-19  0:28                           ` Sean Christopherson
2025-08-19  6:18                             ` Florian Weimer
2025-08-29 18:44                 ` Prakash Sangappa
2025-08-29 18:50                   ` Mathieu Desnoyers
2025-09-01 19:30                     ` Prakash Sangappa
2025-08-18 17:38           ` [patch 00/11] rseq: Optimize exit to user space Michael Jeanson
2025-08-18 20:21             ` Thomas Gleixner
2025-08-18 21:29               ` Michael Jeanson
2025-08-18 23:43                 ` Thomas Gleixner
2025-08-20 14:27           ` Mathieu Desnoyers
2025-08-20 14:10 ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).