* [patch V6 00/31] rseq: Optimize exit to user space
@ 2025-10-27 8:44 Thomas Gleixner
2025-10-27 8:44 ` [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
` (31 more replies)
0 siblings, 32 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
This is a trivial follow up on the V5 series, which can be found here:
https://lore.kernel.org/all/20251022121836.019469732@linutronix.de
Changes vs. V5:
- Adjust to the uaccess changes
As for the previous version these patches have a dependency on the uaccess
scope series:
https://lore.kernel.org/20251027083700.573016505@linutronix.de
which is available at:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git uaccess/scoped
For your convenience the combination of both is available from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git rseq/perf
Thanks,
tglx
---
Thomas Gleixner (31):
rseq: Avoid pointless evaluation in __rseq_notify_resume()
rseq: Condense the inline stubs
rseq: Move algorithm comment to top
rseq: Remove the ksig argument from rseq_handle_notify_resume()
rseq: Simplify registration
rseq: Simplify the event notification
rseq, virt: Retrigger RSEQ after vcpu_run()
rseq: Avoid CPU/MM CID updates when no event pending
rseq: Introduce struct rseq_data
entry: Cleanup header
entry: Remove syscall_enter_from_user_mode_prepare()
entry: Inline irqentry_enter/exit_from/to_user_mode()
sched: Move MM CID related functions to sched.h
rseq: Cache CPU ID and MM CID values
rseq: Record interrupt from user space
rseq: Provide tracepoint wrappers for inline code
rseq: Expose lightweight statistics in debugfs
rseq: Provide static branch for runtime debugging
rseq: Provide and use rseq_update_user_cs()
rseq: Replace the original debug implementation
rseq: Make exit debugging static branch based
rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
rseq: Provide and use rseq_set_ids()
rseq: Separate the signal delivery path
rseq: Rework the TIF_NOTIFY handler
rseq: Optimize event setting
rseq: Implement fast path for exit to user
rseq: Switch to fast path processing on exit to user
entry: Split up exit_to_user_mode_prepare()
rseq: Split up rseq_exit_to_user_mode()
rseq: Switch to TIF_RSEQ if supported
Documentation/admin-guide/kernel-parameters.txt | 4
arch/x86/entry/syscall_32.c | 3
drivers/hv/mshv_root_main.c | 3
fs/binfmt_elf.c | 2
fs/exec.c | 2
include/asm-generic/thread_info_tif.h | 3
include/linux/entry-common.h | 38 -
include/linux/irq-entry-common.h | 68 ++
include/linux/mm.h | 25
include/linux/resume_user_mode.h | 2
include/linux/rseq.h | 228 +++++---
include/linux/rseq_entry.h | 592 +++++++++++++++++++++
include/linux/rseq_types.h | 93 +++
include/linux/sched.h | 48 +
include/linux/thread_info.h | 5
include/trace/events/rseq.h | 4
include/uapi/linux/rseq.h | 21
init/Kconfig | 28 -
kernel/entry/common.c | 39 -
kernel/entry/syscall-common.c | 8
kernel/ptrace.c | 6
kernel/rseq.c | 654 ++++++++++--------------
kernel/sched/core.c | 10
kernel/sched/membarrier.c | 8
kernel/sched/sched.h | 5
virt/kvm/kvm_main.c | 7
26 files changed, 1301 insertions(+), 605 deletions(-)
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume()
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 02/31] rseq: Condense the inline stubs Thomas Gleixner
` (30 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
From: Thomas Gleixner <tglx@linutronix.de>
The RSEQ critical section mechanism only clears the event mask when a
critical section is registered, otherwise it is stale and collects
bits.
That means once a critical section is installed the first invocation of
that code when TIF_NOTIFY_RESUME is set will abort the critical section,
even when the TIF bit was not raised by the rseq preempt/migrate/signal
helpers.
This also has a performance implication because TIF_NOTIFY_RESUME is a
multiplexing TIF bit, which is utilized by quite some infrastructure. That
means every invocation of __rseq_notify_resume() goes unconditionally
through the heavy lifting of user space access and consistency checks even
if there is no reason to do so.
Keeping the stale event mask around when exiting to user space also
prevents it from being utilized by the upcoming time slice extension
mechanism.
Avoid this by reading and clearing the event mask before doing the user
space critical section access with interrupts or preemption disabled, which
ensures that the read and clear operation is CPU local atomic versus
scheduling and the membarrier IPI.
This is correct as after re-enabling interrupts/preemption any relevant
event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the
user space exit code take another round of TIF bit clearing.
If the event mask was non-zero, invoke the slow path. On debug kernels the
slow path is invoked unconditionally and the result of the event mask
evaluation is handed in.
Add a exit path check after the TIF bit loop, which validates on debug
kernels that the event mask is zero before exiting to user space.
While at it reword the convoluted comment why the pt_regs pointer can be
NULL under certain circumstances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/irq-entry-common.h | 7 ++--
include/linux/rseq.h | 10 +++++
kernel/rseq.c | 66 ++++++++++++++++++++++++++-------------
3 files changed, 58 insertions(+), 25 deletions(-)
---
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -2,11 +2,12 @@
#ifndef __LINUX_IRQENTRYCOMMON_H
#define __LINUX_IRQENTRYCOMMON_H
+#include <linux/context_tracking.h>
+#include <linux/kmsan.h>
+#include <linux/rseq.h>
#include <linux/static_call_types.h>
#include <linux/syscalls.h>
-#include <linux/context_tracking.h>
#include <linux/tick.h>
-#include <linux/kmsan.h>
#include <linux/unwind_deferred.h>
#include <asm/entry-common.h>
@@ -226,6 +227,8 @@ static __always_inline void exit_to_user
arch_exit_to_user_mode_prepare(regs, ti_work);
+ rseq_exit_to_user_mode();
+
/* Ensure that kernel state is sane for a return to userspace */
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -66,6 +66,14 @@ static inline void rseq_migrate(struct t
rseq_set_notify_resume(t);
}
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
+ if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
+ current->rseq_event_mask = 0;
+ }
+}
+
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
@@ -118,7 +126,7 @@ static inline void rseq_fork(struct task
static inline void rseq_execve(struct task_struct *t)
{
}
-
+static inline void rseq_exit_to_user_mode(void) { }
#endif
#ifdef CONFIG_DEBUG_RSEQ
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char *
return true;
}
-static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
+static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
{
- u32 flags, event_mask;
+ u32 flags;
int ret;
if (rseq_warn_flags("rseq_cs", cs_flags))
@@ -339,17 +339,7 @@ static int rseq_need_restart(struct task
if (rseq_warn_flags("rseq", flags))
return -EINVAL;
-
- /*
- * Load and clear event mask atomically with respect to
- * scheduler preemption and membarrier IPIs.
- */
- scoped_guard(RSEQ_EVENT_GUARD) {
- event_mask = t->rseq_event_mask;
- t->rseq_event_mask = 0;
- }
-
- return !!event_mask;
+ return 0;
}
static int clear_rseq_cs(struct rseq __user *rseq)
@@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip,
return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
}
-static int rseq_ip_fixup(struct pt_regs *regs)
+static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
{
unsigned long ip = instruction_pointer(regs);
struct task_struct *t = current;
@@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs
*/
if (!in_rseq_cs(ip, &rseq_cs))
return clear_rseq_cs(t->rseq);
- ret = rseq_need_restart(t, rseq_cs.flags);
- if (ret <= 0)
+ ret = rseq_check_flags(t, rseq_cs.flags);
+ if (ret < 0)
return ret;
+ if (!abort)
+ return 0;
ret = clear_rseq_cs(t->rseq);
if (ret)
return ret;
@@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct
return;
/*
- * regs is NULL if and only if the caller is in a syscall path. Skip
- * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
- * kill a misbehaving userspace on debug kernels.
+ * If invoked from hypervisors or IO-URING, then @regs is a NULL
+ * pointer, so fixup cannot be done. If the syscall which led to
+ * this invocation was invoked inside a critical section, then it
+ * will either end up in this code again or a possible violation of
+ * a syscall inside a critical region can only be detected by the
+ * debug code in rseq_syscall() in a debug enabled kernel.
*/
if (regs) {
- ret = rseq_ip_fixup(regs);
- if (unlikely(ret < 0))
- goto error;
+ /*
+ * Read and clear the event mask first. If the task was not
+ * preempted or migrated or a signal is on the way, there
+ * is no point in doing any of the heavy lifting here on
+ * production kernels. In that case TIF_NOTIFY_RESUME was
+ * raised by some other functionality.
+ *
+ * This is correct because the read/clear operation is
+ * guarded against scheduler preemption, which makes it CPU
+ * local atomic. If the task is preempted right after
+ * re-enabling preemption then TIF_NOTIFY_RESUME is set
+ * again and this function is invoked another time _before_
+ * the task is able to return to user mode.
+ *
+ * On a debug kernel, invoke the fixup code unconditionally
+ * with the result handed in to allow the detection of
+ * inconsistencies.
+ */
+ u32 event_mask;
+
+ scoped_guard(RSEQ_EVENT_GUARD) {
+ event_mask = t->rseq_event_mask;
+ t->rseq_event_mask = 0;
+ }
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
+ ret = rseq_ip_fixup(regs, !!event_mask);
+ if (unlikely(ret < 0))
+ goto error;
+ }
}
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 02/31] rseq: Condense the inline stubs
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
2025-10-27 8:44 ` [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 03/31] rseq: Move algorithm comment to top Thomas Gleixner
` (29 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
From: Thomas Gleixner <tglx@linutronix.de>
Scrolling over tons of pointless
{
}
lines to find the actual code is annoying at best.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/rseq.h | 47 ++++++++++++-----------------------------------
1 file changed, 12 insertions(+), 35 deletions(-)
---
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -101,44 +101,21 @@ static inline void rseq_execve(struct ta
t->rseq_event_mask = 0;
}
-#else
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
-}
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_preempt(struct task_struct *t)
-{
-}
-static inline void rseq_migrate(struct task_struct *t)
-{
-}
-static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
-{
-}
-static inline void rseq_execve(struct task_struct *t)
-{
-}
+#else /* CONFIG_RSEQ */
+static inline void rseq_set_notify_resume(struct task_struct *t) { }
+static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_preempt(struct task_struct *t) { }
+static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
+static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
-#endif
+#endif /* !CONFIG_RSEQ */
#ifdef CONFIG_DEBUG_RSEQ
-
void rseq_syscall(struct pt_regs *regs);
-
-#else
-
-static inline void rseq_syscall(struct pt_regs *regs)
-{
-}
-
-#endif
+#else /* CONFIG_DEBUG_RSEQ */
+static inline void rseq_syscall(struct pt_regs *regs) { }
+#endif /* !CONFIG_DEBUG_RSEQ */
#endif /* _LINUX_RSEQ_H */
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 03/31] rseq: Move algorithm comment to top
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
2025-10-27 8:44 ` [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-10-27 8:44 ` [patch V6 02/31] rseq: Condense the inline stubs Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 04/31] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
` (28 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Move the comment which documents the RSEQ algorithm to the top of the file,
so it does not create horrible diffs later when the actual implementation
is fed into the mincer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
kernel/rseq.c | 119 ++++++++++++++++++++++++++++------------------------------
1 file changed, 59 insertions(+), 60 deletions(-)
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -8,6 +8,65 @@
* Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
*/
+/*
+ * Restartable sequences are a lightweight interface that allows
+ * user-level code to be executed atomically relative to scheduler
+ * preemption and signal delivery. Typically used for implementing
+ * per-cpu operations.
+ *
+ * It allows user-space to perform update operations on per-cpu data
+ * without requiring heavy-weight atomic operations.
+ *
+ * Detailed algorithm of rseq user-space assembly sequences:
+ *
+ * init(rseq_cs)
+ * cpu = TLS->rseq::cpu_id_start
+ * [1] TLS->rseq::rseq_cs = rseq_cs
+ * [start_ip] ----------------------------
+ * [2] if (cpu != TLS->rseq::cpu_id)
+ * goto abort_ip;
+ * [3] <last_instruction_in_cs>
+ * [post_commit_ip] ----------------------------
+ *
+ * The address of jump target abort_ip must be outside the critical
+ * region, i.e.:
+ *
+ * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
+ *
+ * Steps [2]-[3] (inclusive) need to be a sequence of instructions in
+ * userspace that can handle being interrupted between any of those
+ * instructions, and then resumed to the abort_ip.
+ *
+ * 1. Userspace stores the address of the struct rseq_cs assembly
+ * block descriptor into the rseq_cs field of the registered
+ * struct rseq TLS area. This update is performed through a single
+ * store within the inline assembly instruction sequence.
+ * [start_ip]
+ *
+ * 2. Userspace tests to check whether the current cpu_id field match
+ * the cpu number loaded before start_ip, branching to abort_ip
+ * in case of a mismatch.
+ *
+ * If the sequence is preempted or interrupted by a signal
+ * at or after start_ip and before post_commit_ip, then the kernel
+ * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
+ * ip to abort_ip before returning to user-space, so the preempted
+ * execution resumes at abort_ip.
+ *
+ * 3. Userspace critical section final instruction before
+ * post_commit_ip is the commit. The critical section is
+ * self-terminating.
+ * [post_commit_ip]
+ *
+ * 4. <success>
+ *
+ * On failure at [2], or if interrupted by preempt or signal delivery
+ * between [1] and [3]:
+ *
+ * [abort_ip]
+ * F1. <failure>
+ */
+
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
@@ -98,66 +157,6 @@ static int rseq_validate_ro_fields(struc
unsafe_put_user(value, &t->rseq->field, error_label)
#endif
-/*
- *
- * Restartable sequences are a lightweight interface that allows
- * user-level code to be executed atomically relative to scheduler
- * preemption and signal delivery. Typically used for implementing
- * per-cpu operations.
- *
- * It allows user-space to perform update operations on per-cpu data
- * without requiring heavy-weight atomic operations.
- *
- * Detailed algorithm of rseq user-space assembly sequences:
- *
- * init(rseq_cs)
- * cpu = TLS->rseq::cpu_id_start
- * [1] TLS->rseq::rseq_cs = rseq_cs
- * [start_ip] ----------------------------
- * [2] if (cpu != TLS->rseq::cpu_id)
- * goto abort_ip;
- * [3] <last_instruction_in_cs>
- * [post_commit_ip] ----------------------------
- *
- * The address of jump target abort_ip must be outside the critical
- * region, i.e.:
- *
- * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
- *
- * Steps [2]-[3] (inclusive) need to be a sequence of instructions in
- * userspace that can handle being interrupted between any of those
- * instructions, and then resumed to the abort_ip.
- *
- * 1. Userspace stores the address of the struct rseq_cs assembly
- * block descriptor into the rseq_cs field of the registered
- * struct rseq TLS area. This update is performed through a single
- * store within the inline assembly instruction sequence.
- * [start_ip]
- *
- * 2. Userspace tests to check whether the current cpu_id field match
- * the cpu number loaded before start_ip, branching to abort_ip
- * in case of a mismatch.
- *
- * If the sequence is preempted or interrupted by a signal
- * at or after start_ip and before post_commit_ip, then the kernel
- * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
- * ip to abort_ip before returning to user-space, so the preempted
- * execution resumes at abort_ip.
- *
- * 3. Userspace critical section final instruction before
- * post_commit_ip is the commit. The critical section is
- * self-terminating.
- * [post_commit_ip]
- *
- * 4. <success>
- *
- * On failure at [2], or if interrupted by preempt or signal delivery
- * between [1] and [3]:
- *
- * [abort_ip]
- * F1. <failure>
- */
-
static int rseq_update_cpu_node_id(struct task_struct *t)
{
struct rseq __user *rseq = t->rseq;
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 04/31] rseq: Remove the ksig argument from rseq_handle_notify_resume()
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (2 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 03/31] rseq: Move algorithm comment to top Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 05/31] rseq: Simplify registration Thomas Gleixner
` (27 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
There is no point for this being visible in the resume_to_user_mode()
handling.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/resume_user_mode.h | 2 +-
include/linux/rseq.h | 13 +++++++------
2 files changed, 8 insertions(+), 7 deletions(-)
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
- rseq_handle_notify_resume(NULL, regs);
+ rseq_handle_notify_resume(regs);
}
#endif /* LINUX_RESUME_USER_MODE_H */
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -37,19 +37,20 @@ static inline void rseq_set_notify_resum
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
if (current->rseq)
- __rseq_handle_notify_resume(ksig, regs);
+ __rseq_handle_notify_resume(NULL, regs);
}
static inline void rseq_signal_deliver(struct ksignal *ksig,
struct pt_regs *regs)
{
- scoped_guard(RSEQ_EVENT_GUARD)
- __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
- rseq_handle_notify_resume(ksig, regs);
+ if (current->rseq) {
+ scoped_guard(RSEQ_EVENT_GUARD)
+ __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ __rseq_handle_notify_resume(ksig, regs);
+ }
}
/* rseq_preempt() requires preemption to be disabled. */
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 05/31] rseq: Simplify registration
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (3 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 04/31] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 06/31] rseq: Simplify the event notification Thomas Gleixner
` (26 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
There is no point to read the critical section element in the newly
registered user space RSEQ struct first in order to clear it.
Just clear it and be done with it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
kernel/rseq.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -492,11 +492,9 @@ void rseq_syscall(struct pt_regs *regs)
/*
* sys_rseq - setup restartable sequences for caller thread.
*/
-SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
- int, flags, u32, sig)
+SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
int ret;
- u64 rseq_cs;
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
@@ -557,11 +555,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
* avoid a potential segfault on return to user-space. The proper thing
* to do would have been to fail the registration but this would break
* older libcs that reuse the rseq area for new threads without
- * clearing the fields.
+ * clearing the fields. Don't bother reading it, just reset it.
*/
- if (rseq_get_rseq_cs_ptr_val(rseq, &rseq_cs))
- return -EFAULT;
- if (rseq_cs && clear_rseq_cs(rseq))
+ if (put_user(0UL, &rseq->rseq_cs))
return -EFAULT;
#ifdef CONFIG_DEBUG_RSEQ
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 06/31] rseq: Simplify the event notification
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (4 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 05/31] rseq: Simplify registration Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
` (25 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
flags") the bits in task::rseq_event_mask are meaningless and just extra
work in terms of setting them individually.
Aside of that the only relevant point where an event has to be raised is
context switch. Neither the CPU nor MM CID can change without going through
a context switch.
Collapse them all into a single boolean which simplifies the code a lot and
remove the pointless invocations which have been sprinkled all over the
place for no value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
V2: Reduce it to the sched switch event.
---
fs/exec.c | 2 -
include/linux/rseq.h | 66 +++++++++-------------------------------------
include/linux/sched.h | 10 +++---
include/uapi/linux/rseq.h | 21 ++++----------
kernel/rseq.c | 28 +++++++++++--------
kernel/sched/core.c | 5 ---
kernel/sched/membarrier.c | 8 ++---
7 files changed, 48 insertions(+), 92 deletions(-)
---
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp
force_fatal_sig(SIGSEGV);
sched_mm_cid_after_execve(current);
- rseq_set_notify_resume(current);
+ rseq_sched_switch_event(current);
current->in_execve = 0;
return retval;
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -3,38 +3,8 @@
#define _LINUX_RSEQ_H
#ifdef CONFIG_RSEQ
-
-#include <linux/preempt.h>
#include <linux/sched.h>
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD irq
-#else
-# define RSEQ_EVENT_GUARD preempt
-#endif
-
-/*
- * Map the event mask on the user-space ABI enum rseq_cs_flags
- * for direct mask checks.
- */
-enum rseq_event_mask_bits {
- RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
- RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
- RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
-};
-
-enum rseq_event_mask {
- RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT),
- RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT),
- RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT),
-};
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
- if (t->rseq)
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
-}
-
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
@@ -43,35 +13,27 @@ static inline void rseq_handle_notify_re
__rseq_handle_notify_resume(NULL, regs);
}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
if (current->rseq) {
- scoped_guard(RSEQ_EVENT_GUARD)
- __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ current->rseq_event_pending = true;
__rseq_handle_notify_resume(ksig, regs);
}
}
-/* rseq_preempt() requires preemption to be disabled. */
-static inline void rseq_preempt(struct task_struct *t)
+static inline void rseq_sched_switch_event(struct task_struct *t)
{
- __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
-}
-
-/* rseq_migrate() requires preemption to be disabled. */
-static inline void rseq_migrate(struct task_struct *t)
-{
- __set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
+ if (t->rseq) {
+ t->rseq_event_pending = true;
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ }
}
static __always_inline void rseq_exit_to_user_mode(void)
{
if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
- current->rseq_event_mask = 0;
+ if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
+ current->rseq_event_pending = false;
}
}
@@ -85,12 +47,12 @@ static inline void rseq_fork(struct task
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
- t->rseq_event_mask = 0;
+ t->rseq_event_pending = false;
} else {
t->rseq = current->rseq;
t->rseq_len = current->rseq_len;
t->rseq_sig = current->rseq_sig;
- t->rseq_event_mask = current->rseq_event_mask;
+ t->rseq_event_pending = current->rseq_event_pending;
}
}
@@ -99,15 +61,13 @@ static inline void rseq_execve(struct ta
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
- t->rseq_event_mask = 0;
+ t->rseq_event_pending = false;
}
#else /* CONFIG_RSEQ */
-static inline void rseq_set_notify_resume(struct task_struct *t) { }
static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
-static inline void rseq_preempt(struct task_struct *t) { }
-static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1407,14 +1407,14 @@ struct task_struct {
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_RSEQ
- struct rseq __user *rseq;
- u32 rseq_len;
- u32 rseq_sig;
+ struct rseq __user *rseq;
+ u32 rseq_len;
+ u32 rseq_sig;
/*
- * RmW on rseq_event_mask must be performed atomically
+ * RmW on rseq_event_pending must be performed atomically
* with respect to preemption.
*/
- unsigned long rseq_event_mask;
+ bool rseq_event_pending;
# ifdef CONFIG_DEBUG_RSEQ
/*
* This is a place holder to save a copy of the rseq fields for
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -114,20 +114,13 @@ struct rseq {
/*
* Restartable sequences flags field.
*
- * This field should only be updated by the thread which
- * registered this data structure. Read by the kernel.
- * Mainly used for single-stepping through rseq critical sections
- * with debuggers.
- *
- * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
- * Inhibit instruction sequence block restart on preemption
- * for this thread.
- * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
- * Inhibit instruction sequence block restart on signal
- * delivery for this thread.
- * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
- * Inhibit instruction sequence block restart on migration for
- * this thread.
+ * This field was initially intended to allow event masking for
+ * single-stepping through rseq critical sections with debuggers.
+ * The kernel does not support this anymore and the relevant bits
+ * are checked for being always false:
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
*/
__u32 flags;
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -78,6 +78,12 @@
#define CREATE_TRACE_POINTS
#include <trace/events/rseq.h>
+#ifdef CONFIG_MEMBARRIER
+# define RSEQ_EVENT_GUARD irq
+#else
+# define RSEQ_EVENT_GUARD preempt
+#endif
+
/* The original rseq structure size (including padding) is 32 bytes. */
#define ORIG_RSEQ_SIZE 32
@@ -430,11 +436,11 @@ void __rseq_handle_notify_resume(struct
*/
if (regs) {
/*
- * Read and clear the event mask first. If the task was not
- * preempted or migrated or a signal is on the way, there
- * is no point in doing any of the heavy lifting here on
- * production kernels. In that case TIF_NOTIFY_RESUME was
- * raised by some other functionality.
+ * Read and clear the event pending bit first. If the task
+ * was not preempted or migrated or a signal is on the way,
+ * there is no point in doing any of the heavy lifting here
+ * on production kernels. In that case TIF_NOTIFY_RESUME
+ * was raised by some other functionality.
*
* This is correct because the read/clear operation is
* guarded against scheduler preemption, which makes it CPU
@@ -447,15 +453,15 @@ void __rseq_handle_notify_resume(struct
* with the result handed in to allow the detection of
* inconsistencies.
*/
- u32 event_mask;
+ bool event;
scoped_guard(RSEQ_EVENT_GUARD) {
- event_mask = t->rseq_event_mask;
- t->rseq_event_mask = 0;
+ event = t->rseq_event_pending;
+ t->rseq_event_pending = false;
}
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
- ret = rseq_ip_fixup(regs, !!event_mask);
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+ ret = rseq_ip_fixup(regs, event);
if (unlikely(ret < 0))
goto error;
}
@@ -584,7 +590,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
* registered, ensure the cpu_id_start and cpu_id fields
* are updated before returning to user-space.
*/
- rseq_set_notify_resume(current);
+ rseq_sched_switch_event(current);
return 0;
}
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3329,7 +3329,6 @@ void set_task_cpu(struct task_struct *p,
if (p->sched_class->migrate_task_rq)
p->sched_class->migrate_task_rq(p, new_cpu);
p->se.nr_migrations++;
- rseq_migrate(p);
sched_mm_cid_migrate_from(p);
perf_event_task_migrate(p);
}
@@ -4763,7 +4762,6 @@ int sched_cgroup_fork(struct task_struct
p->sched_task_group = tg;
}
#endif
- rseq_migrate(p);
/*
* We're setting the CPU for the first time, we don't migrate,
* so use __set_task_cpu().
@@ -4827,7 +4825,6 @@ void wake_up_new_task(struct task_struct
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
- rseq_migrate(p);
__set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
@@ -5121,7 +5118,7 @@ prepare_task_switch(struct rq *rq, struc
kcov_prepare_switch(prev);
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
- rseq_preempt(prev);
+ rseq_sched_switch_event(prev);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -199,7 +199,7 @@ static void ipi_rseq(void *info)
* is negligible.
*/
smp_mb();
- rseq_preempt(current);
+ rseq_sched_switch_event(current);
}
static void ipi_sync_rq_state(void *info)
@@ -407,9 +407,9 @@ static int membarrier_private_expedited(
* membarrier, we will end up with some thread in the mm
* running without a core sync.
*
- * For RSEQ, don't rseq_preempt() the caller. User code
- * is not supposed to issue syscalls at all from inside an
- * rseq critical section.
+ * For RSEQ, don't invoke rseq_sched_switch_event() on the
+ * caller. User code is not supposed to issue syscalls at
+ * all from inside an rseq critical section.
*/
if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
preempt_disable();
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run()
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (5 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 06/31] rseq: Simplify the event notification Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-28 15:08 ` Mathieu Desnoyers
` (3 more replies)
2025-10-27 8:44 ` [patch V6 08/31] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
` (24 subsequent siblings)
31 siblings, 4 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Hypervisors invoke resume_user_mode_work() before entering the guest, which
clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
space context available to them, so the rseq notify handler skips
inspecting the critical section, but updates the CPU/MM CID values
unconditionally so that the eventual pending rseq event is not lost on the
way to user space.
This is a pointless exercise as the task might be rescheduled before
actually returning to user space and it creates unnecessary work in the
vcpu_run() loops.
It's way more efficient to ignore that invocation based on @regs == NULL
and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
vcpu_run() loop before returning from the ioctl().
This ensures that a pending RSEQ update is not lost and the IDs are updated
before returning to user space.
Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
a NOOP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Sean Christopherson <seanjc@google.com>
---
V5: Add a comment that this is temporary - Sean
V3: Add the missing rseq.h include for HV - 0-day
---
drivers/hv/mshv_root_main.c | 3 +
include/linux/rseq.h | 17 +++++++++
kernel/rseq.c | 76 +++++++++++++++++++++++---------------------
virt/kvm/kvm_main.c | 7 ++++
4 files changed, 67 insertions(+), 36 deletions(-)
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -29,6 +29,7 @@
#include <linux/crash_dump.h>
#include <linux/panic_notifier.h>
#include <linux/vmalloc.h>
+#include <linux/rseq.h>
#include "mshv_eventfd.h"
#include "mshv.h"
@@ -560,6 +561,8 @@ static long mshv_run_vp_with_root_schedu
}
} while (!vp->run.flags.intercept_suspend);
+ rseq_virt_userspace_exit();
+
return ret;
}
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to
}
/*
+ * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
+ * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
+ * that case just to do it eventually again before returning to user space,
+ * the entry resume_user_mode_work() invocation is ignored as the register
+ * argument is NULL.
+ *
+ * After returning from guest mode, they have to invoke this function to
+ * re-raise TIF_NOTIFY_RESUME if necessary.
+ */
+static inline void rseq_virt_userspace_exit(void)
+{
+ if (current->rseq_event_pending)
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+}
+
+/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
*/
@@ -68,6 +84,7 @@ static inline void rseq_execve(struct ta
static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct
{
struct task_struct *t = current;
int ret, sig;
+ bool event;
+
+ /*
+ * If invoked from hypervisors before entering the guest via
+ * resume_user_mode_work(), then @regs is a NULL pointer.
+ *
+ * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+ * it before returning from the ioctl() to user space when
+ * rseq_event.sched_switch is set.
+ *
+ * So it's safe to ignore here instead of pointlessly updating it
+ * in the vcpu_run() loop.
+ */
+ if (!regs)
+ return;
if (unlikely(t->flags & PF_EXITING))
return;
/*
- * If invoked from hypervisors or IO-URING, then @regs is a NULL
- * pointer, so fixup cannot be done. If the syscall which led to
- * this invocation was invoked inside a critical section, then it
- * will either end up in this code again or a possible violation of
- * a syscall inside a critical region can only be detected by the
- * debug code in rseq_syscall() in a debug enabled kernel.
+ * Read and clear the event pending bit first. If the task
+ * was not preempted or migrated or a signal is on the way,
+ * there is no point in doing any of the heavy lifting here
+ * on production kernels. In that case TIF_NOTIFY_RESUME
+ * was raised by some other functionality.
+ *
+ * This is correct because the read/clear operation is
+ * guarded against scheduler preemption, which makes it CPU
+ * local atomic. If the task is preempted right after
+ * re-enabling preemption then TIF_NOTIFY_RESUME is set
+ * again and this function is invoked another time _before_
+ * the task is able to return to user mode.
+ *
+ * On a debug kernel, invoke the fixup code unconditionally
+ * with the result handed in to allow the detection of
+ * inconsistencies.
*/
- if (regs) {
- /*
- * Read and clear the event pending bit first. If the task
- * was not preempted or migrated or a signal is on the way,
- * there is no point in doing any of the heavy lifting here
- * on production kernels. In that case TIF_NOTIFY_RESUME
- * was raised by some other functionality.
- *
- * This is correct because the read/clear operation is
- * guarded against scheduler preemption, which makes it CPU
- * local atomic. If the task is preempted right after
- * re-enabling preemption then TIF_NOTIFY_RESUME is set
- * again and this function is invoked another time _before_
- * the task is able to return to user mode.
- *
- * On a debug kernel, invoke the fixup code unconditionally
- * with the result handed in to allow the detection of
- * inconsistencies.
- */
- bool event;
-
- scoped_guard(RSEQ_EVENT_GUARD) {
- event = t->rseq_event_pending;
- t->rseq_event_pending = false;
- }
+ scoped_guard(RSEQ_EVENT_GUARD) {
+ event = t->rseq_event_pending;
+ t->rseq_event_pending = false;
+ }
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
- goto error;
- }
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+ ret = rseq_ip_fixup(regs, event);
+ if (unlikely(ret < 0))
+ goto error;
}
+
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
return;
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
#include <linux/lockdep.h>
#include <linux/kthread.h>
#include <linux/suspend.h>
+#include <linux/rseq.h>
#include <asm/processor.h>
#include <asm/ioctl.h>
@@ -4476,6 +4477,12 @@ static long kvm_vcpu_ioctl(struct file *
r = kvm_arch_vcpu_ioctl_run(vcpu);
vcpu->wants_to_run = false;
+ /*
+ * FIXME: Remove this hack once all KVM architectures
+ * support the generic TIF bits, i.e. a dedicated TIF_RSEQ.
+ */
+ rseq_virt_userspace_exit();
+
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
break;
}
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 08/31] rseq: Avoid CPU/MM CID updates when no event pending
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (6 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 09/31] rseq: Introduce struct rseq_data Thomas Gleixner
` (23 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
There is no need to update these values unconditionally if there is no
event pending.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
kernel/rseq.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -464,11 +464,12 @@ void __rseq_handle_notify_resume(struct
t->rseq_event_pending = false;
}
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
- goto error;
- }
+ if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+ return;
+
+ ret = rseq_ip_fixup(regs, event);
+ if (unlikely(ret < 0))
+ goto error;
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 09/31] rseq: Introduce struct rseq_data
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (7 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 08/31] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 10/31] entry: Cleanup header Thomas Gleixner
` (22 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
In preparation for a major rewrite of this code, provide a data structure
for rseq management.
Put all the rseq related data into it (except for the debug part), which
allows to simplify fork/execve by using memset() and memcpy() instead of
adding new fields to initialize over and over.
Create a storage struct for event management as well and put the
sched_switch event and a indicator for RSEQ on a task into it as a
start. That uses a union, which allows to mask and clear the whole lot
efficiently.
The indicators are explicitly not a bit field. Bit fields generate abysmal
code.
The boolean members are defined as u8 as that actually guarantees that it
fits. There seem to be strange architecture ABIs which need more than 8
bits for a boolean.
The has_rseq member is redundant vs. task::rseq, but it turns out that
boolean operations and quick checks on the union generate better code than
fiddling with separate entities and data types.
This struct will be extended over time to carry more information.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
V4: Move all rseq related data into a dedicated umbrella struct
---
include/linux/rseq.h | 48 +++++++++++++++-------------------
include/linux/rseq_types.h | 51 ++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 14 ++--------
kernel/ptrace.c | 6 ++--
kernel/rseq.c | 63 ++++++++++++++++++++++-----------------------
5 files changed, 110 insertions(+), 72 deletions(-)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -9,22 +9,22 @@ void __rseq_handle_notify_resume(struct
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
- if (current->rseq)
+ if (current->rseq.event.has_rseq)
__rseq_handle_notify_resume(NULL, regs);
}
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
- if (current->rseq) {
- current->rseq_event_pending = true;
+ if (current->rseq.event.has_rseq) {
+ current->rseq.event.sched_switch = true;
__rseq_handle_notify_resume(ksig, regs);
}
}
static inline void rseq_sched_switch_event(struct task_struct *t)
{
- if (t->rseq) {
- t->rseq_event_pending = true;
+ if (t->rseq.event.has_rseq) {
+ t->rseq.event.sched_switch = true;
set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
}
}
@@ -32,8 +32,9 @@ static inline void rseq_sched_switch_eve
static __always_inline void rseq_exit_to_user_mode(void)
{
if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
- current->rseq_event_pending = false;
+ if (WARN_ON_ONCE(current->rseq.event.has_rseq &&
+ current->rseq.event.events))
+ current->rseq.event.events = 0;
}
}
@@ -49,35 +50,30 @@ static __always_inline void rseq_exit_to
*/
static inline void rseq_virt_userspace_exit(void)
{
- if (current->rseq_event_pending)
+ if (current->rseq.event.sched_switch)
set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
}
+static inline void rseq_reset(struct task_struct *t)
+{
+ memset(&t->rseq, 0, sizeof(t->rseq));
+}
+
+static inline void rseq_execve(struct task_struct *t)
+{
+ rseq_reset(t);
+}
+
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM) {
- t->rseq = NULL;
- t->rseq_len = 0;
- t->rseq_sig = 0;
- t->rseq_event_pending = false;
- } else {
+ if (clone_flags & CLONE_VM)
+ rseq_reset(t);
+ else
t->rseq = current->rseq;
- t->rseq_len = current->rseq_len;
- t->rseq_sig = current->rseq_sig;
- t->rseq_event_pending = current->rseq_event_pending;
- }
-}
-
-static inline void rseq_execve(struct task_struct *t)
-{
- t->rseq = NULL;
- t->rseq_len = 0;
- t->rseq_sig = 0;
- t->rseq_event_pending = false;
}
#else /* CONFIG_RSEQ */
--- /dev/null
+++ b/include/linux/rseq_types.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_TYPES_H
+#define _LINUX_RSEQ_TYPES_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_RSEQ
+struct rseq;
+
+/**
+ * struct rseq_event - Storage for rseq related event management
+ * @all: Compound to initialize and clear the data efficiently
+ * @events: Compound to access events with a single load/store
+ * @sched_switch: True if the task was scheduled out
+ * @has_rseq: True if the task has a rseq pointer installed
+ */
+struct rseq_event {
+ union {
+ u32 all;
+ struct {
+ union {
+ u16 events;
+ struct {
+ u8 sched_switch;
+ };
+ };
+
+ u8 has_rseq;
+ };
+ };
+};
+
+/**
+ * struct rseq_data - Storage for all rseq related data
+ * @usrptr: Pointer to the registered user space RSEQ memory
+ * @len: Length of the RSEQ region
+ * @sig: Signature of critial section abort IPs
+ * @event: Storage for event management
+ */
+struct rseq_data {
+ struct rseq __user *usrptr;
+ u32 len;
+ u32 sig;
+ struct rseq_event event;
+};
+
+#else /* CONFIG_RSEQ */
+struct rseq_data { };
+#endif /* !CONFIG_RSEQ */
+
+#endif
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -41,6 +41,7 @@
#include <linux/task_io_accounting.h>
#include <linux/posix-timers_types.h>
#include <linux/restart_block.h>
+#include <linux/rseq_types.h>
#include <uapi/linux/rseq.h>
#include <linux/seqlock_types.h>
#include <linux/kcsan.h>
@@ -1406,16 +1407,8 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
-#ifdef CONFIG_RSEQ
- struct rseq __user *rseq;
- u32 rseq_len;
- u32 rseq_sig;
- /*
- * RmW on rseq_event_pending must be performed atomically
- * with respect to preemption.
- */
- bool rseq_event_pending;
-# ifdef CONFIG_DEBUG_RSEQ
+ struct rseq_data rseq;
+#ifdef CONFIG_DEBUG_RSEQ
/*
* This is a place holder to save a copy of the rseq fields for
* validation of read-only fields. The struct rseq has a
@@ -1423,7 +1416,6 @@ struct task_struct {
* directly. Reserve a size large enough for the known fields.
*/
char rseq_fields[sizeof(struct rseq)];
-# endif
#endif
#ifdef CONFIG_SCHED_MM_CID
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -793,9 +793,9 @@ static long ptrace_get_rseq_configuratio
unsigned long size, void __user *data)
{
struct ptrace_rseq_configuration conf = {
- .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
- .rseq_abi_size = task->rseq_len,
- .signature = task->rseq_sig,
+ .rseq_abi_pointer = (u64)(uintptr_t)task->rseq.usrptr,
+ .rseq_abi_size = task->rseq.len,
+ .signature = task->rseq.sig,
.flags = 0,
};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -103,13 +103,13 @@ static int rseq_validate_ro_fields(struc
DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
u32 cpu_id_start, cpu_id, node_id, mm_cid;
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
/*
* Validate fields which are required to be read-only by
* user-space.
*/
- if (!user_read_access_begin(rseq, t->rseq_len))
+ if (!user_read_access_begin(rseq, t->rseq.len))
goto efault;
unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
@@ -147,10 +147,10 @@ static int rseq_validate_ro_fields(struc
* Update an rseq field and its in-kernel copy in lock-step to keep a coherent
* state.
*/
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- do { \
- unsafe_put_user(value, &t->rseq->field, error_label); \
- rseq_kernel_fields(t)->field = value; \
+#define rseq_unsafe_put_user(t, value, field, error_label) \
+ do { \
+ unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \
+ rseq_kernel_fields(t)->field = value; \
} while (0)
#else
@@ -160,12 +160,12 @@ static int rseq_validate_ro_fields(struc
}
#define rseq_unsafe_put_user(t, value, field, error_label) \
- unsafe_put_user(value, &t->rseq->field, error_label)
+ unsafe_put_user(value, &t->rseq.usrptr->field, error_label)
#endif
static int rseq_update_cpu_node_id(struct task_struct *t)
{
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
u32 cpu_id = raw_smp_processor_id();
u32 node_id = cpu_to_node(cpu_id);
u32 mm_cid = task_mm_cid(t);
@@ -176,7 +176,7 @@ static int rseq_update_cpu_node_id(struc
if (rseq_validate_ro_fields(t))
goto efault;
WARN_ON_ONCE((int) mm_cid < 0);
- if (!user_write_access_begin(rseq, t->rseq_len))
+ if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
@@ -201,7 +201,7 @@ static int rseq_update_cpu_node_id(struc
static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
{
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
mm_cid = 0;
@@ -211,7 +211,7 @@ static int rseq_reset_rseq_cpu_node_id(s
if (rseq_validate_ro_fields(t))
goto efault;
- if (!user_write_access_begin(rseq, t->rseq_len))
+ if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
/*
@@ -272,7 +272,7 @@ static int rseq_get_rseq_cs(struct task_
u32 sig;
int ret;
- ret = rseq_get_rseq_cs_ptr_val(t->rseq, &ptr);
+ ret = rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr);
if (ret)
return ret;
@@ -305,10 +305,10 @@ static int rseq_get_rseq_cs(struct task_
if (ret)
return ret;
- if (current->rseq_sig != sig) {
+ if (current->rseq.sig != sig) {
printk_ratelimited(KERN_WARNING
"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq_sig, current->pid, usig);
+ sig, current->rseq.sig, current->pid, usig);
return -EINVAL;
}
return 0;
@@ -338,7 +338,7 @@ static int rseq_check_flags(struct task_
return -EINVAL;
/* Get thread flags. */
- ret = get_user(flags, &t->rseq->flags);
+ ret = get_user(flags, &t->rseq.usrptr->flags);
if (ret)
return ret;
@@ -392,13 +392,13 @@ static int rseq_ip_fixup(struct pt_regs
* Clear the rseq_cs pointer and return.
*/
if (!in_rseq_cs(ip, &rseq_cs))
- return clear_rseq_cs(t->rseq);
+ return clear_rseq_cs(t->rseq.usrptr);
ret = rseq_check_flags(t, rseq_cs.flags);
if (ret < 0)
return ret;
if (!abort)
return 0;
- ret = clear_rseq_cs(t->rseq);
+ ret = clear_rseq_cs(t->rseq.usrptr);
if (ret)
return ret;
trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
@@ -460,8 +460,8 @@ void __rseq_handle_notify_resume(struct
* inconsistencies.
*/
scoped_guard(RSEQ_EVENT_GUARD) {
- event = t->rseq_event_pending;
- t->rseq_event_pending = false;
+ event = t->rseq.event.sched_switch;
+ t->rseq.event.sched_switch = false;
}
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -492,7 +492,7 @@ void rseq_syscall(struct pt_regs *regs)
struct task_struct *t = current;
struct rseq_cs rseq_cs;
- if (!t->rseq)
+ if (!t->rseq.usrptr)
return;
if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
force_sig(SIGSEGV);
@@ -511,33 +511,31 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
/* Unregister rseq for current thread. */
- if (current->rseq != rseq || !current->rseq)
+ if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
return -EINVAL;
- if (rseq_len != current->rseq_len)
+ if (rseq_len != current->rseq.len)
return -EINVAL;
- if (current->rseq_sig != sig)
+ if (current->rseq.sig != sig)
return -EPERM;
ret = rseq_reset_rseq_cpu_node_id(current);
if (ret)
return ret;
- current->rseq = NULL;
- current->rseq_sig = 0;
- current->rseq_len = 0;
+ rseq_reset(current);
return 0;
}
if (unlikely(flags))
return -EINVAL;
- if (current->rseq) {
+ if (current->rseq.usrptr) {
/*
* If rseq is already registered, check whether
* the provided address differs from the prior
* one.
*/
- if (current->rseq != rseq || rseq_len != current->rseq_len)
+ if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
return -EINVAL;
- if (current->rseq_sig != sig)
+ if (current->rseq.sig != sig)
return -EPERM;
/* Already registered. */
return -EBUSY;
@@ -586,15 +584,16 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
*/
- current->rseq = rseq;
- current->rseq_len = rseq_len;
- current->rseq_sig = sig;
+ current->rseq.usrptr = rseq;
+ current->rseq.len = rseq_len;
+ current->rseq.sig = sig;
/*
* If rseq was previously inactive, and has just been
* registered, ensure the cpu_id_start and cpu_id fields
* are updated before returning to user-space.
*/
+ current->rseq.event.has_rseq = true;
rseq_sched_switch_event(current);
return 0;
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 10/31] entry: Cleanup header
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (8 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 09/31] rseq: Introduce struct rseq_data Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 11/31] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
` (21 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
From: Thomas Gleixner <tglx@linutronix.de>
Cleanup the include ordering, kernel-doc and other trivialities before
making further changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/entry-common.h | 8 ++++----
include/linux/irq-entry-common.h | 2 ++
2 files changed, 6 insertions(+), 4 deletions(-)
---
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -3,11 +3,11 @@
#define __LINUX_ENTRYCOMMON_H
#include <linux/irq-entry-common.h>
+#include <linux/livepatch.h>
#include <linux/ptrace.h>
+#include <linux/resume_user_mode.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
-#include <linux/livepatch.h>
-#include <linux/resume_user_mode.h>
#include <asm/entry-common.h>
#include <asm/syscall.h>
@@ -37,6 +37,7 @@
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_ENTER)
+
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
@@ -61,8 +62,7 @@
*/
void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
- unsigned long work);
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
/**
* syscall_enter_from_user_mode_work - Check and handle work before invoking
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_
/**
* enter_from_user_mode - Establish state when coming from user mode
+ * @regs: Pointer to currents pt_regs
*
* Syscall/interrupt entry disables interrupts, but user mode is traced as
* interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter(
* Conditional reschedule with additional sanity checks.
*/
void raw_irqentry_exit_cond_resched(void);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 11/31] entry: Remove syscall_enter_from_user_mode_prepare()
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (9 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 10/31] entry: Cleanup header Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 12/31] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
` (20 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Open code the only user in the x86 syscall code and reduce the zoo of
functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
arch/x86/entry/syscall_32.c | 3 ++-
include/linux/entry-common.h | 26 +++++---------------------
kernel/entry/syscall-common.c | 8 --------
3 files changed, 7 insertions(+), 30 deletions(-)
--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@@ -274,9 +274,10 @@ static noinstr bool __do_fast_syscall_32
* fetch EBP before invoking any of the syscall entry work
* functions.
*/
- syscall_enter_from_user_mode_prepare(regs);
+ enter_from_user_mode(regs);
instrumentation_begin();
+ local_irq_enable();
/* Fetch EBP from where the vDSO stashed it. */
if (IS_ENABLED(CONFIG_X86_64)) {
/*
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -45,23 +45,6 @@
SYSCALL_WORK_SYSCALL_EXIT_TRAP | \
ARCH_SYSCALL_WORK_EXIT)
-/**
- * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
- * @regs: Pointer to currents pt_regs
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct, interrupts are enabled and the
- * subsequent functions can be instrumented.
- *
- * This handles lockdep, RCU (context tracking) and tracing state, i.e.
- * the functionality provided by enter_from_user_mode().
- *
- * This is invoked when there is extra architecture specific functionality
- * to be done between establishing state and handling user mode entry work.
- */
-void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-
long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
/**
@@ -71,8 +54,8 @@ long syscall_trace_enter(struct pt_regs
* @syscall: The syscall number
*
* Invoked from architecture specific syscall entry code with interrupts
- * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
- * architecture specific work.
+ * enabled after invoking enter_from_user_mode(), enabling interrupts and
+ * extra architecture specific work.
*
* Returns: The original or a modified syscall number
*
@@ -108,8 +91,9 @@ static __always_inline long syscall_ente
* function returns all state is correct, interrupts are enabled and the
* subsequent functions can be instrumented.
*
- * This is combination of syscall_enter_from_user_mode_prepare() and
- * syscall_enter_from_user_mode_work().
+ * This is the combination of enter_from_user_mode() and
+ * syscall_enter_from_user_mode_work() to be used when there is no
+ * architecture specific work to be done between the two.
*
* Returns: The original or a modified syscall number. See
* syscall_enter_from_user_mode_work() for further explanation.
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -63,14 +63,6 @@ long syscall_trace_enter(struct pt_regs
return ret ? : syscall;
}
-noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
-{
- enter_from_user_mode(regs);
- instrumentation_begin();
- local_irq_enable();
- instrumentation_end();
-}
-
/*
* If SYSCALL_EMU is set, then the only reason to report is when
* SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 12/31] entry: Inline irqentry_enter/exit_from/to_user_mode()
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (10 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 11/31] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 13/31] sched: Move MM CID related functions to sched.h Thomas Gleixner
` (19 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
There is no point to have this as a function which just inlines
enter_from_user_mode(). The function call overhead is larger than the
function itself.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/irq-entry-common.h | 13 +++++++++++--
kernel/entry/common.c | 13 -------------
2 files changed, 11 insertions(+), 15 deletions(-)
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -278,7 +278,10 @@ static __always_inline void exit_to_user
*
* The function establishes state (lockdep, RCU (context tracking), tracing)
*/
-void irqentry_enter_from_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
+{
+ enter_from_user_mode(regs);
+}
/**
* irqentry_exit_to_user_mode - Interrupt exit work
@@ -293,7 +296,13 @@ void irqentry_enter_from_user_mode(struc
* Interrupt exit is not invoking #1 which is the syscall specific one time
* work.
*/
-void irqentry_exit_to_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
+{
+ instrumentation_begin();
+ exit_to_user_mode_prepare(regs);
+ instrumentation_end();
+ exit_to_user_mode();
+}
#ifndef irqentry_state
/**
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -62,19 +62,6 @@ void __weak arch_do_signal_or_restart(st
return ti_work;
}
-noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
-{
- enter_from_user_mode(regs);
-}
-
-noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
-{
- instrumentation_begin();
- exit_to_user_mode_prepare(regs);
- instrumentation_end();
- exit_to_user_mode();
-}
-
noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
{
irqentry_state_t ret = {
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 13/31] sched: Move MM CID related functions to sched.h
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (11 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 12/31] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 14/31] rseq: Cache CPU ID and MM CID values Thomas Gleixner
` (18 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
There is nothing mm specific in that and including mm.h can cause header
recursion hell.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/mm.h | 25 -------------------------
include/linux/sched.h | 26 ++++++++++++++++++++++++++
2 files changed, 26 insertions(+), 25 deletions(-)
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2401,31 +2401,6 @@ struct zap_details {
/* Set in unmap_vmas() to indicate a final unmap call. Only used by hugetlb */
#define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1))
-#ifdef CONFIG_SCHED_MM_CID
-void sched_mm_cid_before_execve(struct task_struct *t);
-void sched_mm_cid_after_execve(struct task_struct *t);
-void sched_mm_cid_fork(struct task_struct *t);
-void sched_mm_cid_exit_signals(struct task_struct *t);
-static inline int task_mm_cid(struct task_struct *t)
-{
- return t->mm_cid;
-}
-#else
-static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_fork(struct task_struct *t) { }
-static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
-static inline int task_mm_cid(struct task_struct *t)
-{
- /*
- * Use the processor id as a fall-back when the mm cid feature is
- * disabled. This provides functional per-cpu data structure accesses
- * in user-space, althrough it won't provide the memory usage benefits.
- */
- return raw_smp_processor_id();
-}
-#endif
-
#ifdef CONFIG_MMU
extern bool can_do_mlock(void);
#else
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2310,6 +2310,32 @@ static __always_inline void alloc_tag_re
#define alloc_tag_restore(_tag, _old) do {} while (0)
#endif
+/* Avoids recursive inclusion hell */
+#ifdef CONFIG_SCHED_MM_CID
+void sched_mm_cid_before_execve(struct task_struct *t);
+void sched_mm_cid_after_execve(struct task_struct *t);
+void sched_mm_cid_fork(struct task_struct *t);
+void sched_mm_cid_exit_signals(struct task_struct *t);
+static inline int task_mm_cid(struct task_struct *t)
+{
+ return t->mm_cid;
+}
+#else
+static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_fork(struct task_struct *t) { }
+static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
+static inline int task_mm_cid(struct task_struct *t)
+{
+ /*
+ * Use the processor id as a fall-back when the mm cid feature is
+ * disabled. This provides functional per-cpu data structure accesses
+ * in user-space, althrough it won't provide the memory usage benefits.
+ */
+ return task_cpu(t);
+}
+#endif
+
#ifndef MODULE
#ifndef COMPILE_OFFSETS
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 14/31] rseq: Cache CPU ID and MM CID values
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (12 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 13/31] sched: Move MM CID related functions to sched.h Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 15/31] rseq: Record interrupt from user space Thomas Gleixner
` (17 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
In preparation for rewriting RSEQ exit to user space handling provide
storage to cache the CPU ID and MM CID values which were written to user
space. That prepares for a quick check, which avoids the update when
nothing changed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/rseq.h | 7 +++++--
include/linux/rseq_types.h | 21 +++++++++++++++++++++
include/trace/events/rseq.h | 4 ++--
kernel/rseq.c | 4 ++++
4 files changed, 32 insertions(+), 4 deletions(-)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -57,6 +57,7 @@ static inline void rseq_virt_userspace_e
static inline void rseq_reset(struct task_struct *t)
{
memset(&t->rseq, 0, sizeof(t->rseq));
+ t->rseq.ids.cpu_cid = ~0ULL;
}
static inline void rseq_execve(struct task_struct *t)
@@ -70,10 +71,12 @@ static inline void rseq_execve(struct ta
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM)
+ if (clone_flags & CLONE_VM) {
rseq_reset(t);
- else
+ } else {
t->rseq = current->rseq;
+ t->rseq.ids.cpu_cid = ~0ULL;
+ }
}
#else /* CONFIG_RSEQ */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -31,17 +31,38 @@ struct rseq_event {
};
/**
+ * struct rseq_ids - Cache for ids, which need to be updated
+ * @cpu_cid: Compound of @cpu_id and @mm_cid to make the
+ * compiler emit a single compare on 64-bit
+ * @cpu_id: The CPU ID which was written last to user space
+ * @mm_cid: The MM CID which was written last to user space
+ *
+ * @cpu_id and @mm_cid are updated when the data is written to user space.
+ */
+struct rseq_ids {
+ union {
+ u64 cpu_cid;
+ struct {
+ u32 cpu_id;
+ u32 mm_cid;
+ };
+ };
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
* @sig: Signature of critial section abort IPs
* @event: Storage for event management
+ * @ids: Storage for cached CPU ID and MM CID
*/
struct rseq_data {
struct rseq __user *usrptr;
u32 len;
u32 sig;
struct rseq_event event;
+ struct rseq_ids ids;
};
#else /* CONFIG_RSEQ */
--- a/include/trace/events/rseq.h
+++ b/include/trace/events/rseq.h
@@ -21,9 +21,9 @@ TRACE_EVENT(rseq_update,
),
TP_fast_assign(
- __entry->cpu_id = raw_smp_processor_id();
+ __entry->cpu_id = t->rseq.ids.cpu_id;
__entry->node_id = cpu_to_node(__entry->cpu_id);
- __entry->mm_cid = task_mm_cid(t);
+ __entry->mm_cid = t->rseq.ids.mm_cid;
),
TP_printk("cpu_id=%d node_id=%d mm_cid=%d", __entry->cpu_id,
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -184,6 +184,10 @@ static int rseq_update_cpu_node_id(struc
rseq_unsafe_put_user(t, node_id, node_id, efault_end);
rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
+ /* Cache the user space values */
+ t->rseq.ids.cpu_id = cpu_id;
+ t->rseq.ids.mm_cid = mm_cid;
+
/*
* Additional feature fields added after ORIG_RSEQ_SIZE
* need to be conditionally updated only if
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 15/31] rseq: Record interrupt from user space
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (13 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 14/31] rseq: Cache CPU ID and MM CID values Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-28 15:26 ` Mathieu Desnoyers
` (3 more replies)
2025-10-27 8:44 ` [patch V6 16/31] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
` (16 subsequent siblings)
31 siblings, 4 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
For RSEQ the only relevant reason to inspect and eventually fixup (abort)
user space critical sections is when user space was interrupted and the
task was scheduled out.
If the user to kernel entry was from a syscall no fixup is required. If
user space invokes a syscall from a critical section it can keep the
pieces as documented.
This is only supported on architectures which utilize the generic entry
code. If your architecture does not use it, bad luck.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/irq-entry-common.h | 3 ++-
include/linux/rseq.h | 16 +++++++++++-----
include/linux/rseq_entry.h | 18 ++++++++++++++++++
include/linux/rseq_types.h | 2 ++
4 files changed, 33 insertions(+), 6 deletions(-)
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -4,7 +4,7 @@
#include <linux/context_tracking.h>
#include <linux/kmsan.h>
-#include <linux/rseq.h>
+#include <linux/rseq_entry.h>
#include <linux/static_call_types.h>
#include <linux/syscalls.h>
#include <linux/tick.h>
@@ -281,6 +281,7 @@ static __always_inline void exit_to_user
static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
{
enter_from_user_mode(regs);
+ rseq_note_user_irq_entry();
}
/**
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -31,11 +31,17 @@ static inline void rseq_sched_switch_eve
static __always_inline void rseq_exit_to_user_mode(void)
{
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq.event.has_rseq &&
- current->rseq.event.events))
- current->rseq.event.events = 0;
- }
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ WARN_ON_ONCE(ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing did not clear it.
+ */
+ ev->events = 0;
}
/*
--- /dev/null
+++ b/include/linux/rseq_entry.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_ENTRY_H
+#define _LINUX_RSEQ_ENTRY_H
+
+#ifdef CONFIG_RSEQ
+#include <linux/rseq.h>
+
+static __always_inline void rseq_note_user_irq_entry(void)
+{
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
+ current->rseq.event.user_irq = true;
+}
+
+#else /* CONFIG_RSEQ */
+static inline void rseq_note_user_irq_entry(void) { }
+#endif /* !CONFIG_RSEQ */
+
+#endif /* _LINUX_RSEQ_ENTRY_H */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -12,6 +12,7 @@ struct rseq;
* @all: Compound to initialize and clear the data efficiently
* @events: Compound to access events with a single load/store
* @sched_switch: True if the task was scheduled out
+ * @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
*/
struct rseq_event {
@@ -22,6 +23,7 @@ struct rseq_event {
u16 events;
struct {
u8 sched_switch;
+ u8 user_irq;
};
};
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 16/31] rseq: Provide tracepoint wrappers for inline code
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (14 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 15/31] rseq: Record interrupt from user space Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 17/31] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
` (15 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline
fast path, so that the header can be safely included by code which defines
actual trace points.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
V4: Fix the fallback stub
V3: Get rid of one indentation level - Mathieu
---
include/linux/rseq_entry.h | 28 ++++++++++++++++++++++++++++
kernel/rseq.c | 19 ++++++++++++++++++-
2 files changed, 46 insertions(+), 1 deletion(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -5,6 +5,34 @@
#ifdef CONFIG_RSEQ
#include <linux/rseq.h>
+#include <linux/tracepoint-defs.h>
+
+#ifdef CONFIG_TRACEPOINTS
+DECLARE_TRACEPOINT(rseq_update);
+DECLARE_TRACEPOINT(rseq_ip_fixup);
+void __rseq_trace_update(struct task_struct *t);
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip);
+
+static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids)
+{
+ if (tracepoint_enabled(rseq_update) && ids)
+ __rseq_trace_update(t);
+}
+
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip)
+{
+ if (tracepoint_enabled(rseq_ip_fixup))
+ __rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+
+#else /* CONFIG_TRACEPOINT */
+static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids) { }
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip) { }
+#endif /* !CONFIG_TRACEPOINT */
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -70,7 +70,7 @@
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/rseq.h>
+#include <linux/rseq_entry.h>
#include <linux/types.h>
#include <linux/ratelimit.h>
#include <asm/ptrace.h>
@@ -91,6 +91,23 @@
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
+#ifdef CONFIG_TRACEPOINTS
+/*
+ * Out of line, so the actual update functions can be in a header to be
+ * inlined into the exit to user code.
+ */
+void __rseq_trace_update(struct task_struct *t)
+{
+ trace_rseq_update(t);
+}
+
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip)
+{
+ trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+#endif /* CONFIG_TRACEPOINTS */
+
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
{
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 17/31] rseq: Expose lightweight statistics in debugfs
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (15 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 16/31] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 18/31] rseq: Provide static branch for runtime debugging Thomas Gleixner
` (14 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Analyzing the call frequency without actually using tracing is helpful for
analysis of this infrastructure. The overhead is minimal as it just
increments a per CPU counter associated to each operation.
The debugfs readout provides a racy sum of all counters.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/rseq.h | 16 ---------
include/linux/rseq_entry.h | 49 +++++++++++++++++++++++++++
init/Kconfig | 12 ++++++
kernel/rseq.c | 79 +++++++++++++++++++++++++++++++++++++++++----
4 files changed, 133 insertions(+), 23 deletions(-)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -29,21 +29,6 @@ static inline void rseq_sched_switch_eve
}
}
-static __always_inline void rseq_exit_to_user_mode(void)
-{
- struct rseq_event *ev = ¤t->rseq.event;
-
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
- WARN_ON_ONCE(ev->sched_switch);
-
- /*
- * Ensure that event (especially user_irq) is cleared when the
- * interrupt did not result in a schedule and therefore the
- * rseq processing did not clear it.
- */
- ev->events = 0;
-}
-
/*
* KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
* which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
@@ -92,7 +77,6 @@ static inline void rseq_sched_switch_eve
static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
-static inline void rseq_exit_to_user_mode(void) { }
#endif /* !CONFIG_RSEQ */
#ifdef CONFIG_DEBUG_RSEQ
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -2,6 +2,37 @@
#ifndef _LINUX_RSEQ_ENTRY_H
#define _LINUX_RSEQ_ENTRY_H
+/* Must be outside the CONFIG_RSEQ guard to resolve the stubs */
+#ifdef CONFIG_RSEQ_STATS
+#include <linux/percpu.h>
+
+struct rseq_stats {
+ unsigned long exit;
+ unsigned long signal;
+ unsigned long slowpath;
+ unsigned long ids;
+ unsigned long cs;
+ unsigned long clear;
+ unsigned long fixup;
+};
+
+DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
+
+/*
+ * Slow path has interrupts and preemption enabled, but the fast path
+ * runs with interrupts disabled so there is no point in having the
+ * preemption checks implied in __this_cpu_inc() for every operation.
+ */
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_stat_inc(which) this_cpu_inc((which))
+#else
+#define rseq_stat_inc(which) raw_cpu_inc((which))
+#endif
+
+#else /* CONFIG_RSEQ_STATS */
+#define rseq_stat_inc(x) do { } while (0)
+#endif /* !CONFIG_RSEQ_STATS */
+
#ifdef CONFIG_RSEQ
#include <linux/rseq.h>
@@ -39,8 +70,26 @@ static __always_inline void rseq_note_us
current->rseq.event.user_irq = true;
}
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ WARN_ON_ONCE(ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing did not clear it.
+ */
+ ev->events = 0;
+}
+
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
+static inline void rseq_exit_to_user_mode(void) { }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1913,6 +1913,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_STATS
+ default n
+ bool "Enable lightweight statistics of restartable sequences" if EXPERT
+ depends on RSEQ && DEBUG_FS
+ help
+ Enable lightweight counters which expose information about the
+ frequency of RSEQ operations via debugfs. Mostly interesting for
+ kernel debugging or performance analysis. While lightweight it's
+ still adding code into the user/kernel mode transitions.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -67,12 +67,16 @@
* F1. <failure>
*/
+/* Required to select the proper per_cpu ops for rseq_stats_inc() */
+#define RSEQ_BUILD_SLOW_PATH
+
+#include <linux/debugfs.h>
+#include <linux/ratelimit.h>
+#include <linux/rseq_entry.h>
#include <linux/sched.h>
-#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/rseq_entry.h>
+#include <linux/uaccess.h>
#include <linux/types.h>
-#include <linux/ratelimit.h>
#include <asm/ptrace.h>
#define CREATE_TRACE_POINTS
@@ -108,6 +112,56 @@ void __rseq_trace_ip_fixup(unsigned long
}
#endif /* CONFIG_TRACEPOINTS */
+#ifdef CONFIG_RSEQ_STATS
+DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+ struct rseq_stats stats = { };
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ stats.exit += data_race(per_cpu(rseq_stats.exit, cpu));
+ stats.signal += data_race(per_cpu(rseq_stats.signal, cpu));
+ stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu));
+ stats.ids += data_race(per_cpu(rseq_stats.ids, cpu));
+ stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
+ stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
+ stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
+ }
+
+ seq_printf(m, "exit: %16lu\n", stats.exit);
+ seq_printf(m, "signal: %16lu\n", stats.signal);
+ seq_printf(m, "slowp: %16lu\n", stats.slowpath);
+ seq_printf(m, "ids: %16lu\n", stats.ids);
+ seq_printf(m, "cs: %16lu\n", stats.cs);
+ seq_printf(m, "clear: %16lu\n", stats.clear);
+ seq_printf(m, "fixup: %16lu\n", stats.fixup);
+ return 0;
+}
+
+static int rseq_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, rseq_debug_show, inode->i_private);
+}
+
+static const struct file_operations dfs_ops = {
+ .open = rseq_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init rseq_debugfs_init(void)
+{
+ struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
+
+ debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+ return 0;
+}
+__initcall(rseq_debugfs_init);
+#endif /* CONFIG_RSEQ_STATS */
+
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
{
@@ -187,12 +241,13 @@ static int rseq_update_cpu_node_id(struc
u32 node_id = cpu_to_node(cpu_id);
u32 mm_cid = task_mm_cid(t);
- /*
- * Validate read-only rseq fields.
- */
+ rseq_stat_inc(rseq_stats.ids);
+
+ /* Validate read-only rseq fields on debug kernels */
if (rseq_validate_ro_fields(t))
goto efault;
WARN_ON_ONCE((int) mm_cid < 0);
+
if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
@@ -403,6 +458,8 @@ static int rseq_ip_fixup(struct pt_regs
struct rseq_cs rseq_cs;
int ret;
+ rseq_stat_inc(rseq_stats.cs);
+
ret = rseq_get_rseq_cs(t, &rseq_cs);
if (ret)
return ret;
@@ -412,8 +469,10 @@ static int rseq_ip_fixup(struct pt_regs
* If not nested over a rseq critical section, restart is useless.
* Clear the rseq_cs pointer and return.
*/
- if (!in_rseq_cs(ip, &rseq_cs))
+ if (!in_rseq_cs(ip, &rseq_cs)) {
+ rseq_stat_inc(rseq_stats.clear);
return clear_rseq_cs(t->rseq.usrptr);
+ }
ret = rseq_check_flags(t, rseq_cs.flags);
if (ret < 0)
return ret;
@@ -422,6 +481,7 @@ static int rseq_ip_fixup(struct pt_regs
ret = clear_rseq_cs(t->rseq.usrptr);
if (ret)
return ret;
+ rseq_stat_inc(rseq_stats.fixup);
trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
rseq_cs.abort_ip);
instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
@@ -462,6 +522,11 @@ void __rseq_handle_notify_resume(struct
if (unlikely(t->flags & PF_EXITING))
return;
+ if (ksig)
+ rseq_stat_inc(rseq_stats.signal);
+ else
+ rseq_stat_inc(rseq_stats.slowpath);
+
/*
* Read and clear the event pending bit first. If the task
* was not preempted or migrated or a signal is on the way,
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 18/31] rseq: Provide static branch for runtime debugging
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (16 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 17/31] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:44 ` [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
` (13 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Config based debug is rarely turned on and is not available easily when
things go wrong.
Provide a static branch to allow permanent integration of debug mechanisms
along with the usual toggles in Kconfig, command line and debugfs.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
V3: Fix __setup() return value - Michael
---
Documentation/admin-guide/kernel-parameters.txt | 4 +
include/linux/rseq_entry.h | 3
init/Kconfig | 14 ++++
kernel/rseq.c | 73 ++++++++++++++++++++++--
4 files changed, 90 insertions(+), 4 deletions(-)
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6443,6 +6443,10 @@
Memory area to be used by remote processor image,
managed by CMA.
+ rseq_debug= [KNL] Enable or disable restartable sequence
+ debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE.
+ Format: <bool>
+
rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling
when CONFIG_RT_GROUP_SCHED=y. Defaults to
!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
#endif /* !CONFIG_RSEQ_STATS */
#ifdef CONFIG_RSEQ
+#include <linux/jump_label.h>
#include <linux/rseq.h>
#include <linux/tracepoint-defs.h>
@@ -64,6 +65,8 @@ static inline void rseq_trace_ip_fixup(u
unsigned long offset, unsigned long abort_ip) { }
#endif /* !CONFIG_TRACEPOINT */
+DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1895,10 +1895,24 @@ config RSEQ_STATS
If unsure, say N.
+config RSEQ_DEBUG_DEFAULT_ENABLE
+ default n
+ bool "Enable restartable sequences debug mode by default" if EXPERT
+ depends on RSEQ
+ help
+ This enables the static branch for debug mode of restartable
+ sequences.
+
+ This also can be controlled on the kernel command line via the
+ command line parameter "rseq_debug=0/1" and through debugfs.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
depends on RSEQ && DEBUG_KERNEL
+ select RSEQ_DEBUG_DEFAULT_ENABLE
help
Enable extra debugging checks for the rseq system call.
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -95,6 +95,27 @@
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
+DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
+static inline void rseq_control_debug(bool on)
+{
+ if (on)
+ static_branch_enable(&rseq_debug_enabled);
+ else
+ static_branch_disable(&rseq_debug_enabled);
+}
+
+static int __init rseq_setup_debug(char *str)
+{
+ bool on;
+
+ if (kstrtobool(str, &on))
+ return -EINVAL;
+ rseq_control_debug(on);
+ return 1;
+}
+__setup("rseq_debug=", rseq_setup_debug);
+
#ifdef CONFIG_TRACEPOINTS
/*
* Out of line, so the actual update functions can be in a header to be
@@ -112,10 +133,11 @@ void __rseq_trace_ip_fixup(unsigned long
}
#endif /* CONFIG_TRACEPOINTS */
+#ifdef CONFIG_DEBUG_FS
#ifdef CONFIG_RSEQ_STATS
DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
-static int rseq_debug_show(struct seq_file *m, void *p)
+static int rseq_stats_show(struct seq_file *m, void *p)
{
struct rseq_stats stats = { };
unsigned int cpu;
@@ -140,14 +162,56 @@ static int rseq_debug_show(struct seq_fi
return 0;
}
+static int rseq_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, rseq_stats_show, inode->i_private);
+}
+
+static const struct file_operations stat_ops = {
+ .open = rseq_stats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init rseq_stats_init(struct dentry *root_dir)
+{
+ debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops);
+ return 0;
+}
+#else
+static inline void rseq_stats_init(struct dentry *root_dir) { }
+#endif /* CONFIG_RSEQ_STATS */
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+ bool on = static_branch_unlikely(&rseq_debug_enabled);
+
+ seq_printf(m, "%d\n", on);
+ return 0;
+}
+
+static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ bool on;
+
+ if (kstrtobool_from_user(ubuf, count, &on))
+ return -EINVAL;
+
+ rseq_control_debug(on);
+ return count;
+}
+
static int rseq_debug_open(struct inode *inode, struct file *file)
{
return single_open(file, rseq_debug_show, inode->i_private);
}
-static const struct file_operations dfs_ops = {
+static const struct file_operations debug_ops = {
.open = rseq_debug_open,
.read = seq_read,
+ .write = rseq_debug_write,
.llseek = seq_lseek,
.release = single_release,
};
@@ -156,11 +220,12 @@ static int __init rseq_debugfs_init(void
{
struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
- debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+ debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops);
+ rseq_stats_init(root_dir);
return 0;
}
__initcall(rseq_debugfs_init);
-#endif /* CONFIG_RSEQ_STATS */
+#endif /* CONFIG_DEBUG_FS */
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 19/31] rseq: Provide and use rseq_update_user_cs()
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (17 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 18/31] rseq: Provide static branch for runtime debugging Thomas Gleixner
@ 2025-10-27 8:44 ` Thomas Gleixner
2025-10-28 15:40 ` Mathieu Desnoyers
` (4 more replies)
2025-10-27 8:45 ` [patch V6 20/31] rseq: Replace the original debug implementation Thomas Gleixner
` (12 subsequent siblings)
31 siblings, 5 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:44 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Provide a straight forward implementation to check for and eventually
clear/fixup critical sections in user space.
The non-debug version does only the minimal sanity checks and aims for
efficiency.
There are two attack vectors, which are checked for:
1) An abort IP which is in the kernel address space. That would cause at
least x86 to return to kernel space via IRET.
2) A rogue critical section descriptor with an abort IP pointing to some
arbitrary address, which is not preceded by the RSEQ signature.
If the section descriptors are invalid then the resulting misbehaviour of
the user space application is not the kernels problem.
The kernel provides a run-time switchable debug slow path, which implements
the full zoo of checks including termination of the task when one of the
gazillion conditions is not met.
Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME
handler. Move the remainders into the CONFIG_DEBUG_RSEQ section, which will
be replaced and removed in a subsequent step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Update comments and fix typos - Mathieu
V3: Brought back the signature check along with a comment - Mathieu
---
include/linux/rseq_entry.h | 206 +++++++++++++++++++++++++++++++++++++
include/linux/rseq_types.h | 11 +-
kernel/rseq.c | 244 +++++++++++++--------------------------------
3 files changed, 290 insertions(+), 171 deletions(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -36,6 +36,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
#ifdef CONFIG_RSEQ
#include <linux/jump_label.h>
#include <linux/rseq.h>
+#include <linux/uaccess.h>
#include <linux/tracepoint-defs.h>
@@ -67,12 +68,217 @@ static inline void rseq_trace_ip_fixup(u
DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_inline
+#else
+#define rseq_inline __always_inline
+#endif
+
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
current->rseq.event.user_irq = true;
}
+/*
+ * Check whether there is a valid critical section and whether the
+ * instruction pointer in @regs is inside the critical section.
+ *
+ * - If the critical section is invalid, terminate the task.
+ *
+ * - If valid and the instruction pointer is inside, set it to the abort IP
+ *
+ * - If valid and the instruction pointer is outside, clear the critical
+ * section address.
+ *
+ * Returns true, if the section was valid and either fixup or clear was
+ * done, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when a invalid
+ * section was found. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+
+#ifdef RSEQ_BUILD_SLOW_PATH
+/*
+ * The debug version is put out of line, but kept here so the code stays
+ * together.
+ *
+ * @csaddr has already been checked by the caller to be in user space
+ */
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
+ unsigned long csaddr)
+{
+ struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+ u64 start_ip, abort_ip, offset, cs_end, head, tasksize = TASK_SIZE;
+ unsigned long ip = instruction_pointer(regs);
+ u64 __user *uc_head = (u64 __user *) ucs;
+ u32 usig, __user *uc_sig;
+
+ scoped_user_rw_access(ucs, efault) {
+ /*
+ * Evaluate the user pile and exit if one of the conditions
+ * is not fulfilled.
+ */
+ unsafe_get_user(start_ip, &ucs->start_ip, efault);
+ if (unlikely(start_ip >= tasksize))
+ goto die;
+ /* If outside, just clear the critical section. */
+ if (ip < start_ip)
+ goto clear;
+
+ unsafe_get_user(offset, &ucs->post_commit_offset, efault);
+ cs_end = start_ip + offset;
+ /* Check for overflow and wraparound */
+ if (unlikely(cs_end >= tasksize || cs_end < start_ip))
+ goto die;
+
+ /* If not inside, clear it. */
+ if (ip >= cs_end)
+ goto clear;
+
+ unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
+ /* Ensure it's "valid" */
+ if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
+ goto die;
+ /* Validate that the abort IP is not in the critical section */
+ if (unlikely(abort_ip - start_ip < offset))
+ goto die;
+
+ /*
+ * Check version and flags for 0. No point in emitting
+ * deprecated warnings before dying. That could be done in
+ * the slow path eventually, but *shrug*.
+ */
+ unsafe_get_user(head, uc_head, efault);
+ if (unlikely(head))
+ goto die;
+
+ /* abort_ip - 4 is >= 0. See abort_ip check above */
+ uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+ unsafe_get_user(usig, uc_sig, efault);
+ if (unlikely(usig != t->rseq.sig))
+ goto die;
+
+ /* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /* If not in interrupt from user context, let it die */
+ if (unlikely(!t->rseq.event.user_irq))
+ goto die;
+ }
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ instruction_pointer_set(regs, (unsigned long)abort_ip);
+ rseq_stat_inc(rseq_stats.fixup);
+ break;
+ clear:
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ rseq_stat_inc(rseq_stats.clear);
+ abort_ip = 0ULL;
+ }
+
+ if (unlikely(abort_ip))
+ rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
+#endif /* RSEQ_BUILD_SLOW_PATH */
+
+/*
+ * This only ensures that abort_ip is in the user address space and
+ * validates that it is preceded by the signature.
+ *
+ * No other sanity checks are done here, that's what the debug code is for.
+ */
+static rseq_inline bool
+rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
+{
+ struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+ unsigned long ip = instruction_pointer(regs);
+ u64 start_ip, abort_ip, offset;
+ u32 usig, __user *uc_sig;
+
+ rseq_stat_inc(rseq_stats.cs);
+
+ if (unlikely(csaddr >= TASK_SIZE)) {
+ t->rseq.event.fatal = true;
+ return false;
+ }
+
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ return rseq_debug_update_user_cs(t, regs, csaddr);
+
+ scoped_user_rw_access(ucs, efault) {
+ unsafe_get_user(start_ip, &ucs->start_ip, efault);
+ unsafe_get_user(offset, &ucs->post_commit_offset, efault);
+ unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
+
+ /*
+ * No sanity checks. If user space screwed it up, it can
+ * keep the pieces. That's what debug code is for.
+ *
+ * If outside, just clear the critical section.
+ */
+ if (ip - start_ip >= offset)
+ goto clear;
+
+ /*
+ * Two requirements for @abort_ip:
+ * - Must be in user space as x86 IRET would happily return to
+ * the kernel.
+ * - The four bytes preceding the instruction at @abort_ip must
+ * contain the signature.
+ *
+ * The latter protects against the following attack vector:
+ *
+ * An attacker with limited abilities to write, creates a critical
+ * section descriptor, sets the abort IP to a library function or
+ * some other ROP gadget and stores the address of the descriptor
+ * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
+ * protection.
+ */
+ if (abort_ip >= TASK_SIZE || abort_ip < sizeof(*uc_sig))
+ goto die;
+
+ /* The address is guaranteed to be >= 0 and < TASK_SIZE */
+ uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+ unsafe_get_user(usig, uc_sig, efault);
+ if (unlikely(usig != t->rseq.sig))
+ goto die;
+
+ /* Invalidate the critical section */
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ /* Update the instruction pointer */
+ instruction_pointer_set(regs, (unsigned long)abort_ip);
+ rseq_stat_inc(rseq_stats.fixup);
+ break;
+ clear:
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ rseq_stat_inc(rseq_stats.clear);
+ abort_ip = 0ULL;
+ }
+
+ if (unlikely(abort_ip))
+ rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -14,10 +14,12 @@ struct rseq;
* @sched_switch: True if the task was scheduled out
* @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
+ * @error: Compound error code for the slow path to analyze
+ * @fatal: User space data corrupted or invalid
*/
struct rseq_event {
union {
- u32 all;
+ u64 all;
struct {
union {
u16 events;
@@ -28,6 +30,13 @@ struct rseq_event {
};
u8 has_rseq;
+ u8 __pad;
+ union {
+ u16 error;
+ struct {
+ u8 fatal;
+ };
+ };
};
};
};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -382,175 +382,18 @@ static int rseq_reset_rseq_cpu_node_id(s
return -EFAULT;
}
-/*
- * Get the user-space pointer value stored in the 'rseq_cs' field.
- */
-static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs)
-{
- if (!rseq_cs)
- return -EFAULT;
-
-#ifdef CONFIG_64BIT
- if (get_user(*rseq_cs, &rseq->rseq_cs))
- return -EFAULT;
-#else
- if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-#endif
-
- return 0;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
- struct rseq_cs __user *urseq_cs;
- u64 ptr;
- u32 __user *usig;
- u32 sig;
- int ret;
-
- ret = rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr);
- if (ret)
- return ret;
-
- /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
- if (!ptr) {
- memset(rseq_cs, 0, sizeof(*rseq_cs));
- return 0;
- }
- /* Check that the pointer value fits in the user-space process space. */
- if (ptr >= TASK_SIZE)
- return -EINVAL;
- urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
- if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-
- if (rseq_cs->start_ip >= TASK_SIZE ||
- rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
- rseq_cs->abort_ip >= TASK_SIZE ||
- rseq_cs->version > 0)
- return -EINVAL;
- /* Check for overflow. */
- if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
- return -EINVAL;
- /* Ensure that abort_ip is not in the critical section. */
- if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
- return -EINVAL;
-
- usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
- ret = get_user(sig, usig);
- if (ret)
- return ret;
-
- if (current->rseq.sig != sig) {
- printk_ratelimited(KERN_WARNING
- "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq.sig, current->pid, usig);
- return -EINVAL;
- }
- return 0;
-}
-
-static bool rseq_warn_flags(const char *str, u32 flags)
+static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
{
- u32 test_flags;
+ struct rseq __user *urseq = t->rseq.usrptr;
+ u64 csaddr;
- if (!flags)
- return false;
- test_flags = flags & RSEQ_CS_NO_RESTART_FLAGS;
- if (test_flags)
- pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, str);
- test_flags = flags & ~RSEQ_CS_NO_RESTART_FLAGS;
- if (test_flags)
- pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str);
- return true;
-}
-
-static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
-{
- u32 flags;
- int ret;
-
- if (rseq_warn_flags("rseq_cs", cs_flags))
- return -EINVAL;
-
- /* Get thread flags. */
- ret = get_user(flags, &t->rseq.usrptr->flags);
- if (ret)
- return ret;
-
- if (rseq_warn_flags("rseq", flags))
- return -EINVAL;
- return 0;
-}
-
-static int clear_rseq_cs(struct rseq __user *rseq)
-{
- /*
- * The rseq_cs field is set to NULL on preemption or signal
- * delivery on top of rseq assembly block, as well as on top
- * of code outside of the rseq assembly block. This performs
- * a lazy clear of the rseq_cs field.
- *
- * Set rseq_cs to NULL.
- */
-#ifdef CONFIG_64BIT
- return put_user(0UL, &rseq->rseq_cs);
-#else
- if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs)))
- return -EFAULT;
- return 0;
-#endif
-}
-
-/*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
- return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
-{
- unsigned long ip = instruction_pointer(regs);
- struct task_struct *t = current;
- struct rseq_cs rseq_cs;
- int ret;
-
- rseq_stat_inc(rseq_stats.cs);
-
- ret = rseq_get_rseq_cs(t, &rseq_cs);
- if (ret)
- return ret;
-
- /*
- * Handle potentially not being within a critical section.
- * If not nested over a rseq critical section, restart is useless.
- * Clear the rseq_cs pointer and return.
- */
- if (!in_rseq_cs(ip, &rseq_cs)) {
- rseq_stat_inc(rseq_stats.clear);
- return clear_rseq_cs(t->rseq.usrptr);
- }
- ret = rseq_check_flags(t, rseq_cs.flags);
- if (ret < 0)
- return ret;
- if (!abort)
- return 0;
- ret = clear_rseq_cs(t->rseq.usrptr);
- if (ret)
- return ret;
- rseq_stat_inc(rseq_stats.fixup);
- trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
- rseq_cs.abort_ip);
- instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
- return 0;
+ scoped_user_read_access(urseq, efault)
+ unsafe_get_user(csaddr, &urseq->rseq_cs, efault);
+ if (likely(!csaddr))
+ return true;
+ return rseq_update_user_cs(t, regs, csaddr);
+efault:
+ return false;
}
/*
@@ -567,8 +410,8 @@ static int rseq_ip_fixup(struct pt_regs
void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
- int ret, sig;
bool event;
+ int sig;
/*
* If invoked from hypervisors before entering the guest via
@@ -618,8 +461,7 @@ void __rseq_handle_notify_resume(struct
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
return;
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
+ if (!rseq_handle_cs(t, regs))
goto error;
if (unlikely(rseq_update_cpu_node_id(t)))
@@ -632,6 +474,68 @@ void __rseq_handle_notify_resume(struct
}
#ifdef CONFIG_DEBUG_RSEQ
+/*
+ * Unsigned comparison will be true when ip >= start_ip, and when
+ * ip < start_ip + post_commit_offset.
+ */
+static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
+{
+ return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
+}
+
+/*
+ * If the rseq_cs field of 'struct rseq' contains a valid pointer to
+ * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
+ */
+static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
+{
+ struct rseq __user *urseq = t->rseq.usrptr;
+ struct rseq_cs __user *urseq_cs;
+ u32 __user *usig;
+ u64 ptr;
+ u32 sig;
+ int ret;
+
+ if (get_user(ptr, &rseq->rseq_cs))
+ return -EFAULT;
+
+ /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
+ if (!ptr) {
+ memset(rseq_cs, 0, sizeof(*rseq_cs));
+ return 0;
+ }
+ /* Check that the pointer value fits in the user-space process space. */
+ if (ptr >= TASK_SIZE)
+ return -EINVAL;
+ urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
+ if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
+ return -EFAULT;
+
+ if (rseq_cs->start_ip >= TASK_SIZE ||
+ rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
+ rseq_cs->abort_ip >= TASK_SIZE ||
+ rseq_cs->version > 0)
+ return -EINVAL;
+ /* Check for overflow. */
+ if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
+ return -EINVAL;
+ /* Ensure that abort_ip is not in the critical section. */
+ if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
+ return -EINVAL;
+
+ usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
+ ret = get_user(sig, usig);
+ if (ret)
+ return ret;
+
+ if (current->rseq.sig != sig) {
+ printk_ratelimited(KERN_WARNING
+ "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
+ sig, current->rseq.sig, current->pid, usig);
+ return -EINVAL;
+ }
+ return 0;
+}
/*
* Terminate the process if a syscall is issued within a restartable
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 20/31] rseq: Replace the original debug implementation
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (18 preceding siblings ...)
2025-10-27 8:44 ` [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (3 more replies)
2025-10-27 8:45 ` [patch V6 21/31] rseq: Make exit debugging static branch based Thomas Gleixner
` (11 subsequent siblings)
31 siblings, 4 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Just utilize the new infrastructure and put the original one to rest.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
kernel/rseq.c | 81 ++++++++--------------------------------------------------
1 file changed, 12 insertions(+), 69 deletions(-)
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -475,84 +475,27 @@ void __rseq_handle_notify_resume(struct
#ifdef CONFIG_DEBUG_RSEQ
/*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
- return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
- struct rseq __user *urseq = t->rseq.usrptr;
- struct rseq_cs __user *urseq_cs;
- u32 __user *usig;
- u64 ptr;
- u32 sig;
- int ret;
-
- if (get_user(ptr, &rseq->rseq_cs))
- return -EFAULT;
-
- /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
- if (!ptr) {
- memset(rseq_cs, 0, sizeof(*rseq_cs));
- return 0;
- }
- /* Check that the pointer value fits in the user-space process space. */
- if (ptr >= TASK_SIZE)
- return -EINVAL;
- urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
- if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-
- if (rseq_cs->start_ip >= TASK_SIZE ||
- rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
- rseq_cs->abort_ip >= TASK_SIZE ||
- rseq_cs->version > 0)
- return -EINVAL;
- /* Check for overflow. */
- if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
- return -EINVAL;
- /* Ensure that abort_ip is not in the critical section. */
- if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
- return -EINVAL;
-
- usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
- ret = get_user(sig, usig);
- if (ret)
- return ret;
-
- if (current->rseq.sig != sig) {
- printk_ratelimited(KERN_WARNING
- "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq.sig, current->pid, usig);
- return -EINVAL;
- }
- return 0;
-}
-
-/*
* Terminate the process if a syscall is issued within a restartable
* sequence.
*/
void rseq_syscall(struct pt_regs *regs)
{
- unsigned long ip = instruction_pointer(regs);
struct task_struct *t = current;
- struct rseq_cs rseq_cs;
+ u64 csaddr;
- if (!t->rseq.usrptr)
+ if (!t->rseq.event.has_rseq)
+ return;
+ if (!get_user(csaddr, &t->rseq.usrptr->rseq_cs))
+ goto fail;
+ if (likely(!csaddr))
return;
- if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
- force_sig(SIGSEGV);
+ if (unlikely(csaddr >= TASK_SIZE))
+ goto fail;
+ if (rseq_debug_update_user_cs(t, regs, csaddr))
+ return;
+fail:
+ force_sig(SIGSEGV);
}
-
#endif
/*
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 21/31] rseq: Make exit debugging static branch based
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (19 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 20/31] rseq: Replace the original debug implementation Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:45 ` [patch V6 22/31] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
` (10 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Disconnect it from the config switch and use the static debug branch. This
is a temporary measure for validating the rework. At the end this check
needs to be hidden behind lockdep as it has nothing to do with the other
debug infrastructure, which mainly aids user space debugging by enabling a
zoo of checks which terminate misbehaving tasks instead of letting them
keep the hard to diagnose pieces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/rseq_entry.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -285,7 +285,7 @@ static __always_inline void rseq_exit_to
rseq_stat_inc(rseq_stats.exit);
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ if (static_branch_unlikely(&rseq_debug_enabled))
WARN_ON_ONCE(ev->sched_switch);
/*
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 22/31] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (20 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 21/31] rseq: Make exit debugging static branch based Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:45 ` [patch V6 23/31] rseq: Provide and use rseq_set_ids() Thomas Gleixner
` (9 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Make the syscall exit debug mechanism available via the static branch on
architectures which utilize the generic entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/entry-common.h | 2 +-
include/linux/rseq_entry.h | 9 +++++++++
kernel/rseq.c | 10 ++++++++--
3 files changed, 18 insertions(+), 3 deletions(-)
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -146,7 +146,7 @@ static __always_inline void syscall_exit
local_irq_enable();
}
- rseq_syscall(regs);
+ rseq_debug_syscall_return(regs);
/*
* Do one-time syscall specific work. If these work items are
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -296,9 +296,18 @@ static __always_inline void rseq_exit_to
ev->events = 0;
}
+void __rseq_debug_syscall_return(struct pt_regs *regs);
+
+static inline void rseq_debug_syscall_return(struct pt_regs *regs)
+{
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ __rseq_debug_syscall_return(regs);
+}
+
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -473,12 +473,11 @@ void __rseq_handle_notify_resume(struct
force_sigsegv(sig);
}
-#ifdef CONFIG_DEBUG_RSEQ
/*
* Terminate the process if a syscall is issued within a restartable
* sequence.
*/
-void rseq_syscall(struct pt_regs *regs)
+void __rseq_debug_syscall_return(struct pt_regs *regs)
{
struct task_struct *t = current;
u64 csaddr;
@@ -496,6 +495,13 @@ void rseq_syscall(struct pt_regs *regs)
fail:
force_sig(SIGSEGV);
}
+
+#ifdef CONFIG_DEBUG_RSEQ
+/* Kept around to keep GENERIC_ENTRY=n architectures supported. */
+void rseq_syscall(struct pt_regs *regs)
+{
+ __rseq_debug_syscall_return(regs);
+}
#endif
/*
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 23/31] rseq: Provide and use rseq_set_ids()
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (21 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 22/31] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-28 15:47 ` Mathieu Desnoyers
` (3 more replies)
2025-10-27 8:45 ` [patch V6 24/31] rseq: Separate the signal delivery path Thomas Gleixner
` (8 subsequent siblings)
31 siblings, 4 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Provide a new and straight forward implementation to set the IDs (CPU ID,
Node ID and MM CID), which can be later inlined into the fast path.
It does all operations in one scoped_user_rw_access() section and retrieves
also the critical section member (rseq::cs_rseq) from user space to avoid
another user..begin/end() pair. This is in preparation for optimizing the
fast path to avoid extra work when not required.
On rseq registration set the CPU ID fields to RSEQ_CPU_ID_UNINITIALIZED and
node and MM CID to zero. That's the same as the kernel internal reset
values. That makes the debug validation in the exit code work correctly on
the first exit to user space.
Use it to replace the whole related zoo in rseq.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Initialize IDs on registration, keep them on fork and lift the
first exit restriction in the debug code - Mathieu
V3: Fixed the node ID comparison in the debug path - Mathieu
---
fs/binfmt_elf.c | 2
include/linux/rseq.h | 16 ++-
include/linux/rseq_entry.h | 89 ++++++++++++++++
include/linux/sched.h | 10 -
kernel/rseq.c | 237 +++++++++------------------------------------
5 files changed, 152 insertions(+), 202 deletions(-)
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -46,7 +46,7 @@
#include <linux/cred.h>
#include <linux/dax.h>
#include <linux/uaccess.h>
-#include <linux/rseq.h>
+#include <uapi/linux/rseq.h>
#include <asm/param.h>
#include <asm/page.h>
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -5,6 +5,8 @@
#ifdef CONFIG_RSEQ
#include <linux/sched.h>
+#include <uapi/linux/rseq.h>
+
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
@@ -48,7 +50,7 @@ static inline void rseq_virt_userspace_e
static inline void rseq_reset(struct task_struct *t)
{
memset(&t->rseq, 0, sizeof(t->rseq));
- t->rseq.ids.cpu_cid = ~0ULL;
+ t->rseq.ids.cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
}
static inline void rseq_execve(struct task_struct *t)
@@ -59,15 +61,19 @@ static inline void rseq_execve(struct ta
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
+ *
+ * On fork, keep the IDs (CPU, MMCID) of the parent, which avoids a fault
+ * on the COW page on exit to user space, when the child stays on the same
+ * CPU as the parent. That's obviously not guaranteed, but in overcommit
+ * scenarios it is more likely and optimizes for the fork/exec case without
+ * taking the fault.
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM) {
+ if (clone_flags & CLONE_VM)
rseq_reset(t);
- } else {
+ else
t->rseq = current->rseq;
- t->rseq.ids.cpu_cid = ~0ULL;
- }
}
#else /* CONFIG_RSEQ */
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -75,6 +75,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEB
#endif
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+bool rseq_debug_validate_ids(struct task_struct *t);
static __always_inline void rseq_note_user_irq_entry(void)
{
@@ -194,6 +195,43 @@ bool rseq_debug_update_user_cs(struct ta
return false;
}
+/*
+ * On debug kernels validate that user space did not mess with it if the
+ * debug branch is enabled.
+ */
+bool rseq_debug_validate_ids(struct task_struct *t)
+{
+ struct rseq __user *rseq = t->rseq.usrptr;
+ u32 cpu_id, uval, node_id;
+
+ /*
+ * On the first exit after registering the rseq region CPU ID is
+ * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0!
+ */
+ node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ?
+ cpu_to_node(t->rseq.ids.cpu_id) : 0;
+
+ scoped_user_read_access(rseq, efault) {
+ unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
+ if (cpu_id != t->rseq.ids.cpu_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->cpu_id, efault);
+ if (uval != cpu_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->node_id, efault);
+ if (uval != node_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->mm_cid, efault);
+ if (uval != t->rseq.ids.mm_cid)
+ goto die;
+ }
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
#endif /* RSEQ_BUILD_SLOW_PATH */
/*
@@ -278,6 +316,57 @@ rseq_update_user_cs(struct task_struct *
efault:
return false;
}
+
+/*
+ * Updates CPU ID, Node ID and MM CID and reads the critical section
+ * address, when @csaddr != NULL. This allows to put the ID update and the
+ * read under the same uaccess region to spare a separate begin/end.
+ *
+ * As this is either invoked from a C wrapper with @csaddr = NULL or from
+ * the fast path code with a valid pointer, a clever compiler should be
+ * able to optimize the read out. Spares a duplicate implementation.
+ *
+ * Returns true, if the operation was successful, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when invalid data
+ * was found on debug kernels. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+static rseq_inline
+bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
+ u32 node_id, u64 *csaddr)
+{
+ struct rseq __user *rseq = t->rseq.usrptr;
+
+ if (static_branch_unlikely(&rseq_debug_enabled)) {
+ if (!rseq_debug_validate_ids(t))
+ return false;
+ }
+
+ scoped_user_rw_access(rseq, efault) {
+ unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
+ unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
+ unsafe_put_user(node_id, &rseq->node_id, efault);
+ unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
+ if (csaddr)
+ unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+ }
+
+ /* Cache the new values */
+ t->rseq.ids.cpu_cid = ids->cpu_cid;
+ rseq_stat_inc(rseq_stats.ids);
+ rseq_trace_update(t, ids);
+ return true;
+efault:
+ return false;
+}
static __always_inline void rseq_exit_to_user_mode(void)
{
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -42,7 +42,6 @@
#include <linux/posix-timers_types.h>
#include <linux/restart_block.h>
#include <linux/rseq_types.h>
-#include <uapi/linux/rseq.h>
#include <linux/seqlock_types.h>
#include <linux/kcsan.h>
#include <linux/rv.h>
@@ -1408,15 +1407,6 @@ struct task_struct {
#endif /* CONFIG_NUMA_BALANCING */
struct rseq_data rseq;
-#ifdef CONFIG_DEBUG_RSEQ
- /*
- * This is a place holder to save a copy of the rseq fields for
- * validation of read-only fields. The struct rseq has a
- * variable-length array at the end, so it cannot be used
- * directly. Reserve a size large enough for the known fields.
- */
- char rseq_fields[sizeof(struct rseq)];
-#endif
#ifdef CONFIG_SCHED_MM_CID
int mm_cid; /* Current cid in mm */
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -88,13 +88,6 @@
# define RSEQ_EVENT_GUARD preempt
#endif
-/* The original rseq structure size (including padding) is 32 bytes. */
-#define ORIG_RSEQ_SIZE 32
-
-#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
- RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
- RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
-
DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
static inline void rseq_control_debug(bool on)
@@ -227,159 +220,9 @@ static int __init rseq_debugfs_init(void
__initcall(rseq_debugfs_init);
#endif /* CONFIG_DEBUG_FS */
-#ifdef CONFIG_DEBUG_RSEQ
-static struct rseq *rseq_kernel_fields(struct task_struct *t)
-{
- return (struct rseq *) t->rseq_fields;
-}
-
-static int rseq_validate_ro_fields(struct task_struct *t)
-{
- static DEFINE_RATELIMIT_STATE(_rs,
- DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
- u32 cpu_id_start, cpu_id, node_id, mm_cid;
- struct rseq __user *rseq = t->rseq.usrptr;
-
- /*
- * Validate fields which are required to be read-only by
- * user-space.
- */
- if (!user_read_access_begin(rseq, t->rseq.len))
- goto efault;
- unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
- unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
- unsafe_get_user(node_id, &rseq->node_id, efault_end);
- unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end);
- user_read_access_end();
-
- if ((cpu_id_start != rseq_kernel_fields(t)->cpu_id_start ||
- cpu_id != rseq_kernel_fields(t)->cpu_id ||
- node_id != rseq_kernel_fields(t)->node_id ||
- mm_cid != rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) {
-
- pr_warn("Detected rseq corruption for pid: %d, name: %s\n"
- "\tcpu_id_start: %u ?= %u\n"
- "\tcpu_id: %u ?= %u\n"
- "\tnode_id: %u ?= %u\n"
- "\tmm_cid: %u ?= %u\n",
- t->pid, t->comm,
- cpu_id_start, rseq_kernel_fields(t)->cpu_id_start,
- cpu_id, rseq_kernel_fields(t)->cpu_id,
- node_id, rseq_kernel_fields(t)->node_id,
- mm_cid, rseq_kernel_fields(t)->mm_cid);
- }
-
- /* For now, only print a console warning on mismatch. */
- return 0;
-
-efault_end:
- user_read_access_end();
-efault:
- return -EFAULT;
-}
-
-/*
- * Update an rseq field and its in-kernel copy in lock-step to keep a coherent
- * state.
- */
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- do { \
- unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \
- rseq_kernel_fields(t)->field = value; \
- } while (0)
-
-#else
-static int rseq_validate_ro_fields(struct task_struct *t)
+static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
{
- return 0;
-}
-
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- unsafe_put_user(value, &t->rseq.usrptr->field, error_label)
-#endif
-
-static int rseq_update_cpu_node_id(struct task_struct *t)
-{
- struct rseq __user *rseq = t->rseq.usrptr;
- u32 cpu_id = raw_smp_processor_id();
- u32 node_id = cpu_to_node(cpu_id);
- u32 mm_cid = task_mm_cid(t);
-
- rseq_stat_inc(rseq_stats.ids);
-
- /* Validate read-only rseq fields on debug kernels */
- if (rseq_validate_ro_fields(t))
- goto efault;
- WARN_ON_ONCE((int) mm_cid < 0);
-
- if (!user_write_access_begin(rseq, t->rseq.len))
- goto efault;
-
- rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
- rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
- rseq_unsafe_put_user(t, node_id, node_id, efault_end);
- rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
- /* Cache the user space values */
- t->rseq.ids.cpu_id = cpu_id;
- t->rseq.ids.mm_cid = mm_cid;
-
- /*
- * Additional feature fields added after ORIG_RSEQ_SIZE
- * need to be conditionally updated only if
- * t->rseq_len != ORIG_RSEQ_SIZE.
- */
- user_write_access_end();
- trace_rseq_update(t);
- return 0;
-
-efault_end:
- user_write_access_end();
-efault:
- return -EFAULT;
-}
-
-static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
-{
- struct rseq __user *rseq = t->rseq.usrptr;
- u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
- mm_cid = 0;
-
- /*
- * Validate read-only rseq fields.
- */
- if (rseq_validate_ro_fields(t))
- goto efault;
-
- if (!user_write_access_begin(rseq, t->rseq.len))
- goto efault;
-
- /*
- * Reset all fields to their initial state.
- *
- * All fields have an initial state of 0 except cpu_id which is set to
- * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after
- * unregistration can figure out that rseq needs to be registered
- * again.
- */
- rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end);
- rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
- rseq_unsafe_put_user(t, node_id, node_id, efault_end);
- rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
- /*
- * Additional feature fields added after ORIG_RSEQ_SIZE
- * need to be conditionally reset only if
- * t->rseq_len != ORIG_RSEQ_SIZE.
- */
- user_write_access_end();
- return 0;
-
-efault_end:
- user_write_access_end();
-efault:
- return -EFAULT;
+ return rseq_set_ids_get_csaddr(t, ids, node_id, NULL);
}
static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
@@ -410,6 +253,8 @@ static bool rseq_handle_cs(struct task_s
void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
+ struct rseq_ids ids;
+ u32 node_id;
bool event;
int sig;
@@ -456,6 +301,8 @@ void __rseq_handle_notify_resume(struct
scoped_guard(RSEQ_EVENT_GUARD) {
event = t->rseq.event.sched_switch;
t->rseq.event.sched_switch = false;
+ ids.cpu_id = task_cpu(t);
+ ids.mm_cid = task_mm_cid(t);
}
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -464,7 +311,8 @@ void __rseq_handle_notify_resume(struct
if (!rseq_handle_cs(t, regs))
goto error;
- if (unlikely(rseq_update_cpu_node_id(t)))
+ node_id = cpu_to_node(ids.cpu_id);
+ if (!rseq_set_ids(t, &ids, node_id))
goto error;
return;
@@ -504,13 +352,33 @@ void rseq_syscall(struct pt_regs *regs)
}
#endif
+static bool rseq_reset_ids(void)
+{
+ struct rseq_ids ids = {
+ .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+ .mm_cid = 0,
+ };
+
+ /*
+ * If this fails, terminate it because this leaves the kernel in
+ * stupid state as exit to user space will try to fixup the ids
+ * again.
+ */
+ if (rseq_set_ids(current, &ids, 0))
+ return true;
+
+ force_sig(SIGSEGV);
+ return false;
+}
+
+/* The original rseq structure size (including padding) is 32 bytes. */
+#define ORIG_RSEQ_SIZE 32
+
/*
* sys_rseq - setup restartable sequences for caller thread.
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
- int ret;
-
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -521,9 +389,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
return -EINVAL;
if (current->rseq.sig != sig)
return -EPERM;
- ret = rseq_reset_rseq_cpu_node_id(current);
- if (ret)
- return ret;
+ if (!rseq_reset_ids())
+ return -EFAULT;
rseq_reset(current);
return 0;
}
@@ -563,27 +430,23 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
if (!access_ok(rseq, rseq_len))
return -EFAULT;
- /*
- * If the rseq_cs pointer is non-NULL on registration, clear it to
- * avoid a potential segfault on return to user-space. The proper thing
- * to do would have been to fail the registration but this would break
- * older libcs that reuse the rseq area for new threads without
- * clearing the fields. Don't bother reading it, just reset it.
- */
- if (put_user(0UL, &rseq->rseq_cs))
- return -EFAULT;
+ scoped_user_write_access(rseq, efault) {
+ /*
+ * If the rseq_cs pointer is non-NULL on registration, clear it to
+ * avoid a potential segfault on return to user-space. The proper thing
+ * to do would have been to fail the registration but this would break
+ * older libcs that reuse the rseq area for new threads without
+ * clearing the fields. Don't bother reading it, just reset it.
+ */
+ unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ /* Initialize IDs in user space */
+ unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
+ unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
+ unsafe_put_user(0U, &rseq->node_id, efault);
+ unsafe_put_user(0U, &rseq->mm_cid, efault);
+ }
-#ifdef CONFIG_DEBUG_RSEQ
- /*
- * Initialize the in-kernel rseq fields copy for validation of
- * read-only fields.
- */
- if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_start) ||
- get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) ||
- get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) ||
- get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid))
- return -EFAULT;
-#endif
/*
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
@@ -599,6 +462,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
*/
current->rseq.event.has_rseq = true;
rseq_sched_switch_event(current);
-
return 0;
+
+efault:
+ return -EFAULT;
}
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 24/31] rseq: Separate the signal delivery path
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (22 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 23/31] rseq: Provide and use rseq_set_ids() Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-28 15:51 ` Mathieu Desnoyers
` (3 more replies)
2025-10-27 8:45 ` [patch V6 25/31] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
` (7 subsequent siblings)
31 siblings, 4 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Completely separate the signal delivery path from the notify handler as
they have different semantics versus the event handling.
The signal delivery only needs to ensure that the interrupted user context
was not in a critical section or the section is aborted before it switches
to the signal frame context. The signal frame context does not have the
original instruction pointer anymore, so that can't be handled on exit to
user space.
No point in updating the CPU/CID ids as they might change again before the
task returns to user space for real.
The fast path optimization, which checks for the 'entry from user via
interrupt' condition is only available for architectures which use the
generic entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V3: Move rseq_update_usr() to the next patch - Mathieu
---
include/linux/rseq.h | 21 ++++++++++++++++-----
kernel/rseq.c | 30 ++++++++++++++++++++++--------
2 files changed, 38 insertions(+), 13 deletions(-)
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -7,22 +7,33 @@
#include <uapi/linux/rseq.h>
-void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
+void __rseq_handle_notify_resume(struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
if (current->rseq.event.has_rseq)
- __rseq_handle_notify_resume(NULL, regs);
+ __rseq_handle_notify_resume(regs);
}
+void __rseq_signal_deliver(int sig, struct pt_regs *regs);
+
+/*
+ * Invoked from signal delivery to fixup based on the register context before
+ * switching to the signal delivery context.
+ */
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
- if (current->rseq.event.has_rseq) {
- current->rseq.event.sched_switch = true;
- __rseq_handle_notify_resume(ksig, regs);
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.has_rseq & current->rseq.event.user_irq)
+ __rseq_signal_deliver(ksig->sig, regs);
+ } else {
+ if (current->rseq.event.has_rseq)
+ __rseq_signal_deliver(ksig->sig, regs);
}
}
+/* Raised from context switch and exevce to force evaluation on exit to user */
static inline void rseq_sched_switch_event(struct task_struct *t)
{
if (t->rseq.event.has_rseq) {
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -250,13 +250,12 @@ static bool rseq_handle_cs(struct task_s
* respect to other threads scheduled on the same CPU, and with respect
* to signal handlers.
*/
-void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
+void __rseq_handle_notify_resume(struct pt_regs *regs)
{
struct task_struct *t = current;
struct rseq_ids ids;
u32 node_id;
bool event;
- int sig;
/*
* If invoked from hypervisors before entering the guest via
@@ -275,10 +274,7 @@ void __rseq_handle_notify_resume(struct
if (unlikely(t->flags & PF_EXITING))
return;
- if (ksig)
- rseq_stat_inc(rseq_stats.signal);
- else
- rseq_stat_inc(rseq_stats.slowpath);
+ rseq_stat_inc(rseq_stats.slowpath);
/*
* Read and clear the event pending bit first. If the task
@@ -317,8 +313,26 @@ void __rseq_handle_notify_resume(struct
return;
error:
- sig = ksig ? ksig->sig : 0;
- force_sigsegv(sig);
+ force_sig(SIGSEGV);
+}
+
+void __rseq_signal_deliver(int sig, struct pt_regs *regs)
+{
+ rseq_stat_inc(rseq_stats.signal);
+ /*
+ * Don't update IDs, they are handled on exit to user if
+ * necessary. The important thing is to abort a critical section of
+ * the interrupted context as after this point the instruction
+ * pointer in @regs points to the signal handler.
+ */
+ if (unlikely(!rseq_handle_cs(current, regs))) {
+ /*
+ * Clear the errors just in case this might survive
+ * magically, but leave the rest intact.
+ */
+ current->rseq.event.error = 0;
+ force_sigsegv(sig);
+ }
}
/*
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 25/31] rseq: Rework the TIF_NOTIFY handler
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (23 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 24/31] rseq: Separate the signal delivery path Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:45 ` [patch V6 26/31] rseq: Optimize event setting Thomas Gleixner
` (6 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Replace the whole logic with a new implementation, which is shared with
signal delivery and the upcoming exit fast path.
Contrary to the original implementation, this ignores invocations from
KVM/IO-uring, which invoke resume_user_mode_work() with the @regs argument
set to NULL.
The original implementation updated the CPU/Node/MM CID fields, but that
was just a side effect, which was addressing the problem that this
invocation cleared TIF_NOTIFY_RESUME, which in turn could cause an update
on return to user space to be lost.
This problem has been addressed differently, so that it's not longer
required to do that update before entering the guest.
That might be considered a user visible change, when the hosts thread TLS
memory is mapped into the guest, but as this was never intentionally
supported, this abuse of kernel internal implementation details is not
considered an ABI break.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
V3: Moved rseq_update_usr() to this one - Mathieu
Documented the KVM visible change - Sean
---
include/linux/rseq_entry.h | 29 +++++++++++++++++
kernel/rseq.c | 76 +++++++++++++++++++--------------------------
2 files changed, 62 insertions(+), 43 deletions(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -368,6 +368,35 @@ bool rseq_set_ids_get_csaddr(struct task
return false;
}
+/*
+ * Update user space with new IDs and conditionally check whether the task
+ * is in a critical section.
+ */
+static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *regs,
+ struct rseq_ids *ids, u32 node_id)
+{
+ u64 csaddr;
+
+ if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr))
+ return false;
+
+ /*
+ * On architectures which utilize the generic entry code this
+ * allows to skip the critical section when the entry was not from
+ * a user space interrupt, unless debug mode is enabled.
+ */
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ if (!static_branch_unlikely(&rseq_debug_enabled)) {
+ if (likely(!t->rseq.event.user_irq))
+ return true;
+ }
+ }
+ if (likely(!csaddr))
+ return true;
+ /* Sigh, this really needs to do work */
+ return rseq_update_user_cs(t, regs, csaddr);
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -82,12 +82,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/rseq.h>
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD irq
-#else
-# define RSEQ_EVENT_GUARD preempt
-#endif
-
DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
static inline void rseq_control_debug(bool on)
@@ -239,38 +233,15 @@ static bool rseq_handle_cs(struct task_s
return false;
}
-/*
- * This resume handler must always be executed between any of:
- * - preemption,
- * - signal delivery,
- * and return to user-space.
- *
- * This is how we can ensure that the entire rseq critical section
- * will issue the commit instruction only if executed atomically with
- * respect to other threads scheduled on the same CPU, and with respect
- * to signal handlers.
- */
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+static void rseq_slowpath_update_usr(struct pt_regs *regs)
{
+ /* Preserve rseq state and user_irq state for exit to user */
+ const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
struct task_struct *t = current;
struct rseq_ids ids;
u32 node_id;
bool event;
- /*
- * If invoked from hypervisors before entering the guest via
- * resume_user_mode_work(), then @regs is a NULL pointer.
- *
- * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
- * it before returning from the ioctl() to user space when
- * rseq_event.sched_switch is set.
- *
- * So it's safe to ignore here instead of pointlessly updating it
- * in the vcpu_run() loop.
- */
- if (!regs)
- return;
-
if (unlikely(t->flags & PF_EXITING))
return;
@@ -294,26 +265,45 @@ void __rseq_handle_notify_resume(struct
* with the result handed in to allow the detection of
* inconsistencies.
*/
- scoped_guard(RSEQ_EVENT_GUARD) {
+ scoped_guard(irq) {
event = t->rseq.event.sched_switch;
- t->rseq.event.sched_switch = false;
+ t->rseq.event.all &= evt_mask.all;
ids.cpu_id = task_cpu(t);
ids.mm_cid = task_mm_cid(t);
}
- if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+ if (!event)
return;
- if (!rseq_handle_cs(t, regs))
- goto error;
-
node_id = cpu_to_node(ids.cpu_id);
- if (!rseq_set_ids(t, &ids, node_id))
- goto error;
- return;
-error:
- force_sig(SIGSEGV);
+ if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) {
+ /*
+ * Clear the errors just in case this might survive magically, but
+ * leave the rest intact.
+ */
+ t->rseq.event.error = 0;
+ force_sig(SIGSEGV);
+ }
+}
+
+void __rseq_handle_notify_resume(struct pt_regs *regs)
+{
+ /*
+ * If invoked from hypervisors before entering the guest via
+ * resume_user_mode_work(), then @regs is a NULL pointer.
+ *
+ * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+ * it before returning from the ioctl() to user space when
+ * rseq_event.sched_switch is set.
+ *
+ * So it's safe to ignore here instead of pointlessly updating it
+ * in the vcpu_run() loop.
+ */
+ if (!regs)
+ return;
+
+ rseq_slowpath_update_usr(regs);
}
void __rseq_signal_deliver(int sig, struct pt_regs *regs)
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 26/31] rseq: Optimize event setting
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (24 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 25/31] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-28 15:57 ` Mathieu Desnoyers
` (3 more replies)
2025-10-27 8:45 ` [patch V6 27/31] rseq: Implement fast path for exit to user Thomas Gleixner
` (5 subsequent siblings)
31 siblings, 4 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
After removing the various condition bits earlier it turns out that one
extra information is needed to avoid setting event::sched_switch and
TIF_NOTIFY_RESUME unconditionally on every context switch.
The update of the RSEQ user space memory is only required, when either
the task was interrupted in user space and schedules
or
the CPU or MM CID changes in schedule() independent of the entry mode
Right now only the interrupt from user information is available.
Add a event flag, which is set when the CPU or MM CID or both change.
Evaluate this event in the scheduler to decide whether the sched_switch
event and the TIF bit need to be set.
It's an extra conditional in context_switch(), but the downside of
unconditionally handling RSEQ after a context switch to user is way more
significant. The utilized boolean logic minimizes this to a single
conditional branch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
fs/exec.c | 2 -
include/linux/rseq.h | 81 +++++++++++++++++++++++++++++++++++++++++----
include/linux/rseq_types.h | 11 +++++-
kernel/rseq.c | 2 -
kernel/sched/core.c | 7 +++
kernel/sched/sched.h | 5 ++
6 files changed, 95 insertions(+), 13 deletions(-)
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp
force_fatal_sig(SIGSEGV);
sched_mm_cid_after_execve(current);
- rseq_sched_switch_event(current);
+ rseq_force_update();
current->in_execve = 0;
return retval;
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -11,7 +11,8 @@ void __rseq_handle_notify_resume(struct
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
- if (current->rseq.event.has_rseq)
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
__rseq_handle_notify_resume(regs);
}
@@ -33,12 +34,75 @@ static inline void rseq_signal_deliver(s
}
}
-/* Raised from context switch and exevce to force evaluation on exit to user */
-static inline void rseq_sched_switch_event(struct task_struct *t)
+static inline void rseq_raise_notify_resume(struct task_struct *t)
{
- if (t->rseq.event.has_rseq) {
- t->rseq.event.sched_switch = true;
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+
+/* Invoked from context switch to force evaluation on exit to user */
+static __always_inline void rseq_sched_switch_event(struct task_struct *t)
+{
+ struct rseq_event *ev = &t->rseq.event;
+
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /*
+ * Avoid a boat load of conditionals by using simple logic
+ * to determine whether NOTIFY_RESUME needs to be raised.
+ *
+ * It's required when the CPU or MM CID has changed or
+ * the entry was from user space.
+ */
+ bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
+
+ if (raise) {
+ ev->sched_switch = true;
+ rseq_raise_notify_resume(t);
+ }
+ } else {
+ if (ev->has_rseq) {
+ t->rseq.event.sched_switch = true;
+ rseq_raise_notify_resume(t);
+ }
+ }
+}
+
+/*
+ * Invoked from __set_task_cpu() when a task migrates to enforce an IDs
+ * update.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu)
+{
+ t->rseq.event.ids_changed = true;
+}
+
+/*
+ * Invoked from switch_mm_cid() in context switch when the task gets a MM
+ * CID assigned.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid)
+{
+ /*
+ * Requires a comparison as the switch_mm_cid() code does not
+ * provide a conditional for it readily. So avoid excessive updates
+ * when nothing changes.
+ */
+ if (t->rseq.ids.mm_cid != cid)
+ t->rseq.event.ids_changed = true;
+}
+
+/* Enforce a full update after RSEQ registration and when execve() failed */
+static inline void rseq_force_update(void)
+{
+ if (current->rseq.event.has_rseq) {
+ current->rseq.event.ids_changed = true;
+ current->rseq.event.sched_switch = true;
+ rseq_raise_notify_resume(current);
}
}
@@ -55,7 +119,7 @@ static inline void rseq_sched_switch_eve
static inline void rseq_virt_userspace_exit(void)
{
if (current->rseq.event.sched_switch)
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ rseq_raise_notify_resume(current);
}
static inline void rseq_reset(struct task_struct *t)
@@ -91,6 +155,9 @@ static inline void rseq_fork(struct task
static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
+static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid) { }
+static inline void rseq_force_update(void) { }
static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -11,20 +11,27 @@ struct rseq;
* struct rseq_event - Storage for rseq related event management
* @all: Compound to initialize and clear the data efficiently
* @events: Compound to access events with a single load/store
- * @sched_switch: True if the task was scheduled out
+ * @sched_switch: True if the task was scheduled and needs update on
+ * exit to user
+ * @ids_changed: Indicator that IDs need to be updated
* @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
* @error: Compound error code for the slow path to analyze
* @fatal: User space data corrupted or invalid
+ *
+ * @sched_switch and @ids_changed must be adjacent and the combo must be
+ * 16bit aligned to allow a single store, when both are set at the same
+ * time in the scheduler.
*/
struct rseq_event {
union {
u64 all;
struct {
union {
- u16 events;
+ u32 events;
struct {
u8 sched_switch;
+ u8 ids_changed;
u8 user_irq;
};
};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -465,7 +465,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
* are updated before returning to user-space.
*/
current->rseq.event.has_rseq = true;
- rseq_sched_switch_event(current);
+ rseq_force_update();
return 0;
efault:
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5118,7 +5118,6 @@ prepare_task_switch(struct rq *rq, struc
kcov_prepare_switch(prev);
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
- rseq_sched_switch_event(prev);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
@@ -5316,6 +5315,12 @@ context_switch(struct rq *rq, struct tas
/* switch_mm_cid() requires the memory barriers above. */
switch_mm_cid(rq, prev, next);
+ /*
+ * Tell rseq that the task was scheduled in. Must be after
+ * switch_mm_cid() to get the TIF flag set.
+ */
+ rseq_sched_switch_event(next);
+
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2208,6 +2208,7 @@ static inline void __set_task_cpu(struct
smp_wmb();
WRITE_ONCE(task_thread_info(p)->cpu, cpu);
p->wake_cpu = cpu;
+ rseq_sched_set_task_cpu(p, cpu);
#endif /* CONFIG_SMP */
}
@@ -3808,8 +3809,10 @@ static inline void switch_mm_cid(struct
mm_cid_put_lazy(prev);
prev->mm_cid = -1;
}
- if (next->mm_cid_active)
+ if (next->mm_cid_active) {
next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
+ rseq_sched_set_task_mm_cid(next, next->mm_cid);
+ }
}
#else /* !CONFIG_SCHED_MM_CID: */
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 27/31] rseq: Implement fast path for exit to user
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (25 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 26/31] rseq: Optimize event setting Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-28 16:09 ` Mathieu Desnoyers
` (4 more replies)
2025-10-27 8:45 ` [patch V6 28/31] rseq: Switch to fast path processing on " Thomas Gleixner
` (4 subsequent siblings)
31 siblings, 5 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Implement the actual logic for handling RSEQ updates in a fast path after
handling the TIF work and at the point where the task is actually returning
to user space.
This is the right point to do that because at this point the CPU and the MM
CID are stable and cannot longer change due to yet another reschedule.
That happens when the task is handling it via TIF_NOTIFY_RESUME in
resume_user_mode_work(), which is invoked from the exit to user mode work
loop.
The function is invoked after the TIF work is handled and runs with
interrupts disabled, which means it cannot resolve page faults. It
therefore disables page faults and in case the access to the user space
memory faults, it:
- notes the fail in the event struct
- raises TIF_NOTIFY_RESUME
- returns false to the caller
The caller has to go back to the TIF work, which runs with interrupts
enabled and therefore can resolve the page faults. This happens mostly on
fork() when the memory is marked COW.
If the user memory inspection finds invalid data, the function returns
false as well and sets the fatal flag in the event struct along with
TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
and terminate the task with SIGSEGV as documented.
The initial decision to invoke any of this is based on one flags in the
event struct: @sched_switch. The decision is in pseudo ASM:
load tsk::event::sched_switch
jnz inspect_user_space
mov $0, tsk::event::events
...
leave
So for the common case where the task was not scheduled out, this really
boils down to three instructions before going out if the compiler is not
completely stupid (and yes, some of them are).
If the condition is true, then it checks, whether CPU ID or MM CID have
changed. If so, then the CPU/MM IDs have to be updated and are thereby
cached for the next round. The update unconditionally retrieves the user
space critical section address to spare another user*begin/end() pair. If
that's not zero and tsk::event::user_irq is set, then the critical section
is analyzed and acted upon. If either zero or the entry came via syscall
the critical section analysis is skipped.
If the comparison is false then the critical section has to be analyzed
because the event flag is then only true when entry from user was by
interrupt.
This is provided without the actual hookup to let reviewers focus on the
implementation details. The hookup happens in the next step.
Note: As with quite some other optimizations this depends on the generic
entry infrastructure and is not enabled to be sucked into random
architecture implementations.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V5: Reduce the decision to event::sched_switch - Mathieu
---
include/linux/rseq_entry.h | 131 +++++++++++++++++++++++++++++++++++++++++++--
include/linux/rseq_types.h | 3 +
kernel/rseq.c | 2
3 files changed, 133 insertions(+), 3 deletions(-)
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -10,6 +10,7 @@ struct rseq_stats {
unsigned long exit;
unsigned long signal;
unsigned long slowpath;
+ unsigned long fastpath;
unsigned long ids;
unsigned long cs;
unsigned long clear;
@@ -245,12 +246,13 @@ rseq_update_user_cs(struct task_struct *
{
struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
unsigned long ip = instruction_pointer(regs);
+ unsigned long tasksize = TASK_SIZE;
u64 start_ip, abort_ip, offset;
u32 usig, __user *uc_sig;
rseq_stat_inc(rseq_stats.cs);
- if (unlikely(csaddr >= TASK_SIZE)) {
+ if (unlikely(csaddr >= tasksize)) {
t->rseq.event.fatal = true;
return false;
}
@@ -287,7 +289,7 @@ rseq_update_user_cs(struct task_struct *
* in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
* protection.
*/
- if (abort_ip >= TASK_SIZE || abort_ip < sizeof(*uc_sig))
+ if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
goto die;
/* The address is guaranteed to be >= 0 and < TASK_SIZE */
@@ -397,6 +399,126 @@ static rseq_inline bool rseq_update_usr(
return rseq_update_user_cs(t, regs, csaddr);
}
+/*
+ * If you want to use this then convert your architecture to the generic
+ * entry code. I'm tired of building workarounds for people who can't be
+ * bothered to make the maintenance of generic infrastructure less
+ * burdensome. Just sucking everything into the architecture code and
+ * thereby making others chase the horrible hacks and keep them working is
+ * neither acceptable nor sustainable.
+ */
+#ifdef CONFIG_GENERIC_ENTRY
+
+/*
+ * This is inlined into the exit path because:
+ *
+ * 1) It's a one time comparison in the fast path when there is no event to
+ * handle
+ *
+ * 2) The access to the user space rseq memory (TLS) is unlikely to fault
+ * so the straight inline operation is:
+ *
+ * - Four 32-bit stores only if CPU ID/ MM CID need to be updated
+ * - One 64-bit load to retrieve the critical section address
+ *
+ * 3) In the unlikely case that the critical section address is != NULL:
+ *
+ * - One 64-bit load to retrieve the start IP
+ * - One 64-bit load to retrieve the offset for calculating the end
+ * - One 64-bit load to retrieve the abort IP
+ * - One 64-bit load to retrieve the signature
+ * - One store to clear the critical section address
+ *
+ * The non-debug case implements only the minimal required checking. It
+ * provides protection against a rogue abort IP in kernel space, which
+ * would be exploitable at least on x86, and also against a rouge CS
+ * descriptor by checking the signature at the abort IP. Any fallout from
+ * invalid critical section descriptors is a user space problem. The debug
+ * case provides the full set of checks and terminates the task if a
+ * condition is not met.
+ *
+ * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and
+ * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
+ * slow path there will handle the fail.
+ */
+static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct task_struct *t)
+{
+ /*
+ * Page faults need to be disabled as this is called with
+ * interrupts disabled
+ */
+ guard(pagefault)();
+ if (likely(!t->rseq.event.ids_changed)) {
+ struct rseq __user *rseq = t->rseq.usrptr;
+ /*
+ * If IDs have not changed rseq_event::user_irq must be true
+ * See rseq_sched_switch_event().
+ */
+ u64 csaddr;
+
+ if (unlikely(!get_user_inline(csaddr, &rseq->rseq_cs)))
+ return false;
+
+ if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
+ if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
+ return false;
+ }
+ return true;
+ }
+
+ struct rseq_ids ids = {
+ .cpu_id = task_cpu(t),
+ .mm_cid = task_mm_cid(t),
+ };
+ u32 node_id = cpu_to_node(ids.cpu_id);
+
+ return rseq_update_usr(t, regs, &ids, node_id);
+}
+
+static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ struct task_struct *t = current;
+
+ /*
+ * If the task did not go through schedule or got the flag enforced
+ * by the rseq syscall or execve, then nothing to do here.
+ *
+ * CPU ID and MM CID can only change when going through a context
+ * switch.
+ *
+ * rseq_sched_switch_event() sets the rseq_event::sched_switch bit
+ * only when rseq_event::has_rseq is true. That conditional is
+ * required to avoid setting the TIF bit if RSEQ is not registered
+ * for a task. rseq_event::sched_switch is cleared when RSEQ is
+ * unregistered by a task so it's sufficient to check for the
+ * sched_switch bit alone.
+ *
+ * A sane compiler requires three instructions for the nothing to do
+ * case including clearing the events, but your mileage might vary.
+ */
+ if (unlikely((t->rseq.event.sched_switch))) {
+ rseq_stat_inc(rseq_stats.fastpath);
+
+ if (unlikely(!rseq_exit_user_update(regs, t)))
+ return true;
+ }
+ /* Clear state so next entry starts from a clean slate */
+ t->rseq.event.events = 0;
+ return false;
+}
+
+static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+ current->rseq.event.slowpath = true;
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ return true;
+ }
+ return false;
+}
+
+#endif /* CONFIG_GENERIC_ENTRY */
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
@@ -421,9 +543,12 @@ static inline void rseq_debug_syscall_re
if (static_branch_unlikely(&rseq_debug_enabled))
__rseq_debug_syscall_return(regs);
}
-
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ return false;
+}
static inline void rseq_exit_to_user_mode(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -18,6 +18,8 @@ struct rseq;
* @has_rseq: True if the task has a rseq pointer installed
* @error: Compound error code for the slow path to analyze
* @fatal: User space data corrupted or invalid
+ * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME
+ * is required
*
* @sched_switch and @ids_changed must be adjacent and the combo must be
* 16bit aligned to allow a single store, when both are set at the same
@@ -42,6 +44,7 @@ struct rseq_event {
u16 error;
struct {
u8 fatal;
+ u8 slowpath;
};
};
};
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -133,6 +133,7 @@ static int rseq_stats_show(struct seq_fi
stats.exit += data_race(per_cpu(rseq_stats.exit, cpu));
stats.signal += data_race(per_cpu(rseq_stats.signal, cpu));
stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu));
+ stats.fastpath += data_race(per_cpu(rseq_stats.fastpath, cpu));
stats.ids += data_race(per_cpu(rseq_stats.ids, cpu));
stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
@@ -142,6 +143,7 @@ static int rseq_stats_show(struct seq_fi
seq_printf(m, "exit: %16lu\n", stats.exit);
seq_printf(m, "signal: %16lu\n", stats.signal);
seq_printf(m, "slowp: %16lu\n", stats.slowpath);
+ seq_printf(m, "fastp: %16lu\n", stats.fastpath);
seq_printf(m, "ids: %16lu\n", stats.ids);
seq_printf(m, "cs: %16lu\n", stats.cs);
seq_printf(m, "clear: %16lu\n", stats.clear);
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 28/31] rseq: Switch to fast path processing on exit to user
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (26 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 27/31] rseq: Implement fast path for exit to user Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-28 16:14 ` Mathieu Desnoyers
` (3 more replies)
2025-10-27 8:45 ` [patch V6 29/31] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
` (3 subsequent siblings)
31 siblings, 4 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Now that all bits and pieces are in place, hook the RSEQ handling fast path
function into exit_to_user_mode_prepare() after the TIF work bits have been
handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
and the caller needs to take another turn through the TIF handling slow
path.
This only works for architectures which use the generic entry code.
Architectures who still have their own incomplete hacks are not supported
and won't be.
This results in the following improvements:
Kernel build Before After Reduction
exit to user 80692981 80514451
signal checks: 32581 121 99%
slowpath runs: 1201408 1.49% 198 0.00% 100%
fastpath runs: 675941 0.84% N/A
id updates: 1233989 1.53% 50541 0.06% 96%
cs checks: 1125366 1.39% 0 0.00% 100%
cs cleared: 1125366 100% 0 100%
cs fixup: 0 0% 0
RSEQ selftests Before After Reduction
exit to user: 386281778 387373750
signal checks: 35661203 0 100%
slowpath runs: 140542396 36.38% 100 0.00% 100%
fastpath runs: 9509789 2.51% N/A
id updates: 176203599 45.62% 9087994 2.35% 95%
cs checks: 175587856 45.46% 4728394 1.22% 98%
cs cleared: 172359544 98.16% 1319307 27.90% 99%
cs fixup: 3228312 1.84% 3409087 72.10%
The 'cs cleared' and 'cs fixup' percentages are not relative to the exit to
user invocations, they are relative to the actual 'cs check' invocations.
While some of this could have been avoided in the original code, like the
obvious clearing of CS when it's already clear, the main problem of going
through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
notify handler is invoked more than once before going out to user
space. Doing this once when everything has stabilized is the only solution
to avoid this.
The initial attempt to completely decouple it from the TIF work turned out
to be suboptimal for workloads, which do a lot of quick and short system
calls. Even if the fast path decision is only 4 instructions (including a
conditional branch), this adds up quickly and becomes measurable when the
rate for actually having to handle rseq is in the low single digit
percentage range of user/kernel transitions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
V4: Move the rseq handling into a separate loop to avoid gotos later on
---
include/linux/irq-entry-common.h | 7 ++-----
include/linux/resume_user_mode.h | 2 +-
include/linux/rseq.h | 18 ++++++++++++------
init/Kconfig | 2 +-
kernel/entry/common.c | 26 +++++++++++++++++++-------
kernel/rseq.c | 8 ++++++--
6 files changed, 41 insertions(+), 22 deletions(-)
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -197,11 +197,8 @@ static __always_inline void arch_exit_to
*/
void arch_do_signal_or_restart(struct pt_regs *regs);
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- */
-unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work);
+/* Handle pending TIF work */
+unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
/**
* exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
- rseq_handle_notify_resume(regs);
+ rseq_handle_slowpath(regs);
}
#endif /* LINUX_RESUME_USER_MODE_H */
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -7,13 +7,19 @@
#include <uapi/linux/rseq.h>
-void __rseq_handle_notify_resume(struct pt_regs *regs);
+void __rseq_handle_slowpath(struct pt_regs *regs);
-static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+/* Invoked from resume_user_mode_work() */
+static inline void rseq_handle_slowpath(struct pt_regs *regs)
{
- /* '&' is intentional to spare one conditional branch */
- if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
- __rseq_handle_notify_resume(regs);
+ if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
+ if (current->rseq.event.slowpath)
+ __rseq_handle_slowpath(regs);
+ } else {
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
+ __rseq_handle_slowpath(regs);
+ }
}
void __rseq_signal_deliver(int sig, struct pt_regs *regs);
@@ -152,7 +158,7 @@ static inline void rseq_fork(struct task
}
#else /* CONFIG_RSEQ */
-static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1941,7 +1941,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
- depends on RSEQ && DEBUG_KERNEL
+ depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
select RSEQ_DEBUG_DEFAULT_ENABLE
help
Enable extra debugging checks for the rseq system call.
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,13 +11,8 @@
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- * @regs: Pointer to pt_regs on entry stack
- * @ti_work: TIF work flags as read by the caller
- */
-__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work)
+static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
+ unsigned long ti_work)
{
/*
* Before returning to user space ensure that all pending work
@@ -62,6 +57,23 @@ void __weak arch_do_signal_or_restart(st
return ti_work;
}
+/**
+ * exit_to_user_mode_loop - do any pending work before leaving to user space
+ * @regs: Pointer to pt_regs on entry stack
+ * @ti_work: TIF work flags as read by the caller
+ */
+__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
+ unsigned long ti_work)
+{
+ for (;;) {
+ ti_work = __exit_to_user_mode_loop(regs, ti_work);
+
+ if (likely(!rseq_exit_to_user_mode_restart(regs)))
+ return ti_work;
+ ti_work = read_thread_flags();
+ }
+}
+
noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
{
irqentry_state_t ret = {
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -237,7 +237,11 @@ static bool rseq_handle_cs(struct task_s
static void rseq_slowpath_update_usr(struct pt_regs *regs)
{
- /* Preserve rseq state and user_irq state for exit to user */
+ /*
+ * Preserve rseq state and user_irq state. The generic entry code
+ * clears user_irq on the way out, the non-generic entry
+ * architectures are not having user_irq.
+ */
const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
struct task_struct *t = current;
struct rseq_ids ids;
@@ -289,7 +293,7 @@ static void rseq_slowpath_update_usr(str
}
}
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+void __rseq_handle_slowpath(struct pt_regs *regs)
{
/*
* If invoked from hypervisors before entering the guest via
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 29/31] entry: Split up exit_to_user_mode_prepare()
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (27 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 28/31] rseq: Switch to fast path processing on " Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:45 ` [patch V6 30/31] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
` (2 subsequent siblings)
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
exit_to_user_mode_prepare() is used for both interrupts and syscalls, but
there is extra rseq work, which is only required for in the interrupt exit
case.
Split up the function and provide wrappers for syscalls and interrupts,
which allows to separate the rseq exit work in the next step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/entry-common.h | 2 -
include/linux/irq-entry-common.h | 42 ++++++++++++++++++++++++++++++++++-----
2 files changed, 38 insertions(+), 6 deletions(-)
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -156,7 +156,7 @@ static __always_inline void syscall_exit
if (unlikely(work & SYSCALL_WORK_EXIT))
syscall_exit_work(regs, work);
local_irq_disable_exit_to_user();
- exit_to_user_mode_prepare(regs);
+ syscall_exit_to_user_mode_prepare(regs);
}
/**
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -201,7 +201,7 @@ void arch_do_signal_or_restart(struct pt
unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
/**
- * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
* @regs: Pointer to pt_regs on entry stack
*
* 1) check that interrupts are disabled
@@ -209,8 +209,10 @@ unsigned long exit_to_user_mode_loop(str
* 3) call exit_to_user_mode_loop() if any flags from
* EXIT_TO_USER_MODE_WORK are set
* 4) check that interrupts are still disabled
+ *
+ * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
*/
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
{
unsigned long ti_work;
@@ -224,15 +226,45 @@ static __always_inline void exit_to_user
ti_work = exit_to_user_mode_loop(regs, ti_work);
arch_exit_to_user_mode_prepare(regs, ti_work);
+}
- rseq_exit_to_user_mode();
-
+static __always_inline void __exit_to_user_mode_validate(void)
+{
/* Ensure that kernel state is sane for a return to userspace */
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
lockdep_sys_exit();
}
+
+/**
+ * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs: Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
+/**
+ * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs: Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
/**
* exit_to_user_mode - Fixup state when exiting to user mode
*
@@ -297,7 +329,7 @@ static __always_inline void irqentry_ent
static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
{
instrumentation_begin();
- exit_to_user_mode_prepare(regs);
+ irqentry_exit_to_user_mode_prepare(regs);
instrumentation_end();
exit_to_user_mode();
}
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 30/31] rseq: Split up rseq_exit_to_user_mode()
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (28 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 29/31] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-27 8:45 ` [patch V6 31/31] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-10-29 10:23 ` [patch V6 00/31] rseq: Optimize exit to user space Peter Zijlstra
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
Separate the interrupt and syscall exit handling. Syscall exit does not
require to clear the user_irq bit as it can't be set. On interrupt exit it
can be set when the interrupt did not result in a scheduling event and
therefore the return path did not invoke the TIF work handling, which would
have cleared it.
The debug check for the event state is also not really required even when
debug mode is enabled via the static key. Debug mode is largely aiding user
space by enabling a larger amount of validation checks, which cause a
segfault when a malformed critical section is detected. In production mode
the critical section handling takes the content mostly as is and lets user
space keep the pieces when it screwed up.
On kernel changes in that area the state check is useful, but that can be
done when lockdep is enabled, which is anyway a required test scenario for
fundamental changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/irq-entry-common.h | 4 ++--
include/linux/rseq_entry.h | 21 +++++++++++++++++----
2 files changed, 19 insertions(+), 6 deletions(-)
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -247,7 +247,7 @@ static __always_inline void __exit_to_us
static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_syscall_exit_to_user_mode();
__exit_to_user_mode_validate();
}
@@ -261,7 +261,7 @@ static __always_inline void syscall_exit
static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_irqentry_exit_to_user_mode();
__exit_to_user_mode_validate();
}
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -520,19 +520,31 @@ static __always_inline bool rseq_exit_to
#endif /* CONFIG_GENERIC_ENTRY */
-static __always_inline void rseq_exit_to_user_mode(void)
+static __always_inline void rseq_syscall_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
rseq_stat_inc(rseq_stats.exit);
- if (static_branch_unlikely(&rseq_debug_enabled))
+ /* Needed to remove the store for the !lockdep case */
+ if (IS_ENABLED(CONFIG_LOCKDEP)) {
WARN_ON_ONCE(ev->sched_switch);
+ ev->events = 0;
+ }
+}
+
+static __always_inline void rseq_irqentry_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ lockdep_assert_once(!ev->sched_switch);
/*
* Ensure that event (especially user_irq) is cleared when the
* interrupt did not result in a schedule and therefore the
- * rseq processing did not clear it.
+ * rseq processing could not clear it.
*/
ev->events = 0;
}
@@ -550,7 +562,8 @@ static inline bool rseq_exit_to_user_mod
{
return false;
}
-static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_syscall_exit_to_user_mode(void) { }
+static inline void rseq_irqentry_exit_to_user_mode(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
^ permalink raw reply [flat|nested] 142+ messages in thread
* [patch V6 31/31] rseq: Switch to TIF_RSEQ if supported
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (29 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 30/31] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
@ 2025-10-27 8:45 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 more replies)
2025-10-29 10:23 ` [patch V6 00/31] rseq: Optimize exit to user space Peter Zijlstra
31 siblings, 3 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-27 8:45 UTC (permalink / raw)
To: LKML
Cc: Michael Jeanson, Jens Axboe, Mathieu Desnoyers, Peter Zijlstra,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially
with the RSEQ fast path depending on it, but not really handling it.
Define a seperate TIF_RSEQ in the generic TIF space and enable the full
seperation of fast and slow path for architectures which utilize that.
That avoids the hassle with invocations of resume_user_mode_work() from
hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required
re-evaluation at the end of vcpu_run() a NOOP on architectures which
utilize the generic TIF space and have a seperate TIF_RSEQ.
The hypervisor TIF handling does not include the seperate TIF_RSEQ as there
is no point in doing so. The guest does neither know nor care about the VMM
host applications RSEQ state. That state is only relevant when the ioctl()
returns to user space.
The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure
handling, but this only happens within exit_to_user_mode_loop(), so
arguably the hypervisor ioctl() code is long done when this happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
V4: Adjust it to the new outer loop mechanism
V3: Updated the comment for rseq_virt_userspace_exit() - Sean
Added a static assert for TIF_RSEQ != TIF_NOTIFY_RESUME - Sean
---
include/asm-generic/thread_info_tif.h | 3 +++
include/linux/irq-entry-common.h | 2 +-
include/linux/rseq.h | 22 +++++++++++++++-------
include/linux/rseq_entry.h | 27 +++++++++++++++++++++++++--
include/linux/thread_info.h | 5 +++++
kernel/entry/common.c | 10 ++++++++--
6 files changed, 57 insertions(+), 12 deletions(-)
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -45,4 +45,7 @@
# define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK)
#endif
+#define TIF_RSEQ 11 // Run RSEQ fast path
+#define _TIF_RSEQ BIT(TIF_RSEQ)
+
#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -30,7 +30,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \
- _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
+ _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ | \
ARCH_EXIT_TO_USER_MODE_WORK)
/**
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -42,7 +42,7 @@ static inline void rseq_signal_deliver(s
static inline void rseq_raise_notify_resume(struct task_struct *t)
{
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ set_tsk_thread_flag(t, TIF_RSEQ);
}
/* Invoked from context switch to force evaluation on exit to user */
@@ -114,17 +114,25 @@ static inline void rseq_force_update(voi
/*
* KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
- * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
- * that case just to do it eventually again before returning to user space,
- * the entry resume_user_mode_work() invocation is ignored as the register
- * argument is NULL.
+ * which clears TIF_NOTIFY_RESUME on architectures that don't use the
+ * generic TIF bits and therefore can't provide a separate TIF_RSEQ flag.
*
- * After returning from guest mode, they have to invoke this function to
- * re-raise TIF_NOTIFY_RESUME if necessary.
+ * To avoid updating user space RSEQ in that case just to do it eventually
+ * again before returning to user space, because __rseq_handle_slowpath()
+ * does nothing when invoked with NULL register state.
+ *
+ * After returning from guest mode, before exiting to userspace, hypervisors
+ * must invoke this function to re-raise TIF_NOTIFY_RESUME if necessary.
*/
static inline void rseq_virt_userspace_exit(void)
{
if (current->rseq.event.sched_switch)
+ /*
+ * The generic optimization for deferring RSEQ updates until the next
+ * exit relies on having a dedicated TIF_RSEQ.
+ */
+ if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) &&
+ current->rseq.event.sched_switch)
rseq_raise_notify_resume(current);
}
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -507,13 +507,36 @@ static __always_inline bool __rseq_exit_
return false;
}
-static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+/* Required to allow conversion to GENERIC_ENTRY w/o GENERIC_TIF_BITS */
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
+static __always_inline bool test_tif_rseq(unsigned long ti_work)
{
+ return ti_work & _TIF_RSEQ;
+}
+
+static __always_inline void clear_tif_rseq(void)
+{
+ static_assert(TIF_RSEQ != TIF_NOTIFY_RESUME);
+ clear_thread_flag(TIF_RSEQ);
+}
+#else
+static __always_inline bool test_tif_rseq(unsigned long ti_work) { return true; }
+static __always_inline void clear_tif_rseq(void) { }
+#endif
+
+static __always_inline bool
+rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
+{
+ if (likely(!test_tif_rseq(ti_work)))
+ return false;
+
if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
current->rseq.event.slowpath = true;
set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
return true;
}
+
+ clear_tif_rseq();
return false;
}
@@ -557,7 +580,7 @@ static inline void rseq_debug_syscall_re
}
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
-static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
return false;
}
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -67,6 +67,11 @@ enum syscall_work_bit {
#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
#endif
+#ifndef TIF_RSEQ
+# define TIF_RSEQ TIF_NOTIFY_RESUME
+# define _TIF_RSEQ _TIF_NOTIFY_RESUME
+#endif
+
#ifdef __KERNEL__
#ifndef arch_set_restart_data
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,6 +11,12 @@
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
+#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK & ~_TIF_RSEQ)
+#else
+#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
+#endif
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -18,7 +24,7 @@ static __always_inline unsigned long __e
* Before returning to user space ensure that all pending work
* items have been completed.
*/
- while (ti_work & EXIT_TO_USER_MODE_WORK) {
+ while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
local_irq_enable_exit_to_user(ti_work);
@@ -68,7 +74,7 @@ static __always_inline unsigned long __e
for (;;) {
ti_work = __exit_to_user_mode_loop(regs, ti_work);
- if (likely(!rseq_exit_to_user_mode_restart(regs)))
+ if (likely(!rseq_exit_to_user_mode_restart(regs, ti_work)))
return ti_work;
ti_work = read_thread_flags();
}
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run()
2025-10-27 8:44 ` [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
@ 2025-10-28 15:08 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 subsequent siblings)
3 siblings, 0 replies; 142+ messages in thread
From: Mathieu Desnoyers @ 2025-10-28 15:08 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On 2025-10-27 04:44, Thomas Gleixner wrote:
> Hypervisors invoke resume_user_mode_work() before entering the guest, which
> clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
> space context available to them, so the rseq notify handler skips
> inspecting the critical section, but updates the CPU/MM CID values
> unconditionally so that the eventual pending rseq event is not lost on the
> way to user space.
[...]
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 15/31] rseq: Record interrupt from user space
2025-10-27 8:44 ` [patch V6 15/31] rseq: Record interrupt from user space Thomas Gleixner
@ 2025-10-28 15:26 ` Mathieu Desnoyers
2025-10-28 17:02 ` Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 subsequent siblings)
3 siblings, 1 reply; 142+ messages in thread
From: Mathieu Desnoyers @ 2025-10-28 15:26 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On 2025-10-27 04:44, Thomas Gleixner wrote:
[...]
> @@ -281,6 +281,7 @@ static __always_inline void exit_to_user
> static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
> {
> enter_from_user_mode(regs);
> + rseq_note_user_irq_entry();
> }
>
Looking at x86, both exc_debug_user() and exc_int3() invoke
irqentry_enter_from_user_mode(), but there are various
other traps that can come from userspace (e.g. math_error,
exc_general_protection, ...). Some of those traps don't
necessarily end with a signal delivery to the offending
process. And some of those traps enable interrupts.
So what happens if such a trap is triggered from userspace,
and then scheduling happens on top of this trap ? Is this
skipping rseq ip fixup and rseq fields updates ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 19/31] rseq: Provide and use rseq_update_user_cs()
2025-10-27 8:44 ` [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
@ 2025-10-28 15:40 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (3 subsequent siblings)
4 siblings, 0 replies; 142+ messages in thread
From: Mathieu Desnoyers @ 2025-10-28 15:40 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On 2025-10-27 04:44, Thomas Gleixner wrote:
[...]
>
> +/*
> + * Check whether there is a valid critical section and whether the
> + * instruction pointer in @regs is inside the critical section.
> + *
> + * - If the critical section is invalid, terminate the task.
> + *
> + * - If valid and the instruction pointer is inside, set it to the abort IP
nit: "...to the abort IP."
Other than that:
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 23/31] rseq: Provide and use rseq_set_ids()
2025-10-27 8:45 ` [patch V6 23/31] rseq: Provide and use rseq_set_ids() Thomas Gleixner
@ 2025-10-28 15:47 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 subsequent siblings)
3 siblings, 0 replies; 142+ messages in thread
From: Mathieu Desnoyers @ 2025-10-28 15:47 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On 2025-10-27 04:45, Thomas Gleixner wrote:
[...]
> + unsafe_put_user(0UL, &rseq->rseq_cs, efault);
> + unsafe_put_user(0UL, &rseq->rseq_cs, efault);
Wow! You really want to make sure it's zeroed. ;-)
Other than this:
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 24/31] rseq: Separate the signal delivery path
2025-10-27 8:45 ` [patch V6 24/31] rseq: Separate the signal delivery path Thomas Gleixner
@ 2025-10-28 15:51 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 subsequent siblings)
3 siblings, 0 replies; 142+ messages in thread
From: Mathieu Desnoyers @ 2025-10-28 15:51 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On 2025-10-27 04:45, Thomas Gleixner wrote:
> Completely separate the signal delivery path from the notify handler as
> they have different semantics versus the event handling.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 26/31] rseq: Optimize event setting
2025-10-27 8:45 ` [patch V6 26/31] rseq: Optimize event setting Thomas Gleixner
@ 2025-10-28 15:57 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 subsequent siblings)
3 siblings, 0 replies; 142+ messages in thread
From: Mathieu Desnoyers @ 2025-10-28 15:57 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On 2025-10-27 04:45, Thomas Gleixner wrote:
[...]
> Add a event flag, which is set when the CPU or MM CID or both change.
"Add an event flag...".
Other than this nit:
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 27/31] rseq: Implement fast path for exit to user
2025-10-27 8:45 ` [patch V6 27/31] rseq: Implement fast path for exit to user Thomas Gleixner
@ 2025-10-28 16:09 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (3 subsequent siblings)
4 siblings, 0 replies; 142+ messages in thread
From: Mathieu Desnoyers @ 2025-10-28 16:09 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On 2025-10-27 04:45, Thomas Gleixner wrote:
[...]
> + * would be exploitable at least on x86, and also against a rouge CS
rouge -> rogue
> + * descriptor by checking the signature at the abort IP. Any fallout from
> + * invalid critical section descriptors is a user space problem. The debug
> + * case provides the full set of checks and terminates the task if a
> + * condition is not met.
> + *
> + * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and
> + * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
> + * slow path there will handle the fail.
fail -> failure
Other than those nits:
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 28/31] rseq: Switch to fast path processing on exit to user
2025-10-27 8:45 ` [patch V6 28/31] rseq: Switch to fast path processing on " Thomas Gleixner
@ 2025-10-28 16:14 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
` (2 subsequent siblings)
3 siblings, 0 replies; 142+ messages in thread
From: Mathieu Desnoyers @ 2025-10-28 16:14 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On 2025-10-27 04:45, Thomas Gleixner wrote:
> This results in the following improvements:
Awesome! :-)
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 15/31] rseq: Record interrupt from user space
2025-10-28 15:26 ` Mathieu Desnoyers
@ 2025-10-28 17:02 ` Thomas Gleixner
2025-10-28 17:53 ` Mathieu Desnoyers
0 siblings, 1 reply; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-28 17:02 UTC (permalink / raw)
To: Mathieu Desnoyers, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On Tue, Oct 28 2025 at 11:26, Mathieu Desnoyers wrote:
> On 2025-10-27 04:44, Thomas Gleixner wrote:
> [...]
>> @@ -281,6 +281,7 @@ static __always_inline void exit_to_user
>> static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
>> {
>> enter_from_user_mode(regs);
>> + rseq_note_user_irq_entry();
>> }
>>
> Looking at x86, both exc_debug_user() and exc_int3() invoke
> irqentry_enter_from_user_mode(), but there are various
> other traps that can come from userspace (e.g. math_error,
> exc_general_protection, ...). Some of those traps don't
> necessarily end with a signal delivery to the offending
> process. And some of those traps enable interrupts.
They all go through irqentry_enter_from_user_mode(). See
DEFINE_IDTENTRY*() macros. They invoke:
irqentry_enter()
if (user_mode())
irqentry_enter_from_user_mode();
If that wouldn't be the case then all the RCU/NOHZ magic would not work
either. So any exception, trap, interrupt must go through this to
establish state correctly. Whether that's explicit as it's required for
int3 and debug_user or implicit through the IDTENTRY magic.
Thanks,
tglx
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 15/31] rseq: Record interrupt from user space
2025-10-28 17:02 ` Thomas Gleixner
@ 2025-10-28 17:53 ` Mathieu Desnoyers
0 siblings, 0 replies; 142+ messages in thread
From: Mathieu Desnoyers @ 2025-10-28 17:53 UTC (permalink / raw)
To: Thomas Gleixner, LKML
Cc: Michael Jeanson, Jens Axboe, Peter Zijlstra, Paul E. McKenney,
x86, Sean Christopherson, Wei Liu
On 2025-10-28 13:02, Thomas Gleixner wrote:
> On Tue, Oct 28 2025 at 11:26, Mathieu Desnoyers wrote:
>> On 2025-10-27 04:44, Thomas Gleixner wrote:
>> [...]
>>> @@ -281,6 +281,7 @@ static __always_inline void exit_to_user
>>> static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
>>> {
>>> enter_from_user_mode(regs);
>>> + rseq_note_user_irq_entry();
>>> }
>>>
>> Looking at x86, both exc_debug_user() and exc_int3() invoke
>> irqentry_enter_from_user_mode(), but there are various
>> other traps that can come from userspace (e.g. math_error,
>> exc_general_protection, ...). Some of those traps don't
>> necessarily end with a signal delivery to the offending
>> process. And some of those traps enable interrupts.
>
> They all go through irqentry_enter_from_user_mode(). See
> DEFINE_IDTENTRY*() macros. They invoke:
>
> irqentry_enter()
> if (user_mode())
> irqentry_enter_from_user_mode();
>
> If that wouldn't be the case then all the RCU/NOHZ magic would not work
> either. So any exception, trap, interrupt must go through this to
> establish state correctly. Whether that's explicit as it's required for
> int3 and debug_user or implicit through the IDTENTRY magic.
That's what I missed, thanks for the explanation.
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
^ permalink raw reply [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Switch to TIF_RSEQ if supported
2025-10-27 8:45 ` [patch V6 31/31] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 69c8e3d1610588d677faaa6035e1bd5de9431d6e
Gitweb: https://git.kernel.org/tip/69c8e3d1610588d677faaa6035e1bd5de9431d6e
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:26 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:20 +01:00
rseq: Switch to TIF_RSEQ if supported
TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially
with the RSEQ fast path depending on it, but not really handling it.
Define a seperate TIF_RSEQ in the generic TIF space and enable the full
seperation of fast and slow path for architectures which utilize that.
That avoids the hassle with invocations of resume_user_mode_work() from
hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required
re-evaluation at the end of vcpu_run() a NOOP on architectures which
utilize the generic TIF space and have a seperate TIF_RSEQ.
The hypervisor TIF handling does not include the seperate TIF_RSEQ as there
is no point in doing so. The guest does neither know nor care about the VMM
host applications RSEQ state. That state is only relevant when the ioctl()
returns to user space.
The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure
handling, but this only happens within exit_to_user_mode_loop(), so
arguably the hypervisor ioctl() code is long done when this happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.903622031@linutronix.de
---
include/asm-generic/thread_info_tif.h | 3 +++-
include/linux/irq-entry-common.h | 2 +-
include/linux/rseq.h | 22 ++++++++++++++-------
include/linux/rseq_entry.h | 27 ++++++++++++++++++++++++--
include/linux/thread_info.h | 5 +++++-
kernel/entry/common.c | 10 ++++++++--
6 files changed, 57 insertions(+), 12 deletions(-)
diff --git a/include/asm-generic/thread_info_tif.h b/include/asm-generic/thread_info_tif.h
index ee3793e..da1610a 100644
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -45,4 +45,7 @@
# define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK)
#endif
+#define TIF_RSEQ 11 // Run RSEQ fast path
+#define _TIF_RSEQ BIT(TIF_RSEQ)
+
#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 3ee0f16..b5baf22 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -30,7 +30,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \
- _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
+ _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ | \
ARCH_EXIT_TO_USER_MODE_WORK)
/**
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index ded4baa..b5e4803 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -42,7 +42,7 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *reg
static inline void rseq_raise_notify_resume(struct task_struct *t)
{
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ set_tsk_thread_flag(t, TIF_RSEQ);
}
/* Invoked from context switch to force evaluation on exit to user */
@@ -114,17 +114,25 @@ static inline void rseq_force_update(void)
/*
* KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
- * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
- * that case just to do it eventually again before returning to user space,
- * the entry resume_user_mode_work() invocation is ignored as the register
- * argument is NULL.
+ * which clears TIF_NOTIFY_RESUME on architectures that don't use the
+ * generic TIF bits and therefore can't provide a separate TIF_RSEQ flag.
*
- * After returning from guest mode, they have to invoke this function to
- * re-raise TIF_NOTIFY_RESUME if necessary.
+ * To avoid updating user space RSEQ in that case just to do it eventually
+ * again before returning to user space, because __rseq_handle_slowpath()
+ * does nothing when invoked with NULL register state.
+ *
+ * After returning from guest mode, before exiting to userspace, hypervisors
+ * must invoke this function to re-raise TIF_NOTIFY_RESUME if necessary.
*/
static inline void rseq_virt_userspace_exit(void)
{
if (current->rseq.event.sched_switch)
+ /*
+ * The generic optimization for deferring RSEQ updates until the next
+ * exit relies on having a dedicated TIF_RSEQ.
+ */
+ if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) &&
+ current->rseq.event.sched_switch)
rseq_raise_notify_resume(current);
}
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 98d70a8..d14719d 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -507,13 +507,36 @@ static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *reg
return false;
}
-static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+/* Required to allow conversion to GENERIC_ENTRY w/o GENERIC_TIF_BITS */
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
+static __always_inline bool test_tif_rseq(unsigned long ti_work)
{
+ return ti_work & _TIF_RSEQ;
+}
+
+static __always_inline void clear_tif_rseq(void)
+{
+ static_assert(TIF_RSEQ != TIF_NOTIFY_RESUME);
+ clear_thread_flag(TIF_RSEQ);
+}
+#else
+static __always_inline bool test_tif_rseq(unsigned long ti_work) { return true; }
+static __always_inline void clear_tif_rseq(void) { }
+#endif
+
+static __always_inline bool
+rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
+{
+ if (likely(!test_tif_rseq(ti_work)))
+ return false;
+
if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
current->rseq.event.slowpath = true;
set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
return true;
}
+
+ clear_tif_rseq();
return false;
}
@@ -557,7 +580,7 @@ static inline void rseq_debug_syscall_return(struct pt_regs *regs)
}
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
-static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
return false;
}
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index dd925d8..b40de9b 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -67,6 +67,11 @@ enum syscall_work_bit {
#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
#endif
+#ifndef TIF_RSEQ
+# define TIF_RSEQ TIF_NOTIFY_RESUME
+# define _TIF_RSEQ _TIF_NOTIFY_RESUME
+#endif
+
#ifdef __KERNEL__
#ifndef arch_set_restart_data
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 523a3e7..5c792b3 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,6 +11,12 @@
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
+#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK & ~_TIF_RSEQ)
+#else
+#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
+#endif
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -18,7 +24,7 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
* Before returning to user space ensure that all pending work
* items have been completed.
*/
- while (ti_work & EXIT_TO_USER_MODE_WORK) {
+ while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
local_irq_enable_exit_to_user(ti_work);
@@ -68,7 +74,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
for (;;) {
ti_work = __exit_to_user_mode_loop(regs, ti_work);
- if (likely(!rseq_exit_to_user_mode_restart(regs)))
+ if (likely(!rseq_exit_to_user_mode_restart(regs, ti_work)))
return ti_work;
ti_work = read_thread_flags();
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Split up rseq_exit_to_user_mode()
2025-10-27 8:45 ` [patch V6 30/31] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 1b3dd1c538a82055a1de699cefdd4ac3b60f9e49
Gitweb: https://git.kernel.org/tip/1b3dd1c538a82055a1de699cefdd4ac3b60f9e49
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:24 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:20 +01:00
rseq: Split up rseq_exit_to_user_mode()
Separate the interrupt and syscall exit handling. Syscall exit does not
require to clear the user_irq bit as it can't be set. On interrupt exit it
can be set when the interrupt did not result in a scheduling event and
therefore the return path did not invoke the TIF work handling, which would
have cleared it.
The debug check for the event state is also not really required even when
debug mode is enabled via the static key. Debug mode is largely aiding user
space by enabling a larger amount of validation checks, which cause a
segfault when a malformed critical section is detected. In production mode
the critical section handling takes the content mostly as is and lets user
space keep the pieces when it screwed up.
On kernel changes in that area the state check is useful, but that can be
done when lockdep is enabled, which is anyway a required test scenario for
fundamental changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.842785700@linutronix.de
---
include/linux/irq-entry-common.h | 4 ++--
include/linux/rseq_entry.h | 21 +++++++++++++++++----
2 files changed, 19 insertions(+), 6 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 9621d40..3ee0f16 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -247,7 +247,7 @@ static __always_inline void __exit_to_user_mode_validate(void)
static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_syscall_exit_to_user_mode();
__exit_to_user_mode_validate();
}
@@ -261,7 +261,7 @@ static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *re
static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_irqentry_exit_to_user_mode();
__exit_to_user_mode_validate();
}
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 9bf3d58..98d70a8 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -519,19 +519,31 @@ static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
#endif /* CONFIG_GENERIC_ENTRY */
-static __always_inline void rseq_exit_to_user_mode(void)
+static __always_inline void rseq_syscall_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
rseq_stat_inc(rseq_stats.exit);
- if (static_branch_unlikely(&rseq_debug_enabled))
+ /* Needed to remove the store for the !lockdep case */
+ if (IS_ENABLED(CONFIG_LOCKDEP)) {
WARN_ON_ONCE(ev->sched_switch);
+ ev->events = 0;
+ }
+}
+
+static __always_inline void rseq_irqentry_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ lockdep_assert_once(!ev->sched_switch);
/*
* Ensure that event (especially user_irq) is cleared when the
* interrupt did not result in a schedule and therefore the
- * rseq processing did not clear it.
+ * rseq processing could not clear it.
*/
ev->events = 0;
}
@@ -549,7 +561,8 @@ static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
{
return false;
}
-static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_syscall_exit_to_user_mode(void) { }
+static inline void rseq_irqentry_exit_to_user_mode(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Split up exit_to_user_mode_prepare()
2025-10-27 8:45 ` [patch V6 29/31] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: d589306403107aa3ba5f014cb7d17b3d5db3cf94
Gitweb: https://git.kernel.org/tip/d589306403107aa3ba5f014cb7d17b3d5db3cf94
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:21 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:19 +01:00
entry: Split up exit_to_user_mode_prepare()
exit_to_user_mode_prepare() is used for both interrupts and syscalls, but
there is extra rseq work, which is only required for in the interrupt exit
case.
Split up the function and provide wrappers for syscalls and interrupts,
which allows to separate the rseq exit work in the next step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.782234789@linutronix.de
---
include/linux/entry-common.h | 2 +-
include/linux/irq-entry-common.h | 42 +++++++++++++++++++++++++++----
2 files changed, 38 insertions(+), 6 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index d967184..87efb38 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -156,7 +156,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
if (unlikely(work & SYSCALL_WORK_EXIT))
syscall_exit_work(regs, work);
local_irq_disable_exit_to_user();
- exit_to_user_mode_prepare(regs);
+ syscall_exit_to_user_mode_prepare(regs);
}
/**
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 8f5ceea..9621d40 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -201,7 +201,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
/**
- * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
* @regs: Pointer to pt_regs on entry stack
*
* 1) check that interrupts are disabled
@@ -209,8 +209,10 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work
* 3) call exit_to_user_mode_loop() if any flags from
* EXIT_TO_USER_MODE_WORK are set
* 4) check that interrupts are still disabled
+ *
+ * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
*/
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
{
unsigned long ti_work;
@@ -224,15 +226,45 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
ti_work = exit_to_user_mode_loop(regs, ti_work);
arch_exit_to_user_mode_prepare(regs, ti_work);
+}
- rseq_exit_to_user_mode();
-
+static __always_inline void __exit_to_user_mode_validate(void)
+{
/* Ensure that kernel state is sane for a return to userspace */
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
lockdep_sys_exit();
}
+
+/**
+ * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs: Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
+/**
+ * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs: Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
/**
* exit_to_user_mode - Fixup state when exiting to user mode
*
@@ -297,7 +329,7 @@ static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
{
instrumentation_begin();
- exit_to_user_mode_prepare(regs);
+ irqentry_exit_to_user_mode_prepare(regs);
instrumentation_end();
exit_to_user_mode();
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Switch to fast path processing on exit to user
2025-10-27 8:45 ` [patch V6 28/31] rseq: Switch to fast path processing on " Thomas Gleixner
2025-10-28 16:14 ` Mathieu Desnoyers
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 84eeeb00203526c29135d5352833d01e53fc1e16
Gitweb: https://git.kernel.org/tip/84eeeb00203526c29135d5352833d01e53fc1e16
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:19 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:19 +01:00
rseq: Switch to fast path processing on exit to user
Now that all bits and pieces are in place, hook the RSEQ handling fast path
function into exit_to_user_mode_prepare() after the TIF work bits have been
handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
and the caller needs to take another turn through the TIF handling slow
path.
This only works for architectures which use the generic entry code.
Architectures who still have their own incomplete hacks are not supported
and won't be.
This results in the following improvements:
Kernel build Before After Reduction
exit to user 80692981 80514451
signal checks: 32581 121 99%
slowpath runs: 1201408 1.49% 198 0.00% 100%
fastpath runs: 675941 0.84% N/A
id updates: 1233989 1.53% 50541 0.06% 96%
cs checks: 1125366 1.39% 0 0.00% 100%
cs cleared: 1125366 100% 0 100%
cs fixup: 0 0% 0
RSEQ selftests Before After Reduction
exit to user: 386281778 387373750
signal checks: 35661203 0 100%
slowpath runs: 140542396 36.38% 100 0.00% 100%
fastpath runs: 9509789 2.51% N/A
id updates: 176203599 45.62% 9087994 2.35% 95%
cs checks: 175587856 45.46% 4728394 1.22% 98%
cs cleared: 172359544 98.16% 1319307 27.90% 99%
cs fixup: 3228312 1.84% 3409087 72.10%
The 'cs cleared' and 'cs fixup' percentages are not relative to the exit to
user invocations, they are relative to the actual 'cs check' invocations.
While some of this could have been avoided in the original code, like the
obvious clearing of CS when it's already clear, the main problem of going
through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
notify handler is invoked more than once before going out to user
space. Doing this once when everything has stabilized is the only solution
to avoid this.
The initial attempt to completely decouple it from the TIF work turned out
to be suboptimal for workloads, which do a lot of quick and short system
calls. Even if the fast path decision is only 4 instructions (including a
conditional branch), this adds up quickly and becomes measurable when the
rate for actually having to handle rseq is in the low single digit
percentage range of user/kernel transitions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.701201365@linutronix.de
---
include/linux/irq-entry-common.h | 7 ++-----
include/linux/resume_user_mode.h | 2 +-
include/linux/rseq.h | 18 ++++++++++++------
init/Kconfig | 2 +-
kernel/entry/common.c | 26 +++++++++++++++++++-------
kernel/rseq.c | 8 ++++++--
6 files changed, 41 insertions(+), 22 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index cb31fb8..8f5ceea 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -197,11 +197,8 @@ static __always_inline void arch_exit_to_user_mode(void) { }
*/
void arch_do_signal_or_restart(struct pt_regs *regs);
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- */
-unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work);
+/* Handle pending TIF work */
+unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
/**
* exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index dd3bf7d..bf92227 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
- rseq_handle_notify_resume(regs);
+ rseq_handle_slowpath(regs);
}
#endif /* LINUX_RESUME_USER_MODE_H */
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 684b5bd..ded4baa 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -7,13 +7,19 @@
#include <uapi/linux/rseq.h>
-void __rseq_handle_notify_resume(struct pt_regs *regs);
+void __rseq_handle_slowpath(struct pt_regs *regs);
-static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+/* Invoked from resume_user_mode_work() */
+static inline void rseq_handle_slowpath(struct pt_regs *regs)
{
- /* '&' is intentional to spare one conditional branch */
- if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
- __rseq_handle_notify_resume(regs);
+ if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
+ if (current->rseq.event.slowpath)
+ __rseq_handle_slowpath(regs);
+ } else {
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
+ __rseq_handle_slowpath(regs);
+ }
}
void __rseq_signal_deliver(int sig, struct pt_regs *regs);
@@ -152,7 +158,7 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
}
#else /* CONFIG_RSEQ */
-static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
diff --git a/init/Kconfig b/init/Kconfig
index bde40ab..d1c606e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1941,7 +1941,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
- depends on RSEQ && DEBUG_KERNEL
+ depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
select RSEQ_DEBUG_DEFAULT_ENABLE
help
Enable extra debugging checks for the rseq system call.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 70a16db..523a3e7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,13 +11,8 @@
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- * @regs: Pointer to pt_regs on entry stack
- * @ti_work: TIF work flags as read by the caller
- */
-__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work)
+static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
+ unsigned long ti_work)
{
/*
* Before returning to user space ensure that all pending work
@@ -62,6 +57,23 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
return ti_work;
}
+/**
+ * exit_to_user_mode_loop - do any pending work before leaving to user space
+ * @regs: Pointer to pt_regs on entry stack
+ * @ti_work: TIF work flags as read by the caller
+ */
+__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
+ unsigned long ti_work)
+{
+ for (;;) {
+ ti_work = __exit_to_user_mode_loop(regs, ti_work);
+
+ if (likely(!rseq_exit_to_user_mode_restart(regs)))
+ return ti_work;
+ ti_work = read_thread_flags();
+ }
+}
+
noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
{
irqentry_state_t ret = {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9633b08..cbb2ff7 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -237,7 +237,11 @@ efault:
static void rseq_slowpath_update_usr(struct pt_regs *regs)
{
- /* Preserve rseq state and user_irq state for exit to user */
+ /*
+ * Preserve rseq state and user_irq state. The generic entry code
+ * clears user_irq on the way out, the non-generic entry
+ * architectures are not having user_irq.
+ */
const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
struct task_struct *t = current;
struct rseq_ids ids;
@@ -289,7 +293,7 @@ static void rseq_slowpath_update_usr(struct pt_regs *regs)
}
}
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+void __rseq_handle_slowpath(struct pt_regs *regs)
{
/*
* If invoked from hypervisors before entering the guest via
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Implement fast path for exit to user
2025-10-27 8:45 ` [patch V6 27/31] rseq: Implement fast path for exit to user Thomas Gleixner
2025-10-28 16:09 ` Mathieu Desnoyers
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-10-29 16:28 ` [patch V6 27/31] " Steven Rostedt
` (2 subsequent siblings)
4 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 4bcfcad41764c991433db2bf6f477d828f6300d3
Gitweb: https://git.kernel.org/tip/4bcfcad41764c991433db2bf6f477d828f6300d3
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:17 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:19 +01:00
rseq: Implement fast path for exit to user
Implement the actual logic for handling RSEQ updates in a fast path after
handling the TIF work and at the point where the task is actually returning
to user space.
This is the right point to do that because at this point the CPU and the MM
CID are stable and cannot longer change due to yet another reschedule.
That happens when the task is handling it via TIF_NOTIFY_RESUME in
resume_user_mode_work(), which is invoked from the exit to user mode work
loop.
The function is invoked after the TIF work is handled and runs with
interrupts disabled, which means it cannot resolve page faults. It
therefore disables page faults and in case the access to the user space
memory faults, it:
- notes the fail in the event struct
- raises TIF_NOTIFY_RESUME
- returns false to the caller
The caller has to go back to the TIF work, which runs with interrupts
enabled and therefore can resolve the page faults. This happens mostly on
fork() when the memory is marked COW.
If the user memory inspection finds invalid data, the function returns
false as well and sets the fatal flag in the event struct along with
TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
and terminate the task with SIGSEGV as documented.
The initial decision to invoke any of this is based on one flags in the
event struct: @sched_switch. The decision is in pseudo ASM:
load tsk::event::sched_switch
jnz inspect_user_space
mov $0, tsk::event::events
...
leave
So for the common case where the task was not scheduled out, this really
boils down to three instructions before going out if the compiler is not
completely stupid (and yes, some of them are).
If the condition is true, then it checks, whether CPU ID or MM CID have
changed. If so, then the CPU/MM IDs have to be updated and are thereby
cached for the next round. The update unconditionally retrieves the user
space critical section address to spare another user*begin/end() pair. If
that's not zero and tsk::event::user_irq is set, then the critical section
is analyzed and acted upon. If either zero or the entry came via syscall
the critical section analysis is skipped.
If the comparison is false then the critical section has to be analyzed
because the event flag is then only true when entry from user was by
interrupt.
This is provided without the actual hookup to let reviewers focus on the
implementation details. The hookup happens in the next step.
Note: As with quite some other optimizations this depends on the generic
entry infrastructure and is not enabled to be sucked into random
architecture implementations.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.638929615@linutronix.de
---
include/linux/rseq_entry.h | 131 +++++++++++++++++++++++++++++++++++-
include/linux/rseq_types.h | 3 +-
kernel/rseq.c | 2 +-
3 files changed, 133 insertions(+), 3 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index aa1c046..9bf3d58 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -10,6 +10,7 @@ struct rseq_stats {
unsigned long exit;
unsigned long signal;
unsigned long slowpath;
+ unsigned long fastpath;
unsigned long ids;
unsigned long cs;
unsigned long clear;
@@ -245,12 +246,13 @@ rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long c
{
struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
unsigned long ip = instruction_pointer(regs);
+ unsigned long tasksize = TASK_SIZE;
u64 start_ip, abort_ip, offset;
u32 usig, __user *uc_sig;
rseq_stat_inc(rseq_stats.cs);
- if (unlikely(csaddr >= TASK_SIZE)) {
+ if (unlikely(csaddr >= tasksize)) {
t->rseq.event.fatal = true;
return false;
}
@@ -287,7 +289,7 @@ rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long c
* in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
* protection.
*/
- if (abort_ip >= TASK_SIZE || abort_ip < sizeof(*uc_sig))
+ if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
goto die;
/* The address is guaranteed to be >= 0 and < TASK_SIZE */
@@ -397,6 +399,126 @@ static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *r
return rseq_update_user_cs(t, regs, csaddr);
}
+/*
+ * If you want to use this then convert your architecture to the generic
+ * entry code. I'm tired of building workarounds for people who can't be
+ * bothered to make the maintenance of generic infrastructure less
+ * burdensome. Just sucking everything into the architecture code and
+ * thereby making others chase the horrible hacks and keep them working is
+ * neither acceptable nor sustainable.
+ */
+#ifdef CONFIG_GENERIC_ENTRY
+
+/*
+ * This is inlined into the exit path because:
+ *
+ * 1) It's a one time comparison in the fast path when there is no event to
+ * handle
+ *
+ * 2) The access to the user space rseq memory (TLS) is unlikely to fault
+ * so the straight inline operation is:
+ *
+ * - Four 32-bit stores only if CPU ID/ MM CID need to be updated
+ * - One 64-bit load to retrieve the critical section address
+ *
+ * 3) In the unlikely case that the critical section address is != NULL:
+ *
+ * - One 64-bit load to retrieve the start IP
+ * - One 64-bit load to retrieve the offset for calculating the end
+ * - One 64-bit load to retrieve the abort IP
+ * - One 64-bit load to retrieve the signature
+ * - One store to clear the critical section address
+ *
+ * The non-debug case implements only the minimal required checking. It
+ * provides protection against a rogue abort IP in kernel space, which
+ * would be exploitable at least on x86, and also against a rogue CS
+ * descriptor by checking the signature at the abort IP. Any fallout from
+ * invalid critical section descriptors is a user space problem. The debug
+ * case provides the full set of checks and terminates the task if a
+ * condition is not met.
+ *
+ * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and
+ * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
+ * slow path there will handle the failure.
+ */
+static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct task_struct *t)
+{
+ /*
+ * Page faults need to be disabled as this is called with
+ * interrupts disabled
+ */
+ guard(pagefault)();
+ if (likely(!t->rseq.event.ids_changed)) {
+ struct rseq __user *rseq = t->rseq.usrptr;
+ /*
+ * If IDs have not changed rseq_event::user_irq must be true
+ * See rseq_sched_switch_event().
+ */
+ u64 csaddr;
+
+ if (unlikely(!get_user_inline(csaddr, &rseq->rseq_cs)))
+ return false;
+
+ if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
+ if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
+ return false;
+ }
+ return true;
+ }
+
+ struct rseq_ids ids = {
+ .cpu_id = task_cpu(t),
+ .mm_cid = task_mm_cid(t),
+ };
+ u32 node_id = cpu_to_node(ids.cpu_id);
+
+ return rseq_update_usr(t, regs, &ids, node_id);
+}
+
+static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ struct task_struct *t = current;
+
+ /*
+ * If the task did not go through schedule or got the flag enforced
+ * by the rseq syscall or execve, then nothing to do here.
+ *
+ * CPU ID and MM CID can only change when going through a context
+ * switch.
+ *
+ * rseq_sched_switch_event() sets the rseq_event::sched_switch bit
+ * only when rseq_event::has_rseq is true. That conditional is
+ * required to avoid setting the TIF bit if RSEQ is not registered
+ * for a task. rseq_event::sched_switch is cleared when RSEQ is
+ * unregistered by a task so it's sufficient to check for the
+ * sched_switch bit alone.
+ *
+ * A sane compiler requires three instructions for the nothing to do
+ * case including clearing the events, but your mileage might vary.
+ */
+ if (unlikely((t->rseq.event.sched_switch))) {
+ rseq_stat_inc(rseq_stats.fastpath);
+
+ if (unlikely(!rseq_exit_user_update(regs, t)))
+ return true;
+ }
+ /* Clear state so next entry starts from a clean slate */
+ t->rseq.event.events = 0;
+ return false;
+}
+
+static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+ current->rseq.event.slowpath = true;
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ return true;
+ }
+ return false;
+}
+
+#endif /* CONFIG_GENERIC_ENTRY */
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
@@ -421,9 +543,12 @@ static inline void rseq_debug_syscall_return(struct pt_regs *regs)
if (static_branch_unlikely(&rseq_debug_enabled))
__rseq_debug_syscall_return(regs);
}
-
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ return false;
+}
static inline void rseq_exit_to_user_mode(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index a1389ff..9c7a341 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -18,6 +18,8 @@ struct rseq;
* @has_rseq: True if the task has a rseq pointer installed
* @error: Compound error code for the slow path to analyze
* @fatal: User space data corrupted or invalid
+ * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME
+ * is required
*
* @sched_switch and @ids_changed must be adjacent and the combo must be
* 16bit aligned to allow a single store, when both are set at the same
@@ -42,6 +44,7 @@ struct rseq_event {
u16 error;
struct {
u8 fatal;
+ u8 slowpath;
};
};
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 857fe61..9633b08 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -133,6 +133,7 @@ static int rseq_stats_show(struct seq_file *m, void *p)
stats.exit += data_race(per_cpu(rseq_stats.exit, cpu));
stats.signal += data_race(per_cpu(rseq_stats.signal, cpu));
stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu));
+ stats.fastpath += data_race(per_cpu(rseq_stats.fastpath, cpu));
stats.ids += data_race(per_cpu(rseq_stats.ids, cpu));
stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
@@ -142,6 +143,7 @@ static int rseq_stats_show(struct seq_file *m, void *p)
seq_printf(m, "exit: %16lu\n", stats.exit);
seq_printf(m, "signal: %16lu\n", stats.signal);
seq_printf(m, "slowp: %16lu\n", stats.slowpath);
+ seq_printf(m, "fastp: %16lu\n", stats.fastpath);
seq_printf(m, "ids: %16lu\n", stats.ids);
seq_printf(m, "cs: %16lu\n", stats.cs);
seq_printf(m, "clear: %16lu\n", stats.clear);
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Optimize event setting
2025-10-27 8:45 ` [patch V6 26/31] rseq: Optimize event setting Thomas Gleixner
2025-10-28 15:57 ` Mathieu Desnoyers
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: ea7a1cb200defdbbdf1ae99405f0eaefe433eb2d
Gitweb: https://git.kernel.org/tip/ea7a1cb200defdbbdf1ae99405f0eaefe433eb2d
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:14 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:19 +01:00
rseq: Optimize event setting
After removing the various condition bits earlier it turns out that one
extra information is needed to avoid setting event::sched_switch and
TIF_NOTIFY_RESUME unconditionally on every context switch.
The update of the RSEQ user space memory is only required, when either
the task was interrupted in user space and schedules
or
the CPU or MM CID changes in schedule() independent of the entry mode
Right now only the interrupt from user information is available.
Add an event flag, which is set when the CPU or MM CID or both change.
Evaluate this event in the scheduler to decide whether the sched_switch
event and the TIF bit need to be set.
It's an extra conditional in context_switch(), but the downside of
unconditionally handling RSEQ after a context switch to user is way more
significant. The utilized boolean logic minimizes this to a single
conditional branch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.578058898@linutronix.de
---
fs/exec.c | 2 +-
include/linux/rseq.h | 81 +++++++++++++++++++++++++++++++++----
include/linux/rseq_types.h | 11 ++++-
kernel/rseq.c | 2 +-
kernel/sched/core.c | 7 ++-
kernel/sched/sched.h | 5 +-
6 files changed, 95 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index e45b298..90e47eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ out:
force_fatal_sig(SIGSEGV);
sched_mm_cid_after_execve(current);
- rseq_sched_switch_event(current);
+ rseq_force_update();
current->in_execve = 0;
return retval;
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 06c21f6..684b5bd 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -11,7 +11,8 @@ void __rseq_handle_notify_resume(struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
- if (current->rseq.event.has_rseq)
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
__rseq_handle_notify_resume(regs);
}
@@ -33,12 +34,75 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *reg
}
}
-/* Raised from context switch and exevce to force evaluation on exit to user */
-static inline void rseq_sched_switch_event(struct task_struct *t)
+static inline void rseq_raise_notify_resume(struct task_struct *t)
{
- if (t->rseq.event.has_rseq) {
- t->rseq.event.sched_switch = true;
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+
+/* Invoked from context switch to force evaluation on exit to user */
+static __always_inline void rseq_sched_switch_event(struct task_struct *t)
+{
+ struct rseq_event *ev = &t->rseq.event;
+
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /*
+ * Avoid a boat load of conditionals by using simple logic
+ * to determine whether NOTIFY_RESUME needs to be raised.
+ *
+ * It's required when the CPU or MM CID has changed or
+ * the entry was from user space.
+ */
+ bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
+
+ if (raise) {
+ ev->sched_switch = true;
+ rseq_raise_notify_resume(t);
+ }
+ } else {
+ if (ev->has_rseq) {
+ t->rseq.event.sched_switch = true;
+ rseq_raise_notify_resume(t);
+ }
+ }
+}
+
+/*
+ * Invoked from __set_task_cpu() when a task migrates to enforce an IDs
+ * update.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu)
+{
+ t->rseq.event.ids_changed = true;
+}
+
+/*
+ * Invoked from switch_mm_cid() in context switch when the task gets a MM
+ * CID assigned.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid)
+{
+ /*
+ * Requires a comparison as the switch_mm_cid() code does not
+ * provide a conditional for it readily. So avoid excessive updates
+ * when nothing changes.
+ */
+ if (t->rseq.ids.mm_cid != cid)
+ t->rseq.event.ids_changed = true;
+}
+
+/* Enforce a full update after RSEQ registration and when execve() failed */
+static inline void rseq_force_update(void)
+{
+ if (current->rseq.event.has_rseq) {
+ current->rseq.event.ids_changed = true;
+ current->rseq.event.sched_switch = true;
+ rseq_raise_notify_resume(current);
}
}
@@ -55,7 +119,7 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
static inline void rseq_virt_userspace_exit(void)
{
if (current->rseq.event.sched_switch)
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ rseq_raise_notify_resume(current);
}
static inline void rseq_reset(struct task_struct *t)
@@ -91,6 +155,9 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
+static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid) { }
+static inline void rseq_force_update(void) { }
static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 7c12394..a1389ff 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -11,20 +11,27 @@ struct rseq;
* struct rseq_event - Storage for rseq related event management
* @all: Compound to initialize and clear the data efficiently
* @events: Compound to access events with a single load/store
- * @sched_switch: True if the task was scheduled out
+ * @sched_switch: True if the task was scheduled and needs update on
+ * exit to user
+ * @ids_changed: Indicator that IDs need to be updated
* @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
* @error: Compound error code for the slow path to analyze
* @fatal: User space data corrupted or invalid
+ *
+ * @sched_switch and @ids_changed must be adjacent and the combo must be
+ * 16bit aligned to allow a single store, when both are set at the same
+ * time in the scheduler.
*/
struct rseq_event {
union {
u64 all;
struct {
union {
- u16 events;
+ u32 events;
struct {
u8 sched_switch;
+ u8 ids_changed;
u8 user_irq;
};
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index cbb1ab4..857fe61 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -464,7 +464,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* are updated before returning to user-space.
*/
current->rseq.event.has_rseq = true;
- rseq_sched_switch_event(current);
+ rseq_force_update();
return 0;
efault:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b75e8e1..579a8e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5118,7 +5118,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
kcov_prepare_switch(prev);
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
- rseq_sched_switch_event(prev);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
@@ -5316,6 +5315,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
/* switch_mm_cid() requires the memory barriers above. */
switch_mm_cid(rq, prev, next);
+ /*
+ * Tell rseq that the task was scheduled in. Must be after
+ * switch_mm_cid() to get the TIF flag set.
+ */
+ rseq_sched_switch_event(next);
+
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 361f910..7fe961d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2208,6 +2208,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
smp_wmb();
WRITE_ONCE(task_thread_info(p)->cpu, cpu);
p->wake_cpu = cpu;
+ rseq_sched_set_task_cpu(p, cpu);
#endif /* CONFIG_SMP */
}
@@ -3806,8 +3807,10 @@ static inline void switch_mm_cid(struct rq *rq,
mm_cid_put_lazy(prev);
prev->mm_cid = -1;
}
- if (next->mm_cid_active)
+ if (next->mm_cid_active) {
next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
+ rseq_sched_set_task_mm_cid(next, next->mm_cid);
+ }
}
#else /* !CONFIG_SCHED_MM_CID: */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Rework the TIF_NOTIFY handler
2025-10-27 8:45 ` [patch V6 25/31] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: e1a607ccd2a4ea8885cd6d9626137b21b559551f
Gitweb: https://git.kernel.org/tip/e1a607ccd2a4ea8885cd6d9626137b21b559551f
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:12 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:18 +01:00
rseq: Rework the TIF_NOTIFY handler
Replace the whole logic with a new implementation, which is shared with
signal delivery and the upcoming exit fast path.
Contrary to the original implementation, this ignores invocations from
KVM/IO-uring, which invoke resume_user_mode_work() with the @regs argument
set to NULL.
The original implementation updated the CPU/Node/MM CID fields, but that
was just a side effect, which was addressing the problem that this
invocation cleared TIF_NOTIFY_RESUME, which in turn could cause an update
on return to user space to be lost.
This problem has been addressed differently, so that it's not longer
required to do that update before entering the guest.
That might be considered a user visible change, when the hosts thread TLS
memory is mapped into the guest, but as this was never intentionally
supported, this abuse of kernel internal implementation details is not
considered an ABI break.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.517640811@linutronix.de
---
include/linux/rseq_entry.h | 29 ++++++++++++++-
kernel/rseq.c | 76 ++++++++++++++++---------------------
2 files changed, 62 insertions(+), 43 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 37444e8..aa1c046 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -368,6 +368,35 @@ efault:
return false;
}
+/*
+ * Update user space with new IDs and conditionally check whether the task
+ * is in a critical section.
+ */
+static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *regs,
+ struct rseq_ids *ids, u32 node_id)
+{
+ u64 csaddr;
+
+ if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr))
+ return false;
+
+ /*
+ * On architectures which utilize the generic entry code this
+ * allows to skip the critical section when the entry was not from
+ * a user space interrupt, unless debug mode is enabled.
+ */
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ if (!static_branch_unlikely(&rseq_debug_enabled)) {
+ if (likely(!t->rseq.event.user_irq))
+ return true;
+ }
+ }
+ if (likely(!csaddr))
+ return true;
+ /* Sigh, this really needs to do work */
+ return rseq_update_user_cs(t, regs, csaddr);
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 4d7ba22..cbb1ab4 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -82,12 +82,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/rseq.h>
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD irq
-#else
-# define RSEQ_EVENT_GUARD preempt
-#endif
-
DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
static inline void rseq_control_debug(bool on)
@@ -239,38 +233,15 @@ efault:
return false;
}
-/*
- * This resume handler must always be executed between any of:
- * - preemption,
- * - signal delivery,
- * and return to user-space.
- *
- * This is how we can ensure that the entire rseq critical section
- * will issue the commit instruction only if executed atomically with
- * respect to other threads scheduled on the same CPU, and with respect
- * to signal handlers.
- */
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+static void rseq_slowpath_update_usr(struct pt_regs *regs)
{
+ /* Preserve rseq state and user_irq state for exit to user */
+ const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
struct task_struct *t = current;
struct rseq_ids ids;
u32 node_id;
bool event;
- /*
- * If invoked from hypervisors before entering the guest via
- * resume_user_mode_work(), then @regs is a NULL pointer.
- *
- * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
- * it before returning from the ioctl() to user space when
- * rseq_event.sched_switch is set.
- *
- * So it's safe to ignore here instead of pointlessly updating it
- * in the vcpu_run() loop.
- */
- if (!regs)
- return;
-
if (unlikely(t->flags & PF_EXITING))
return;
@@ -294,26 +265,45 @@ void __rseq_handle_notify_resume(struct pt_regs *regs)
* with the result handed in to allow the detection of
* inconsistencies.
*/
- scoped_guard(RSEQ_EVENT_GUARD) {
+ scoped_guard(irq) {
event = t->rseq.event.sched_switch;
- t->rseq.event.sched_switch = false;
+ t->rseq.event.all &= evt_mask.all;
ids.cpu_id = task_cpu(t);
ids.mm_cid = task_mm_cid(t);
}
- if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+ if (!event)
return;
- if (!rseq_handle_cs(t, regs))
- goto error;
-
node_id = cpu_to_node(ids.cpu_id);
- if (!rseq_set_ids(t, &ids, node_id))
- goto error;
- return;
-error:
- force_sig(SIGSEGV);
+ if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) {
+ /*
+ * Clear the errors just in case this might survive magically, but
+ * leave the rest intact.
+ */
+ t->rseq.event.error = 0;
+ force_sig(SIGSEGV);
+ }
+}
+
+void __rseq_handle_notify_resume(struct pt_regs *regs)
+{
+ /*
+ * If invoked from hypervisors before entering the guest via
+ * resume_user_mode_work(), then @regs is a NULL pointer.
+ *
+ * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+ * it before returning from the ioctl() to user space when
+ * rseq_event.sched_switch is set.
+ *
+ * So it's safe to ignore here instead of pointlessly updating it
+ * in the vcpu_run() loop.
+ */
+ if (!regs)
+ return;
+
+ rseq_slowpath_update_usr(regs);
}
void __rseq_signal_deliver(int sig, struct pt_regs *regs)
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Separate the signal delivery path
2025-10-27 8:45 ` [patch V6 24/31] rseq: Separate the signal delivery path Thomas Gleixner
2025-10-28 15:51 ` Mathieu Desnoyers
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 7315751a4e9e4c00278717c9f344a5929ca9898c
Gitweb: https://git.kernel.org/tip/7315751a4e9e4c00278717c9f344a5929ca9898c
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:10 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:18 +01:00
rseq: Separate the signal delivery path
Completely separate the signal delivery path from the notify handler as
they have different semantics versus the event handling.
The signal delivery only needs to ensure that the interrupted user context
was not in a critical section or the section is aborted before it switches
to the signal frame context. The signal frame context does not have the
original instruction pointer anymore, so that can't be handled on exit to
user space.
No point in updating the CPU/CID ids as they might change again before the
task returns to user space for real.
The fast path optimization, which checks for the 'entry from user via
interrupt' condition is only available for architectures which use the
generic entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.455429038@linutronix.de
---
include/linux/rseq.h | 21 ++++++++++++++++-----
kernel/rseq.c | 30 ++++++++++++++++++++++--------
2 files changed, 38 insertions(+), 13 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index a526074..06c21f6 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -7,22 +7,33 @@
#include <uapi/linux/rseq.h>
-void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
+void __rseq_handle_notify_resume(struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
if (current->rseq.event.has_rseq)
- __rseq_handle_notify_resume(NULL, regs);
+ __rseq_handle_notify_resume(regs);
}
+void __rseq_signal_deliver(int sig, struct pt_regs *regs);
+
+/*
+ * Invoked from signal delivery to fixup based on the register context before
+ * switching to the signal delivery context.
+ */
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
- if (current->rseq.event.has_rseq) {
- current->rseq.event.sched_switch = true;
- __rseq_handle_notify_resume(ksig, regs);
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.has_rseq & current->rseq.event.user_irq)
+ __rseq_signal_deliver(ksig->sig, regs);
+ } else {
+ if (current->rseq.event.has_rseq)
+ __rseq_signal_deliver(ksig->sig, regs);
}
}
+/* Raised from context switch and exevce to force evaluation on exit to user */
static inline void rseq_sched_switch_event(struct task_struct *t)
{
if (t->rseq.event.has_rseq) {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 32d4ab7..4d7ba22 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -250,13 +250,12 @@ efault:
* respect to other threads scheduled on the same CPU, and with respect
* to signal handlers.
*/
-void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
+void __rseq_handle_notify_resume(struct pt_regs *regs)
{
struct task_struct *t = current;
struct rseq_ids ids;
u32 node_id;
bool event;
- int sig;
/*
* If invoked from hypervisors before entering the guest via
@@ -275,10 +274,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING))
return;
- if (ksig)
- rseq_stat_inc(rseq_stats.signal);
- else
- rseq_stat_inc(rseq_stats.slowpath);
+ rseq_stat_inc(rseq_stats.slowpath);
/*
* Read and clear the event pending bit first. If the task
@@ -317,8 +313,26 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
return;
error:
- sig = ksig ? ksig->sig : 0;
- force_sigsegv(sig);
+ force_sig(SIGSEGV);
+}
+
+void __rseq_signal_deliver(int sig, struct pt_regs *regs)
+{
+ rseq_stat_inc(rseq_stats.signal);
+ /*
+ * Don't update IDs, they are handled on exit to user if
+ * necessary. The important thing is to abort a critical section of
+ * the interrupted context as after this point the instruction
+ * pointer in @regs points to the signal handler.
+ */
+ if (unlikely(!rseq_handle_cs(current, regs))) {
+ /*
+ * Clear the errors just in case this might survive
+ * magically, but leave the rest intact.
+ */
+ current->rseq.event.error = 0;
+ force_sigsegv(sig);
+ }
}
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide and use rseq_set_ids()
2025-10-27 8:45 ` [patch V6 23/31] rseq: Provide and use rseq_set_ids() Thomas Gleixner
2025-10-28 15:47 ` Mathieu Desnoyers
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: cd806440ebf989ab5cde08d90d33f57b8d487e50
Gitweb: https://git.kernel.org/tip/cd806440ebf989ab5cde08d90d33f57b8d487e50
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:08 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:18 +01:00
rseq: Provide and use rseq_set_ids()
Provide a new and straight forward implementation to set the IDs (CPU ID,
Node ID and MM CID), which can be later inlined into the fast path.
It does all operations in one scoped_user_rw_access() section and retrieves
also the critical section member (rseq::cs_rseq) from user space to avoid
another user..begin/end() pair. This is in preparation for optimizing the
fast path to avoid extra work when not required.
On rseq registration set the CPU ID fields to RSEQ_CPU_ID_UNINITIALIZED and
node and MM CID to zero. That's the same as the kernel internal reset
values. That makes the debug validation in the exit code work correctly on
the first exit to user space.
Use it to replace the whole related zoo in rseq.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.393972266@linutronix.de
---
fs/binfmt_elf.c | 2 +-
include/linux/rseq.h | 16 +-
include/linux/rseq_entry.h | 89 ++++++++++++++-
include/linux/sched.h | 10 +--
kernel/rseq.c | 236 +++++++-----------------------------
5 files changed, 151 insertions(+), 202 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index e4653bb..3eb734c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -46,7 +46,7 @@
#include <linux/cred.h>
#include <linux/dax.h>
#include <linux/uaccess.h>
-#include <linux/rseq.h>
+#include <uapi/linux/rseq.h>
#include <asm/param.h>
#include <asm/page.h>
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 2023c50..a526074 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -5,6 +5,8 @@
#ifdef CONFIG_RSEQ
#include <linux/sched.h>
+#include <uapi/linux/rseq.h>
+
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
@@ -48,7 +50,7 @@ static inline void rseq_virt_userspace_exit(void)
static inline void rseq_reset(struct task_struct *t)
{
memset(&t->rseq, 0, sizeof(t->rseq));
- t->rseq.ids.cpu_cid = ~0ULL;
+ t->rseq.ids.cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
}
static inline void rseq_execve(struct task_struct *t)
@@ -59,15 +61,19 @@ static inline void rseq_execve(struct task_struct *t)
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
+ *
+ * On fork, keep the IDs (CPU, MMCID) of the parent, which avoids a fault
+ * on the COW page on exit to user space, when the child stays on the same
+ * CPU as the parent. That's obviously not guaranteed, but in overcommit
+ * scenarios it is more likely and optimizes for the fork/exec case without
+ * taking the fault.
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM) {
+ if (clone_flags & CLONE_VM)
rseq_reset(t);
- } else {
+ else
t->rseq = current->rseq;
- t->rseq.ids.cpu_cid = ~0ULL;
- }
}
#else /* CONFIG_RSEQ */
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index fb53a6f..37444e8 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -75,6 +75,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
#endif
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+bool rseq_debug_validate_ids(struct task_struct *t);
static __always_inline void rseq_note_user_irq_entry(void)
{
@@ -194,6 +195,43 @@ efault:
return false;
}
+/*
+ * On debug kernels validate that user space did not mess with it if the
+ * debug branch is enabled.
+ */
+bool rseq_debug_validate_ids(struct task_struct *t)
+{
+ struct rseq __user *rseq = t->rseq.usrptr;
+ u32 cpu_id, uval, node_id;
+
+ /*
+ * On the first exit after registering the rseq region CPU ID is
+ * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0!
+ */
+ node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ?
+ cpu_to_node(t->rseq.ids.cpu_id) : 0;
+
+ scoped_user_read_access(rseq, efault) {
+ unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
+ if (cpu_id != t->rseq.ids.cpu_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->cpu_id, efault);
+ if (uval != cpu_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->node_id, efault);
+ if (uval != node_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->mm_cid, efault);
+ if (uval != t->rseq.ids.mm_cid)
+ goto die;
+ }
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
#endif /* RSEQ_BUILD_SLOW_PATH */
/*
@@ -279,6 +317,57 @@ efault:
return false;
}
+/*
+ * Updates CPU ID, Node ID and MM CID and reads the critical section
+ * address, when @csaddr != NULL. This allows to put the ID update and the
+ * read under the same uaccess region to spare a separate begin/end.
+ *
+ * As this is either invoked from a C wrapper with @csaddr = NULL or from
+ * the fast path code with a valid pointer, a clever compiler should be
+ * able to optimize the read out. Spares a duplicate implementation.
+ *
+ * Returns true, if the operation was successful, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when invalid data
+ * was found on debug kernels. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+static rseq_inline
+bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
+ u32 node_id, u64 *csaddr)
+{
+ struct rseq __user *rseq = t->rseq.usrptr;
+
+ if (static_branch_unlikely(&rseq_debug_enabled)) {
+ if (!rseq_debug_validate_ids(t))
+ return false;
+ }
+
+ scoped_user_rw_access(rseq, efault) {
+ unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
+ unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
+ unsafe_put_user(node_id, &rseq->node_id, efault);
+ unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
+ if (csaddr)
+ unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+ }
+
+ /* Cache the new values */
+ t->rseq.ids.cpu_cid = ids->cpu_cid;
+ rseq_stat_inc(rseq_stats.ids);
+ rseq_trace_update(t, ids);
+ return true;
+efault:
+ return false;
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d6e59c8..d4f4a41 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -42,7 +42,6 @@
#include <linux/posix-timers_types.h>
#include <linux/restart_block.h>
#include <linux/rseq_types.h>
-#include <uapi/linux/rseq.h>
#include <linux/seqlock_types.h>
#include <linux/kcsan.h>
#include <linux/rv.h>
@@ -1408,15 +1407,6 @@ struct task_struct {
#endif /* CONFIG_NUMA_BALANCING */
struct rseq_data rseq;
-#ifdef CONFIG_DEBUG_RSEQ
- /*
- * This is a place holder to save a copy of the rseq fields for
- * validation of read-only fields. The struct rseq has a
- * variable-length array at the end, so it cannot be used
- * directly. Reserve a size large enough for the known fields.
- */
- char rseq_fields[sizeof(struct rseq)];
-#endif
#ifdef CONFIG_SCHED_MM_CID
int mm_cid; /* Current cid in mm */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 5f21270..32d4ab7 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -88,13 +88,6 @@
# define RSEQ_EVENT_GUARD preempt
#endif
-/* The original rseq structure size (including padding) is 32 bytes. */
-#define ORIG_RSEQ_SIZE 32
-
-#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
- RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
- RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
-
DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
static inline void rseq_control_debug(bool on)
@@ -227,159 +220,9 @@ static int __init rseq_debugfs_init(void)
__initcall(rseq_debugfs_init);
#endif /* CONFIG_DEBUG_FS */
-#ifdef CONFIG_DEBUG_RSEQ
-static struct rseq *rseq_kernel_fields(struct task_struct *t)
-{
- return (struct rseq *) t->rseq_fields;
-}
-
-static int rseq_validate_ro_fields(struct task_struct *t)
-{
- static DEFINE_RATELIMIT_STATE(_rs,
- DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
- u32 cpu_id_start, cpu_id, node_id, mm_cid;
- struct rseq __user *rseq = t->rseq.usrptr;
-
- /*
- * Validate fields which are required to be read-only by
- * user-space.
- */
- if (!user_read_access_begin(rseq, t->rseq.len))
- goto efault;
- unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
- unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
- unsafe_get_user(node_id, &rseq->node_id, efault_end);
- unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end);
- user_read_access_end();
-
- if ((cpu_id_start != rseq_kernel_fields(t)->cpu_id_start ||
- cpu_id != rseq_kernel_fields(t)->cpu_id ||
- node_id != rseq_kernel_fields(t)->node_id ||
- mm_cid != rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) {
-
- pr_warn("Detected rseq corruption for pid: %d, name: %s\n"
- "\tcpu_id_start: %u ?= %u\n"
- "\tcpu_id: %u ?= %u\n"
- "\tnode_id: %u ?= %u\n"
- "\tmm_cid: %u ?= %u\n",
- t->pid, t->comm,
- cpu_id_start, rseq_kernel_fields(t)->cpu_id_start,
- cpu_id, rseq_kernel_fields(t)->cpu_id,
- node_id, rseq_kernel_fields(t)->node_id,
- mm_cid, rseq_kernel_fields(t)->mm_cid);
- }
-
- /* For now, only print a console warning on mismatch. */
- return 0;
-
-efault_end:
- user_read_access_end();
-efault:
- return -EFAULT;
-}
-
-/*
- * Update an rseq field and its in-kernel copy in lock-step to keep a coherent
- * state.
- */
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- do { \
- unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \
- rseq_kernel_fields(t)->field = value; \
- } while (0)
-
-#else
-static int rseq_validate_ro_fields(struct task_struct *t)
-{
- return 0;
-}
-
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- unsafe_put_user(value, &t->rseq.usrptr->field, error_label)
-#endif
-
-static int rseq_update_cpu_node_id(struct task_struct *t)
-{
- struct rseq __user *rseq = t->rseq.usrptr;
- u32 cpu_id = raw_smp_processor_id();
- u32 node_id = cpu_to_node(cpu_id);
- u32 mm_cid = task_mm_cid(t);
-
- rseq_stat_inc(rseq_stats.ids);
-
- /* Validate read-only rseq fields on debug kernels */
- if (rseq_validate_ro_fields(t))
- goto efault;
- WARN_ON_ONCE((int) mm_cid < 0);
-
- if (!user_write_access_begin(rseq, t->rseq.len))
- goto efault;
-
- rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
- rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
- rseq_unsafe_put_user(t, node_id, node_id, efault_end);
- rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
- /* Cache the user space values */
- t->rseq.ids.cpu_id = cpu_id;
- t->rseq.ids.mm_cid = mm_cid;
-
- /*
- * Additional feature fields added after ORIG_RSEQ_SIZE
- * need to be conditionally updated only if
- * t->rseq_len != ORIG_RSEQ_SIZE.
- */
- user_write_access_end();
- trace_rseq_update(t);
- return 0;
-
-efault_end:
- user_write_access_end();
-efault:
- return -EFAULT;
-}
-
-static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
+static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
{
- struct rseq __user *rseq = t->rseq.usrptr;
- u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
- mm_cid = 0;
-
- /*
- * Validate read-only rseq fields.
- */
- if (rseq_validate_ro_fields(t))
- goto efault;
-
- if (!user_write_access_begin(rseq, t->rseq.len))
- goto efault;
-
- /*
- * Reset all fields to their initial state.
- *
- * All fields have an initial state of 0 except cpu_id which is set to
- * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after
- * unregistration can figure out that rseq needs to be registered
- * again.
- */
- rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end);
- rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
- rseq_unsafe_put_user(t, node_id, node_id, efault_end);
- rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
- /*
- * Additional feature fields added after ORIG_RSEQ_SIZE
- * need to be conditionally reset only if
- * t->rseq_len != ORIG_RSEQ_SIZE.
- */
- user_write_access_end();
- return 0;
-
-efault_end:
- user_write_access_end();
-efault:
- return -EFAULT;
+ return rseq_set_ids_get_csaddr(t, ids, node_id, NULL);
}
static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
@@ -410,6 +253,8 @@ efault:
void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
+ struct rseq_ids ids;
+ u32 node_id;
bool event;
int sig;
@@ -456,6 +301,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
scoped_guard(RSEQ_EVENT_GUARD) {
event = t->rseq.event.sched_switch;
t->rseq.event.sched_switch = false;
+ ids.cpu_id = task_cpu(t);
+ ids.mm_cid = task_mm_cid(t);
}
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -464,7 +311,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (!rseq_handle_cs(t, regs))
goto error;
- if (unlikely(rseq_update_cpu_node_id(t)))
+ node_id = cpu_to_node(ids.cpu_id);
+ if (!rseq_set_ids(t, &ids, node_id))
goto error;
return;
@@ -504,13 +352,33 @@ void rseq_syscall(struct pt_regs *regs)
}
#endif
+static bool rseq_reset_ids(void)
+{
+ struct rseq_ids ids = {
+ .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+ .mm_cid = 0,
+ };
+
+ /*
+ * If this fails, terminate it because this leaves the kernel in
+ * stupid state as exit to user space will try to fixup the ids
+ * again.
+ */
+ if (rseq_set_ids(current, &ids, 0))
+ return true;
+
+ force_sig(SIGSEGV);
+ return false;
+}
+
+/* The original rseq structure size (including padding) is 32 bytes. */
+#define ORIG_RSEQ_SIZE 32
+
/*
* sys_rseq - setup restartable sequences for caller thread.
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
- int ret;
-
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -521,9 +389,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
return -EINVAL;
if (current->rseq.sig != sig)
return -EPERM;
- ret = rseq_reset_rseq_cpu_node_id(current);
- if (ret)
- return ret;
+ if (!rseq_reset_ids())
+ return -EFAULT;
rseq_reset(current);
return 0;
}
@@ -563,27 +430,22 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
if (!access_ok(rseq, rseq_len))
return -EFAULT;
- /*
- * If the rseq_cs pointer is non-NULL on registration, clear it to
- * avoid a potential segfault on return to user-space. The proper thing
- * to do would have been to fail the registration but this would break
- * older libcs that reuse the rseq area for new threads without
- * clearing the fields. Don't bother reading it, just reset it.
- */
- if (put_user(0UL, &rseq->rseq_cs))
- return -EFAULT;
+ scoped_user_write_access(rseq, efault) {
+ /*
+ * If the rseq_cs pointer is non-NULL on registration, clear it to
+ * avoid a potential segfault on return to user-space. The proper thing
+ * to do would have been to fail the registration but this would break
+ * older libcs that reuse the rseq area for new threads without
+ * clearing the fields. Don't bother reading it, just reset it.
+ */
+ unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ /* Initialize IDs in user space */
+ unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
+ unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
+ unsafe_put_user(0U, &rseq->node_id, efault);
+ unsafe_put_user(0U, &rseq->mm_cid, efault);
+ }
-#ifdef CONFIG_DEBUG_RSEQ
- /*
- * Initialize the in-kernel rseq fields copy for validation of
- * read-only fields.
- */
- if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_start) ||
- get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) ||
- get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) ||
- get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid))
- return -EFAULT;
-#endif
/*
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
@@ -599,6 +461,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
*/
current->rseq.event.has_rseq = true;
rseq_sched_switch_event(current);
-
return 0;
+
+efault:
+ return -EFAULT;
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
2025-10-27 8:45 ` [patch V6 22/31] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 785637e834310e88ce00cb9eed2ba7ba03ea561d
Gitweb: https://git.kernel.org/tip/785637e834310e88ce00cb9eed2ba7ba03ea561d
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:05 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:17 +01:00
rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
Make the syscall exit debug mechanism available via the static branch on
architectures which utilize the generic entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.333440475@linutronix.de
---
include/linux/entry-common.h | 2 +-
include/linux/rseq_entry.h | 9 +++++++++
kernel/rseq.c | 10 ++++++++--
3 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 75b194c..d967184 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -146,7 +146,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
local_irq_enable();
}
- rseq_syscall(regs);
+ rseq_debug_syscall_return(regs);
/*
* Do one-time syscall specific work. If these work items are
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 5bdcf5b..fb53a6f 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -296,9 +296,18 @@ static __always_inline void rseq_exit_to_user_mode(void)
ev->events = 0;
}
+void __rseq_debug_syscall_return(struct pt_regs *regs);
+
+static inline void rseq_debug_syscall_return(struct pt_regs *regs)
+{
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ __rseq_debug_syscall_return(regs);
+}
+
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 07d9fa7..5f21270 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -473,12 +473,11 @@ error:
force_sigsegv(sig);
}
-#ifdef CONFIG_DEBUG_RSEQ
/*
* Terminate the process if a syscall is issued within a restartable
* sequence.
*/
-void rseq_syscall(struct pt_regs *regs)
+void __rseq_debug_syscall_return(struct pt_regs *regs)
{
struct task_struct *t = current;
u64 csaddr;
@@ -496,6 +495,13 @@ void rseq_syscall(struct pt_regs *regs)
fail:
force_sig(SIGSEGV);
}
+
+#ifdef CONFIG_DEBUG_RSEQ
+/* Kept around to keep GENERIC_ENTRY=n architectures supported. */
+void rseq_syscall(struct pt_regs *regs)
+{
+ __rseq_debug_syscall_return(regs);
+}
#endif
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Make exit debugging static branch based
2025-10-27 8:45 ` [patch V6 21/31] rseq: Make exit debugging static branch based Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 0042fe6474333f3e29984054fe4b3b56d1611847
Gitweb: https://git.kernel.org/tip/0042fe6474333f3e29984054fe4b3b56d1611847
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:02 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:17 +01:00
rseq: Make exit debugging static branch based
Disconnect it from the config switch and use the static debug branch. This
is a temporary measure for validating the rework. At the end this check
needs to be hidden behind lockdep as it has nothing to do with the other
debug infrastructure, which mainly aids user space debugging by enabling a
zoo of checks which terminate misbehaving tasks instead of letting them
keep the hard to diagnose pieces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.272660745@linutronix.de
---
include/linux/rseq_entry.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index f9510ce..5bdcf5b 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -285,7 +285,7 @@ static __always_inline void rseq_exit_to_user_mode(void)
rseq_stat_inc(rseq_stats.exit);
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ if (static_branch_unlikely(&rseq_debug_enabled))
WARN_ON_ONCE(ev->sched_switch);
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Replace the original debug implementation
2025-10-27 8:45 ` [patch V6 20/31] rseq: Replace the original debug implementation Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-10-30 21:52 ` [patch V6 20/31] " Prakash Sangappa
` (2 subsequent siblings)
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 9674b4c51f437c35b7a0a385faf6c9bdebc1d3ab
Gitweb: https://git.kernel.org/tip/9674b4c51f437c35b7a0a385faf6c9bdebc1d3ab
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:00 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:17 +01:00
rseq: Replace the original debug implementation
Just utilize the new infrastructure and put the original one to rest.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.212510692@linutronix.de
---
kernel/rseq.c | 81 +++++++-------------------------------------------
1 file changed, 12 insertions(+), 69 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 12a9b6a..07d9fa7 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -475,84 +475,27 @@ error:
#ifdef CONFIG_DEBUG_RSEQ
/*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
- return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
- struct rseq __user *urseq = t->rseq.usrptr;
- struct rseq_cs __user *urseq_cs;
- u32 __user *usig;
- u64 ptr;
- u32 sig;
- int ret;
-
- if (get_user(ptr, &rseq->rseq_cs))
- return -EFAULT;
-
- /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
- if (!ptr) {
- memset(rseq_cs, 0, sizeof(*rseq_cs));
- return 0;
- }
- /* Check that the pointer value fits in the user-space process space. */
- if (ptr >= TASK_SIZE)
- return -EINVAL;
- urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
- if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-
- if (rseq_cs->start_ip >= TASK_SIZE ||
- rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
- rseq_cs->abort_ip >= TASK_SIZE ||
- rseq_cs->version > 0)
- return -EINVAL;
- /* Check for overflow. */
- if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
- return -EINVAL;
- /* Ensure that abort_ip is not in the critical section. */
- if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
- return -EINVAL;
-
- usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
- ret = get_user(sig, usig);
- if (ret)
- return ret;
-
- if (current->rseq.sig != sig) {
- printk_ratelimited(KERN_WARNING
- "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq.sig, current->pid, usig);
- return -EINVAL;
- }
- return 0;
-}
-
-/*
* Terminate the process if a syscall is issued within a restartable
* sequence.
*/
void rseq_syscall(struct pt_regs *regs)
{
- unsigned long ip = instruction_pointer(regs);
struct task_struct *t = current;
- struct rseq_cs rseq_cs;
+ u64 csaddr;
- if (!t->rseq.usrptr)
+ if (!t->rseq.event.has_rseq)
return;
- if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
- force_sig(SIGSEGV);
+ if (!get_user(csaddr, &t->rseq.usrptr->rseq_cs))
+ goto fail;
+ if (likely(!csaddr))
+ return;
+ if (unlikely(csaddr >= TASK_SIZE))
+ goto fail;
+ if (rseq_debug_update_user_cs(t, regs, csaddr))
+ return;
+fail:
+ force_sig(SIGSEGV);
}
-
#endif
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide and use rseq_update_user_cs()
2025-10-27 8:44 ` [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
2025-10-28 15:40 ` Mathieu Desnoyers
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-10-29 16:04 ` [patch V6 19/31] " Steven Rostedt
` (2 subsequent siblings)
4 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 39c6292509385bfc947d99772c1f3ceca7c648cf
Gitweb: https://git.kernel.org/tip/39c6292509385bfc947d99772c1f3ceca7c648cf
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:57 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:16 +01:00
rseq: Provide and use rseq_update_user_cs()
Provide a straight forward implementation to check for and eventually
clear/fixup critical sections in user space.
The non-debug version does only the minimal sanity checks and aims for
efficiency.
There are two attack vectors, which are checked for:
1) An abort IP which is in the kernel address space. That would cause at
least x86 to return to kernel space via IRET.
2) A rogue critical section descriptor with an abort IP pointing to some
arbitrary address, which is not preceded by the RSEQ signature.
If the section descriptors are invalid then the resulting misbehaviour of
the user space application is not the kernels problem.
The kernel provides a run-time switchable debug slow path, which implements
the full zoo of checks including termination of the task when one of the
gazillion conditions is not met.
Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME
handler. Move the remainders into the CONFIG_DEBUG_RSEQ section, which will
be replaced and removed in a subsequent step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.151465632@linutronix.de
---
include/linux/rseq_entry.h | 206 ++++++++++++++++++++++++++++++-
include/linux/rseq_types.h | 11 +-
kernel/rseq.c | 246 ++++++++++--------------------------
3 files changed, 291 insertions(+), 172 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index ed8e5f8..f9510ce 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -36,6 +36,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
#ifdef CONFIG_RSEQ
#include <linux/jump_label.h>
#include <linux/rseq.h>
+#include <linux/uaccess.h>
#include <linux/tracepoint-defs.h>
@@ -67,12 +68,217 @@ static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_inline
+#else
+#define rseq_inline __always_inline
+#endif
+
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
current->rseq.event.user_irq = true;
}
+/*
+ * Check whether there is a valid critical section and whether the
+ * instruction pointer in @regs is inside the critical section.
+ *
+ * - If the critical section is invalid, terminate the task.
+ *
+ * - If valid and the instruction pointer is inside, set it to the abort IP.
+ *
+ * - If valid and the instruction pointer is outside, clear the critical
+ * section address.
+ *
+ * Returns true, if the section was valid and either fixup or clear was
+ * done, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when a invalid
+ * section was found. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+
+#ifdef RSEQ_BUILD_SLOW_PATH
+/*
+ * The debug version is put out of line, but kept here so the code stays
+ * together.
+ *
+ * @csaddr has already been checked by the caller to be in user space
+ */
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
+ unsigned long csaddr)
+{
+ struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+ u64 start_ip, abort_ip, offset, cs_end, head, tasksize = TASK_SIZE;
+ unsigned long ip = instruction_pointer(regs);
+ u64 __user *uc_head = (u64 __user *) ucs;
+ u32 usig, __user *uc_sig;
+
+ scoped_user_rw_access(ucs, efault) {
+ /*
+ * Evaluate the user pile and exit if one of the conditions
+ * is not fulfilled.
+ */
+ unsafe_get_user(start_ip, &ucs->start_ip, efault);
+ if (unlikely(start_ip >= tasksize))
+ goto die;
+ /* If outside, just clear the critical section. */
+ if (ip < start_ip)
+ goto clear;
+
+ unsafe_get_user(offset, &ucs->post_commit_offset, efault);
+ cs_end = start_ip + offset;
+ /* Check for overflow and wraparound */
+ if (unlikely(cs_end >= tasksize || cs_end < start_ip))
+ goto die;
+
+ /* If not inside, clear it. */
+ if (ip >= cs_end)
+ goto clear;
+
+ unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
+ /* Ensure it's "valid" */
+ if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
+ goto die;
+ /* Validate that the abort IP is not in the critical section */
+ if (unlikely(abort_ip - start_ip < offset))
+ goto die;
+
+ /*
+ * Check version and flags for 0. No point in emitting
+ * deprecated warnings before dying. That could be done in
+ * the slow path eventually, but *shrug*.
+ */
+ unsafe_get_user(head, uc_head, efault);
+ if (unlikely(head))
+ goto die;
+
+ /* abort_ip - 4 is >= 0. See abort_ip check above */
+ uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+ unsafe_get_user(usig, uc_sig, efault);
+ if (unlikely(usig != t->rseq.sig))
+ goto die;
+
+ /* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /* If not in interrupt from user context, let it die */
+ if (unlikely(!t->rseq.event.user_irq))
+ goto die;
+ }
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ instruction_pointer_set(regs, (unsigned long)abort_ip);
+ rseq_stat_inc(rseq_stats.fixup);
+ break;
+ clear:
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ rseq_stat_inc(rseq_stats.clear);
+ abort_ip = 0ULL;
+ }
+
+ if (unlikely(abort_ip))
+ rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
+#endif /* RSEQ_BUILD_SLOW_PATH */
+
+/*
+ * This only ensures that abort_ip is in the user address space and
+ * validates that it is preceded by the signature.
+ *
+ * No other sanity checks are done here, that's what the debug code is for.
+ */
+static rseq_inline bool
+rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
+{
+ struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+ unsigned long ip = instruction_pointer(regs);
+ u64 start_ip, abort_ip, offset;
+ u32 usig, __user *uc_sig;
+
+ rseq_stat_inc(rseq_stats.cs);
+
+ if (unlikely(csaddr >= TASK_SIZE)) {
+ t->rseq.event.fatal = true;
+ return false;
+ }
+
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ return rseq_debug_update_user_cs(t, regs, csaddr);
+
+ scoped_user_rw_access(ucs, efault) {
+ unsafe_get_user(start_ip, &ucs->start_ip, efault);
+ unsafe_get_user(offset, &ucs->post_commit_offset, efault);
+ unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
+
+ /*
+ * No sanity checks. If user space screwed it up, it can
+ * keep the pieces. That's what debug code is for.
+ *
+ * If outside, just clear the critical section.
+ */
+ if (ip - start_ip >= offset)
+ goto clear;
+
+ /*
+ * Two requirements for @abort_ip:
+ * - Must be in user space as x86 IRET would happily return to
+ * the kernel.
+ * - The four bytes preceding the instruction at @abort_ip must
+ * contain the signature.
+ *
+ * The latter protects against the following attack vector:
+ *
+ * An attacker with limited abilities to write, creates a critical
+ * section descriptor, sets the abort IP to a library function or
+ * some other ROP gadget and stores the address of the descriptor
+ * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
+ * protection.
+ */
+ if (abort_ip >= TASK_SIZE || abort_ip < sizeof(*uc_sig))
+ goto die;
+
+ /* The address is guaranteed to be >= 0 and < TASK_SIZE */
+ uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+ unsafe_get_user(usig, uc_sig, efault);
+ if (unlikely(usig != t->rseq.sig))
+ goto die;
+
+ /* Invalidate the critical section */
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ /* Update the instruction pointer */
+ instruction_pointer_set(regs, (unsigned long)abort_ip);
+ rseq_stat_inc(rseq_stats.fixup);
+ break;
+ clear:
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ rseq_stat_inc(rseq_stats.clear);
+ abort_ip = 0ULL;
+ }
+
+ if (unlikely(abort_ip))
+ rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 80f6c39..7c12394 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -14,10 +14,12 @@ struct rseq;
* @sched_switch: True if the task was scheduled out
* @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
+ * @error: Compound error code for the slow path to analyze
+ * @fatal: User space data corrupted or invalid
*/
struct rseq_event {
union {
- u32 all;
+ u64 all;
struct {
union {
u16 events;
@@ -28,6 +30,13 @@ struct rseq_event {
};
u8 has_rseq;
+ u8 __pad;
+ union {
+ u16 error;
+ struct {
+ u8 fatal;
+ };
+ };
};
};
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 679ab8e..12a9b6a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -382,175 +382,18 @@ efault:
return -EFAULT;
}
-/*
- * Get the user-space pointer value stored in the 'rseq_cs' field.
- */
-static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs)
-{
- if (!rseq_cs)
- return -EFAULT;
-
-#ifdef CONFIG_64BIT
- if (get_user(*rseq_cs, &rseq->rseq_cs))
- return -EFAULT;
-#else
- if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-#endif
-
- return 0;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
- struct rseq_cs __user *urseq_cs;
- u64 ptr;
- u32 __user *usig;
- u32 sig;
- int ret;
-
- ret = rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr);
- if (ret)
- return ret;
-
- /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
- if (!ptr) {
- memset(rseq_cs, 0, sizeof(*rseq_cs));
- return 0;
- }
- /* Check that the pointer value fits in the user-space process space. */
- if (ptr >= TASK_SIZE)
- return -EINVAL;
- urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
- if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-
- if (rseq_cs->start_ip >= TASK_SIZE ||
- rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
- rseq_cs->abort_ip >= TASK_SIZE ||
- rseq_cs->version > 0)
- return -EINVAL;
- /* Check for overflow. */
- if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
- return -EINVAL;
- /* Ensure that abort_ip is not in the critical section. */
- if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
- return -EINVAL;
-
- usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
- ret = get_user(sig, usig);
- if (ret)
- return ret;
-
- if (current->rseq.sig != sig) {
- printk_ratelimited(KERN_WARNING
- "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq.sig, current->pid, usig);
- return -EINVAL;
- }
- return 0;
-}
-
-static bool rseq_warn_flags(const char *str, u32 flags)
-{
- u32 test_flags;
-
- if (!flags)
- return false;
- test_flags = flags & RSEQ_CS_NO_RESTART_FLAGS;
- if (test_flags)
- pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, str);
- test_flags = flags & ~RSEQ_CS_NO_RESTART_FLAGS;
- if (test_flags)
- pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str);
- return true;
-}
-
-static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
-{
- u32 flags;
- int ret;
-
- if (rseq_warn_flags("rseq_cs", cs_flags))
- return -EINVAL;
-
- /* Get thread flags. */
- ret = get_user(flags, &t->rseq.usrptr->flags);
- if (ret)
- return ret;
-
- if (rseq_warn_flags("rseq", flags))
- return -EINVAL;
- return 0;
-}
-
-static int clear_rseq_cs(struct rseq __user *rseq)
-{
- /*
- * The rseq_cs field is set to NULL on preemption or signal
- * delivery on top of rseq assembly block, as well as on top
- * of code outside of the rseq assembly block. This performs
- * a lazy clear of the rseq_cs field.
- *
- * Set rseq_cs to NULL.
- */
-#ifdef CONFIG_64BIT
- return put_user(0UL, &rseq->rseq_cs);
-#else
- if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs)))
- return -EFAULT;
- return 0;
-#endif
-}
-
-/*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
- return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
+static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
{
- unsigned long ip = instruction_pointer(regs);
- struct task_struct *t = current;
- struct rseq_cs rseq_cs;
- int ret;
-
- rseq_stat_inc(rseq_stats.cs);
-
- ret = rseq_get_rseq_cs(t, &rseq_cs);
- if (ret)
- return ret;
-
- /*
- * Handle potentially not being within a critical section.
- * If not nested over a rseq critical section, restart is useless.
- * Clear the rseq_cs pointer and return.
- */
- if (!in_rseq_cs(ip, &rseq_cs)) {
- rseq_stat_inc(rseq_stats.clear);
- return clear_rseq_cs(t->rseq.usrptr);
- }
- ret = rseq_check_flags(t, rseq_cs.flags);
- if (ret < 0)
- return ret;
- if (!abort)
- return 0;
- ret = clear_rseq_cs(t->rseq.usrptr);
- if (ret)
- return ret;
- rseq_stat_inc(rseq_stats.fixup);
- trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
- rseq_cs.abort_ip);
- instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
- return 0;
+ struct rseq __user *urseq = t->rseq.usrptr;
+ u64 csaddr;
+
+ scoped_user_read_access(urseq, efault)
+ unsafe_get_user(csaddr, &urseq->rseq_cs, efault);
+ if (likely(!csaddr))
+ return true;
+ return rseq_update_user_cs(t, regs, csaddr);
+efault:
+ return false;
}
/*
@@ -567,8 +410,8 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
- int ret, sig;
bool event;
+ int sig;
/*
* If invoked from hypervisors before entering the guest via
@@ -618,8 +461,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
return;
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
+ if (!rseq_handle_cs(t, regs))
goto error;
if (unlikely(rseq_update_cpu_node_id(t)))
@@ -632,6 +474,68 @@ error:
}
#ifdef CONFIG_DEBUG_RSEQ
+/*
+ * Unsigned comparison will be true when ip >= start_ip, and when
+ * ip < start_ip + post_commit_offset.
+ */
+static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
+{
+ return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
+}
+
+/*
+ * If the rseq_cs field of 'struct rseq' contains a valid pointer to
+ * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
+ */
+static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
+{
+ struct rseq __user *urseq = t->rseq.usrptr;
+ struct rseq_cs __user *urseq_cs;
+ u32 __user *usig;
+ u64 ptr;
+ u32 sig;
+ int ret;
+
+ if (get_user(ptr, &rseq->rseq_cs))
+ return -EFAULT;
+
+ /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
+ if (!ptr) {
+ memset(rseq_cs, 0, sizeof(*rseq_cs));
+ return 0;
+ }
+ /* Check that the pointer value fits in the user-space process space. */
+ if (ptr >= TASK_SIZE)
+ return -EINVAL;
+ urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
+ if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
+ return -EFAULT;
+
+ if (rseq_cs->start_ip >= TASK_SIZE ||
+ rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
+ rseq_cs->abort_ip >= TASK_SIZE ||
+ rseq_cs->version > 0)
+ return -EINVAL;
+ /* Check for overflow. */
+ if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
+ return -EINVAL;
+ /* Ensure that abort_ip is not in the critical section. */
+ if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
+ return -EINVAL;
+
+ usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
+ ret = get_user(sig, usig);
+ if (ret)
+ return ret;
+
+ if (current->rseq.sig != sig) {
+ printk_ratelimited(KERN_WARNING
+ "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
+ sig, current->rseq.sig, current->pid, usig);
+ return -EINVAL;
+ }
+ return 0;
+}
/*
* Terminate the process if a syscall is issued within a restartable
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide static branch for runtime debugging
2025-10-27 8:44 ` [patch V6 18/31] rseq: Provide static branch for runtime debugging Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 5885f2a4824dbf7628fc1fedce27f50657f7dbc8
Gitweb: https://git.kernel.org/tip/5885f2a4824dbf7628fc1fedce27f50657f7dbc8
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:55 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:16 +01:00
rseq: Provide static branch for runtime debugging
Config based debug is rarely turned on and is not available easily when
things go wrong.
Provide a static branch to allow permanent integration of debug mechanisms
along with the usual toggles in Kconfig, command line and debugfs.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.089270547@linutronix.de
---
Documentation/admin-guide/kernel-parameters.txt | 4 +-
include/linux/rseq_entry.h | 3 +-
init/Kconfig | 14 +++-
kernel/rseq.c | 73 +++++++++++++++-
4 files changed, 90 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6c42061..e638274 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6500,6 +6500,10 @@
Memory area to be used by remote processor image,
managed by CMA.
+ rseq_debug= [KNL] Enable or disable restartable sequence
+ debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE.
+ Format: <bool>
+
rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling
when CONFIG_RT_GROUP_SCHED=y. Defaults to
!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index ff9080b..ed8e5f8 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
#endif /* !CONFIG_RSEQ_STATS */
#ifdef CONFIG_RSEQ
+#include <linux/jump_label.h>
#include <linux/rseq.h>
#include <linux/tracepoint-defs.h>
@@ -64,6 +65,8 @@ static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
unsigned long offset, unsigned long abort_ip) { }
#endif /* !CONFIG_TRACEPOINT */
+DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
diff --git a/init/Kconfig b/init/Kconfig
index f39fdfb..bde40ab 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1925,10 +1925,24 @@ config RSEQ_STATS
If unsure, say N.
+config RSEQ_DEBUG_DEFAULT_ENABLE
+ default n
+ bool "Enable restartable sequences debug mode by default" if EXPERT
+ depends on RSEQ
+ help
+ This enables the static branch for debug mode of restartable
+ sequences.
+
+ This also can be controlled on the kernel command line via the
+ command line parameter "rseq_debug=0/1" and through debugfs.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
depends on RSEQ && DEBUG_KERNEL
+ select RSEQ_DEBUG_DEFAULT_ENABLE
help
Enable extra debugging checks for the rseq system call.
diff --git a/kernel/rseq.c b/kernel/rseq.c
index c0dbe2e..679ab8e 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -95,6 +95,27 @@
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
+DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
+static inline void rseq_control_debug(bool on)
+{
+ if (on)
+ static_branch_enable(&rseq_debug_enabled);
+ else
+ static_branch_disable(&rseq_debug_enabled);
+}
+
+static int __init rseq_setup_debug(char *str)
+{
+ bool on;
+
+ if (kstrtobool(str, &on))
+ return -EINVAL;
+ rseq_control_debug(on);
+ return 1;
+}
+__setup("rseq_debug=", rseq_setup_debug);
+
#ifdef CONFIG_TRACEPOINTS
/*
* Out of line, so the actual update functions can be in a header to be
@@ -112,10 +133,11 @@ void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
}
#endif /* CONFIG_TRACEPOINTS */
+#ifdef CONFIG_DEBUG_FS
#ifdef CONFIG_RSEQ_STATS
DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
-static int rseq_debug_show(struct seq_file *m, void *p)
+static int rseq_stats_show(struct seq_file *m, void *p)
{
struct rseq_stats stats = { };
unsigned int cpu;
@@ -140,14 +162,56 @@ static int rseq_debug_show(struct seq_file *m, void *p)
return 0;
}
+static int rseq_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, rseq_stats_show, inode->i_private);
+}
+
+static const struct file_operations stat_ops = {
+ .open = rseq_stats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init rseq_stats_init(struct dentry *root_dir)
+{
+ debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops);
+ return 0;
+}
+#else
+static inline void rseq_stats_init(struct dentry *root_dir) { }
+#endif /* CONFIG_RSEQ_STATS */
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+ bool on = static_branch_unlikely(&rseq_debug_enabled);
+
+ seq_printf(m, "%d\n", on);
+ return 0;
+}
+
+static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ bool on;
+
+ if (kstrtobool_from_user(ubuf, count, &on))
+ return -EINVAL;
+
+ rseq_control_debug(on);
+ return count;
+}
+
static int rseq_debug_open(struct inode *inode, struct file *file)
{
return single_open(file, rseq_debug_show, inode->i_private);
}
-static const struct file_operations dfs_ops = {
+static const struct file_operations debug_ops = {
.open = rseq_debug_open,
.read = seq_read,
+ .write = rseq_debug_write,
.llseek = seq_lseek,
.release = single_release,
};
@@ -156,11 +220,12 @@ static int __init rseq_debugfs_init(void)
{
struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
- debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+ debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops);
+ rseq_stats_init(root_dir);
return 0;
}
__initcall(rseq_debugfs_init);
-#endif /* CONFIG_RSEQ_STATS */
+#endif /* CONFIG_DEBUG_FS */
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Expose lightweight statistics in debugfs
2025-10-27 8:44 ` [patch V6 17/31] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 18e0aab129d3940b5d588e3eea3d2ee9286d6516
Gitweb: https://git.kernel.org/tip/18e0aab129d3940b5d588e3eea3d2ee9286d6516
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:52 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:16 +01:00
rseq: Expose lightweight statistics in debugfs
Analyzing the call frequency without actually using tracing is helpful for
analysis of this infrastructure. The overhead is minimal as it just
increments a per CPU counter associated to each operation.
The debugfs readout provides a racy sum of all counters.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.027916598@linutronix.de
---
include/linux/rseq.h | 16 +-------
include/linux/rseq_entry.h | 49 +++++++++++++++++++++++-
init/Kconfig | 12 ++++++-
kernel/rseq.c | 79 +++++++++++++++++++++++++++++++++----
4 files changed, 133 insertions(+), 23 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index eb0dd13..2023c50 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -29,21 +29,6 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
}
}
-static __always_inline void rseq_exit_to_user_mode(void)
-{
- struct rseq_event *ev = ¤t->rseq.event;
-
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
- WARN_ON_ONCE(ev->sched_switch);
-
- /*
- * Ensure that event (especially user_irq) is cleared when the
- * interrupt did not result in a schedule and therefore the
- * rseq processing did not clear it.
- */
- ev->events = 0;
-}
-
/*
* KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
* which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
@@ -92,7 +77,6 @@ static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
-static inline void rseq_exit_to_user_mode(void) { }
#endif /* !CONFIG_RSEQ */
#ifdef CONFIG_DEBUG_RSEQ
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 5be507a..ff9080b 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -2,6 +2,37 @@
#ifndef _LINUX_RSEQ_ENTRY_H
#define _LINUX_RSEQ_ENTRY_H
+/* Must be outside the CONFIG_RSEQ guard to resolve the stubs */
+#ifdef CONFIG_RSEQ_STATS
+#include <linux/percpu.h>
+
+struct rseq_stats {
+ unsigned long exit;
+ unsigned long signal;
+ unsigned long slowpath;
+ unsigned long ids;
+ unsigned long cs;
+ unsigned long clear;
+ unsigned long fixup;
+};
+
+DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
+
+/*
+ * Slow path has interrupts and preemption enabled, but the fast path
+ * runs with interrupts disabled so there is no point in having the
+ * preemption checks implied in __this_cpu_inc() for every operation.
+ */
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_stat_inc(which) this_cpu_inc((which))
+#else
+#define rseq_stat_inc(which) raw_cpu_inc((which))
+#endif
+
+#else /* CONFIG_RSEQ_STATS */
+#define rseq_stat_inc(x) do { } while (0)
+#endif /* !CONFIG_RSEQ_STATS */
+
#ifdef CONFIG_RSEQ
#include <linux/rseq.h>
@@ -39,8 +70,26 @@ static __always_inline void rseq_note_user_irq_entry(void)
current->rseq.event.user_irq = true;
}
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ WARN_ON_ONCE(ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing did not clear it.
+ */
+ ev->events = 0;
+}
+
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
+static inline void rseq_exit_to_user_mode(void) { }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/init/Kconfig b/init/Kconfig
index cab3ad2..f39fdfb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1913,6 +1913,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_STATS
+ default n
+ bool "Enable lightweight statistics of restartable sequences" if EXPERT
+ depends on RSEQ && DEBUG_FS
+ help
+ Enable lightweight counters which expose information about the
+ frequency of RSEQ operations via debugfs. Mostly interesting for
+ kernel debugging or performance analysis. While lightweight it's
+ still adding code into the user/kernel mode transitions.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
diff --git a/kernel/rseq.c b/kernel/rseq.c
index f49d311..c0dbe2e 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -67,12 +67,16 @@
* F1. <failure>
*/
+/* Required to select the proper per_cpu ops for rseq_stats_inc() */
+#define RSEQ_BUILD_SLOW_PATH
+
+#include <linux/debugfs.h>
+#include <linux/ratelimit.h>
+#include <linux/rseq_entry.h>
#include <linux/sched.h>
-#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/rseq_entry.h>
+#include <linux/uaccess.h>
#include <linux/types.h>
-#include <linux/ratelimit.h>
#include <asm/ptrace.h>
#define CREATE_TRACE_POINTS
@@ -108,6 +112,56 @@ void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
}
#endif /* CONFIG_TRACEPOINTS */
+#ifdef CONFIG_RSEQ_STATS
+DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+ struct rseq_stats stats = { };
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ stats.exit += data_race(per_cpu(rseq_stats.exit, cpu));
+ stats.signal += data_race(per_cpu(rseq_stats.signal, cpu));
+ stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu));
+ stats.ids += data_race(per_cpu(rseq_stats.ids, cpu));
+ stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
+ stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
+ stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
+ }
+
+ seq_printf(m, "exit: %16lu\n", stats.exit);
+ seq_printf(m, "signal: %16lu\n", stats.signal);
+ seq_printf(m, "slowp: %16lu\n", stats.slowpath);
+ seq_printf(m, "ids: %16lu\n", stats.ids);
+ seq_printf(m, "cs: %16lu\n", stats.cs);
+ seq_printf(m, "clear: %16lu\n", stats.clear);
+ seq_printf(m, "fixup: %16lu\n", stats.fixup);
+ return 0;
+}
+
+static int rseq_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, rseq_debug_show, inode->i_private);
+}
+
+static const struct file_operations dfs_ops = {
+ .open = rseq_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init rseq_debugfs_init(void)
+{
+ struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
+
+ debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+ return 0;
+}
+__initcall(rseq_debugfs_init);
+#endif /* CONFIG_RSEQ_STATS */
+
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
{
@@ -187,12 +241,13 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
u32 node_id = cpu_to_node(cpu_id);
u32 mm_cid = task_mm_cid(t);
- /*
- * Validate read-only rseq fields.
- */
+ rseq_stat_inc(rseq_stats.ids);
+
+ /* Validate read-only rseq fields on debug kernels */
if (rseq_validate_ro_fields(t))
goto efault;
WARN_ON_ONCE((int) mm_cid < 0);
+
if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
@@ -403,6 +458,8 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
struct rseq_cs rseq_cs;
int ret;
+ rseq_stat_inc(rseq_stats.cs);
+
ret = rseq_get_rseq_cs(t, &rseq_cs);
if (ret)
return ret;
@@ -412,8 +469,10 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
* If not nested over a rseq critical section, restart is useless.
* Clear the rseq_cs pointer and return.
*/
- if (!in_rseq_cs(ip, &rseq_cs))
+ if (!in_rseq_cs(ip, &rseq_cs)) {
+ rseq_stat_inc(rseq_stats.clear);
return clear_rseq_cs(t->rseq.usrptr);
+ }
ret = rseq_check_flags(t, rseq_cs.flags);
if (ret < 0)
return ret;
@@ -422,6 +481,7 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
ret = clear_rseq_cs(t->rseq.usrptr);
if (ret)
return ret;
+ rseq_stat_inc(rseq_stats.fixup);
trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
rseq_cs.abort_ip);
instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
@@ -462,6 +522,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING))
return;
+ if (ksig)
+ rseq_stat_inc(rseq_stats.signal);
+ else
+ rseq_stat_inc(rseq_stats.slowpath);
+
/*
* Read and clear the event pending bit first. If the task
* was not preempted or migrated or a signal is on the way,
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide tracepoint wrappers for inline code
2025-10-27 8:44 ` [patch V6 16/31] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 2be690996a88b79741a7b00ba434576ca36d1912
Gitweb: https://git.kernel.org/tip/2be690996a88b79741a7b00ba434576ca36d1912
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:50 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:15 +01:00
rseq: Provide tracepoint wrappers for inline code
Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline
fast path, so that the header can be safely included by code which defines
actual trace points.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.967114316@linutronix.de
---
include/linux/rseq_entry.h | 28 ++++++++++++++++++++++++++++
kernel/rseq.c | 19 ++++++++++++++++++-
2 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index ce30e87..5be507a 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -5,6 +5,34 @@
#ifdef CONFIG_RSEQ
#include <linux/rseq.h>
+#include <linux/tracepoint-defs.h>
+
+#ifdef CONFIG_TRACEPOINTS
+DECLARE_TRACEPOINT(rseq_update);
+DECLARE_TRACEPOINT(rseq_ip_fixup);
+void __rseq_trace_update(struct task_struct *t);
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip);
+
+static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids)
+{
+ if (tracepoint_enabled(rseq_update) && ids)
+ __rseq_trace_update(t);
+}
+
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip)
+{
+ if (tracepoint_enabled(rseq_ip_fixup))
+ __rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+
+#else /* CONFIG_TRACEPOINT */
+static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids) { }
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip) { }
+#endif /* !CONFIG_TRACEPOINT */
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
diff --git a/kernel/rseq.c b/kernel/rseq.c
index ad1e7ce..f49d311 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -70,7 +70,7 @@
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/rseq.h>
+#include <linux/rseq_entry.h>
#include <linux/types.h>
#include <linux/ratelimit.h>
#include <asm/ptrace.h>
@@ -91,6 +91,23 @@
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
+#ifdef CONFIG_TRACEPOINTS
+/*
+ * Out of line, so the actual update functions can be in a header to be
+ * inlined into the exit to user code.
+ */
+void __rseq_trace_update(struct task_struct *t)
+{
+ trace_rseq_update(t);
+}
+
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip)
+{
+ trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+#endif /* CONFIG_TRACEPOINTS */
+
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
{
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Record interrupt from user space
2025-10-27 8:44 ` [patch V6 15/31] rseq: Record interrupt from user space Thomas Gleixner
2025-10-28 15:26 ` Mathieu Desnoyers
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 3cdfc5701dcd290c100ece742bcd13c2e6415cda
Gitweb: https://git.kernel.org/tip/3cdfc5701dcd290c100ece742bcd13c2e6415cda
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:48 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:15 +01:00
rseq: Record interrupt from user space
For RSEQ the only relevant reason to inspect and eventually fixup (abort)
user space critical sections is when user space was interrupted and the
task was scheduled out.
If the user to kernel entry was from a syscall no fixup is required. If
user space invokes a syscall from a critical section it can keep the
pieces as documented.
This is only supported on architectures which utilize the generic entry
code. If your architecture does not use it, bad luck.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.905067101@linutronix.de
---
include/linux/irq-entry-common.h | 3 ++-
include/linux/rseq.h | 16 +++++++++++-----
include/linux/rseq_entry.h | 18 ++++++++++++++++++
include/linux/rseq_types.h | 2 ++
4 files changed, 33 insertions(+), 6 deletions(-)
create mode 100644 include/linux/rseq_entry.h
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 83c9d84..cb31fb8 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -4,7 +4,7 @@
#include <linux/context_tracking.h>
#include <linux/kmsan.h>
-#include <linux/rseq.h>
+#include <linux/rseq_entry.h>
#include <linux/static_call_types.h>
#include <linux/syscalls.h>
#include <linux/tick.h>
@@ -281,6 +281,7 @@ static __always_inline void exit_to_user_mode(void)
static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
{
enter_from_user_mode(regs);
+ rseq_note_user_irq_entry();
}
/**
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 88067a6..eb0dd13 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -31,11 +31,17 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
static __always_inline void rseq_exit_to_user_mode(void)
{
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq.event.has_rseq &&
- current->rseq.event.events))
- current->rseq.event.events = 0;
- }
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ WARN_ON_ONCE(ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing did not clear it.
+ */
+ ev->events = 0;
}
/*
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
new file mode 100644
index 0000000..ce30e87
--- /dev/null
+++ b/include/linux/rseq_entry.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_ENTRY_H
+#define _LINUX_RSEQ_ENTRY_H
+
+#ifdef CONFIG_RSEQ
+#include <linux/rseq.h>
+
+static __always_inline void rseq_note_user_irq_entry(void)
+{
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
+ current->rseq.event.user_irq = true;
+}
+
+#else /* CONFIG_RSEQ */
+static inline void rseq_note_user_irq_entry(void) { }
+#endif /* !CONFIG_RSEQ */
+
+#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 40901b0..80f6c39 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -12,6 +12,7 @@ struct rseq;
* @all: Compound to initialize and clear the data efficiently
* @events: Compound to access events with a single load/store
* @sched_switch: True if the task was scheduled out
+ * @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
*/
struct rseq_event {
@@ -22,6 +23,7 @@ struct rseq_event {
u16 events;
struct {
u8 sched_switch;
+ u8 user_irq;
};
};
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Cache CPU ID and MM CID values
2025-10-27 8:44 ` [patch V6 14/31] rseq: Cache CPU ID and MM CID values Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 1a1de8251dbd94829c7e5f8adb1ecbb3d28074ca
Gitweb: https://git.kernel.org/tip/1a1de8251dbd94829c7e5f8adb1ecbb3d28074ca
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:45 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:15 +01:00
rseq: Cache CPU ID and MM CID values
In preparation for rewriting RSEQ exit to user space handling provide
storage to cache the CPU ID and MM CID values which were written to user
space. That prepares for a quick check, which avoids the update when
nothing changed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.841964081@linutronix.de
---
include/linux/rseq.h | 7 +++++--
include/linux/rseq_types.h | 21 +++++++++++++++++++++
include/trace/events/rseq.h | 4 ++--
kernel/rseq.c | 4 ++++
4 files changed, 32 insertions(+), 4 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 8c1c316..88067a6 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -57,6 +57,7 @@ static inline void rseq_virt_userspace_exit(void)
static inline void rseq_reset(struct task_struct *t)
{
memset(&t->rseq, 0, sizeof(t->rseq));
+ t->rseq.ids.cpu_cid = ~0ULL;
}
static inline void rseq_execve(struct task_struct *t)
@@ -70,10 +71,12 @@ static inline void rseq_execve(struct task_struct *t)
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM)
+ if (clone_flags & CLONE_VM) {
rseq_reset(t);
- else
+ } else {
t->rseq = current->rseq;
+ t->rseq.ids.cpu_cid = ~0ULL;
+ }
}
#else /* CONFIG_RSEQ */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index f7a60c8..40901b0 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -31,17 +31,38 @@ struct rseq_event {
};
/**
+ * struct rseq_ids - Cache for ids, which need to be updated
+ * @cpu_cid: Compound of @cpu_id and @mm_cid to make the
+ * compiler emit a single compare on 64-bit
+ * @cpu_id: The CPU ID which was written last to user space
+ * @mm_cid: The MM CID which was written last to user space
+ *
+ * @cpu_id and @mm_cid are updated when the data is written to user space.
+ */
+struct rseq_ids {
+ union {
+ u64 cpu_cid;
+ struct {
+ u32 cpu_id;
+ u32 mm_cid;
+ };
+ };
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
* @sig: Signature of critial section abort IPs
* @event: Storage for event management
+ * @ids: Storage for cached CPU ID and MM CID
*/
struct rseq_data {
struct rseq __user *usrptr;
u32 len;
u32 sig;
struct rseq_event event;
+ struct rseq_ids ids;
};
#else /* CONFIG_RSEQ */
diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
index 823b47d..ce85d65 100644
--- a/include/trace/events/rseq.h
+++ b/include/trace/events/rseq.h
@@ -21,9 +21,9 @@ TRACE_EVENT(rseq_update,
),
TP_fast_assign(
- __entry->cpu_id = raw_smp_processor_id();
+ __entry->cpu_id = t->rseq.ids.cpu_id;
__entry->node_id = cpu_to_node(__entry->cpu_id);
- __entry->mm_cid = task_mm_cid(t);
+ __entry->mm_cid = t->rseq.ids.mm_cid;
),
TP_printk("cpu_id=%d node_id=%d mm_cid=%d", __entry->cpu_id,
diff --git a/kernel/rseq.c b/kernel/rseq.c
index aae6266..ad1e7ce 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -184,6 +184,10 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
rseq_unsafe_put_user(t, node_id, node_id, efault_end);
rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
+ /* Cache the user space values */
+ t->rseq.ids.cpu_id = cpu_id;
+ t->rseq.ids.mm_cid = mm_cid;
+
/*
* Additional feature fields added after ORIG_RSEQ_SIZE
* need to be conditionally updated only if
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] sched: Move MM CID related functions to sched.h
2025-10-27 8:44 ` [patch V6 13/31] sched: Move MM CID related functions to sched.h Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 47a507c0f0c93f6a0ae8bc4dfc049d000e9bd50a
Gitweb: https://git.kernel.org/tip/47a507c0f0c93f6a0ae8bc4dfc049d000e9bd50a
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:42 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:14 +01:00
sched: Move MM CID related functions to sched.h
There is nothing mm specific in that and including mm.h can cause header
recursion hell.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.778457951@linutronix.de
---
include/linux/mm.h | 25 -------------------------
include/linux/sched.h | 26 ++++++++++++++++++++++++++
2 files changed, 26 insertions(+), 25 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33b..17cfbba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2401,31 +2401,6 @@ struct zap_details {
/* Set in unmap_vmas() to indicate a final unmap call. Only used by hugetlb */
#define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1))
-#ifdef CONFIG_SCHED_MM_CID
-void sched_mm_cid_before_execve(struct task_struct *t);
-void sched_mm_cid_after_execve(struct task_struct *t);
-void sched_mm_cid_fork(struct task_struct *t);
-void sched_mm_cid_exit_signals(struct task_struct *t);
-static inline int task_mm_cid(struct task_struct *t)
-{
- return t->mm_cid;
-}
-#else
-static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_fork(struct task_struct *t) { }
-static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
-static inline int task_mm_cid(struct task_struct *t)
-{
- /*
- * Use the processor id as a fall-back when the mm cid feature is
- * disabled. This provides functional per-cpu data structure accesses
- * in user-space, althrough it won't provide the memory usage benefits.
- */
- return raw_smp_processor_id();
-}
-#endif
-
#ifdef CONFIG_MMU
extern bool can_do_mlock(void);
#else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e64bff3..d6e59c8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2310,6 +2310,32 @@ static __always_inline void alloc_tag_restore(struct alloc_tag *tag, struct allo
#define alloc_tag_restore(_tag, _old) do {} while (0)
#endif
+/* Avoids recursive inclusion hell */
+#ifdef CONFIG_SCHED_MM_CID
+void sched_mm_cid_before_execve(struct task_struct *t);
+void sched_mm_cid_after_execve(struct task_struct *t);
+void sched_mm_cid_fork(struct task_struct *t);
+void sched_mm_cid_exit_signals(struct task_struct *t);
+static inline int task_mm_cid(struct task_struct *t)
+{
+ return t->mm_cid;
+}
+#else
+static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_fork(struct task_struct *t) { }
+static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
+static inline int task_mm_cid(struct task_struct *t)
+{
+ /*
+ * Use the processor id as a fall-back when the mm cid feature is
+ * disabled. This provides functional per-cpu data structure accesses
+ * in user-space, althrough it won't provide the memory usage benefits.
+ */
+ return task_cpu(t);
+}
+#endif
+
#ifndef MODULE
#ifndef COMPILE_OFFSETS
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Inline irqentry_enter/exit_from/to_user_mode()
2025-10-27 8:44 ` [patch V6 12/31] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 6a2d376e6e9bb4ac79ba1e72beaca5a74669aff6
Gitweb: https://git.kernel.org/tip/6a2d376e6e9bb4ac79ba1e72beaca5a74669aff6
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:40 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:14 +01:00
entry: Inline irqentry_enter/exit_from/to_user_mode()
There is no point to have this as a function which just inlines
enter_from_user_mode(). The function call overhead is larger than the
function itself.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.715309918@linutronix.de
---
include/linux/irq-entry-common.h | 13 +++++++++++--
kernel/entry/common.c | 13 -------------
2 files changed, 11 insertions(+), 15 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 9b1f386..83c9d84 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -278,7 +278,10 @@ static __always_inline void exit_to_user_mode(void)
*
* The function establishes state (lockdep, RCU (context tracking), tracing)
*/
-void irqentry_enter_from_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
+{
+ enter_from_user_mode(regs);
+}
/**
* irqentry_exit_to_user_mode - Interrupt exit work
@@ -293,7 +296,13 @@ void irqentry_enter_from_user_mode(struct pt_regs *regs);
* Interrupt exit is not invoking #1 which is the syscall specific one time
* work.
*/
-void irqentry_exit_to_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
+{
+ instrumentation_begin();
+ exit_to_user_mode_prepare(regs);
+ instrumentation_end();
+ exit_to_user_mode();
+}
#ifndef irqentry_state
/**
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f62e1d1..70a16db 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -62,19 +62,6 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
return ti_work;
}
-noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
-{
- enter_from_user_mode(regs);
-}
-
-noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
-{
- instrumentation_begin();
- exit_to_user_mode_prepare(regs);
- instrumentation_end();
- exit_to_user_mode();
-}
-
noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
{
irqentry_state_t ret = {
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Remove syscall_enter_from_user_mode_prepare()
2025-10-27 8:44 ` [patch V6 11/31] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 8a10c38923ff6cc15c9debb806305cbccca3806b
Gitweb: https://git.kernel.org/tip/8a10c38923ff6cc15c9debb806305cbccca3806b
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:38 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:14 +01:00
entry: Remove syscall_enter_from_user_mode_prepare()
Open code the only user in the x86 syscall code and reduce the zoo of
functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.652839989@linutronix.de
---
arch/x86/entry/syscall_32.c | 3 ++-
include/linux/entry-common.h | 26 +++++---------------------
kernel/entry/syscall-common.c | 8 --------
3 files changed, 7 insertions(+), 30 deletions(-)
diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
index 2b15ea1..a67a644 100644
--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@@ -274,9 +274,10 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
* fetch EBP before invoking any of the syscall entry work
* functions.
*/
- syscall_enter_from_user_mode_prepare(regs);
+ enter_from_user_mode(regs);
instrumentation_begin();
+ local_irq_enable();
/* Fetch EBP from where the vDSO stashed it. */
if (IS_ENABLED(CONFIG_X86_64)) {
/*
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index c585221..75b194c 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -45,23 +45,6 @@
SYSCALL_WORK_SYSCALL_EXIT_TRAP | \
ARCH_SYSCALL_WORK_EXIT)
-/**
- * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
- * @regs: Pointer to currents pt_regs
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct, interrupts are enabled and the
- * subsequent functions can be instrumented.
- *
- * This handles lockdep, RCU (context tracking) and tracing state, i.e.
- * the functionality provided by enter_from_user_mode().
- *
- * This is invoked when there is extra architecture specific functionality
- * to be done between establishing state and handling user mode entry work.
- */
-void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-
long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
/**
@@ -71,8 +54,8 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
* @syscall: The syscall number
*
* Invoked from architecture specific syscall entry code with interrupts
- * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
- * architecture specific work.
+ * enabled after invoking enter_from_user_mode(), enabling interrupts and
+ * extra architecture specific work.
*
* Returns: The original or a modified syscall number
*
@@ -108,8 +91,9 @@ static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *re
* function returns all state is correct, interrupts are enabled and the
* subsequent functions can be instrumented.
*
- * This is combination of syscall_enter_from_user_mode_prepare() and
- * syscall_enter_from_user_mode_work().
+ * This is the combination of enter_from_user_mode() and
+ * syscall_enter_from_user_mode_work() to be used when there is no
+ * architecture specific work to be done between the two.
*
* Returns: The original or a modified syscall number. See
* syscall_enter_from_user_mode_work() for further explanation.
diff --git a/kernel/entry/syscall-common.c b/kernel/entry/syscall-common.c
index 66e6ba7..940a597 100644
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -63,14 +63,6 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
return ret ? : syscall;
}
-noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
-{
- enter_from_user_mode(regs);
- instrumentation_begin();
- local_irq_enable();
- instrumentation_end();
-}
-
/*
* If SYSCALL_EMU is set, then the only reason to report is when
* SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Cleanup header
2025-10-27 8:44 ` [patch V6 10/31] entry: Cleanup header Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` [tip: core/rseq] entry: Clean up header tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 9ef94cc077010c99ff1e8a32d0719cecd8f79552
Gitweb: https://git.kernel.org/tip/9ef94cc077010c99ff1e8a32d0719cecd8f79552
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:36 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:13 +01:00
entry: Cleanup header
Cleanup the include ordering, kernel-doc and other trivialities before
making further changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.590338411@linutronix.de
---
include/linux/entry-common.h | 8 ++++----
include/linux/irq-entry-common.h | 2 ++
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 7177436..c585221 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -3,11 +3,11 @@
#define __LINUX_ENTRYCOMMON_H
#include <linux/irq-entry-common.h>
+#include <linux/livepatch.h>
#include <linux/ptrace.h>
+#include <linux/resume_user_mode.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
-#include <linux/livepatch.h>
-#include <linux/resume_user_mode.h>
#include <asm/entry-common.h>
#include <asm/syscall.h>
@@ -37,6 +37,7 @@
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_ENTER)
+
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
@@ -61,8 +62,7 @@
*/
void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
- unsigned long work);
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
/**
* syscall_enter_from_user_mode_work - Check and handle work before invoking
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index e5941df..9b1f386 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_eqs(void) { return false; }
/**
* enter_from_user_mode - Establish state when coming from user mode
+ * @regs: Pointer to currents pt_regs
*
* Syscall/interrupt entry disables interrupts, but user mode is traced as
* interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
* Conditional reschedule with additional sanity checks.
*/
void raw_irqentry_exit_cond_resched(void);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Introduce struct rseq_data
2025-10-27 8:44 ` [patch V6 09/31] rseq: Introduce struct rseq_data Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: d34619cdd24e964ada11874cc792c75a0f507714
Gitweb: https://git.kernel.org/tip/d34619cdd24e964ada11874cc792c75a0f507714
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:33 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:13 +01:00
rseq: Introduce struct rseq_data
In preparation for a major rewrite of this code, provide a data structure
for rseq management.
Put all the rseq related data into it (except for the debug part), which
allows to simplify fork/execve by using memset() and memcpy() instead of
adding new fields to initialize over and over.
Create a storage struct for event management as well and put the
sched_switch event and a indicator for RSEQ on a task into it as a
start. That uses a union, which allows to mask and clear the whole lot
efficiently.
The indicators are explicitly not a bit field. Bit fields generate abysmal
code.
The boolean members are defined as u8 as that actually guarantees that it
fits. There seem to be strange architecture ABIs which need more than 8
bits for a boolean.
The has_rseq member is redundant vs. task::rseq, but it turns out that
boolean operations and quick checks on the union generate better code than
fiddling with separate entities and data types.
This struct will be extended over time to carry more information.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.527086690@linutronix.de
---
include/linux/rseq.h | 48 ++++++++++++----------------
include/linux/rseq_types.h | 51 ++++++++++++++++++++++++++++++-
include/linux/sched.h | 14 +-------
kernel/ptrace.c | 6 ++--
kernel/rseq.c | 63 ++++++++++++++++++-------------------
5 files changed, 110 insertions(+), 72 deletions(-)
create mode 100644 include/linux/rseq_types.h
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 78fa4c1..8c1c316 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -9,22 +9,22 @@ void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
- if (current->rseq)
+ if (current->rseq.event.has_rseq)
__rseq_handle_notify_resume(NULL, regs);
}
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
- if (current->rseq) {
- current->rseq_event_pending = true;
+ if (current->rseq.event.has_rseq) {
+ current->rseq.event.sched_switch = true;
__rseq_handle_notify_resume(ksig, regs);
}
}
static inline void rseq_sched_switch_event(struct task_struct *t)
{
- if (t->rseq) {
- t->rseq_event_pending = true;
+ if (t->rseq.event.has_rseq) {
+ t->rseq.event.sched_switch = true;
set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
}
}
@@ -32,8 +32,9 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
static __always_inline void rseq_exit_to_user_mode(void)
{
if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
- current->rseq_event_pending = false;
+ if (WARN_ON_ONCE(current->rseq.event.has_rseq &&
+ current->rseq.event.events))
+ current->rseq.event.events = 0;
}
}
@@ -49,35 +50,30 @@ static __always_inline void rseq_exit_to_user_mode(void)
*/
static inline void rseq_virt_userspace_exit(void)
{
- if (current->rseq_event_pending)
+ if (current->rseq.event.sched_switch)
set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
}
+static inline void rseq_reset(struct task_struct *t)
+{
+ memset(&t->rseq, 0, sizeof(t->rseq));
+}
+
+static inline void rseq_execve(struct task_struct *t)
+{
+ rseq_reset(t);
+}
+
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM) {
- t->rseq = NULL;
- t->rseq_len = 0;
- t->rseq_sig = 0;
- t->rseq_event_pending = false;
- } else {
+ if (clone_flags & CLONE_VM)
+ rseq_reset(t);
+ else
t->rseq = current->rseq;
- t->rseq_len = current->rseq_len;
- t->rseq_sig = current->rseq_sig;
- t->rseq_event_pending = current->rseq_event_pending;
- }
-}
-
-static inline void rseq_execve(struct task_struct *t)
-{
- t->rseq = NULL;
- t->rseq_len = 0;
- t->rseq_sig = 0;
- t->rseq_event_pending = false;
}
#else /* CONFIG_RSEQ */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
new file mode 100644
index 0000000..f7a60c8
--- /dev/null
+++ b/include/linux/rseq_types.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_TYPES_H
+#define _LINUX_RSEQ_TYPES_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_RSEQ
+struct rseq;
+
+/**
+ * struct rseq_event - Storage for rseq related event management
+ * @all: Compound to initialize and clear the data efficiently
+ * @events: Compound to access events with a single load/store
+ * @sched_switch: True if the task was scheduled out
+ * @has_rseq: True if the task has a rseq pointer installed
+ */
+struct rseq_event {
+ union {
+ u32 all;
+ struct {
+ union {
+ u16 events;
+ struct {
+ u8 sched_switch;
+ };
+ };
+
+ u8 has_rseq;
+ };
+ };
+};
+
+/**
+ * struct rseq_data - Storage for all rseq related data
+ * @usrptr: Pointer to the registered user space RSEQ memory
+ * @len: Length of the RSEQ region
+ * @sig: Signature of critial section abort IPs
+ * @event: Storage for event management
+ */
+struct rseq_data {
+ struct rseq __user *usrptr;
+ u32 len;
+ u32 sig;
+ struct rseq_event event;
+};
+
+#else /* CONFIG_RSEQ */
+struct rseq_data { };
+#endif /* !CONFIG_RSEQ */
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9599525..e64bff3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -41,6 +41,7 @@
#include <linux/task_io_accounting.h>
#include <linux/posix-timers_types.h>
#include <linux/restart_block.h>
+#include <linux/rseq_types.h>
#include <uapi/linux/rseq.h>
#include <linux/seqlock_types.h>
#include <linux/kcsan.h>
@@ -1406,16 +1407,8 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
-#ifdef CONFIG_RSEQ
- struct rseq __user *rseq;
- u32 rseq_len;
- u32 rseq_sig;
- /*
- * RmW on rseq_event_pending must be performed atomically
- * with respect to preemption.
- */
- bool rseq_event_pending;
-# ifdef CONFIG_DEBUG_RSEQ
+ struct rseq_data rseq;
+#ifdef CONFIG_DEBUG_RSEQ
/*
* This is a place holder to save a copy of the rseq fields for
* validation of read-only fields. The struct rseq has a
@@ -1423,7 +1416,6 @@ struct task_struct {
* directly. Reserve a size large enough for the known fields.
*/
char rseq_fields[sizeof(struct rseq)];
-# endif
#endif
#ifdef CONFIG_SCHED_MM_CID
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84ef..392ec2f 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -793,9 +793,9 @@ static long ptrace_get_rseq_configuration(struct task_struct *task,
unsigned long size, void __user *data)
{
struct ptrace_rseq_configuration conf = {
- .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
- .rseq_abi_size = task->rseq_len,
- .signature = task->rseq_sig,
+ .rseq_abi_pointer = (u64)(uintptr_t)task->rseq.usrptr,
+ .rseq_abi_size = task->rseq.len,
+ .signature = task->rseq.sig,
.flags = 0,
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 81dddaf..aae6266 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -103,13 +103,13 @@ static int rseq_validate_ro_fields(struct task_struct *t)
DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
u32 cpu_id_start, cpu_id, node_id, mm_cid;
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
/*
* Validate fields which are required to be read-only by
* user-space.
*/
- if (!user_read_access_begin(rseq, t->rseq_len))
+ if (!user_read_access_begin(rseq, t->rseq.len))
goto efault;
unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
@@ -147,10 +147,10 @@ efault:
* Update an rseq field and its in-kernel copy in lock-step to keep a coherent
* state.
*/
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- do { \
- unsafe_put_user(value, &t->rseq->field, error_label); \
- rseq_kernel_fields(t)->field = value; \
+#define rseq_unsafe_put_user(t, value, field, error_label) \
+ do { \
+ unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \
+ rseq_kernel_fields(t)->field = value; \
} while (0)
#else
@@ -160,12 +160,12 @@ static int rseq_validate_ro_fields(struct task_struct *t)
}
#define rseq_unsafe_put_user(t, value, field, error_label) \
- unsafe_put_user(value, &t->rseq->field, error_label)
+ unsafe_put_user(value, &t->rseq.usrptr->field, error_label)
#endif
static int rseq_update_cpu_node_id(struct task_struct *t)
{
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
u32 cpu_id = raw_smp_processor_id();
u32 node_id = cpu_to_node(cpu_id);
u32 mm_cid = task_mm_cid(t);
@@ -176,7 +176,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
if (rseq_validate_ro_fields(t))
goto efault;
WARN_ON_ONCE((int) mm_cid < 0);
- if (!user_write_access_begin(rseq, t->rseq_len))
+ if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
@@ -201,7 +201,7 @@ efault:
static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
{
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
mm_cid = 0;
@@ -211,7 +211,7 @@ static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
if (rseq_validate_ro_fields(t))
goto efault;
- if (!user_write_access_begin(rseq, t->rseq_len))
+ if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
/*
@@ -272,7 +272,7 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
u32 sig;
int ret;
- ret = rseq_get_rseq_cs_ptr_val(t->rseq, &ptr);
+ ret = rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr);
if (ret)
return ret;
@@ -305,10 +305,10 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
if (ret)
return ret;
- if (current->rseq_sig != sig) {
+ if (current->rseq.sig != sig) {
printk_ratelimited(KERN_WARNING
"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq_sig, current->pid, usig);
+ sig, current->rseq.sig, current->pid, usig);
return -EINVAL;
}
return 0;
@@ -338,7 +338,7 @@ static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
return -EINVAL;
/* Get thread flags. */
- ret = get_user(flags, &t->rseq->flags);
+ ret = get_user(flags, &t->rseq.usrptr->flags);
if (ret)
return ret;
@@ -392,13 +392,13 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
* Clear the rseq_cs pointer and return.
*/
if (!in_rseq_cs(ip, &rseq_cs))
- return clear_rseq_cs(t->rseq);
+ return clear_rseq_cs(t->rseq.usrptr);
ret = rseq_check_flags(t, rseq_cs.flags);
if (ret < 0)
return ret;
if (!abort)
return 0;
- ret = clear_rseq_cs(t->rseq);
+ ret = clear_rseq_cs(t->rseq.usrptr);
if (ret)
return ret;
trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
@@ -460,8 +460,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
* inconsistencies.
*/
scoped_guard(RSEQ_EVENT_GUARD) {
- event = t->rseq_event_pending;
- t->rseq_event_pending = false;
+ event = t->rseq.event.sched_switch;
+ t->rseq.event.sched_switch = false;
}
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -492,7 +492,7 @@ void rseq_syscall(struct pt_regs *regs)
struct task_struct *t = current;
struct rseq_cs rseq_cs;
- if (!t->rseq)
+ if (!t->rseq.usrptr)
return;
if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
force_sig(SIGSEGV);
@@ -511,33 +511,31 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
/* Unregister rseq for current thread. */
- if (current->rseq != rseq || !current->rseq)
+ if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
return -EINVAL;
- if (rseq_len != current->rseq_len)
+ if (rseq_len != current->rseq.len)
return -EINVAL;
- if (current->rseq_sig != sig)
+ if (current->rseq.sig != sig)
return -EPERM;
ret = rseq_reset_rseq_cpu_node_id(current);
if (ret)
return ret;
- current->rseq = NULL;
- current->rseq_sig = 0;
- current->rseq_len = 0;
+ rseq_reset(current);
return 0;
}
if (unlikely(flags))
return -EINVAL;
- if (current->rseq) {
+ if (current->rseq.usrptr) {
/*
* If rseq is already registered, check whether
* the provided address differs from the prior
* one.
*/
- if (current->rseq != rseq || rseq_len != current->rseq_len)
+ if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
return -EINVAL;
- if (current->rseq_sig != sig)
+ if (current->rseq.sig != sig)
return -EPERM;
/* Already registered. */
return -EBUSY;
@@ -586,15 +584,16 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
*/
- current->rseq = rseq;
- current->rseq_len = rseq_len;
- current->rseq_sig = sig;
+ current->rseq.usrptr = rseq;
+ current->rseq.len = rseq_len;
+ current->rseq.sig = sig;
/*
* If rseq was previously inactive, and has just been
* registered, ensure the cpu_id_start and cpu_id fields
* are updated before returning to user-space.
*/
+ current->rseq.event.has_rseq = true;
rseq_sched_switch_event(current);
return 0;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Avoid CPU/MM CID updates when no event pending
2025-10-27 8:44 ` [patch V6 08/31] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 49c347874852230f8a7a419ef1faa871f5b2206e
Gitweb: https://git.kernel.org/tip/49c347874852230f8a7a419ef1faa871f5b2206e
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:31 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:13 +01:00
rseq: Avoid CPU/MM CID updates when no event pending
There is no need to update these values unconditionally if there is no
event pending.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.462964916@linutronix.de
---
kernel/rseq.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 01e7113..81dddaf 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -464,11 +464,12 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
t->rseq_event_pending = false;
}
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
- goto error;
- }
+ if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+ return;
+
+ ret = rseq_ip_fixup(regs, event);
+ if (unlikely(ret < 0))
+ goto error;
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq, virt: Retrigger RSEQ after vcpu_run()
2025-10-27 8:44 ` [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
2025-10-28 15:08 ` Mathieu Desnoyers
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers,
Sean Christopherson, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 7683b565b32e7ee98e6e3be7a16c7587f1c94b20
Gitweb: https://git.kernel.org/tip/7683b565b32e7ee98e6e3be7a16c7587f1c94b20
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:28 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:13 +01:00
rseq, virt: Retrigger RSEQ after vcpu_run()
Hypervisors invoke resume_user_mode_work() before entering the guest, which
clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
space context available to them, so the rseq notify handler skips
inspecting the critical section, but updates the CPU/MM CID values
unconditionally so that the eventual pending rseq event is not lost on the
way to user space.
This is a pointless exercise as the task might be rescheduled before
actually returning to user space and it creates unnecessary work in the
vcpu_run() loops.
It's way more efficient to ignore that invocation based on @regs == NULL
and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
vcpu_run() loop before returning from the ioctl().
This ensures that a pending RSEQ update is not lost and the IDs are updated
before returning to user space.
Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
a NOOP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20251027084306.399495855@linutronix.de
---
drivers/hv/mshv_root_main.c | 3 +-
include/linux/rseq.h | 17 ++++++++-
kernel/rseq.c | 78 ++++++++++++++++++------------------
virt/kvm/kvm_main.c | 7 +++-
4 files changed, 68 insertions(+), 37 deletions(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e3b2bd4..a21a0eb 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -29,6 +29,7 @@
#include <linux/crash_dump.h>
#include <linux/panic_notifier.h>
#include <linux/vmalloc.h>
+#include <linux/rseq.h>
#include "mshv_eventfd.h"
#include "mshv.h"
@@ -560,6 +561,8 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
}
} while (!vp->run.flags.intercept_suspend);
+ rseq_virt_userspace_exit();
+
return ret;
}
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index b91f837..78fa4c1 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to_user_mode(void)
}
/*
+ * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
+ * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
+ * that case just to do it eventually again before returning to user space,
+ * the entry resume_user_mode_work() invocation is ignored as the register
+ * argument is NULL.
+ *
+ * After returning from guest mode, they have to invoke this function to
+ * re-raise TIF_NOTIFY_RESUME if necessary.
+ */
+static inline void rseq_virt_userspace_exit(void)
+{
+ if (current->rseq_event_pending)
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+}
+
+/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
*/
@@ -68,6 +84,7 @@ static inline void rseq_execve(struct task_struct *t)
static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 59adc1a..01e7113 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
int ret, sig;
+ bool event;
+
+ /*
+ * If invoked from hypervisors before entering the guest via
+ * resume_user_mode_work(), then @regs is a NULL pointer.
+ *
+ * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+ * it before returning from the ioctl() to user space when
+ * rseq_event.sched_switch is set.
+ *
+ * So it's safe to ignore here instead of pointlessly updating it
+ * in the vcpu_run() loop.
+ */
+ if (!regs)
+ return;
if (unlikely(t->flags & PF_EXITING))
return;
/*
- * If invoked from hypervisors or IO-URING, then @regs is a NULL
- * pointer, so fixup cannot be done. If the syscall which led to
- * this invocation was invoked inside a critical section, then it
- * will either end up in this code again or a possible violation of
- * a syscall inside a critical region can only be detected by the
- * debug code in rseq_syscall() in a debug enabled kernel.
+ * Read and clear the event pending bit first. If the task
+ * was not preempted or migrated or a signal is on the way,
+ * there is no point in doing any of the heavy lifting here
+ * on production kernels. In that case TIF_NOTIFY_RESUME
+ * was raised by some other functionality.
+ *
+ * This is correct because the read/clear operation is
+ * guarded against scheduler preemption, which makes it CPU
+ * local atomic. If the task is preempted right after
+ * re-enabling preemption then TIF_NOTIFY_RESUME is set
+ * again and this function is invoked another time _before_
+ * the task is able to return to user mode.
+ *
+ * On a debug kernel, invoke the fixup code unconditionally
+ * with the result handed in to allow the detection of
+ * inconsistencies.
*/
- if (regs) {
- /*
- * Read and clear the event pending bit first. If the task
- * was not preempted or migrated or a signal is on the way,
- * there is no point in doing any of the heavy lifting here
- * on production kernels. In that case TIF_NOTIFY_RESUME
- * was raised by some other functionality.
- *
- * This is correct because the read/clear operation is
- * guarded against scheduler preemption, which makes it CPU
- * local atomic. If the task is preempted right after
- * re-enabling preemption then TIF_NOTIFY_RESUME is set
- * again and this function is invoked another time _before_
- * the task is able to return to user mode.
- *
- * On a debug kernel, invoke the fixup code unconditionally
- * with the result handed in to allow the detection of
- * inconsistencies.
- */
- bool event;
-
- scoped_guard(RSEQ_EVENT_GUARD) {
- event = t->rseq_event_pending;
- t->rseq_event_pending = false;
- }
-
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
- goto error;
- }
+ scoped_guard(RSEQ_EVENT_GUARD) {
+ event = t->rseq_event_pending;
+ t->rseq_event_pending = false;
}
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+ ret = rseq_ip_fixup(regs, event);
+ if (unlikely(ret < 0))
+ goto error;
+ }
+
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
return;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7a0ae2..4255fcf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
#include <linux/lockdep.h>
#include <linux/kthread.h>
#include <linux/suspend.h>
+#include <linux/rseq.h>
#include <asm/processor.h>
#include <asm/ioctl.h>
@@ -4476,6 +4477,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
r = kvm_arch_vcpu_ioctl_run(vcpu);
vcpu->wants_to_run = false;
+ /*
+ * FIXME: Remove this hack once all KVM architectures
+ * support the generic TIF bits, i.e. a dedicated TIF_RSEQ.
+ */
+ rseq_virt_userspace_exit();
+
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
break;
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Simplify the event notification
2025-10-27 8:44 ` [patch V6 06/31] rseq: Simplify the event notification Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 0ebddaab76b21cab606e7c0b39a8b06d733bbbc9
Gitweb: https://git.kernel.org/tip/0ebddaab76b21cab606e7c0b39a8b06d733bbbc9
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:26 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:12 +01:00
rseq: Simplify the event notification
Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
flags") the bits in task::rseq_event_mask are meaningless and just extra
work in terms of setting them individually.
Aside of that the only relevant point where an event has to be raised is
context switch. Neither the CPU nor MM CID can change without going through
a context switch.
Collapse them all into a single boolean which simplifies the code a lot and
remove the pointless invocations which have been sprinkled all over the
place for no value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.336978188@linutronix.de
---
fs/exec.c | 2 +-
include/linux/rseq.h | 66 +++++++-------------------------------
include/linux/sched.h | 10 +++---
include/uapi/linux/rseq.h | 21 ++++--------
kernel/rseq.c | 28 +++++++++-------
kernel/sched/core.c | 5 +---
kernel/sched/membarrier.c | 8 ++---
7 files changed, 48 insertions(+), 92 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 4298e7e..e45b298 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ out:
force_fatal_sig(SIGSEGV);
sched_mm_cid_after_execve(current);
- rseq_set_notify_resume(current);
+ rseq_sched_switch_event(current);
current->in_execve = 0;
return retval;
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index c06a8e5..b91f837 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -3,38 +3,8 @@
#define _LINUX_RSEQ_H
#ifdef CONFIG_RSEQ
-
-#include <linux/preempt.h>
#include <linux/sched.h>
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD irq
-#else
-# define RSEQ_EVENT_GUARD preempt
-#endif
-
-/*
- * Map the event mask on the user-space ABI enum rseq_cs_flags
- * for direct mask checks.
- */
-enum rseq_event_mask_bits {
- RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
- RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
- RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
-};
-
-enum rseq_event_mask {
- RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT),
- RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT),
- RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT),
-};
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
- if (t->rseq)
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
-}
-
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
@@ -43,35 +13,27 @@ static inline void rseq_handle_notify_resume(struct pt_regs *regs)
__rseq_handle_notify_resume(NULL, regs);
}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
if (current->rseq) {
- scoped_guard(RSEQ_EVENT_GUARD)
- __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ current->rseq_event_pending = true;
__rseq_handle_notify_resume(ksig, regs);
}
}
-/* rseq_preempt() requires preemption to be disabled. */
-static inline void rseq_preempt(struct task_struct *t)
+static inline void rseq_sched_switch_event(struct task_struct *t)
{
- __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
-}
-
-/* rseq_migrate() requires preemption to be disabled. */
-static inline void rseq_migrate(struct task_struct *t)
-{
- __set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
+ if (t->rseq) {
+ t->rseq_event_pending = true;
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ }
}
static __always_inline void rseq_exit_to_user_mode(void)
{
if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
- current->rseq_event_mask = 0;
+ if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
+ current->rseq_event_pending = false;
}
}
@@ -85,12 +47,12 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
- t->rseq_event_mask = 0;
+ t->rseq_event_pending = false;
} else {
t->rseq = current->rseq;
t->rseq_len = current->rseq_len;
t->rseq_sig = current->rseq_sig;
- t->rseq_event_mask = current->rseq_event_mask;
+ t->rseq_event_pending = current->rseq_event_pending;
}
}
@@ -99,15 +61,13 @@ static inline void rseq_execve(struct task_struct *t)
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
- t->rseq_event_mask = 0;
+ t->rseq_event_pending = false;
}
#else /* CONFIG_RSEQ */
-static inline void rseq_set_notify_resume(struct task_struct *t) { }
static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
-static inline void rseq_preempt(struct task_struct *t) { }
-static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cbb7340..9599525 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1407,14 +1407,14 @@ struct task_struct {
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_RSEQ
- struct rseq __user *rseq;
- u32 rseq_len;
- u32 rseq_sig;
+ struct rseq __user *rseq;
+ u32 rseq_len;
+ u32 rseq_sig;
/*
- * RmW on rseq_event_mask must be performed atomically
+ * RmW on rseq_event_pending must be performed atomically
* with respect to preemption.
*/
- unsigned long rseq_event_mask;
+ bool rseq_event_pending;
# ifdef CONFIG_DEBUG_RSEQ
/*
* This is a place holder to save a copy of the rseq fields for
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae..1b76d50 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -114,20 +114,13 @@ struct rseq {
/*
* Restartable sequences flags field.
*
- * This field should only be updated by the thread which
- * registered this data structure. Read by the kernel.
- * Mainly used for single-stepping through rseq critical sections
- * with debuggers.
- *
- * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
- * Inhibit instruction sequence block restart on preemption
- * for this thread.
- * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
- * Inhibit instruction sequence block restart on signal
- * delivery for this thread.
- * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
- * Inhibit instruction sequence block restart on migration for
- * this thread.
+ * This field was initially intended to allow event masking for
+ * single-stepping through rseq critical sections with debuggers.
+ * The kernel does not support this anymore and the relevant bits
+ * are checked for being always false:
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
*/
__u32 flags;
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 80af48a..59adc1a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -78,6 +78,12 @@
#define CREATE_TRACE_POINTS
#include <trace/events/rseq.h>
+#ifdef CONFIG_MEMBARRIER
+# define RSEQ_EVENT_GUARD irq
+#else
+# define RSEQ_EVENT_GUARD preempt
+#endif
+
/* The original rseq structure size (including padding) is 32 bytes. */
#define ORIG_RSEQ_SIZE 32
@@ -430,11 +436,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
*/
if (regs) {
/*
- * Read and clear the event mask first. If the task was not
- * preempted or migrated or a signal is on the way, there
- * is no point in doing any of the heavy lifting here on
- * production kernels. In that case TIF_NOTIFY_RESUME was
- * raised by some other functionality.
+ * Read and clear the event pending bit first. If the task
+ * was not preempted or migrated or a signal is on the way,
+ * there is no point in doing any of the heavy lifting here
+ * on production kernels. In that case TIF_NOTIFY_RESUME
+ * was raised by some other functionality.
*
* This is correct because the read/clear operation is
* guarded against scheduler preemption, which makes it CPU
@@ -447,15 +453,15 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
* with the result handed in to allow the detection of
* inconsistencies.
*/
- u32 event_mask;
+ bool event;
scoped_guard(RSEQ_EVENT_GUARD) {
- event_mask = t->rseq_event_mask;
- t->rseq_event_mask = 0;
+ event = t->rseq_event_pending;
+ t->rseq_event_pending = false;
}
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
- ret = rseq_ip_fixup(regs, !!event_mask);
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+ ret = rseq_ip_fixup(regs, event);
if (unlikely(ret < 0))
goto error;
}
@@ -584,7 +590,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* registered, ensure the cpu_id_start and cpu_id fields
* are updated before returning to user-space.
*/
- rseq_set_notify_resume(current);
+ rseq_sched_switch_event(current);
return 0;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67..b75e8e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3329,7 +3329,6 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
if (p->sched_class->migrate_task_rq)
p->sched_class->migrate_task_rq(p, new_cpu);
p->se.nr_migrations++;
- rseq_migrate(p);
sched_mm_cid_migrate_from(p);
perf_event_task_migrate(p);
}
@@ -4763,7 +4762,6 @@ int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
p->sched_task_group = tg;
}
#endif
- rseq_migrate(p);
/*
* We're setting the CPU for the first time, we don't migrate,
* so use __set_task_cpu().
@@ -4827,7 +4825,6 @@ void wake_up_new_task(struct task_struct *p)
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
- rseq_migrate(p);
__set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
@@ -5121,7 +5118,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
kcov_prepare_switch(prev);
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
- rseq_preempt(prev);
+ rseq_sched_switch_event(prev);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 62fba83..6234456 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -199,7 +199,7 @@ static void ipi_rseq(void *info)
* is negligible.
*/
smp_mb();
- rseq_preempt(current);
+ rseq_sched_switch_event(current);
}
static void ipi_sync_rq_state(void *info)
@@ -407,9 +407,9 @@ static int membarrier_private_expedited(int flags, int cpu_id)
* membarrier, we will end up with some thread in the mm
* running without a core sync.
*
- * For RSEQ, don't rseq_preempt() the caller. User code
- * is not supposed to issue syscalls at all from inside an
- * rseq critical section.
+ * For RSEQ, don't invoke rseq_sched_switch_event() on the
+ * caller. User code is not supposed to issue syscalls at
+ * all from inside an rseq critical section.
*/
if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
preempt_disable();
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Simplify registration
2025-10-27 8:44 ` [patch V6 05/31] rseq: Simplify registration Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: bd88944825d9579d1240dcc55b5f57635298a121
Gitweb: https://git.kernel.org/tip/bd88944825d9579d1240dcc55b5f57635298a121
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:24 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:12 +01:00
rseq: Simplify registration
There is no point to read the critical section element in the newly
registered user space RSEQ struct first in order to clear it.
Just clear it and be done with it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.274661227@linutronix.de
---
kernel/rseq.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 51fafc4..80af48a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -492,11 +492,9 @@ void rseq_syscall(struct pt_regs *regs)
/*
* sys_rseq - setup restartable sequences for caller thread.
*/
-SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
- int, flags, u32, sig)
+SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
int ret;
- u64 rseq_cs;
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
@@ -557,11 +555,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
* avoid a potential segfault on return to user-space. The proper thing
* to do would have been to fail the registration but this would break
* older libcs that reuse the rseq area for new threads without
- * clearing the fields.
+ * clearing the fields. Don't bother reading it, just reset it.
*/
- if (rseq_get_rseq_cs_ptr_val(rseq, &rseq_cs))
- return -EFAULT;
- if (rseq_cs && clear_rseq_cs(rseq))
+ if (put_user(0UL, &rseq->rseq_cs))
return -EFAULT;
#ifdef CONFIG_DEBUG_RSEQ
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Remove the ksig argument from rseq_handle_notify_resume()
2025-10-27 8:44 ` [patch V6 04/31] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
@ 2025-10-29 10:23 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:23 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: dfd205204dd4c459d5911fd1cfb17f86cae80150
Gitweb: https://git.kernel.org/tip/dfd205204dd4c459d5911fd1cfb17f86cae80150
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:22 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:12 +01:00
rseq: Remove the ksig argument from rseq_handle_notify_resume()
There is no point for this being visible in the resume_to_user_mode()
handling.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.211520245@linutronix.de
---
include/linux/resume_user_mode.h | 2 +-
include/linux/rseq.h | 13 +++++++------
2 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0..dd3bf7d 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
- rseq_handle_notify_resume(NULL, regs);
+ rseq_handle_notify_resume(regs);
}
#endif /* LINUX_RESUME_USER_MODE_H */
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 21f875a..c06a8e5 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -37,19 +37,20 @@ static inline void rseq_set_notify_resume(struct task_struct *t)
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
if (current->rseq)
- __rseq_handle_notify_resume(ksig, regs);
+ __rseq_handle_notify_resume(NULL, regs);
}
static inline void rseq_signal_deliver(struct ksignal *ksig,
struct pt_regs *regs)
{
- scoped_guard(RSEQ_EVENT_GUARD)
- __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
- rseq_handle_notify_resume(ksig, regs);
+ if (current->rseq) {
+ scoped_guard(RSEQ_EVENT_GUARD)
+ __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ __rseq_handle_notify_resume(ksig, regs);
+ }
}
/* rseq_preempt() requires preemption to be disabled. */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* Re: [patch V6 00/31] rseq: Optimize exit to user space
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
` (30 preceding siblings ...)
2025-10-27 8:45 ` [patch V6 31/31] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
@ 2025-10-29 10:23 ` Peter Zijlstra
31 siblings, 0 replies; 142+ messages in thread
From: Peter Zijlstra @ 2025-10-29 10:23 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Michael Jeanson, Jens Axboe, Mathieu Desnoyers,
Paul E. McKenney, x86, Sean Christopherson, Wei Liu
On Mon, Oct 27, 2025 at 09:44:14AM +0100, Thomas Gleixner wrote:
> Thomas Gleixner (31):
> rseq: Avoid pointless evaluation in __rseq_notify_resume()
> rseq: Condense the inline stubs
> rseq: Move algorithm comment to top
> rseq: Remove the ksig argument from rseq_handle_notify_resume()
> rseq: Simplify registration
> rseq: Simplify the event notification
> rseq, virt: Retrigger RSEQ after vcpu_run()
> rseq: Avoid CPU/MM CID updates when no event pending
> rseq: Introduce struct rseq_data
> entry: Cleanup header
> entry: Remove syscall_enter_from_user_mode_prepare()
> entry: Inline irqentry_enter/exit_from/to_user_mode()
> sched: Move MM CID related functions to sched.h
> rseq: Cache CPU ID and MM CID values
> rseq: Record interrupt from user space
> rseq: Provide tracepoint wrappers for inline code
> rseq: Expose lightweight statistics in debugfs
> rseq: Provide static branch for runtime debugging
> rseq: Provide and use rseq_update_user_cs()
> rseq: Replace the original debug implementation
> rseq: Make exit debugging static branch based
> rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
> rseq: Provide and use rseq_set_ids()
> rseq: Separate the signal delivery path
> rseq: Rework the TIF_NOTIFY handler
> rseq: Optimize event setting
> rseq: Implement fast path for exit to user
> rseq: Switch to fast path processing on exit to user
> entry: Split up exit_to_user_mode_prepare()
> rseq: Split up rseq_exit_to_user_mode()
> rseq: Switch to TIF_RSEQ if supported
Applied to tip/core/rseq.
^ permalink raw reply [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Move algorithm comment to top
2025-10-27 8:44 ` [patch V6 03/31] rseq: Move algorithm comment to top Thomas Gleixner
@ 2025-10-29 10:24 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 83bae0cfe44a8173bbdf94fbce776bc8f8ef8dee
Gitweb: https://git.kernel.org/tip/83bae0cfe44a8173bbdf94fbce776bc8f8ef8dee
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:20 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:11 +01:00
rseq: Move algorithm comment to top
Move the comment which documents the RSEQ algorithm to the top of the file,
so it does not create horrible diffs later when the actual implementation
is fed into the mincer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.149519580@linutronix.de
---
kernel/rseq.c | 119 ++++++++++++++++++++++++-------------------------
1 file changed, 59 insertions(+), 60 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 246319d..51fafc4 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -8,6 +8,65 @@
* Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
*/
+/*
+ * Restartable sequences are a lightweight interface that allows
+ * user-level code to be executed atomically relative to scheduler
+ * preemption and signal delivery. Typically used for implementing
+ * per-cpu operations.
+ *
+ * It allows user-space to perform update operations on per-cpu data
+ * without requiring heavy-weight atomic operations.
+ *
+ * Detailed algorithm of rseq user-space assembly sequences:
+ *
+ * init(rseq_cs)
+ * cpu = TLS->rseq::cpu_id_start
+ * [1] TLS->rseq::rseq_cs = rseq_cs
+ * [start_ip] ----------------------------
+ * [2] if (cpu != TLS->rseq::cpu_id)
+ * goto abort_ip;
+ * [3] <last_instruction_in_cs>
+ * [post_commit_ip] ----------------------------
+ *
+ * The address of jump target abort_ip must be outside the critical
+ * region, i.e.:
+ *
+ * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
+ *
+ * Steps [2]-[3] (inclusive) need to be a sequence of instructions in
+ * userspace that can handle being interrupted between any of those
+ * instructions, and then resumed to the abort_ip.
+ *
+ * 1. Userspace stores the address of the struct rseq_cs assembly
+ * block descriptor into the rseq_cs field of the registered
+ * struct rseq TLS area. This update is performed through a single
+ * store within the inline assembly instruction sequence.
+ * [start_ip]
+ *
+ * 2. Userspace tests to check whether the current cpu_id field match
+ * the cpu number loaded before start_ip, branching to abort_ip
+ * in case of a mismatch.
+ *
+ * If the sequence is preempted or interrupted by a signal
+ * at or after start_ip and before post_commit_ip, then the kernel
+ * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
+ * ip to abort_ip before returning to user-space, so the preempted
+ * execution resumes at abort_ip.
+ *
+ * 3. Userspace critical section final instruction before
+ * post_commit_ip is the commit. The critical section is
+ * self-terminating.
+ * [post_commit_ip]
+ *
+ * 4. <success>
+ *
+ * On failure at [2], or if interrupted by preempt or signal delivery
+ * between [1] and [3]:
+ *
+ * [abort_ip]
+ * F1. <failure>
+ */
+
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
@@ -98,66 +157,6 @@ static int rseq_validate_ro_fields(struct task_struct *t)
unsafe_put_user(value, &t->rseq->field, error_label)
#endif
-/*
- *
- * Restartable sequences are a lightweight interface that allows
- * user-level code to be executed atomically relative to scheduler
- * preemption and signal delivery. Typically used for implementing
- * per-cpu operations.
- *
- * It allows user-space to perform update operations on per-cpu data
- * without requiring heavy-weight atomic operations.
- *
- * Detailed algorithm of rseq user-space assembly sequences:
- *
- * init(rseq_cs)
- * cpu = TLS->rseq::cpu_id_start
- * [1] TLS->rseq::rseq_cs = rseq_cs
- * [start_ip] ----------------------------
- * [2] if (cpu != TLS->rseq::cpu_id)
- * goto abort_ip;
- * [3] <last_instruction_in_cs>
- * [post_commit_ip] ----------------------------
- *
- * The address of jump target abort_ip must be outside the critical
- * region, i.e.:
- *
- * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
- *
- * Steps [2]-[3] (inclusive) need to be a sequence of instructions in
- * userspace that can handle being interrupted between any of those
- * instructions, and then resumed to the abort_ip.
- *
- * 1. Userspace stores the address of the struct rseq_cs assembly
- * block descriptor into the rseq_cs field of the registered
- * struct rseq TLS area. This update is performed through a single
- * store within the inline assembly instruction sequence.
- * [start_ip]
- *
- * 2. Userspace tests to check whether the current cpu_id field match
- * the cpu number loaded before start_ip, branching to abort_ip
- * in case of a mismatch.
- *
- * If the sequence is preempted or interrupted by a signal
- * at or after start_ip and before post_commit_ip, then the kernel
- * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
- * ip to abort_ip before returning to user-space, so the preempted
- * execution resumes at abort_ip.
- *
- * 3. Userspace critical section final instruction before
- * post_commit_ip is the commit. The critical section is
- * self-terminating.
- * [post_commit_ip]
- *
- * 4. <success>
- *
- * On failure at [2], or if interrupted by preempt or signal delivery
- * between [1] and [3]:
- *
- * [abort_ip]
- * F1. <failure>
- */
-
static int rseq_update_cpu_node_id(struct task_struct *t)
{
struct rseq __user *rseq = t->rseq;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Condense the inline stubs
2025-10-27 8:44 ` [patch V6 02/31] rseq: Condense the inline stubs Thomas Gleixner
@ 2025-10-29 10:24 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 1bf005a23953e6d0afd16a161041ca6fae753739
Gitweb: https://git.kernel.org/tip/1bf005a23953e6d0afd16a161041ca6fae753739
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:18 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:11 +01:00
rseq: Condense the inline stubs
Scrolling over tons of pointless
{
}
lines to find the actual code is annoying at best.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.085971048@linutronix.de
---
include/linux/rseq.h | 47 ++++++++++---------------------------------
1 file changed, 12 insertions(+), 35 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 7622b73..21f875a 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -101,44 +101,21 @@ static inline void rseq_execve(struct task_struct *t)
t->rseq_event_mask = 0;
}
-#else
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
-}
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_preempt(struct task_struct *t)
-{
-}
-static inline void rseq_migrate(struct task_struct *t)
-{
-}
-static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
-{
-}
-static inline void rseq_execve(struct task_struct *t)
-{
-}
+#else /* CONFIG_RSEQ */
+static inline void rseq_set_notify_resume(struct task_struct *t) { }
+static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_preempt(struct task_struct *t) { }
+static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
+static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
-#endif
+#endif /* !CONFIG_RSEQ */
#ifdef CONFIG_DEBUG_RSEQ
-
void rseq_syscall(struct pt_regs *regs);
-
-#else
-
-static inline void rseq_syscall(struct pt_regs *regs)
-{
-}
-
-#endif
+#else /* CONFIG_DEBUG_RSEQ */
+static inline void rseq_syscall(struct pt_regs *regs) { }
+#endif /* !CONFIG_DEBUG_RSEQ */
#endif /* _LINUX_RSEQ_H */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Avoid pointless evaluation in __rseq_notify_resume()
2025-10-27 8:44 ` [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
@ 2025-10-29 10:24 ` tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-10-29 10:24 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: fe9d47cb92603914b184b394a80f37e28274eacb
Gitweb: https://git.kernel.org/tip/fe9d47cb92603914b184b394a80f37e28274eacb
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:16 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 29 Oct 2025 11:07:11 +01:00
rseq: Avoid pointless evaluation in __rseq_notify_resume()
The RSEQ critical section mechanism only clears the event mask when a
critical section is registered, otherwise it is stale and collects
bits.
That means once a critical section is installed the first invocation of
that code when TIF_NOTIFY_RESUME is set will abort the critical section,
even when the TIF bit was not raised by the rseq preempt/migrate/signal
helpers.
This also has a performance implication because TIF_NOTIFY_RESUME is a
multiplexing TIF bit, which is utilized by quite some infrastructure. That
means every invocation of __rseq_notify_resume() goes unconditionally
through the heavy lifting of user space access and consistency checks even
if there is no reason to do so.
Keeping the stale event mask around when exiting to user space also
prevents it from being utilized by the upcoming time slice extension
mechanism.
Avoid this by reading and clearing the event mask before doing the user
space critical section access with interrupts or preemption disabled, which
ensures that the read and clear operation is CPU local atomic versus
scheduling and the membarrier IPI.
This is correct as after re-enabling interrupts/preemption any relevant
event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the
user space exit code take another round of TIF bit clearing.
If the event mask was non-zero, invoke the slow path. On debug kernels the
slow path is invoked unconditionally and the result of the event mask
evaluation is handed in.
Add a exit path check after the TIF bit loop, which validates on debug
kernels that the event mask is zero before exiting to user space.
While at it reword the convoluted comment why the pt_regs pointer can be
NULL under certain circumstances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.022571576@linutronix.de
---
include/linux/irq-entry-common.h | 7 ++-
include/linux/rseq.h | 10 ++++-
kernel/rseq.c | 66 ++++++++++++++++++++-----------
3 files changed, 58 insertions(+), 25 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index d643c7c..e5941df 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -2,11 +2,12 @@
#ifndef __LINUX_IRQENTRYCOMMON_H
#define __LINUX_IRQENTRYCOMMON_H
+#include <linux/context_tracking.h>
+#include <linux/kmsan.h>
+#include <linux/rseq.h>
#include <linux/static_call_types.h>
#include <linux/syscalls.h>
-#include <linux/context_tracking.h>
#include <linux/tick.h>
-#include <linux/kmsan.h>
#include <linux/unwind_deferred.h>
#include <asm/entry-common.h>
@@ -226,6 +227,8 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
arch_exit_to_user_mode_prepare(regs, ti_work);
+ rseq_exit_to_user_mode();
+
/* Ensure that kernel state is sane for a return to userspace */
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 69553e7..7622b73 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -66,6 +66,14 @@ static inline void rseq_migrate(struct task_struct *t)
rseq_set_notify_resume(t);
}
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
+ if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
+ current->rseq_event_mask = 0;
+ }
+}
+
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
@@ -118,7 +126,7 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
static inline void rseq_execve(struct task_struct *t)
{
}
-
+static inline void rseq_exit_to_user_mode(void) { }
#endif
#ifdef CONFIG_DEBUG_RSEQ
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 2452b73..246319d 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char *str, u32 flags)
return true;
}
-static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
+static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
{
- u32 flags, event_mask;
+ u32 flags;
int ret;
if (rseq_warn_flags("rseq_cs", cs_flags))
@@ -339,17 +339,7 @@ static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
if (rseq_warn_flags("rseq", flags))
return -EINVAL;
-
- /*
- * Load and clear event mask atomically with respect to
- * scheduler preemption and membarrier IPIs.
- */
- scoped_guard(RSEQ_EVENT_GUARD) {
- event_mask = t->rseq_event_mask;
- t->rseq_event_mask = 0;
- }
-
- return !!event_mask;
+ return 0;
}
static int clear_rseq_cs(struct rseq __user *rseq)
@@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
}
-static int rseq_ip_fixup(struct pt_regs *regs)
+static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
{
unsigned long ip = instruction_pointer(regs);
struct task_struct *t = current;
@@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs *regs)
*/
if (!in_rseq_cs(ip, &rseq_cs))
return clear_rseq_cs(t->rseq);
- ret = rseq_need_restart(t, rseq_cs.flags);
- if (ret <= 0)
+ ret = rseq_check_flags(t, rseq_cs.flags);
+ if (ret < 0)
return ret;
+ if (!abort)
+ return 0;
ret = clear_rseq_cs(t->rseq);
if (ret)
return ret;
@@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
return;
/*
- * regs is NULL if and only if the caller is in a syscall path. Skip
- * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
- * kill a misbehaving userspace on debug kernels.
+ * If invoked from hypervisors or IO-URING, then @regs is a NULL
+ * pointer, so fixup cannot be done. If the syscall which led to
+ * this invocation was invoked inside a critical section, then it
+ * will either end up in this code again or a possible violation of
+ * a syscall inside a critical region can only be detected by the
+ * debug code in rseq_syscall() in a debug enabled kernel.
*/
if (regs) {
- ret = rseq_ip_fixup(regs);
- if (unlikely(ret < 0))
- goto error;
+ /*
+ * Read and clear the event mask first. If the task was not
+ * preempted or migrated or a signal is on the way, there
+ * is no point in doing any of the heavy lifting here on
+ * production kernels. In that case TIF_NOTIFY_RESUME was
+ * raised by some other functionality.
+ *
+ * This is correct because the read/clear operation is
+ * guarded against scheduler preemption, which makes it CPU
+ * local atomic. If the task is preempted right after
+ * re-enabling preemption then TIF_NOTIFY_RESUME is set
+ * again and this function is invoked another time _before_
+ * the task is able to return to user mode.
+ *
+ * On a debug kernel, invoke the fixup code unconditionally
+ * with the result handed in to allow the detection of
+ * inconsistencies.
+ */
+ u32 event_mask;
+
+ scoped_guard(RSEQ_EVENT_GUARD) {
+ event_mask = t->rseq_event_mask;
+ t->rseq_event_mask = 0;
+ }
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
+ ret = rseq_ip_fixup(regs, !!event_mask);
+ if (unlikely(ret < 0))
+ goto error;
+ }
}
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* Re: [patch V6 19/31] rseq: Provide and use rseq_update_user_cs()
2025-10-27 8:44 ` [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
2025-10-28 15:40 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-10-29 16:04 ` Steven Rostedt
2025-10-29 21:00 ` Thomas Gleixner
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
4 siblings, 1 reply; 142+ messages in thread
From: Steven Rostedt @ 2025-10-29 16:04 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Michael Jeanson, Jens Axboe, Mathieu Desnoyers,
Peter Zijlstra, Paul E. McKenney, x86, Sean Christopherson,
Wei Liu
On Mon, 27 Oct 2025 09:44:57 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:
> include/linux/rseq_entry.h | 206 +++++++++++++++++++++++++++++++++++++
> include/linux/rseq_types.h | 11 +-
> kernel/rseq.c | 244 +++++++++++++--------------------------------
> 3 files changed, 290 insertions(+), 171 deletions(-)
> --- a/include/linux/rseq_entry.h
> +++ b/include/linux/rseq_entry.h
> @@ -36,6 +36,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_
> #ifdef CONFIG_RSEQ
> #include <linux/jump_label.h>
> #include <linux/rseq.h>
> +#include <linux/uaccess.h>
>
> #include <linux/tracepoint-defs.h>
>
> @@ -67,12 +68,217 @@ static inline void rseq_trace_ip_fixup(u
>
> DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
>
> +#ifdef RSEQ_BUILD_SLOW_PATH
> +#define rseq_inline
> +#else
> +#define rseq_inline __always_inline
> +#endif
> +
> +bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
So rseq_debug_update_user_cs() is always declared.
> +
> static __always_inline void rseq_note_user_irq_entry(void)
> {
> if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
> current->rseq.event.user_irq = true;
> }
>
> +/*
> + * Check whether there is a valid critical section and whether the
> + * instruction pointer in @regs is inside the critical section.
> + *
> + * - If the critical section is invalid, terminate the task.
> + *
> + * - If valid and the instruction pointer is inside, set it to the abort IP
> + *
> + * - If valid and the instruction pointer is outside, clear the critical
> + * section address.
> + *
> + * Returns true, if the section was valid and either fixup or clear was
> + * done, false otherwise.
> + *
> + * In the failure case task::rseq_event::fatal is set when a invalid
> + * section was found. It's clear when the failure was an unresolved page
> + * fault.
> + *
> + * If inlined into the exit to user path with interrupts disabled, the
> + * caller has to protect against page faults with pagefault_disable().
> + *
> + * In preemptible task context this would be counterproductive as the page
> + * faults could not be fully resolved. As a consequence unresolved page
> + * faults in task context are fatal too.
> + */
> +
> +#ifdef RSEQ_BUILD_SLOW_PATH
But the function is only defined if RSEQ_BUILD_SLOW_PATH is defined.
> +/*
> + * The debug version is put out of line, but kept here so the code stays
> + * together.
> + *
> + * @csaddr has already been checked by the caller to be in user space
> + */
> +bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
> + unsigned long csaddr)
> +{
> + struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
> + u64 start_ip, abort_ip, offset, cs_end, head, tasksize = TASK_SIZE;
> + unsigned long ip = instruction_pointer(regs);
> + u64 __user *uc_head = (u64 __user *) ucs;
> + u32 usig, __user *uc_sig;
> +
> + scoped_user_rw_access(ucs, efault) {
> + /*
> + * Evaluate the user pile and exit if one of the conditions
> + * is not fulfilled.
> + */
> + unsafe_get_user(start_ip, &ucs->start_ip, efault);
> + if (unlikely(start_ip >= tasksize))
> + goto die;
> + /* If outside, just clear the critical section. */
> + if (ip < start_ip)
> + goto clear;
> +
> + unsafe_get_user(offset, &ucs->post_commit_offset, efault);
> + cs_end = start_ip + offset;
> + /* Check for overflow and wraparound */
> + if (unlikely(cs_end >= tasksize || cs_end < start_ip))
> + goto die;
> +
> + /* If not inside, clear it. */
> + if (ip >= cs_end)
> + goto clear;
> +
> + unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
> + /* Ensure it's "valid" */
> + if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
> + goto die;
> + /* Validate that the abort IP is not in the critical section */
> + if (unlikely(abort_ip - start_ip < offset))
> + goto die;
> +
> + /*
> + * Check version and flags for 0. No point in emitting
> + * deprecated warnings before dying. That could be done in
> + * the slow path eventually, but *shrug*.
> + */
> + unsafe_get_user(head, uc_head, efault);
> + if (unlikely(head))
> + goto die;
> +
> + /* abort_ip - 4 is >= 0. See abort_ip check above */
> + uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
> + unsafe_get_user(usig, uc_sig, efault);
> + if (unlikely(usig != t->rseq.sig))
> + goto die;
> +
> + /* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
> + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> + /* If not in interrupt from user context, let it die */
> + if (unlikely(!t->rseq.event.user_irq))
> + goto die;
> + }
> + unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
> + instruction_pointer_set(regs, (unsigned long)abort_ip);
> + rseq_stat_inc(rseq_stats.fixup);
> + break;
> + clear:
> + unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
> + rseq_stat_inc(rseq_stats.clear);
> + abort_ip = 0ULL;
> + }
> +
> + if (unlikely(abort_ip))
> + rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
> + return true;
> +die:
> + t->rseq.event.fatal = true;
> +efault:
> + return false;
> +}
> +
There's no:
#else
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
unsigned long csaddr)
{
return false;
}
> +#endif /* RSEQ_BUILD_SLOW_PATH */
> +
> +/*
> + * This only ensures that abort_ip is in the user address space and
> + * validates that it is preceded by the signature.
> + *
> + * No other sanity checks are done here, that's what the debug code is for.
> + */
> +static rseq_inline bool
> +rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
> +{
> + struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
> + unsigned long ip = instruction_pointer(regs);
> + u64 start_ip, abort_ip, offset;
> + u32 usig, __user *uc_sig;
> +
> + rseq_stat_inc(rseq_stats.cs);
> +
> + if (unlikely(csaddr >= TASK_SIZE)) {
> + t->rseq.event.fatal = true;
> + return false;
> + }
> +
> + if (static_branch_unlikely(&rseq_debug_enabled))
> + return rseq_debug_update_user_cs(t, regs, csaddr);
Wouldn't the above reference to rseq_debug_update_user_cs() fail to build
if RSEQ_BUILD_SLOW_PATH is not defined?
Or am I missing something?
I see that it looks like RSEQ_BUILD_SLOW_PATH is always defined, but why
have this logic if it can't be not defined?
-- Steve
> +
> + scoped_user_rw_access(ucs, efault) {
> + unsafe_get_user(start_ip, &ucs->start_ip, efault);
> + unsafe_get_user(offset, &ucs->post_commit_offset, efault);
> + unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
> +
> + /*
> + * No sanity checks. If user space screwed it up, it can
> + * keep the pieces. That's what debug code is for.
> + *
> + * If outside, just clear the critical section.
> + */
> + if (ip - start_ip >= offset)
> + goto clear;
> +
> + /*
> + * Two requirements for @abort_ip:
> + * - Must be in user space as x86 IRET would happily return to
> + * the kernel.
> + * - The four bytes preceding the instruction at @abort_ip must
> + * contain the signature.
> + *
> + * The latter protects against the following attack vector:
> + *
> + * An attacker with limited abilities to write, creates a critical
> + * section descriptor, sets the abort IP to a library function or
> + * some other ROP gadget and stores the address of the descriptor
> + * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
> + * protection.
> + */
> + if (abort_ip >= TASK_SIZE || abort_ip < sizeof(*uc_sig))
> + goto die;
> +
> + /* The address is guaranteed to be >= 0 and < TASK_SIZE */
> + uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
> + unsafe_get_user(usig, uc_sig, efault);
> + if (unlikely(usig != t->rseq.sig))
> + goto die;
> +
> + /* Invalidate the critical section */
> + unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
> + /* Update the instruction pointer */
> + instruction_pointer_set(regs, (unsigned long)abort_ip);
> + rseq_stat_inc(rseq_stats.fixup);
> + break;
> + clear:
> + unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
> + rseq_stat_inc(rseq_stats.clear);
> + abort_ip = 0ULL;
> + }
> +
> + if (unlikely(abort_ip))
> + rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
> + return true;
> +die:
> + t->rseq.event.fatal = true;
> +efault:
> + return false;
> +}
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 27/31] rseq: Implement fast path for exit to user
2025-10-27 8:45 ` [patch V6 27/31] rseq: Implement fast path for exit to user Thomas Gleixner
2025-10-28 16:09 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-10-29 16:28 ` Steven Rostedt
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
4 siblings, 0 replies; 142+ messages in thread
From: Steven Rostedt @ 2025-10-29 16:28 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Michael Jeanson, Jens Axboe, Mathieu Desnoyers,
Peter Zijlstra, Paul E. McKenney, x86, Sean Christopherson,
Wei Liu
On Mon, 27 Oct 2025 09:45:17 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:
> Implement the actual logic for handling RSEQ updates in a fast path after
> handling the TIF work and at the point where the task is actually returning
> to user space.
>
> This is the right point to do that because at this point the CPU and the MM
> CID are stable and cannot longer change due to yet another reschedule.
"and can no longer change"
> That happens when the task is handling it via TIF_NOTIFY_RESUME in
> resume_user_mode_work(), which is invoked from the exit to user mode work
> loop.
>
> The function is invoked after the TIF work is handled and runs with
> interrupts disabled, which means it cannot resolve page faults. It
> therefore disables page faults and in case the access to the user space
> memory faults, it:
>
> - notes the fail in the event struct
> - raises TIF_NOTIFY_RESUME
> - returns false to the caller
>
> The caller has to go back to the TIF work, which runs with interrupts
> enabled and therefore can resolve the page faults. This happens mostly on
> fork() when the memory is marked COW.
>
> If the user memory inspection finds invalid data, the function returns
> false as well and sets the fatal flag in the event struct along with
> TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
> and terminate the task with SIGSEGV as documented.
>
> The initial decision to invoke any of this is based on one flags in the
"based on one flag in"
-- Steve
> event struct: @sched_switch. The decision is in pseudo ASM:
>
> load tsk::event::sched_switch
> jnz inspect_user_space
> mov $0, tsk::event::events
> ...
> leave
>
> So for the common case where the task was not scheduled out, this really
> boils down to three instructions before going out if the compiler is not
> completely stupid (and yes, some of them are).
>
> If the condition is true, then it checks, whether CPU ID or MM CID have
> changed. If so, then the CPU/MM IDs have to be updated and are thereby
> cached for the next round. The update unconditionally retrieves the user
> space critical section address to spare another user*begin/end() pair. If
> that's not zero and tsk::event::user_irq is set, then the critical section
> is analyzed and acted upon. If either zero or the entry came via syscall
> the critical section analysis is skipped.
>
> If the comparison is false then the critical section has to be analyzed
> because the event flag is then only true when entry from user was by
> interrupt.
>
> This is provided without the actual hookup to let reviewers focus on the
> implementation details. The hookup happens in the next step.
>
> Note: As with quite some other optimizations this depends on the generic
> entry infrastructure and is not enabled to be sucked into random
> architecture implementations.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 19/31] rseq: Provide and use rseq_update_user_cs()
2025-10-29 16:04 ` [patch V6 19/31] " Steven Rostedt
@ 2025-10-29 21:00 ` Thomas Gleixner
2025-10-29 21:53 ` Steven Rostedt
0 siblings, 1 reply; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-29 21:00 UTC (permalink / raw)
To: Steven Rostedt
Cc: LKML, Michael Jeanson, Jens Axboe, Mathieu Desnoyers,
Peter Zijlstra, Paul E. McKenney, x86, Sean Christopherson,
Wei Liu
On Wed, Oct 29 2025 at 12:04, Steven Rostedt wrote:
>> +bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
>
> So rseq_debug_update_user_cs() is always declared.
And?
>> +#ifdef RSEQ_BUILD_SLOW_PATH
>
> But the function is only defined if RSEQ_BUILD_SLOW_PATH is defined.
Right.
>
> There's no:
>
> #else
>
> bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
> unsigned long csaddr)
> {
> return false;
> }
And why would you need that?
>> + if (static_branch_unlikely(&rseq_debug_enabled))
>> + return rseq_debug_update_user_cs(t, regs, csaddr);
>
> Wouldn't the above reference to rseq_debug_update_user_cs() fail to build
> if RSEQ_BUILD_SLOW_PATH is not defined?
>
> Or am I missing something?
Yes. None of this is compiled at all when CONFIG_RSEQ=n.
When CONFIG_RSEQ=y then the slowpath muck is compiled into kernel/rseq.c
because that file defines RSEQ_BUILD_SLOW_PATH.
> I see that it looks like RSEQ_BUILD_SLOW_PATH is always defined, but why
> have this logic if it can't be not defined?
The fastpath inline version is compiled into the entry code and that
does not define RSEQ_BUILD_SLOW_PATH and therefore needs the
unconditional forward declaration of the debug function.
I made it this way because I wanted to keep the debug and non-debug
version next to each other as it's simpler that way to keep them in sync
instead of forgetting about the other variant because it's in a
different file.
Thanks,
tglx
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 19/31] rseq: Provide and use rseq_update_user_cs()
2025-10-29 21:00 ` Thomas Gleixner
@ 2025-10-29 21:53 ` Steven Rostedt
0 siblings, 0 replies; 142+ messages in thread
From: Steven Rostedt @ 2025-10-29 21:53 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Michael Jeanson, Jens Axboe, Mathieu Desnoyers,
Peter Zijlstra, Paul E. McKenney, x86, Sean Christopherson,
Wei Liu
On Wed, 29 Oct 2025 22:00:37 +0100
Thomas Gleixner <tglx@linutronix.de> wrote:
> > Or am I missing something?
>
> Yes. None of this is compiled at all when CONFIG_RSEQ=n.
>
> When CONFIG_RSEQ=y then the slowpath muck is compiled into kernel/rseq.c
> because that file defines RSEQ_BUILD_SLOW_PATH.
What I was missing was that the code was in a header file.
I mistook it as being part of rseq.c and not living in
include/linux/rseq_entry.h :-p
-- Steve
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 20/31] rseq: Replace the original debug implementation
2025-10-27 8:45 ` [patch V6 20/31] rseq: Replace the original debug implementation Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-10-30 21:52 ` Prakash Sangappa
2025-10-31 14:27 ` Thomas Gleixner
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 1 reply; 142+ messages in thread
From: Prakash Sangappa @ 2025-10-30 21:52 UTC (permalink / raw)
To: Thomas Gleixner
Cc: LKML, Michael Jeanson, Jens Axboe, Mathieu Desnoyers,
Peter Zijlstra, Paul E. McKenney, x86@kernel.org,
Sean Christopherson, Wei Liu
> On Oct 27, 2025, at 1:45 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Just utilize the new infrastructure and put the original one to rest.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
[…]
> void rseq_syscall(struct pt_regs *regs)
> {
> - unsigned long ip = instruction_pointer(regs);
> struct task_struct *t = current;
> - struct rseq_cs rseq_cs;
> + u64 csaddr;
>
> - if (!t->rseq.usrptr)
> + if (!t->rseq.event.has_rseq)
> + return;
> + if (!get_user(csaddr, &t->rseq.usrptr->rseq_cs))
> + goto fail;
I believe get_user() would returns non zero(-EFAULT) on failure.
Noticed test program dies with segv when command line parameter rseq_debug=1 is set.
So, It should be
if (get_user(csaddr, &t->rseq.usrptr->rseq_cs))
goto fail;
-Prakash
> + if (likely(!csaddr))
> return;
> - if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
> - force_sig(SIGSEGV);
> + if (unlikely(csaddr >= TASK_SIZE))
> + goto fail;
> + if (rseq_debug_update_user_cs(t, regs, csaddr))
> + return;
> +fail:
> + force_sig(SIGSEGV);
> }
> -
> #endif
>
> /*
>
>
^ permalink raw reply [flat|nested] 142+ messages in thread
* Re: [patch V6 20/31] rseq: Replace the original debug implementation
2025-10-30 21:52 ` [patch V6 20/31] " Prakash Sangappa
@ 2025-10-31 14:27 ` Thomas Gleixner
0 siblings, 0 replies; 142+ messages in thread
From: Thomas Gleixner @ 2025-10-31 14:27 UTC (permalink / raw)
To: Prakash Sangappa
Cc: LKML, Michael Jeanson, Jens Axboe, Mathieu Desnoyers,
Peter Zijlstra, Paul E. McKenney, x86@kernel.org,
Sean Christopherson, Wei Liu
On Thu, Oct 30 2025 at 21:52, Prakash Sangappa wrote:
>> On Oct 27, 2025, at 1:45 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Just utilize the new infrastructure and put the original one to rest.
>>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> […]
>> void rseq_syscall(struct pt_regs *regs)
>> {
>> - unsigned long ip = instruction_pointer(regs);
>> struct task_struct *t = current;
>> - struct rseq_cs rseq_cs;
>> + u64 csaddr;
>>
>> - if (!t->rseq.usrptr)
>> + if (!t->rseq.event.has_rseq)
>> + return;
>> + if (!get_user(csaddr, &t->rseq.usrptr->rseq_cs))
>> + goto fail;
>
> I believe get_user() would returns non zero(-EFAULT) on failure.
> Noticed test program dies with segv when command line parameter rseq_debug=1 is set.
>
> So, It should be
> if (get_user(csaddr, &t->rseq.usrptr->rseq_cs))
> goto fail;
Indeed
^ permalink raw reply [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Switch to TIF_RSEQ if supported
2025-10-27 8:45 ` [patch V6 31/31] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 117f37ad484af20ab014e990b238ff04c43ba636
Gitweb: https://git.kernel.org/tip/117f37ad484af20ab014e990b238ff04c43ba636
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:26 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:22 +01:00
rseq: Switch to TIF_RSEQ if supported
TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially
with the RSEQ fast path depending on it, but not really handling it.
Define a seperate TIF_RSEQ in the generic TIF space and enable the full
seperation of fast and slow path for architectures which utilize that.
That avoids the hassle with invocations of resume_user_mode_work() from
hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required
re-evaluation at the end of vcpu_run() a NOOP on architectures which
utilize the generic TIF space and have a seperate TIF_RSEQ.
The hypervisor TIF handling does not include the seperate TIF_RSEQ as there
is no point in doing so. The guest does neither know nor care about the VMM
host applications RSEQ state. That state is only relevant when the ioctl()
returns to user space.
The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure
handling, but this only happens within exit_to_user_mode_loop(), so
arguably the hypervisor ioctl() code is long done when this happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.903622031@linutronix.de
---
include/asm-generic/thread_info_tif.h | 3 ++-
include/linux/irq-entry-common.h | 2 +-
include/linux/rseq.h | 22 ++++++++++++------
include/linux/rseq_entry.h | 32 +++++++++++++++++++++++---
include/linux/thread_info.h | 5 ++++-
kernel/entry/common.c | 10 ++++++--
6 files changed, 61 insertions(+), 13 deletions(-)
diff --git a/include/asm-generic/thread_info_tif.h b/include/asm-generic/thread_info_tif.h
index ee3793e..da1610a 100644
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -45,4 +45,7 @@
# define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK)
#endif
+#define TIF_RSEQ 11 // Run RSEQ fast path
+#define _TIF_RSEQ BIT(TIF_RSEQ)
+
#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index bc5d178..72e3f7a 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -30,7 +30,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \
- _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
+ _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ | \
ARCH_EXIT_TO_USER_MODE_WORK)
/**
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index ded4baa..b5e4803 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -42,7 +42,7 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *reg
static inline void rseq_raise_notify_resume(struct task_struct *t)
{
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ set_tsk_thread_flag(t, TIF_RSEQ);
}
/* Invoked from context switch to force evaluation on exit to user */
@@ -114,17 +114,25 @@ static inline void rseq_force_update(void)
/*
* KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
- * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
- * that case just to do it eventually again before returning to user space,
- * the entry resume_user_mode_work() invocation is ignored as the register
- * argument is NULL.
+ * which clears TIF_NOTIFY_RESUME on architectures that don't use the
+ * generic TIF bits and therefore can't provide a separate TIF_RSEQ flag.
*
- * After returning from guest mode, they have to invoke this function to
- * re-raise TIF_NOTIFY_RESUME if necessary.
+ * To avoid updating user space RSEQ in that case just to do it eventually
+ * again before returning to user space, because __rseq_handle_slowpath()
+ * does nothing when invoked with NULL register state.
+ *
+ * After returning from guest mode, before exiting to userspace, hypervisors
+ * must invoke this function to re-raise TIF_NOTIFY_RESUME if necessary.
*/
static inline void rseq_virt_userspace_exit(void)
{
if (current->rseq.event.sched_switch)
+ /*
+ * The generic optimization for deferring RSEQ updates until the next
+ * exit relies on having a dedicated TIF_RSEQ.
+ */
+ if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) &&
+ current->rseq.event.sched_switch)
rseq_raise_notify_resume(current);
}
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 958a63e..c92167f 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -507,18 +507,44 @@ static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *reg
return false;
}
-static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+/* Required to allow conversion to GENERIC_ENTRY w/o GENERIC_TIF_BITS */
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
+static __always_inline bool test_tif_rseq(unsigned long ti_work)
{
+ return ti_work & _TIF_RSEQ;
+}
+
+static __always_inline void clear_tif_rseq(void)
+{
+ static_assert(TIF_RSEQ != TIF_NOTIFY_RESUME);
+ clear_thread_flag(TIF_RSEQ);
+}
+#else
+static __always_inline bool test_tif_rseq(unsigned long ti_work) { return true; }
+static __always_inline void clear_tif_rseq(void) { }
+#endif
+
+static __always_inline bool
+rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
+{
+ if (likely(!test_tif_rseq(ti_work)))
+ return false;
+
if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
current->rseq.event.slowpath = true;
set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
return true;
}
+
+ clear_tif_rseq();
return false;
}
#else /* CONFIG_GENERIC_ENTRY */
-static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs) { return false; }
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
+{
+ return false;
+}
#endif /* !CONFIG_GENERIC_ENTRY */
static __always_inline void rseq_syscall_exit_to_user_mode(void)
@@ -577,7 +603,7 @@ static inline void rseq_debug_syscall_return(struct pt_regs *regs)
}
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
-static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
return false;
}
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index dd925d8..b40de9b 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -67,6 +67,11 @@ enum syscall_work_bit {
#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
#endif
+#ifndef TIF_RSEQ
+# define TIF_RSEQ TIF_NOTIFY_RESUME
+# define _TIF_RSEQ _TIF_NOTIFY_RESUME
+#endif
+
#ifdef __KERNEL__
#ifndef arch_set_restart_data
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 523a3e7..5c792b3 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,6 +11,12 @@
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
+#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK & ~_TIF_RSEQ)
+#else
+#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
+#endif
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -18,7 +24,7 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
* Before returning to user space ensure that all pending work
* items have been completed.
*/
- while (ti_work & EXIT_TO_USER_MODE_WORK) {
+ while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
local_irq_enable_exit_to_user(ti_work);
@@ -68,7 +74,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
for (;;) {
ti_work = __exit_to_user_mode_loop(regs, ti_work);
- if (likely(!rseq_exit_to_user_mode_restart(regs)))
+ if (likely(!rseq_exit_to_user_mode_restart(regs, ti_work)))
return ti_work;
ti_work = read_thread_flags();
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Split up rseq_exit_to_user_mode()
2025-10-27 8:45 ` [patch V6 30/31] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: a2fc232771d36c41d1fc4efe005578bcf3e060c7
Gitweb: https://git.kernel.org/tip/a2fc232771d36c41d1fc4efe005578bcf3e060c7
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:24 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:22 +01:00
rseq: Split up rseq_exit_to_user_mode()
Separate the interrupt and syscall exit handling. Syscall exit does not
require to clear the user_irq bit as it can't be set. On interrupt exit it
can be set when the interrupt did not result in a scheduling event and
therefore the return path did not invoke the TIF work handling, which would
have cleared it.
The debug check for the event state is also not really required even when
debug mode is enabled via the static key. Debug mode is largely aiding user
space by enabling a larger amount of validation checks, which cause a
segfault when a malformed critical section is detected. In production mode
the critical section handling takes the content mostly as is and lets user
space keep the pieces when it screwed up.
On kernel changes in that area the state check is useful, but that can be
done when lockdep is enabled, which is anyway a required test scenario for
fundamental changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.842785700@linutronix.de
---
include/linux/irq-entry-common.h | 6 ++---
include/linux/rseq_entry.h | 36 +++++++++++++++++++++++++++++--
2 files changed, 37 insertions(+), 5 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 5ea6172..bc5d178 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -240,7 +240,7 @@ static __always_inline void __exit_to_user_mode_validate(void)
static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_exit_to_user_mode_legacy();
__exit_to_user_mode_validate();
}
@@ -254,7 +254,7 @@ static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *reg
static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_syscall_exit_to_user_mode();
__exit_to_user_mode_validate();
}
@@ -268,7 +268,7 @@ static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *re
static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_irqentry_exit_to_user_mode();
__exit_to_user_mode_validate();
}
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 3f13be7..958a63e 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -521,7 +521,37 @@ static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs) { return false; }
#endif /* !CONFIG_GENERIC_ENTRY */
-static __always_inline void rseq_exit_to_user_mode(void)
+static __always_inline void rseq_syscall_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ /* Needed to remove the store for the !lockdep case */
+ if (IS_ENABLED(CONFIG_LOCKDEP)) {
+ WARN_ON_ONCE(ev->sched_switch);
+ ev->events = 0;
+ }
+}
+
+static __always_inline void rseq_irqentry_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ lockdep_assert_once(!ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing could not clear it.
+ */
+ ev->events = 0;
+}
+
+/* Required to keep ARM64 working */
+static __always_inline void rseq_exit_to_user_mode_legacy(void)
{
struct rseq_event *ev = ¤t->rseq.event;
@@ -551,7 +581,9 @@ static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
{
return false;
}
-static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_syscall_exit_to_user_mode(void) { }
+static inline void rseq_irqentry_exit_to_user_mode(void) { }
+static inline void rseq_exit_to_user_mode_legacy(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Split up exit_to_user_mode_prepare()
2025-10-27 8:45 ` [patch V6 29/31] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 835baf7a6a1cf1ca8ade0dc22599e44beb764846
Gitweb: https://git.kernel.org/tip/835baf7a6a1cf1ca8ade0dc22599e44beb764846
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:21 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:22 +01:00
entry: Split up exit_to_user_mode_prepare()
exit_to_user_mode_prepare() is used for both interrupts and syscalls, but
there is extra rseq work, which is only required for in the interrupt exit
case.
Split up the function and provide wrappers for syscalls and interrupts,
which allows to separate the rseq exit work in the next step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.782234789@linutronix.de
---
arch/arm64/kernel/entry-common.c | 2 +-
include/linux/entry-common.h | 2 +-
include/linux/irq-entry-common.h | 49 +++++++++++++++++++++++++++----
3 files changed, 46 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index a9c8171..0a97e26 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -100,7 +100,7 @@ static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
static __always_inline void arm64_exit_to_user_mode(struct pt_regs *regs)
{
local_irq_disable();
- exit_to_user_mode_prepare(regs);
+ exit_to_user_mode_prepare_legacy(regs);
local_daif_mask();
mte_check_tfsr_exit();
exit_to_user_mode();
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index d967184..87efb38 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -156,7 +156,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
if (unlikely(work & SYSCALL_WORK_EXIT))
syscall_exit_work(regs, work);
local_irq_disable_exit_to_user();
- exit_to_user_mode_prepare(regs);
+ syscall_exit_to_user_mode_prepare(regs);
}
/**
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 8f5ceea..5ea6172 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -201,7 +201,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
/**
- * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
* @regs: Pointer to pt_regs on entry stack
*
* 1) check that interrupts are disabled
@@ -209,8 +209,10 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work
* 3) call exit_to_user_mode_loop() if any flags from
* EXIT_TO_USER_MODE_WORK are set
* 4) check that interrupts are still disabled
+ *
+ * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
*/
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
{
unsigned long ti_work;
@@ -224,15 +226,52 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
ti_work = exit_to_user_mode_loop(regs, ti_work);
arch_exit_to_user_mode_prepare(regs, ti_work);
+}
- rseq_exit_to_user_mode();
-
+static __always_inline void __exit_to_user_mode_validate(void)
+{
/* Ensure that kernel state is sane for a return to userspace */
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
lockdep_sys_exit();
}
+/* Temporary workaround to keep ARM64 alive */
+static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
+/**
+ * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs: Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
+/**
+ * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs: Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
/**
* exit_to_user_mode - Fixup state when exiting to user mode
*
@@ -297,7 +336,7 @@ static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
{
instrumentation_begin();
- exit_to_user_mode_prepare(regs);
+ irqentry_exit_to_user_mode_prepare(regs);
instrumentation_end();
exit_to_user_mode();
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Switch to fast path processing on exit to user
2025-10-27 8:45 ` [patch V6 28/31] rseq: Switch to fast path processing on " Thomas Gleixner
2025-10-28 16:14 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 4be94db4a3c16346303e226bdc85dcfa0c403437
Gitweb: https://git.kernel.org/tip/4be94db4a3c16346303e226bdc85dcfa0c403437
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:19 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:21 +01:00
rseq: Switch to fast path processing on exit to user
Now that all bits and pieces are in place, hook the RSEQ handling fast path
function into exit_to_user_mode_prepare() after the TIF work bits have been
handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
and the caller needs to take another turn through the TIF handling slow
path.
This only works for architectures which use the generic entry code.
Architectures who still have their own incomplete hacks are not supported
and won't be.
This results in the following improvements:
Kernel build Before After Reduction
exit to user 80692981 80514451
signal checks: 32581 121 99%
slowpath runs: 1201408 1.49% 198 0.00% 100%
fastpath runs: 675941 0.84% N/A
id updates: 1233989 1.53% 50541 0.06% 96%
cs checks: 1125366 1.39% 0 0.00% 100%
cs cleared: 1125366 100% 0 100%
cs fixup: 0 0% 0
RSEQ selftests Before After Reduction
exit to user: 386281778 387373750
signal checks: 35661203 0 100%
slowpath runs: 140542396 36.38% 100 0.00% 100%
fastpath runs: 9509789 2.51% N/A
id updates: 176203599 45.62% 9087994 2.35% 95%
cs checks: 175587856 45.46% 4728394 1.22% 98%
cs cleared: 172359544 98.16% 1319307 27.90% 99%
cs fixup: 3228312 1.84% 3409087 72.10%
The 'cs cleared' and 'cs fixup' percentages are not relative to the exit to
user invocations, they are relative to the actual 'cs check' invocations.
While some of this could have been avoided in the original code, like the
obvious clearing of CS when it's already clear, the main problem of going
through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
notify handler is invoked more than once before going out to user
space. Doing this once when everything has stabilized is the only solution
to avoid this.
The initial attempt to completely decouple it from the TIF work turned out
to be suboptimal for workloads, which do a lot of quick and short system
calls. Even if the fast path decision is only 4 instructions (including a
conditional branch), this adds up quickly and becomes measurable when the
rate for actually having to handle rseq is in the low single digit
percentage range of user/kernel transitions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.701201365@linutronix.de
---
include/linux/irq-entry-common.h | 7 ++-----
include/linux/resume_user_mode.h | 2 +-
include/linux/rseq.h | 18 ++++++++++++------
init/Kconfig | 2 +-
kernel/entry/common.c | 26 +++++++++++++++++++-------
kernel/rseq.c | 8 ++++++--
6 files changed, 41 insertions(+), 22 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index cb31fb8..8f5ceea 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -197,11 +197,8 @@ static __always_inline void arch_exit_to_user_mode(void) { }
*/
void arch_do_signal_or_restart(struct pt_regs *regs);
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- */
-unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work);
+/* Handle pending TIF work */
+unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
/**
* exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index dd3bf7d..bf92227 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
- rseq_handle_notify_resume(regs);
+ rseq_handle_slowpath(regs);
}
#endif /* LINUX_RESUME_USER_MODE_H */
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index abfbeb4..ded4baa 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -7,13 +7,19 @@
#include <uapi/linux/rseq.h>
-void __rseq_handle_notify_resume(struct pt_regs *regs);
+void __rseq_handle_slowpath(struct pt_regs *regs);
-static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+/* Invoked from resume_user_mode_work() */
+static inline void rseq_handle_slowpath(struct pt_regs *regs)
{
- /* '&' is intentional to spare one conditional branch */
- if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
- __rseq_handle_notify_resume(regs);
+ if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
+ if (current->rseq.event.slowpath)
+ __rseq_handle_slowpath(regs);
+ } else {
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
+ __rseq_handle_slowpath(regs);
+ }
}
void __rseq_signal_deliver(int sig, struct pt_regs *regs);
@@ -152,7 +158,7 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
}
#else /* CONFIG_RSEQ */
-static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
+static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
diff --git a/init/Kconfig b/init/Kconfig
index bde40ab..d1c606e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1941,7 +1941,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
- depends on RSEQ && DEBUG_KERNEL
+ depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
select RSEQ_DEBUG_DEFAULT_ENABLE
help
Enable extra debugging checks for the rseq system call.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 70a16db..523a3e7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,13 +11,8 @@
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- * @regs: Pointer to pt_regs on entry stack
- * @ti_work: TIF work flags as read by the caller
- */
-__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work)
+static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
+ unsigned long ti_work)
{
/*
* Before returning to user space ensure that all pending work
@@ -62,6 +57,23 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
return ti_work;
}
+/**
+ * exit_to_user_mode_loop - do any pending work before leaving to user space
+ * @regs: Pointer to pt_regs on entry stack
+ * @ti_work: TIF work flags as read by the caller
+ */
+__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
+ unsigned long ti_work)
+{
+ for (;;) {
+ ti_work = __exit_to_user_mode_loop(regs, ti_work);
+
+ if (likely(!rseq_exit_to_user_mode_restart(regs)))
+ return ti_work;
+ ti_work = read_thread_flags();
+ }
+}
+
noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
{
irqentry_state_t ret = {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index c5d6336..395d8b0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -237,7 +237,11 @@ efault:
static void rseq_slowpath_update_usr(struct pt_regs *regs)
{
- /* Preserve rseq state and user_irq state for exit to user */
+ /*
+ * Preserve rseq state and user_irq state. The generic entry code
+ * clears user_irq on the way out, the non-generic entry
+ * architectures are not having user_irq.
+ */
const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
struct task_struct *t = current;
struct rseq_ids ids;
@@ -289,7 +293,7 @@ static void rseq_slowpath_update_usr(struct pt_regs *regs)
}
}
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+void __rseq_handle_slowpath(struct pt_regs *regs)
{
/*
* If invoked from hypervisors before entering the guest via
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Implement fast path for exit to user
2025-10-27 8:45 ` [patch V6 27/31] rseq: Implement fast path for exit to user Thomas Gleixner
` (2 preceding siblings ...)
2025-10-29 16:28 ` [patch V6 27/31] " Steven Rostedt
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
4 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 968b4673b3cbaaa027feed243ea6c0f791b3b3a8
Gitweb: https://git.kernel.org/tip/968b4673b3cbaaa027feed243ea6c0f791b3b3a8
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:17 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:21 +01:00
rseq: Implement fast path for exit to user
Implement the actual logic for handling RSEQ updates in a fast path after
handling the TIF work and at the point where the task is actually returning
to user space.
This is the right point to do that because at this point the CPU and the MM
CID are stable and cannot longer change due to yet another reschedule.
That happens when the task is handling it via TIF_NOTIFY_RESUME in
resume_user_mode_work(), which is invoked from the exit to user mode work
loop.
The function is invoked after the TIF work is handled and runs with
interrupts disabled, which means it cannot resolve page faults. It
therefore disables page faults and in case the access to the user space
memory faults, it:
- notes the fail in the event struct
- raises TIF_NOTIFY_RESUME
- returns false to the caller
The caller has to go back to the TIF work, which runs with interrupts
enabled and therefore can resolve the page faults. This happens mostly on
fork() when the memory is marked COW.
If the user memory inspection finds invalid data, the function returns
false as well and sets the fatal flag in the event struct along with
TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
and terminate the task with SIGSEGV as documented.
The initial decision to invoke any of this is based on one flags in the
event struct: @sched_switch. The decision is in pseudo ASM:
load tsk::event::sched_switch
jnz inspect_user_space
mov $0, tsk::event::events
...
leave
So for the common case where the task was not scheduled out, this really
boils down to three instructions before going out if the compiler is not
completely stupid (and yes, some of them are).
If the condition is true, then it checks, whether CPU ID or MM CID have
changed. If so, then the CPU/MM IDs have to be updated and are thereby
cached for the next round. The update unconditionally retrieves the user
space critical section address to spare another user*begin/end() pair. If
that's not zero and tsk::event::user_irq is set, then the critical section
is analyzed and acted upon. If either zero or the entry came via syscall
the critical section analysis is skipped.
If the comparison is false then the critical section has to be analyzed
because the event flag is then only true when entry from user was by
interrupt.
This is provided without the actual hookup to let reviewers focus on the
implementation details. The hookup happens in the next step.
Note: As with quite some other optimizations this depends on the generic
entry infrastructure and is not enabled to be sucked into random
architecture implementations.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.638929615@linutronix.de
---
include/linux/rseq_entry.h | 133 +++++++++++++++++++++++++++++++++++-
include/linux/rseq_types.h | 3 +-
kernel/rseq.c | 2 +-
3 files changed, 135 insertions(+), 3 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index aa1c046..3f13be7 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -10,6 +10,7 @@ struct rseq_stats {
unsigned long exit;
unsigned long signal;
unsigned long slowpath;
+ unsigned long fastpath;
unsigned long ids;
unsigned long cs;
unsigned long clear;
@@ -245,12 +246,13 @@ rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long c
{
struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
unsigned long ip = instruction_pointer(regs);
+ unsigned long tasksize = TASK_SIZE;
u64 start_ip, abort_ip, offset;
u32 usig, __user *uc_sig;
rseq_stat_inc(rseq_stats.cs);
- if (unlikely(csaddr >= TASK_SIZE)) {
+ if (unlikely(csaddr >= tasksize)) {
t->rseq.event.fatal = true;
return false;
}
@@ -287,7 +289,7 @@ rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long c
* in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
* protection.
*/
- if (abort_ip >= TASK_SIZE || abort_ip < sizeof(*uc_sig))
+ if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
goto die;
/* The address is guaranteed to be >= 0 and < TASK_SIZE */
@@ -397,6 +399,128 @@ static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *r
return rseq_update_user_cs(t, regs, csaddr);
}
+/*
+ * If you want to use this then convert your architecture to the generic
+ * entry code. I'm tired of building workarounds for people who can't be
+ * bothered to make the maintenance of generic infrastructure less
+ * burdensome. Just sucking everything into the architecture code and
+ * thereby making others chase the horrible hacks and keep them working is
+ * neither acceptable nor sustainable.
+ */
+#ifdef CONFIG_GENERIC_ENTRY
+
+/*
+ * This is inlined into the exit path because:
+ *
+ * 1) It's a one time comparison in the fast path when there is no event to
+ * handle
+ *
+ * 2) The access to the user space rseq memory (TLS) is unlikely to fault
+ * so the straight inline operation is:
+ *
+ * - Four 32-bit stores only if CPU ID/ MM CID need to be updated
+ * - One 64-bit load to retrieve the critical section address
+ *
+ * 3) In the unlikely case that the critical section address is != NULL:
+ *
+ * - One 64-bit load to retrieve the start IP
+ * - One 64-bit load to retrieve the offset for calculating the end
+ * - One 64-bit load to retrieve the abort IP
+ * - One 64-bit load to retrieve the signature
+ * - One store to clear the critical section address
+ *
+ * The non-debug case implements only the minimal required checking. It
+ * provides protection against a rogue abort IP in kernel space, which
+ * would be exploitable at least on x86, and also against a rogue CS
+ * descriptor by checking the signature at the abort IP. Any fallout from
+ * invalid critical section descriptors is a user space problem. The debug
+ * case provides the full set of checks and terminates the task if a
+ * condition is not met.
+ *
+ * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and
+ * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
+ * slow path there will handle the failure.
+ */
+static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct task_struct *t)
+{
+ /*
+ * Page faults need to be disabled as this is called with
+ * interrupts disabled
+ */
+ guard(pagefault)();
+ if (likely(!t->rseq.event.ids_changed)) {
+ struct rseq __user *rseq = t->rseq.usrptr;
+ /*
+ * If IDs have not changed rseq_event::user_irq must be true
+ * See rseq_sched_switch_event().
+ */
+ u64 csaddr;
+
+ if (unlikely(get_user_inline(csaddr, &rseq->rseq_cs)))
+ return false;
+
+ if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
+ if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
+ return false;
+ }
+ return true;
+ }
+
+ struct rseq_ids ids = {
+ .cpu_id = task_cpu(t),
+ .mm_cid = task_mm_cid(t),
+ };
+ u32 node_id = cpu_to_node(ids.cpu_id);
+
+ return rseq_update_usr(t, regs, &ids, node_id);
+}
+
+static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ struct task_struct *t = current;
+
+ /*
+ * If the task did not go through schedule or got the flag enforced
+ * by the rseq syscall or execve, then nothing to do here.
+ *
+ * CPU ID and MM CID can only change when going through a context
+ * switch.
+ *
+ * rseq_sched_switch_event() sets the rseq_event::sched_switch bit
+ * only when rseq_event::has_rseq is true. That conditional is
+ * required to avoid setting the TIF bit if RSEQ is not registered
+ * for a task. rseq_event::sched_switch is cleared when RSEQ is
+ * unregistered by a task so it's sufficient to check for the
+ * sched_switch bit alone.
+ *
+ * A sane compiler requires three instructions for the nothing to do
+ * case including clearing the events, but your mileage might vary.
+ */
+ if (unlikely((t->rseq.event.sched_switch))) {
+ rseq_stat_inc(rseq_stats.fastpath);
+
+ if (unlikely(!rseq_exit_user_update(regs, t)))
+ return true;
+ }
+ /* Clear state so next entry starts from a clean slate */
+ t->rseq.event.events = 0;
+ return false;
+}
+
+static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+ current->rseq.event.slowpath = true;
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ return true;
+ }
+ return false;
+}
+
+#else /* CONFIG_GENERIC_ENTRY */
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs) { return false; }
+#endif /* !CONFIG_GENERIC_ENTRY */
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
@@ -421,9 +545,12 @@ static inline void rseq_debug_syscall_return(struct pt_regs *regs)
if (static_branch_unlikely(&rseq_debug_enabled))
__rseq_debug_syscall_return(regs);
}
-
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ return false;
+}
static inline void rseq_exit_to_user_mode(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index a1389ff..9c7a341 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -18,6 +18,8 @@ struct rseq;
* @has_rseq: True if the task has a rseq pointer installed
* @error: Compound error code for the slow path to analyze
* @fatal: User space data corrupted or invalid
+ * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME
+ * is required
*
* @sched_switch and @ids_changed must be adjacent and the combo must be
* 16bit aligned to allow a single store, when both are set at the same
@@ -42,6 +44,7 @@ struct rseq_event {
u16 error;
struct {
u8 fatal;
+ u8 slowpath;
};
};
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 183dde7..c5d6336 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -133,6 +133,7 @@ static int rseq_stats_show(struct seq_file *m, void *p)
stats.exit += data_race(per_cpu(rseq_stats.exit, cpu));
stats.signal += data_race(per_cpu(rseq_stats.signal, cpu));
stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu));
+ stats.fastpath += data_race(per_cpu(rseq_stats.fastpath, cpu));
stats.ids += data_race(per_cpu(rseq_stats.ids, cpu));
stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
@@ -142,6 +143,7 @@ static int rseq_stats_show(struct seq_file *m, void *p)
seq_printf(m, "exit: %16lu\n", stats.exit);
seq_printf(m, "signal: %16lu\n", stats.signal);
seq_printf(m, "slowp: %16lu\n", stats.slowpath);
+ seq_printf(m, "fastp: %16lu\n", stats.fastpath);
seq_printf(m, "ids: %16lu\n", stats.ids);
seq_printf(m, "cs: %16lu\n", stats.cs);
seq_printf(m, "clear: %16lu\n", stats.clear);
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Optimize event setting
2025-10-27 8:45 ` [patch V6 26/31] rseq: Optimize event setting Thomas Gleixner
2025-10-28 15:57 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 472a7fb937b82ed1dbafb48f1ae46acac123eba6
Gitweb: https://git.kernel.org/tip/472a7fb937b82ed1dbafb48f1ae46acac123eba6
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:14 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:20 +01:00
rseq: Optimize event setting
After removing the various condition bits earlier it turns out that one
extra information is needed to avoid setting event::sched_switch and
TIF_NOTIFY_RESUME unconditionally on every context switch.
The update of the RSEQ user space memory is only required, when either
the task was interrupted in user space and schedules
or
the CPU or MM CID changes in schedule() independent of the entry mode
Right now only the interrupt from user information is available.
Add an event flag, which is set when the CPU or MM CID or both change.
Evaluate this event in the scheduler to decide whether the sched_switch
event and the TIF bit need to be set.
It's an extra conditional in context_switch(), but the downside of
unconditionally handling RSEQ after a context switch to user is way more
significant. The utilized boolean logic minimizes this to a single
conditional branch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.578058898@linutronix.de
---
fs/exec.c | 2 +-
include/linux/rseq.h | 81 +++++++++++++++++++++++++++++++++----
include/linux/rseq_types.h | 11 ++++-
kernel/rseq.c | 2 +-
kernel/sched/core.c | 7 ++-
kernel/sched/sched.h | 5 +-
6 files changed, 95 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index e45b298..90e47eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ out:
force_fatal_sig(SIGSEGV);
sched_mm_cid_after_execve(current);
- rseq_sched_switch_event(current);
+ rseq_force_update();
current->in_execve = 0;
return retval;
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index f5a4318..abfbeb4 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -11,7 +11,8 @@ void __rseq_handle_notify_resume(struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
- if (current->rseq.event.has_rseq)
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
__rseq_handle_notify_resume(regs);
}
@@ -33,12 +34,75 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *reg
}
}
-/* Raised from context switch and exevce to force evaluation on exit to user */
-static inline void rseq_sched_switch_event(struct task_struct *t)
+static inline void rseq_raise_notify_resume(struct task_struct *t)
{
- if (t->rseq.event.has_rseq) {
- t->rseq.event.sched_switch = true;
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+
+/* Invoked from context switch to force evaluation on exit to user */
+static __always_inline void rseq_sched_switch_event(struct task_struct *t)
+{
+ struct rseq_event *ev = &t->rseq.event;
+
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /*
+ * Avoid a boat load of conditionals by using simple logic
+ * to determine whether NOTIFY_RESUME needs to be raised.
+ *
+ * It's required when the CPU or MM CID has changed or
+ * the entry was from user space.
+ */
+ bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
+
+ if (raise) {
+ ev->sched_switch = true;
+ rseq_raise_notify_resume(t);
+ }
+ } else {
+ if (ev->has_rseq) {
+ t->rseq.event.sched_switch = true;
+ rseq_raise_notify_resume(t);
+ }
+ }
+}
+
+/*
+ * Invoked from __set_task_cpu() when a task migrates to enforce an IDs
+ * update.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu)
+{
+ t->rseq.event.ids_changed = true;
+}
+
+/*
+ * Invoked from switch_mm_cid() in context switch when the task gets a MM
+ * CID assigned.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid)
+{
+ /*
+ * Requires a comparison as the switch_mm_cid() code does not
+ * provide a conditional for it readily. So avoid excessive updates
+ * when nothing changes.
+ */
+ if (t->rseq.ids.mm_cid != cid)
+ t->rseq.event.ids_changed = true;
+}
+
+/* Enforce a full update after RSEQ registration and when execve() failed */
+static inline void rseq_force_update(void)
+{
+ if (current->rseq.event.has_rseq) {
+ current->rseq.event.ids_changed = true;
+ current->rseq.event.sched_switch = true;
+ rseq_raise_notify_resume(current);
}
}
@@ -55,7 +119,7 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
static inline void rseq_virt_userspace_exit(void)
{
if (current->rseq.event.sched_switch)
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ rseq_raise_notify_resume(current);
}
static inline void rseq_reset(struct task_struct *t)
@@ -91,6 +155,9 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
+static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid) { }
+static inline void rseq_force_update(void) { }
static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 7c12394..a1389ff 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -11,20 +11,27 @@ struct rseq;
* struct rseq_event - Storage for rseq related event management
* @all: Compound to initialize and clear the data efficiently
* @events: Compound to access events with a single load/store
- * @sched_switch: True if the task was scheduled out
+ * @sched_switch: True if the task was scheduled and needs update on
+ * exit to user
+ * @ids_changed: Indicator that IDs need to be updated
* @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
* @error: Compound error code for the slow path to analyze
* @fatal: User space data corrupted or invalid
+ *
+ * @sched_switch and @ids_changed must be adjacent and the combo must be
+ * 16bit aligned to allow a single store, when both are set at the same
+ * time in the scheduler.
*/
struct rseq_event {
union {
u64 all;
struct {
union {
- u16 events;
+ u32 events;
struct {
u8 sched_switch;
+ u8 ids_changed;
u8 user_irq;
};
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 148fb21..183dde7 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -464,7 +464,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* are updated before returning to user-space.
*/
current->rseq.event.has_rseq = true;
- rseq_sched_switch_event(current);
+ rseq_force_update();
return 0;
efault:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b75e8e1..579a8e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5118,7 +5118,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
kcov_prepare_switch(prev);
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
- rseq_sched_switch_event(prev);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
@@ -5316,6 +5315,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
/* switch_mm_cid() requires the memory barriers above. */
switch_mm_cid(rq, prev, next);
+ /*
+ * Tell rseq that the task was scheduled in. Must be after
+ * switch_mm_cid() to get the TIF flag set.
+ */
+ rseq_sched_switch_event(next);
+
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adfb6e3..4838dda 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2209,6 +2209,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
smp_wmb();
WRITE_ONCE(task_thread_info(p)->cpu, cpu);
p->wake_cpu = cpu;
+ rseq_sched_set_task_cpu(p, cpu);
#endif /* CONFIG_SMP */
}
@@ -3807,8 +3808,10 @@ static inline void switch_mm_cid(struct rq *rq,
mm_cid_put_lazy(prev);
prev->mm_cid = -1;
}
- if (next->mm_cid_active)
+ if (next->mm_cid_active) {
next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
+ rseq_sched_set_task_mm_cid(next, next->mm_cid);
+ }
}
#else /* !CONFIG_SCHED_MM_CID: */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Rework the TIF_NOTIFY handler
2025-10-27 8:45 ` [patch V6 25/31] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 0dbdcc20a5c80e1d540f004b49d550c60b19fb5d
Gitweb: https://git.kernel.org/tip/0dbdcc20a5c80e1d540f004b49d550c60b19fb5d
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:12 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:20 +01:00
rseq: Rework the TIF_NOTIFY handler
Replace the whole logic with a new implementation, which is shared with
signal delivery and the upcoming exit fast path.
Contrary to the original implementation, this ignores invocations from
KVM/IO-uring, which invoke resume_user_mode_work() with the @regs argument
set to NULL.
The original implementation updated the CPU/Node/MM CID fields, but that
was just a side effect, which was addressing the problem that this
invocation cleared TIF_NOTIFY_RESUME, which in turn could cause an update
on return to user space to be lost.
This problem has been addressed differently, so that it's not longer
required to do that update before entering the guest.
That might be considered a user visible change, when the hosts thread TLS
memory is mapped into the guest, but as this was never intentionally
supported, this abuse of kernel internal implementation details is not
considered an ABI break.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.517640811@linutronix.de
---
include/linux/rseq_entry.h | 29 ++++++++++++++-
kernel/rseq.c | 76 ++++++++++++++++---------------------
2 files changed, 62 insertions(+), 43 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 37444e8..aa1c046 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -368,6 +368,35 @@ efault:
return false;
}
+/*
+ * Update user space with new IDs and conditionally check whether the task
+ * is in a critical section.
+ */
+static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *regs,
+ struct rseq_ids *ids, u32 node_id)
+{
+ u64 csaddr;
+
+ if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr))
+ return false;
+
+ /*
+ * On architectures which utilize the generic entry code this
+ * allows to skip the critical section when the entry was not from
+ * a user space interrupt, unless debug mode is enabled.
+ */
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ if (!static_branch_unlikely(&rseq_debug_enabled)) {
+ if (likely(!t->rseq.event.user_irq))
+ return true;
+ }
+ }
+ if (likely(!csaddr))
+ return true;
+ /* Sigh, this really needs to do work */
+ return rseq_update_user_cs(t, regs, csaddr);
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 13faadc..148fb21 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -82,12 +82,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/rseq.h>
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD irq
-#else
-# define RSEQ_EVENT_GUARD preempt
-#endif
-
DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
static inline void rseq_control_debug(bool on)
@@ -239,38 +233,15 @@ efault:
return false;
}
-/*
- * This resume handler must always be executed between any of:
- * - preemption,
- * - signal delivery,
- * and return to user-space.
- *
- * This is how we can ensure that the entire rseq critical section
- * will issue the commit instruction only if executed atomically with
- * respect to other threads scheduled on the same CPU, and with respect
- * to signal handlers.
- */
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+static void rseq_slowpath_update_usr(struct pt_regs *regs)
{
+ /* Preserve rseq state and user_irq state for exit to user */
+ const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
struct task_struct *t = current;
struct rseq_ids ids;
u32 node_id;
bool event;
- /*
- * If invoked from hypervisors before entering the guest via
- * resume_user_mode_work(), then @regs is a NULL pointer.
- *
- * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
- * it before returning from the ioctl() to user space when
- * rseq_event.sched_switch is set.
- *
- * So it's safe to ignore here instead of pointlessly updating it
- * in the vcpu_run() loop.
- */
- if (!regs)
- return;
-
if (unlikely(t->flags & PF_EXITING))
return;
@@ -294,26 +265,45 @@ void __rseq_handle_notify_resume(struct pt_regs *regs)
* with the result handed in to allow the detection of
* inconsistencies.
*/
- scoped_guard(RSEQ_EVENT_GUARD) {
+ scoped_guard(irq) {
event = t->rseq.event.sched_switch;
- t->rseq.event.sched_switch = false;
+ t->rseq.event.all &= evt_mask.all;
ids.cpu_id = task_cpu(t);
ids.mm_cid = task_mm_cid(t);
}
- if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+ if (!event)
return;
- if (!rseq_handle_cs(t, regs))
- goto error;
-
node_id = cpu_to_node(ids.cpu_id);
- if (!rseq_set_ids(t, &ids, node_id))
- goto error;
- return;
-error:
- force_sig(SIGSEGV);
+ if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) {
+ /*
+ * Clear the errors just in case this might survive magically, but
+ * leave the rest intact.
+ */
+ t->rseq.event.error = 0;
+ force_sig(SIGSEGV);
+ }
+}
+
+void __rseq_handle_notify_resume(struct pt_regs *regs)
+{
+ /*
+ * If invoked from hypervisors before entering the guest via
+ * resume_user_mode_work(), then @regs is a NULL pointer.
+ *
+ * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+ * it before returning from the ioctl() to user space when
+ * rseq_event.sched_switch is set.
+ *
+ * So it's safe to ignore here instead of pointlessly updating it
+ * in the vcpu_run() loop.
+ */
+ if (!regs)
+ return;
+
+ rseq_slowpath_update_usr(regs);
}
void __rseq_signal_deliver(int sig, struct pt_regs *regs)
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Separate the signal delivery path
2025-10-27 8:45 ` [patch V6 24/31] rseq: Separate the signal delivery path Thomas Gleixner
2025-10-28 15:51 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: f1cad05fc4a1d00251c06d80c71b2b0d758c3346
Gitweb: https://git.kernel.org/tip/f1cad05fc4a1d00251c06d80c71b2b0d758c3346
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:10 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:20 +01:00
rseq: Separate the signal delivery path
Completely separate the signal delivery path from the notify handler as
they have different semantics versus the event handling.
The signal delivery only needs to ensure that the interrupted user context
was not in a critical section or the section is aborted before it switches
to the signal frame context. The signal frame context does not have the
original instruction pointer anymore, so that can't be handled on exit to
user space.
No point in updating the CPU/CID ids as they might change again before the
task returns to user space for real.
The fast path optimization, which checks for the 'entry from user via
interrupt' condition is only available for architectures which use the
generic entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.455429038@linutronix.de
---
include/linux/rseq.h | 21 ++++++++++++++++-----
kernel/rseq.c | 30 ++++++++++++++++++++++--------
2 files changed, 38 insertions(+), 13 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 92f9cd4..f5a4318 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -7,22 +7,33 @@
#include <uapi/linux/rseq.h>
-void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
+void __rseq_handle_notify_resume(struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
if (current->rseq.event.has_rseq)
- __rseq_handle_notify_resume(NULL, regs);
+ __rseq_handle_notify_resume(regs);
}
+void __rseq_signal_deliver(int sig, struct pt_regs *regs);
+
+/*
+ * Invoked from signal delivery to fixup based on the register context before
+ * switching to the signal delivery context.
+ */
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
- if (current->rseq.event.has_rseq) {
- current->rseq.event.sched_switch = true;
- __rseq_handle_notify_resume(ksig, regs);
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.has_rseq & current->rseq.event.user_irq)
+ __rseq_signal_deliver(ksig->sig, regs);
+ } else {
+ if (current->rseq.event.has_rseq)
+ __rseq_signal_deliver(ksig->sig, regs);
}
}
+/* Raised from context switch and exevce to force evaluation on exit to user */
static inline void rseq_sched_switch_event(struct task_struct *t)
{
if (t->rseq.event.has_rseq) {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 1e4f1d2..13faadc 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -250,13 +250,12 @@ efault:
* respect to other threads scheduled on the same CPU, and with respect
* to signal handlers.
*/
-void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
+void __rseq_handle_notify_resume(struct pt_regs *regs)
{
struct task_struct *t = current;
struct rseq_ids ids;
u32 node_id;
bool event;
- int sig;
/*
* If invoked from hypervisors before entering the guest via
@@ -275,10 +274,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING))
return;
- if (ksig)
- rseq_stat_inc(rseq_stats.signal);
- else
- rseq_stat_inc(rseq_stats.slowpath);
+ rseq_stat_inc(rseq_stats.slowpath);
/*
* Read and clear the event pending bit first. If the task
@@ -317,8 +313,26 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
return;
error:
- sig = ksig ? ksig->sig : 0;
- force_sigsegv(sig);
+ force_sig(SIGSEGV);
+}
+
+void __rseq_signal_deliver(int sig, struct pt_regs *regs)
+{
+ rseq_stat_inc(rseq_stats.signal);
+ /*
+ * Don't update IDs, they are handled on exit to user if
+ * necessary. The important thing is to abort a critical section of
+ * the interrupted context as after this point the instruction
+ * pointer in @regs points to the signal handler.
+ */
+ if (unlikely(!rseq_handle_cs(current, regs))) {
+ /*
+ * Clear the errors just in case this might survive
+ * magically, but leave the rest intact.
+ */
+ current->rseq.event.error = 0;
+ force_sigsegv(sig);
+ }
}
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide and use rseq_set_ids()
2025-10-27 8:45 ` [patch V6 23/31] rseq: Provide and use rseq_set_ids() Thomas Gleixner
2025-10-28 15:47 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 5371d55ceec76a7e172d8c5187c0ae7e70784cc7
Gitweb: https://git.kernel.org/tip/5371d55ceec76a7e172d8c5187c0ae7e70784cc7
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:08 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:19 +01:00
rseq: Provide and use rseq_set_ids()
Provide a new and straight forward implementation to set the IDs (CPU ID,
Node ID and MM CID), which can be later inlined into the fast path.
It does all operations in one scoped_user_rw_access() section and retrieves
also the critical section member (rseq::cs_rseq) from user space to avoid
another user..begin/end() pair. This is in preparation for optimizing the
fast path to avoid extra work when not required.
On rseq registration set the CPU ID fields to RSEQ_CPU_ID_UNINITIALIZED and
node and MM CID to zero. That's the same as the kernel internal reset
values. That makes the debug validation in the exit code work correctly on
the first exit to user space.
Use it to replace the whole related zoo in rseq.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.393972266@linutronix.de
---
fs/binfmt_elf.c | 2 +-
include/linux/rseq.h | 16 +-
include/linux/rseq_entry.h | 89 ++++++++++++++-
include/linux/sched.h | 10 +--
kernel/rseq.c | 236 +++++++-----------------------------
5 files changed, 151 insertions(+), 202 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index e4653bb..3eb734c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -46,7 +46,7 @@
#include <linux/cred.h>
#include <linux/dax.h>
#include <linux/uaccess.h>
-#include <linux/rseq.h>
+#include <uapi/linux/rseq.h>
#include <asm/param.h>
#include <asm/page.h>
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 7f347c3..92f9cd4 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -5,6 +5,8 @@
#ifdef CONFIG_RSEQ
#include <linux/sched.h>
+#include <uapi/linux/rseq.h>
+
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
@@ -48,7 +50,7 @@ static inline void rseq_virt_userspace_exit(void)
static inline void rseq_reset(struct task_struct *t)
{
memset(&t->rseq, 0, sizeof(t->rseq));
- t->rseq.ids.cpu_cid = ~0ULL;
+ t->rseq.ids.cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
}
static inline void rseq_execve(struct task_struct *t)
@@ -59,15 +61,19 @@ static inline void rseq_execve(struct task_struct *t)
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
+ *
+ * On fork, keep the IDs (CPU, MMCID) of the parent, which avoids a fault
+ * on the COW page on exit to user space, when the child stays on the same
+ * CPU as the parent. That's obviously not guaranteed, but in overcommit
+ * scenarios it is more likely and optimizes for the fork/exec case without
+ * taking the fault.
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM) {
+ if (clone_flags & CLONE_VM)
rseq_reset(t);
- } else {
+ else
t->rseq = current->rseq;
- t->rseq.ids.cpu_cid = ~0ULL;
- }
}
#else /* CONFIG_RSEQ */
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index fb53a6f..37444e8 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -75,6 +75,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
#endif
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+bool rseq_debug_validate_ids(struct task_struct *t);
static __always_inline void rseq_note_user_irq_entry(void)
{
@@ -194,6 +195,43 @@ efault:
return false;
}
+/*
+ * On debug kernels validate that user space did not mess with it if the
+ * debug branch is enabled.
+ */
+bool rseq_debug_validate_ids(struct task_struct *t)
+{
+ struct rseq __user *rseq = t->rseq.usrptr;
+ u32 cpu_id, uval, node_id;
+
+ /*
+ * On the first exit after registering the rseq region CPU ID is
+ * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0!
+ */
+ node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ?
+ cpu_to_node(t->rseq.ids.cpu_id) : 0;
+
+ scoped_user_read_access(rseq, efault) {
+ unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
+ if (cpu_id != t->rseq.ids.cpu_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->cpu_id, efault);
+ if (uval != cpu_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->node_id, efault);
+ if (uval != node_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->mm_cid, efault);
+ if (uval != t->rseq.ids.mm_cid)
+ goto die;
+ }
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
#endif /* RSEQ_BUILD_SLOW_PATH */
/*
@@ -279,6 +317,57 @@ efault:
return false;
}
+/*
+ * Updates CPU ID, Node ID and MM CID and reads the critical section
+ * address, when @csaddr != NULL. This allows to put the ID update and the
+ * read under the same uaccess region to spare a separate begin/end.
+ *
+ * As this is either invoked from a C wrapper with @csaddr = NULL or from
+ * the fast path code with a valid pointer, a clever compiler should be
+ * able to optimize the read out. Spares a duplicate implementation.
+ *
+ * Returns true, if the operation was successful, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when invalid data
+ * was found on debug kernels. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+static rseq_inline
+bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
+ u32 node_id, u64 *csaddr)
+{
+ struct rseq __user *rseq = t->rseq.usrptr;
+
+ if (static_branch_unlikely(&rseq_debug_enabled)) {
+ if (!rseq_debug_validate_ids(t))
+ return false;
+ }
+
+ scoped_user_rw_access(rseq, efault) {
+ unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
+ unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
+ unsafe_put_user(node_id, &rseq->node_id, efault);
+ unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
+ if (csaddr)
+ unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+ }
+
+ /* Cache the new values */
+ t->rseq.ids.cpu_cid = ids->cpu_cid;
+ rseq_stat_inc(rseq_stats.ids);
+ rseq_trace_update(t, ids);
+ return true;
+efault:
+ return false;
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 24a9da7..e47abc8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -42,7 +42,6 @@
#include <linux/posix-timers_types.h>
#include <linux/restart_block.h>
#include <linux/rseq_types.h>
-#include <uapi/linux/rseq.h>
#include <linux/seqlock_types.h>
#include <linux/kcsan.h>
#include <linux/rv.h>
@@ -1408,15 +1407,6 @@ struct task_struct {
#endif /* CONFIG_NUMA_BALANCING */
struct rseq_data rseq;
-#ifdef CONFIG_DEBUG_RSEQ
- /*
- * This is a place holder to save a copy of the rseq fields for
- * validation of read-only fields. The struct rseq has a
- * variable-length array at the end, so it cannot be used
- * directly. Reserve a size large enough for the known fields.
- */
- char rseq_fields[sizeof(struct rseq)];
-#endif
#ifdef CONFIG_SCHED_MM_CID
int mm_cid; /* Current cid in mm */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9763155..1e4f1d2 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -88,13 +88,6 @@
# define RSEQ_EVENT_GUARD preempt
#endif
-/* The original rseq structure size (including padding) is 32 bytes. */
-#define ORIG_RSEQ_SIZE 32
-
-#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
- RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
- RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
-
DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
static inline void rseq_control_debug(bool on)
@@ -227,159 +220,9 @@ static int __init rseq_debugfs_init(void)
__initcall(rseq_debugfs_init);
#endif /* CONFIG_DEBUG_FS */
-#ifdef CONFIG_DEBUG_RSEQ
-static struct rseq *rseq_kernel_fields(struct task_struct *t)
-{
- return (struct rseq *) t->rseq_fields;
-}
-
-static int rseq_validate_ro_fields(struct task_struct *t)
-{
- static DEFINE_RATELIMIT_STATE(_rs,
- DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
- u32 cpu_id_start, cpu_id, node_id, mm_cid;
- struct rseq __user *rseq = t->rseq.usrptr;
-
- /*
- * Validate fields which are required to be read-only by
- * user-space.
- */
- if (!user_read_access_begin(rseq, t->rseq.len))
- goto efault;
- unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
- unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
- unsafe_get_user(node_id, &rseq->node_id, efault_end);
- unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end);
- user_read_access_end();
-
- if ((cpu_id_start != rseq_kernel_fields(t)->cpu_id_start ||
- cpu_id != rseq_kernel_fields(t)->cpu_id ||
- node_id != rseq_kernel_fields(t)->node_id ||
- mm_cid != rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) {
-
- pr_warn("Detected rseq corruption for pid: %d, name: %s\n"
- "\tcpu_id_start: %u ?= %u\n"
- "\tcpu_id: %u ?= %u\n"
- "\tnode_id: %u ?= %u\n"
- "\tmm_cid: %u ?= %u\n",
- t->pid, t->comm,
- cpu_id_start, rseq_kernel_fields(t)->cpu_id_start,
- cpu_id, rseq_kernel_fields(t)->cpu_id,
- node_id, rseq_kernel_fields(t)->node_id,
- mm_cid, rseq_kernel_fields(t)->mm_cid);
- }
-
- /* For now, only print a console warning on mismatch. */
- return 0;
-
-efault_end:
- user_read_access_end();
-efault:
- return -EFAULT;
-}
-
-/*
- * Update an rseq field and its in-kernel copy in lock-step to keep a coherent
- * state.
- */
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- do { \
- unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \
- rseq_kernel_fields(t)->field = value; \
- } while (0)
-
-#else
-static int rseq_validate_ro_fields(struct task_struct *t)
-{
- return 0;
-}
-
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- unsafe_put_user(value, &t->rseq.usrptr->field, error_label)
-#endif
-
-static int rseq_update_cpu_node_id(struct task_struct *t)
-{
- struct rseq __user *rseq = t->rseq.usrptr;
- u32 cpu_id = raw_smp_processor_id();
- u32 node_id = cpu_to_node(cpu_id);
- u32 mm_cid = task_mm_cid(t);
-
- rseq_stat_inc(rseq_stats.ids);
-
- /* Validate read-only rseq fields on debug kernels */
- if (rseq_validate_ro_fields(t))
- goto efault;
- WARN_ON_ONCE((int) mm_cid < 0);
-
- if (!user_write_access_begin(rseq, t->rseq.len))
- goto efault;
-
- rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
- rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
- rseq_unsafe_put_user(t, node_id, node_id, efault_end);
- rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
- /* Cache the user space values */
- t->rseq.ids.cpu_id = cpu_id;
- t->rseq.ids.mm_cid = mm_cid;
-
- /*
- * Additional feature fields added after ORIG_RSEQ_SIZE
- * need to be conditionally updated only if
- * t->rseq_len != ORIG_RSEQ_SIZE.
- */
- user_write_access_end();
- trace_rseq_update(t);
- return 0;
-
-efault_end:
- user_write_access_end();
-efault:
- return -EFAULT;
-}
-
-static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
+static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
{
- struct rseq __user *rseq = t->rseq.usrptr;
- u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
- mm_cid = 0;
-
- /*
- * Validate read-only rseq fields.
- */
- if (rseq_validate_ro_fields(t))
- goto efault;
-
- if (!user_write_access_begin(rseq, t->rseq.len))
- goto efault;
-
- /*
- * Reset all fields to their initial state.
- *
- * All fields have an initial state of 0 except cpu_id which is set to
- * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after
- * unregistration can figure out that rseq needs to be registered
- * again.
- */
- rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end);
- rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
- rseq_unsafe_put_user(t, node_id, node_id, efault_end);
- rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
- /*
- * Additional feature fields added after ORIG_RSEQ_SIZE
- * need to be conditionally reset only if
- * t->rseq_len != ORIG_RSEQ_SIZE.
- */
- user_write_access_end();
- return 0;
-
-efault_end:
- user_write_access_end();
-efault:
- return -EFAULT;
+ return rseq_set_ids_get_csaddr(t, ids, node_id, NULL);
}
static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
@@ -410,6 +253,8 @@ efault:
void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
+ struct rseq_ids ids;
+ u32 node_id;
bool event;
int sig;
@@ -456,6 +301,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
scoped_guard(RSEQ_EVENT_GUARD) {
event = t->rseq.event.sched_switch;
t->rseq.event.sched_switch = false;
+ ids.cpu_id = task_cpu(t);
+ ids.mm_cid = task_mm_cid(t);
}
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -464,7 +311,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (!rseq_handle_cs(t, regs))
goto error;
- if (unlikely(rseq_update_cpu_node_id(t)))
+ node_id = cpu_to_node(ids.cpu_id);
+ if (!rseq_set_ids(t, &ids, node_id))
goto error;
return;
@@ -504,13 +352,33 @@ void rseq_syscall(struct pt_regs *regs)
}
#endif
+static bool rseq_reset_ids(void)
+{
+ struct rseq_ids ids = {
+ .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+ .mm_cid = 0,
+ };
+
+ /*
+ * If this fails, terminate it because this leaves the kernel in
+ * stupid state as exit to user space will try to fixup the ids
+ * again.
+ */
+ if (rseq_set_ids(current, &ids, 0))
+ return true;
+
+ force_sig(SIGSEGV);
+ return false;
+}
+
+/* The original rseq structure size (including padding) is 32 bytes. */
+#define ORIG_RSEQ_SIZE 32
+
/*
* sys_rseq - setup restartable sequences for caller thread.
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
- int ret;
-
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -521,9 +389,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
return -EINVAL;
if (current->rseq.sig != sig)
return -EPERM;
- ret = rseq_reset_rseq_cpu_node_id(current);
- if (ret)
- return ret;
+ if (!rseq_reset_ids())
+ return -EFAULT;
rseq_reset(current);
return 0;
}
@@ -563,27 +430,22 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
if (!access_ok(rseq, rseq_len))
return -EFAULT;
- /*
- * If the rseq_cs pointer is non-NULL on registration, clear it to
- * avoid a potential segfault on return to user-space. The proper thing
- * to do would have been to fail the registration but this would break
- * older libcs that reuse the rseq area for new threads without
- * clearing the fields. Don't bother reading it, just reset it.
- */
- if (put_user(0UL, &rseq->rseq_cs))
- return -EFAULT;
+ scoped_user_write_access(rseq, efault) {
+ /*
+ * If the rseq_cs pointer is non-NULL on registration, clear it to
+ * avoid a potential segfault on return to user-space. The proper thing
+ * to do would have been to fail the registration but this would break
+ * older libcs that reuse the rseq area for new threads without
+ * clearing the fields. Don't bother reading it, just reset it.
+ */
+ unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ /* Initialize IDs in user space */
+ unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
+ unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
+ unsafe_put_user(0U, &rseq->node_id, efault);
+ unsafe_put_user(0U, &rseq->mm_cid, efault);
+ }
-#ifdef CONFIG_DEBUG_RSEQ
- /*
- * Initialize the in-kernel rseq fields copy for validation of
- * read-only fields.
- */
- if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_start) ||
- get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) ||
- get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) ||
- get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid))
- return -EFAULT;
-#endif
/*
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
@@ -599,6 +461,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
*/
current->rseq.event.has_rseq = true;
rseq_sched_switch_event(current);
-
return 0;
+
+efault:
+ return -EFAULT;
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
2025-10-27 8:45 ` [patch V6 22/31] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 457dc0a3ef5b3cd51c1497638feccb4db5e7943c
Gitweb: https://git.kernel.org/tip/457dc0a3ef5b3cd51c1497638feccb4db5e7943c
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:05 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:19 +01:00
rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
Make the syscall exit debug mechanism available via the static branch on
architectures which utilize the generic entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.333440475@linutronix.de
---
include/linux/entry-common.h | 2 +-
include/linux/rseq_entry.h | 9 +++++++++
kernel/rseq.c | 10 ++++++++--
3 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 75b194c..d967184 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -146,7 +146,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
local_irq_enable();
}
- rseq_syscall(regs);
+ rseq_debug_syscall_return(regs);
/*
* Do one-time syscall specific work. If these work items are
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 5bdcf5b..fb53a6f 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -296,9 +296,18 @@ static __always_inline void rseq_exit_to_user_mode(void)
ev->events = 0;
}
+void __rseq_debug_syscall_return(struct pt_regs *regs);
+
+static inline void rseq_debug_syscall_return(struct pt_regs *regs)
+{
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ __rseq_debug_syscall_return(regs);
+}
+
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index abd6bfa..9763155 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -473,12 +473,11 @@ error:
force_sigsegv(sig);
}
-#ifdef CONFIG_DEBUG_RSEQ
/*
* Terminate the process if a syscall is issued within a restartable
* sequence.
*/
-void rseq_syscall(struct pt_regs *regs)
+void __rseq_debug_syscall_return(struct pt_regs *regs)
{
struct task_struct *t = current;
u64 csaddr;
@@ -496,6 +495,13 @@ void rseq_syscall(struct pt_regs *regs)
fail:
force_sig(SIGSEGV);
}
+
+#ifdef CONFIG_DEBUG_RSEQ
+/* Kept around to keep GENERIC_ENTRY=n architectures supported. */
+void rseq_syscall(struct pt_regs *regs)
+{
+ __rseq_debug_syscall_return(regs);
+}
#endif
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Make exit debugging static branch based
2025-10-27 8:45 ` [patch V6 21/31] rseq: Make exit debugging static branch based Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 10ad7e96d88c286e018c4507add219bbef55bfcf
Gitweb: https://git.kernel.org/tip/10ad7e96d88c286e018c4507add219bbef55bfcf
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:02 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:19 +01:00
rseq: Make exit debugging static branch based
Disconnect it from the config switch and use the static debug branch. This
is a temporary measure for validating the rework. At the end this check
needs to be hidden behind lockdep as it has nothing to do with the other
debug infrastructure, which mainly aids user space debugging by enabling a
zoo of checks which terminate misbehaving tasks instead of letting them
keep the hard to diagnose pieces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.272660745@linutronix.de
---
include/linux/rseq_entry.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index f9510ce..5bdcf5b 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -285,7 +285,7 @@ static __always_inline void rseq_exit_to_user_mode(void)
rseq_stat_inc(rseq_stats.exit);
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ if (static_branch_unlikely(&rseq_debug_enabled))
WARN_ON_ONCE(ev->sched_switch);
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Replace the original debug implementation
2025-10-27 8:45 ` [patch V6 20/31] rseq: Replace the original debug implementation Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-10-30 21:52 ` [patch V6 20/31] " Prakash Sangappa
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 502c6554c7c32040cd02ab1779a1eb1ffb371a2f
Gitweb: https://git.kernel.org/tip/502c6554c7c32040cd02ab1779a1eb1ffb371a2f
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:00 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:18 +01:00
rseq: Replace the original debug implementation
Just utilize the new infrastructure and put the original one to rest.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.212510692@linutronix.de
---
kernel/rseq.c | 81 +++++++-------------------------------------------
1 file changed, 12 insertions(+), 69 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 12a9b6a..abd6bfa 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -475,84 +475,27 @@ error:
#ifdef CONFIG_DEBUG_RSEQ
/*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
- return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
- struct rseq __user *urseq = t->rseq.usrptr;
- struct rseq_cs __user *urseq_cs;
- u32 __user *usig;
- u64 ptr;
- u32 sig;
- int ret;
-
- if (get_user(ptr, &rseq->rseq_cs))
- return -EFAULT;
-
- /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
- if (!ptr) {
- memset(rseq_cs, 0, sizeof(*rseq_cs));
- return 0;
- }
- /* Check that the pointer value fits in the user-space process space. */
- if (ptr >= TASK_SIZE)
- return -EINVAL;
- urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
- if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-
- if (rseq_cs->start_ip >= TASK_SIZE ||
- rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
- rseq_cs->abort_ip >= TASK_SIZE ||
- rseq_cs->version > 0)
- return -EINVAL;
- /* Check for overflow. */
- if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
- return -EINVAL;
- /* Ensure that abort_ip is not in the critical section. */
- if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
- return -EINVAL;
-
- usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
- ret = get_user(sig, usig);
- if (ret)
- return ret;
-
- if (current->rseq.sig != sig) {
- printk_ratelimited(KERN_WARNING
- "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq.sig, current->pid, usig);
- return -EINVAL;
- }
- return 0;
-}
-
-/*
* Terminate the process if a syscall is issued within a restartable
* sequence.
*/
void rseq_syscall(struct pt_regs *regs)
{
- unsigned long ip = instruction_pointer(regs);
struct task_struct *t = current;
- struct rseq_cs rseq_cs;
+ u64 csaddr;
- if (!t->rseq.usrptr)
+ if (!t->rseq.event.has_rseq)
return;
- if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
- force_sig(SIGSEGV);
+ if (get_user(csaddr, &t->rseq.usrptr->rseq_cs))
+ goto fail;
+ if (likely(!csaddr))
+ return;
+ if (unlikely(csaddr >= TASK_SIZE))
+ goto fail;
+ if (rseq_debug_update_user_cs(t, regs, csaddr))
+ return;
+fail:
+ force_sig(SIGSEGV);
}
-
#endif
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide and use rseq_update_user_cs()
2025-10-27 8:44 ` [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
` (2 preceding siblings ...)
2025-10-29 16:04 ` [patch V6 19/31] " Steven Rostedt
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
4 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 693b9917c8b1879bf954b5cbb72da1bd07466282
Gitweb: https://git.kernel.org/tip/693b9917c8b1879bf954b5cbb72da1bd07466282
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:57 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:18 +01:00
rseq: Provide and use rseq_update_user_cs()
Provide a straight forward implementation to check for and eventually
clear/fixup critical sections in user space.
The non-debug version does only the minimal sanity checks and aims for
efficiency.
There are two attack vectors, which are checked for:
1) An abort IP which is in the kernel address space. That would cause at
least x86 to return to kernel space via IRET.
2) A rogue critical section descriptor with an abort IP pointing to some
arbitrary address, which is not preceded by the RSEQ signature.
If the section descriptors are invalid then the resulting misbehaviour of
the user space application is not the kernels problem.
The kernel provides a run-time switchable debug slow path, which implements
the full zoo of checks including termination of the task when one of the
gazillion conditions is not met.
Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME
handler. Move the remainders into the CONFIG_DEBUG_RSEQ section, which will
be replaced and removed in a subsequent step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.151465632@linutronix.de
---
include/linux/rseq_entry.h | 206 ++++++++++++++++++++++++++++++-
include/linux/rseq_types.h | 11 +-
kernel/rseq.c | 246 ++++++++++--------------------------
3 files changed, 291 insertions(+), 172 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index ed8e5f8..f9510ce 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -36,6 +36,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
#ifdef CONFIG_RSEQ
#include <linux/jump_label.h>
#include <linux/rseq.h>
+#include <linux/uaccess.h>
#include <linux/tracepoint-defs.h>
@@ -67,12 +68,217 @@ static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_inline
+#else
+#define rseq_inline __always_inline
+#endif
+
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
current->rseq.event.user_irq = true;
}
+/*
+ * Check whether there is a valid critical section and whether the
+ * instruction pointer in @regs is inside the critical section.
+ *
+ * - If the critical section is invalid, terminate the task.
+ *
+ * - If valid and the instruction pointer is inside, set it to the abort IP.
+ *
+ * - If valid and the instruction pointer is outside, clear the critical
+ * section address.
+ *
+ * Returns true, if the section was valid and either fixup or clear was
+ * done, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when a invalid
+ * section was found. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+
+#ifdef RSEQ_BUILD_SLOW_PATH
+/*
+ * The debug version is put out of line, but kept here so the code stays
+ * together.
+ *
+ * @csaddr has already been checked by the caller to be in user space
+ */
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
+ unsigned long csaddr)
+{
+ struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+ u64 start_ip, abort_ip, offset, cs_end, head, tasksize = TASK_SIZE;
+ unsigned long ip = instruction_pointer(regs);
+ u64 __user *uc_head = (u64 __user *) ucs;
+ u32 usig, __user *uc_sig;
+
+ scoped_user_rw_access(ucs, efault) {
+ /*
+ * Evaluate the user pile and exit if one of the conditions
+ * is not fulfilled.
+ */
+ unsafe_get_user(start_ip, &ucs->start_ip, efault);
+ if (unlikely(start_ip >= tasksize))
+ goto die;
+ /* If outside, just clear the critical section. */
+ if (ip < start_ip)
+ goto clear;
+
+ unsafe_get_user(offset, &ucs->post_commit_offset, efault);
+ cs_end = start_ip + offset;
+ /* Check for overflow and wraparound */
+ if (unlikely(cs_end >= tasksize || cs_end < start_ip))
+ goto die;
+
+ /* If not inside, clear it. */
+ if (ip >= cs_end)
+ goto clear;
+
+ unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
+ /* Ensure it's "valid" */
+ if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
+ goto die;
+ /* Validate that the abort IP is not in the critical section */
+ if (unlikely(abort_ip - start_ip < offset))
+ goto die;
+
+ /*
+ * Check version and flags for 0. No point in emitting
+ * deprecated warnings before dying. That could be done in
+ * the slow path eventually, but *shrug*.
+ */
+ unsafe_get_user(head, uc_head, efault);
+ if (unlikely(head))
+ goto die;
+
+ /* abort_ip - 4 is >= 0. See abort_ip check above */
+ uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+ unsafe_get_user(usig, uc_sig, efault);
+ if (unlikely(usig != t->rseq.sig))
+ goto die;
+
+ /* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /* If not in interrupt from user context, let it die */
+ if (unlikely(!t->rseq.event.user_irq))
+ goto die;
+ }
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ instruction_pointer_set(regs, (unsigned long)abort_ip);
+ rseq_stat_inc(rseq_stats.fixup);
+ break;
+ clear:
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ rseq_stat_inc(rseq_stats.clear);
+ abort_ip = 0ULL;
+ }
+
+ if (unlikely(abort_ip))
+ rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
+#endif /* RSEQ_BUILD_SLOW_PATH */
+
+/*
+ * This only ensures that abort_ip is in the user address space and
+ * validates that it is preceded by the signature.
+ *
+ * No other sanity checks are done here, that's what the debug code is for.
+ */
+static rseq_inline bool
+rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
+{
+ struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+ unsigned long ip = instruction_pointer(regs);
+ u64 start_ip, abort_ip, offset;
+ u32 usig, __user *uc_sig;
+
+ rseq_stat_inc(rseq_stats.cs);
+
+ if (unlikely(csaddr >= TASK_SIZE)) {
+ t->rseq.event.fatal = true;
+ return false;
+ }
+
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ return rseq_debug_update_user_cs(t, regs, csaddr);
+
+ scoped_user_rw_access(ucs, efault) {
+ unsafe_get_user(start_ip, &ucs->start_ip, efault);
+ unsafe_get_user(offset, &ucs->post_commit_offset, efault);
+ unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
+
+ /*
+ * No sanity checks. If user space screwed it up, it can
+ * keep the pieces. That's what debug code is for.
+ *
+ * If outside, just clear the critical section.
+ */
+ if (ip - start_ip >= offset)
+ goto clear;
+
+ /*
+ * Two requirements for @abort_ip:
+ * - Must be in user space as x86 IRET would happily return to
+ * the kernel.
+ * - The four bytes preceding the instruction at @abort_ip must
+ * contain the signature.
+ *
+ * The latter protects against the following attack vector:
+ *
+ * An attacker with limited abilities to write, creates a critical
+ * section descriptor, sets the abort IP to a library function or
+ * some other ROP gadget and stores the address of the descriptor
+ * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
+ * protection.
+ */
+ if (abort_ip >= TASK_SIZE || abort_ip < sizeof(*uc_sig))
+ goto die;
+
+ /* The address is guaranteed to be >= 0 and < TASK_SIZE */
+ uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+ unsafe_get_user(usig, uc_sig, efault);
+ if (unlikely(usig != t->rseq.sig))
+ goto die;
+
+ /* Invalidate the critical section */
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ /* Update the instruction pointer */
+ instruction_pointer_set(regs, (unsigned long)abort_ip);
+ rseq_stat_inc(rseq_stats.fixup);
+ break;
+ clear:
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ rseq_stat_inc(rseq_stats.clear);
+ abort_ip = 0ULL;
+ }
+
+ if (unlikely(abort_ip))
+ rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 80f6c39..7c12394 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -14,10 +14,12 @@ struct rseq;
* @sched_switch: True if the task was scheduled out
* @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
+ * @error: Compound error code for the slow path to analyze
+ * @fatal: User space data corrupted or invalid
*/
struct rseq_event {
union {
- u32 all;
+ u64 all;
struct {
union {
u16 events;
@@ -28,6 +30,13 @@ struct rseq_event {
};
u8 has_rseq;
+ u8 __pad;
+ union {
+ u16 error;
+ struct {
+ u8 fatal;
+ };
+ };
};
};
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 679ab8e..12a9b6a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -382,175 +382,18 @@ efault:
return -EFAULT;
}
-/*
- * Get the user-space pointer value stored in the 'rseq_cs' field.
- */
-static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs)
-{
- if (!rseq_cs)
- return -EFAULT;
-
-#ifdef CONFIG_64BIT
- if (get_user(*rseq_cs, &rseq->rseq_cs))
- return -EFAULT;
-#else
- if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-#endif
-
- return 0;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
- struct rseq_cs __user *urseq_cs;
- u64 ptr;
- u32 __user *usig;
- u32 sig;
- int ret;
-
- ret = rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr);
- if (ret)
- return ret;
-
- /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
- if (!ptr) {
- memset(rseq_cs, 0, sizeof(*rseq_cs));
- return 0;
- }
- /* Check that the pointer value fits in the user-space process space. */
- if (ptr >= TASK_SIZE)
- return -EINVAL;
- urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
- if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-
- if (rseq_cs->start_ip >= TASK_SIZE ||
- rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
- rseq_cs->abort_ip >= TASK_SIZE ||
- rseq_cs->version > 0)
- return -EINVAL;
- /* Check for overflow. */
- if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
- return -EINVAL;
- /* Ensure that abort_ip is not in the critical section. */
- if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
- return -EINVAL;
-
- usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
- ret = get_user(sig, usig);
- if (ret)
- return ret;
-
- if (current->rseq.sig != sig) {
- printk_ratelimited(KERN_WARNING
- "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq.sig, current->pid, usig);
- return -EINVAL;
- }
- return 0;
-}
-
-static bool rseq_warn_flags(const char *str, u32 flags)
-{
- u32 test_flags;
-
- if (!flags)
- return false;
- test_flags = flags & RSEQ_CS_NO_RESTART_FLAGS;
- if (test_flags)
- pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, str);
- test_flags = flags & ~RSEQ_CS_NO_RESTART_FLAGS;
- if (test_flags)
- pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str);
- return true;
-}
-
-static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
-{
- u32 flags;
- int ret;
-
- if (rseq_warn_flags("rseq_cs", cs_flags))
- return -EINVAL;
-
- /* Get thread flags. */
- ret = get_user(flags, &t->rseq.usrptr->flags);
- if (ret)
- return ret;
-
- if (rseq_warn_flags("rseq", flags))
- return -EINVAL;
- return 0;
-}
-
-static int clear_rseq_cs(struct rseq __user *rseq)
-{
- /*
- * The rseq_cs field is set to NULL on preemption or signal
- * delivery on top of rseq assembly block, as well as on top
- * of code outside of the rseq assembly block. This performs
- * a lazy clear of the rseq_cs field.
- *
- * Set rseq_cs to NULL.
- */
-#ifdef CONFIG_64BIT
- return put_user(0UL, &rseq->rseq_cs);
-#else
- if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs)))
- return -EFAULT;
- return 0;
-#endif
-}
-
-/*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
- return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
+static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
{
- unsigned long ip = instruction_pointer(regs);
- struct task_struct *t = current;
- struct rseq_cs rseq_cs;
- int ret;
-
- rseq_stat_inc(rseq_stats.cs);
-
- ret = rseq_get_rseq_cs(t, &rseq_cs);
- if (ret)
- return ret;
-
- /*
- * Handle potentially not being within a critical section.
- * If not nested over a rseq critical section, restart is useless.
- * Clear the rseq_cs pointer and return.
- */
- if (!in_rseq_cs(ip, &rseq_cs)) {
- rseq_stat_inc(rseq_stats.clear);
- return clear_rseq_cs(t->rseq.usrptr);
- }
- ret = rseq_check_flags(t, rseq_cs.flags);
- if (ret < 0)
- return ret;
- if (!abort)
- return 0;
- ret = clear_rseq_cs(t->rseq.usrptr);
- if (ret)
- return ret;
- rseq_stat_inc(rseq_stats.fixup);
- trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
- rseq_cs.abort_ip);
- instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
- return 0;
+ struct rseq __user *urseq = t->rseq.usrptr;
+ u64 csaddr;
+
+ scoped_user_read_access(urseq, efault)
+ unsafe_get_user(csaddr, &urseq->rseq_cs, efault);
+ if (likely(!csaddr))
+ return true;
+ return rseq_update_user_cs(t, regs, csaddr);
+efault:
+ return false;
}
/*
@@ -567,8 +410,8 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
- int ret, sig;
bool event;
+ int sig;
/*
* If invoked from hypervisors before entering the guest via
@@ -618,8 +461,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
return;
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
+ if (!rseq_handle_cs(t, regs))
goto error;
if (unlikely(rseq_update_cpu_node_id(t)))
@@ -632,6 +474,68 @@ error:
}
#ifdef CONFIG_DEBUG_RSEQ
+/*
+ * Unsigned comparison will be true when ip >= start_ip, and when
+ * ip < start_ip + post_commit_offset.
+ */
+static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
+{
+ return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
+}
+
+/*
+ * If the rseq_cs field of 'struct rseq' contains a valid pointer to
+ * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
+ */
+static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
+{
+ struct rseq __user *urseq = t->rseq.usrptr;
+ struct rseq_cs __user *urseq_cs;
+ u32 __user *usig;
+ u64 ptr;
+ u32 sig;
+ int ret;
+
+ if (get_user(ptr, &rseq->rseq_cs))
+ return -EFAULT;
+
+ /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
+ if (!ptr) {
+ memset(rseq_cs, 0, sizeof(*rseq_cs));
+ return 0;
+ }
+ /* Check that the pointer value fits in the user-space process space. */
+ if (ptr >= TASK_SIZE)
+ return -EINVAL;
+ urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
+ if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
+ return -EFAULT;
+
+ if (rseq_cs->start_ip >= TASK_SIZE ||
+ rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
+ rseq_cs->abort_ip >= TASK_SIZE ||
+ rseq_cs->version > 0)
+ return -EINVAL;
+ /* Check for overflow. */
+ if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
+ return -EINVAL;
+ /* Ensure that abort_ip is not in the critical section. */
+ if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
+ return -EINVAL;
+
+ usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
+ ret = get_user(sig, usig);
+ if (ret)
+ return ret;
+
+ if (current->rseq.sig != sig) {
+ printk_ratelimited(KERN_WARNING
+ "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
+ sig, current->rseq.sig, current->pid, usig);
+ return -EINVAL;
+ }
+ return 0;
+}
/*
* Terminate the process if a syscall is issued within a restartable
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide static branch for runtime debugging
2025-10-27 8:44 ` [patch V6 18/31] rseq: Provide static branch for runtime debugging Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 01395f041c84fd0be161efbaba3a76dabc16125e
Gitweb: https://git.kernel.org/tip/01395f041c84fd0be161efbaba3a76dabc16125e
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:55 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:18 +01:00
rseq: Provide static branch for runtime debugging
Config based debug is rarely turned on and is not available easily when
things go wrong.
Provide a static branch to allow permanent integration of debug mechanisms
along with the usual toggles in Kconfig, command line and debugfs.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.089270547@linutronix.de
---
Documentation/admin-guide/kernel-parameters.txt | 4 +-
include/linux/rseq_entry.h | 3 +-
init/Kconfig | 14 +++-
kernel/rseq.c | 73 +++++++++++++++-
4 files changed, 90 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6c42061..e638274 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6500,6 +6500,10 @@
Memory area to be used by remote processor image,
managed by CMA.
+ rseq_debug= [KNL] Enable or disable restartable sequence
+ debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE.
+ Format: <bool>
+
rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling
when CONFIG_RT_GROUP_SCHED=y. Defaults to
!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index ff9080b..ed8e5f8 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
#endif /* !CONFIG_RSEQ_STATS */
#ifdef CONFIG_RSEQ
+#include <linux/jump_label.h>
#include <linux/rseq.h>
#include <linux/tracepoint-defs.h>
@@ -64,6 +65,8 @@ static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
unsigned long offset, unsigned long abort_ip) { }
#endif /* !CONFIG_TRACEPOINT */
+DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
diff --git a/init/Kconfig b/init/Kconfig
index f39fdfb..bde40ab 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1925,10 +1925,24 @@ config RSEQ_STATS
If unsure, say N.
+config RSEQ_DEBUG_DEFAULT_ENABLE
+ default n
+ bool "Enable restartable sequences debug mode by default" if EXPERT
+ depends on RSEQ
+ help
+ This enables the static branch for debug mode of restartable
+ sequences.
+
+ This also can be controlled on the kernel command line via the
+ command line parameter "rseq_debug=0/1" and through debugfs.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
depends on RSEQ && DEBUG_KERNEL
+ select RSEQ_DEBUG_DEFAULT_ENABLE
help
Enable extra debugging checks for the rseq system call.
diff --git a/kernel/rseq.c b/kernel/rseq.c
index c0dbe2e..679ab8e 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -95,6 +95,27 @@
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
+DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
+static inline void rseq_control_debug(bool on)
+{
+ if (on)
+ static_branch_enable(&rseq_debug_enabled);
+ else
+ static_branch_disable(&rseq_debug_enabled);
+}
+
+static int __init rseq_setup_debug(char *str)
+{
+ bool on;
+
+ if (kstrtobool(str, &on))
+ return -EINVAL;
+ rseq_control_debug(on);
+ return 1;
+}
+__setup("rseq_debug=", rseq_setup_debug);
+
#ifdef CONFIG_TRACEPOINTS
/*
* Out of line, so the actual update functions can be in a header to be
@@ -112,10 +133,11 @@ void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
}
#endif /* CONFIG_TRACEPOINTS */
+#ifdef CONFIG_DEBUG_FS
#ifdef CONFIG_RSEQ_STATS
DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
-static int rseq_debug_show(struct seq_file *m, void *p)
+static int rseq_stats_show(struct seq_file *m, void *p)
{
struct rseq_stats stats = { };
unsigned int cpu;
@@ -140,14 +162,56 @@ static int rseq_debug_show(struct seq_file *m, void *p)
return 0;
}
+static int rseq_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, rseq_stats_show, inode->i_private);
+}
+
+static const struct file_operations stat_ops = {
+ .open = rseq_stats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init rseq_stats_init(struct dentry *root_dir)
+{
+ debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops);
+ return 0;
+}
+#else
+static inline void rseq_stats_init(struct dentry *root_dir) { }
+#endif /* CONFIG_RSEQ_STATS */
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+ bool on = static_branch_unlikely(&rseq_debug_enabled);
+
+ seq_printf(m, "%d\n", on);
+ return 0;
+}
+
+static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ bool on;
+
+ if (kstrtobool_from_user(ubuf, count, &on))
+ return -EINVAL;
+
+ rseq_control_debug(on);
+ return count;
+}
+
static int rseq_debug_open(struct inode *inode, struct file *file)
{
return single_open(file, rseq_debug_show, inode->i_private);
}
-static const struct file_operations dfs_ops = {
+static const struct file_operations debug_ops = {
.open = rseq_debug_open,
.read = seq_read,
+ .write = rseq_debug_write,
.llseek = seq_lseek,
.release = single_release,
};
@@ -156,11 +220,12 @@ static int __init rseq_debugfs_init(void)
{
struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
- debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+ debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops);
+ rseq_stats_init(root_dir);
return 0;
}
__initcall(rseq_debugfs_init);
-#endif /* CONFIG_RSEQ_STATS */
+#endif /* CONFIG_DEBUG_FS */
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Expose lightweight statistics in debugfs
2025-10-27 8:44 ` [patch V6 17/31] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 5f58b34aae52d285674b5f99f6b1bd42e737e190
Gitweb: https://git.kernel.org/tip/5f58b34aae52d285674b5f99f6b1bd42e737e190
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:52 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:17 +01:00
rseq: Expose lightweight statistics in debugfs
Analyzing the call frequency without actually using tracing is helpful for
analysis of this infrastructure. The overhead is minimal as it just
increments a per CPU counter associated to each operation.
The debugfs readout provides a racy sum of all counters.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.027916598@linutronix.de
---
include/linux/rseq.h | 16 +-------
include/linux/rseq_entry.h | 49 +++++++++++++++++++++++-
init/Kconfig | 12 ++++++-
kernel/rseq.c | 79 +++++++++++++++++++++++++++++++++----
4 files changed, 133 insertions(+), 23 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index a200836..7f347c3 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -29,21 +29,6 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
}
}
-static __always_inline void rseq_exit_to_user_mode(void)
-{
- struct rseq_event *ev = ¤t->rseq.event;
-
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
- WARN_ON_ONCE(ev->sched_switch);
-
- /*
- * Ensure that event (especially user_irq) is cleared when the
- * interrupt did not result in a schedule and therefore the
- * rseq processing did not clear it.
- */
- ev->events = 0;
-}
-
/*
* KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
* which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
@@ -92,7 +77,6 @@ static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
-static inline void rseq_exit_to_user_mode(void) { }
#endif /* !CONFIG_RSEQ */
#ifdef CONFIG_DEBUG_RSEQ
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 5be507a..ff9080b 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -2,6 +2,37 @@
#ifndef _LINUX_RSEQ_ENTRY_H
#define _LINUX_RSEQ_ENTRY_H
+/* Must be outside the CONFIG_RSEQ guard to resolve the stubs */
+#ifdef CONFIG_RSEQ_STATS
+#include <linux/percpu.h>
+
+struct rseq_stats {
+ unsigned long exit;
+ unsigned long signal;
+ unsigned long slowpath;
+ unsigned long ids;
+ unsigned long cs;
+ unsigned long clear;
+ unsigned long fixup;
+};
+
+DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
+
+/*
+ * Slow path has interrupts and preemption enabled, but the fast path
+ * runs with interrupts disabled so there is no point in having the
+ * preemption checks implied in __this_cpu_inc() for every operation.
+ */
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_stat_inc(which) this_cpu_inc((which))
+#else
+#define rseq_stat_inc(which) raw_cpu_inc((which))
+#endif
+
+#else /* CONFIG_RSEQ_STATS */
+#define rseq_stat_inc(x) do { } while (0)
+#endif /* !CONFIG_RSEQ_STATS */
+
#ifdef CONFIG_RSEQ
#include <linux/rseq.h>
@@ -39,8 +70,26 @@ static __always_inline void rseq_note_user_irq_entry(void)
current->rseq.event.user_irq = true;
}
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ WARN_ON_ONCE(ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing did not clear it.
+ */
+ ev->events = 0;
+}
+
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
+static inline void rseq_exit_to_user_mode(void) { }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/init/Kconfig b/init/Kconfig
index cab3ad2..f39fdfb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1913,6 +1913,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_STATS
+ default n
+ bool "Enable lightweight statistics of restartable sequences" if EXPERT
+ depends on RSEQ && DEBUG_FS
+ help
+ Enable lightweight counters which expose information about the
+ frequency of RSEQ operations via debugfs. Mostly interesting for
+ kernel debugging or performance analysis. While lightweight it's
+ still adding code into the user/kernel mode transitions.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
diff --git a/kernel/rseq.c b/kernel/rseq.c
index f49d311..c0dbe2e 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -67,12 +67,16 @@
* F1. <failure>
*/
+/* Required to select the proper per_cpu ops for rseq_stats_inc() */
+#define RSEQ_BUILD_SLOW_PATH
+
+#include <linux/debugfs.h>
+#include <linux/ratelimit.h>
+#include <linux/rseq_entry.h>
#include <linux/sched.h>
-#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/rseq_entry.h>
+#include <linux/uaccess.h>
#include <linux/types.h>
-#include <linux/ratelimit.h>
#include <asm/ptrace.h>
#define CREATE_TRACE_POINTS
@@ -108,6 +112,56 @@ void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
}
#endif /* CONFIG_TRACEPOINTS */
+#ifdef CONFIG_RSEQ_STATS
+DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+ struct rseq_stats stats = { };
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ stats.exit += data_race(per_cpu(rseq_stats.exit, cpu));
+ stats.signal += data_race(per_cpu(rseq_stats.signal, cpu));
+ stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu));
+ stats.ids += data_race(per_cpu(rseq_stats.ids, cpu));
+ stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
+ stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
+ stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
+ }
+
+ seq_printf(m, "exit: %16lu\n", stats.exit);
+ seq_printf(m, "signal: %16lu\n", stats.signal);
+ seq_printf(m, "slowp: %16lu\n", stats.slowpath);
+ seq_printf(m, "ids: %16lu\n", stats.ids);
+ seq_printf(m, "cs: %16lu\n", stats.cs);
+ seq_printf(m, "clear: %16lu\n", stats.clear);
+ seq_printf(m, "fixup: %16lu\n", stats.fixup);
+ return 0;
+}
+
+static int rseq_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, rseq_debug_show, inode->i_private);
+}
+
+static const struct file_operations dfs_ops = {
+ .open = rseq_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init rseq_debugfs_init(void)
+{
+ struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
+
+ debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+ return 0;
+}
+__initcall(rseq_debugfs_init);
+#endif /* CONFIG_RSEQ_STATS */
+
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
{
@@ -187,12 +241,13 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
u32 node_id = cpu_to_node(cpu_id);
u32 mm_cid = task_mm_cid(t);
- /*
- * Validate read-only rseq fields.
- */
+ rseq_stat_inc(rseq_stats.ids);
+
+ /* Validate read-only rseq fields on debug kernels */
if (rseq_validate_ro_fields(t))
goto efault;
WARN_ON_ONCE((int) mm_cid < 0);
+
if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
@@ -403,6 +458,8 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
struct rseq_cs rseq_cs;
int ret;
+ rseq_stat_inc(rseq_stats.cs);
+
ret = rseq_get_rseq_cs(t, &rseq_cs);
if (ret)
return ret;
@@ -412,8 +469,10 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
* If not nested over a rseq critical section, restart is useless.
* Clear the rseq_cs pointer and return.
*/
- if (!in_rseq_cs(ip, &rseq_cs))
+ if (!in_rseq_cs(ip, &rseq_cs)) {
+ rseq_stat_inc(rseq_stats.clear);
return clear_rseq_cs(t->rseq.usrptr);
+ }
ret = rseq_check_flags(t, rseq_cs.flags);
if (ret < 0)
return ret;
@@ -422,6 +481,7 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
ret = clear_rseq_cs(t->rseq.usrptr);
if (ret)
return ret;
+ rseq_stat_inc(rseq_stats.fixup);
trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
rseq_cs.abort_ip);
instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
@@ -462,6 +522,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING))
return;
+ if (ksig)
+ rseq_stat_inc(rseq_stats.signal);
+ else
+ rseq_stat_inc(rseq_stats.slowpath);
+
/*
* Read and clear the event pending bit first. If the task
* was not preempted or migrated or a signal is on the way,
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide tracepoint wrappers for inline code
2025-10-27 8:44 ` [patch V6 16/31] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 19eaf4863bb5f11740e4ce264683d33e9f651ae3
Gitweb: https://git.kernel.org/tip/19eaf4863bb5f11740e4ce264683d33e9f651ae3
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:50 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:17 +01:00
rseq: Provide tracepoint wrappers for inline code
Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline
fast path, so that the header can be safely included by code which defines
actual trace points.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.967114316@linutronix.de
---
include/linux/rseq_entry.h | 28 ++++++++++++++++++++++++++++
kernel/rseq.c | 19 ++++++++++++++++++-
2 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index ce30e87..5be507a 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -5,6 +5,34 @@
#ifdef CONFIG_RSEQ
#include <linux/rseq.h>
+#include <linux/tracepoint-defs.h>
+
+#ifdef CONFIG_TRACEPOINTS
+DECLARE_TRACEPOINT(rseq_update);
+DECLARE_TRACEPOINT(rseq_ip_fixup);
+void __rseq_trace_update(struct task_struct *t);
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip);
+
+static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids)
+{
+ if (tracepoint_enabled(rseq_update) && ids)
+ __rseq_trace_update(t);
+}
+
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip)
+{
+ if (tracepoint_enabled(rseq_ip_fixup))
+ __rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+
+#else /* CONFIG_TRACEPOINT */
+static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids) { }
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip) { }
+#endif /* !CONFIG_TRACEPOINT */
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
diff --git a/kernel/rseq.c b/kernel/rseq.c
index ad1e7ce..f49d311 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -70,7 +70,7 @@
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/rseq.h>
+#include <linux/rseq_entry.h>
#include <linux/types.h>
#include <linux/ratelimit.h>
#include <asm/ptrace.h>
@@ -91,6 +91,23 @@
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
+#ifdef CONFIG_TRACEPOINTS
+/*
+ * Out of line, so the actual update functions can be in a header to be
+ * inlined into the exit to user code.
+ */
+void __rseq_trace_update(struct task_struct *t)
+{
+ trace_rseq_update(t);
+}
+
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip)
+{
+ trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+#endif /* CONFIG_TRACEPOINTS */
+
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
{
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Record interrupt from user space
2025-10-27 8:44 ` [patch V6 15/31] rseq: Record interrupt from user space Thomas Gleixner
2025-10-28 15:26 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 60cbf3a8e3b17637498dbe5a13c58008ecec09ba
Gitweb: https://git.kernel.org/tip/60cbf3a8e3b17637498dbe5a13c58008ecec09ba
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:48 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:17 +01:00
rseq: Record interrupt from user space
For RSEQ the only relevant reason to inspect and eventually fixup (abort)
user space critical sections is when user space was interrupted and the
task was scheduled out.
If the user to kernel entry was from a syscall no fixup is required. If
user space invokes a syscall from a critical section it can keep the
pieces as documented.
This is only supported on architectures which utilize the generic entry
code. If your architecture does not use it, bad luck.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.905067101@linutronix.de
---
include/linux/irq-entry-common.h | 3 ++-
include/linux/rseq.h | 16 +++++++++++-----
include/linux/rseq_entry.h | 18 ++++++++++++++++++
include/linux/rseq_types.h | 2 ++
4 files changed, 33 insertions(+), 6 deletions(-)
create mode 100644 include/linux/rseq_entry.h
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 83c9d84..cb31fb8 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -4,7 +4,7 @@
#include <linux/context_tracking.h>
#include <linux/kmsan.h>
-#include <linux/rseq.h>
+#include <linux/rseq_entry.h>
#include <linux/static_call_types.h>
#include <linux/syscalls.h>
#include <linux/tick.h>
@@ -281,6 +281,7 @@ static __always_inline void exit_to_user_mode(void)
static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
{
enter_from_user_mode(regs);
+ rseq_note_user_irq_entry();
}
/**
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index d315a92..a200836 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -31,11 +31,17 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
static __always_inline void rseq_exit_to_user_mode(void)
{
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq.event.has_rseq &&
- current->rseq.event.events))
- current->rseq.event.events = 0;
- }
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ WARN_ON_ONCE(ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing did not clear it.
+ */
+ ev->events = 0;
}
/*
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
new file mode 100644
index 0000000..ce30e87
--- /dev/null
+++ b/include/linux/rseq_entry.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_ENTRY_H
+#define _LINUX_RSEQ_ENTRY_H
+
+#ifdef CONFIG_RSEQ
+#include <linux/rseq.h>
+
+static __always_inline void rseq_note_user_irq_entry(void)
+{
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
+ current->rseq.event.user_irq = true;
+}
+
+#else /* CONFIG_RSEQ */
+static inline void rseq_note_user_irq_entry(void) { }
+#endif /* !CONFIG_RSEQ */
+
+#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 40901b0..80f6c39 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -12,6 +12,7 @@ struct rseq;
* @all: Compound to initialize and clear the data efficiently
* @events: Compound to access events with a single load/store
* @sched_switch: True if the task was scheduled out
+ * @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
*/
struct rseq_event {
@@ -22,6 +23,7 @@ struct rseq_event {
u16 events;
struct {
u8 sched_switch;
+ u8 user_irq;
};
};
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Cache CPU ID and MM CID values
2025-10-27 8:44 ` [patch V6 14/31] rseq: Cache CPU ID and MM CID values Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 30c6409e123ce98429798c34595227dbe0496c1c
Gitweb: https://git.kernel.org/tip/30c6409e123ce98429798c34595227dbe0496c1c
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:45 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:16 +01:00
rseq: Cache CPU ID and MM CID values
In preparation for rewriting RSEQ exit to user space handling provide
storage to cache the CPU ID and MM CID values which were written to user
space. That prepares for a quick check, which avoids the update when
nothing changed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.841964081@linutronix.de
---
include/linux/rseq.h | 7 +++++--
include/linux/rseq_types.h | 21 +++++++++++++++++++++
include/trace/events/rseq.h | 4 ++--
kernel/rseq.c | 4 ++++
4 files changed, 32 insertions(+), 4 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index ab91b1e..d315a92 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -57,6 +57,7 @@ static inline void rseq_virt_userspace_exit(void)
static inline void rseq_reset(struct task_struct *t)
{
memset(&t->rseq, 0, sizeof(t->rseq));
+ t->rseq.ids.cpu_cid = ~0ULL;
}
static inline void rseq_execve(struct task_struct *t)
@@ -70,10 +71,12 @@ static inline void rseq_execve(struct task_struct *t)
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM)
+ if (clone_flags & CLONE_VM) {
rseq_reset(t);
- else
+ } else {
t->rseq = current->rseq;
+ t->rseq.ids.cpu_cid = ~0ULL;
+ }
}
#else /* CONFIG_RSEQ */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index f7a60c8..40901b0 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -31,17 +31,38 @@ struct rseq_event {
};
/**
+ * struct rseq_ids - Cache for ids, which need to be updated
+ * @cpu_cid: Compound of @cpu_id and @mm_cid to make the
+ * compiler emit a single compare on 64-bit
+ * @cpu_id: The CPU ID which was written last to user space
+ * @mm_cid: The MM CID which was written last to user space
+ *
+ * @cpu_id and @mm_cid are updated when the data is written to user space.
+ */
+struct rseq_ids {
+ union {
+ u64 cpu_cid;
+ struct {
+ u32 cpu_id;
+ u32 mm_cid;
+ };
+ };
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
* @sig: Signature of critial section abort IPs
* @event: Storage for event management
+ * @ids: Storage for cached CPU ID and MM CID
*/
struct rseq_data {
struct rseq __user *usrptr;
u32 len;
u32 sig;
struct rseq_event event;
+ struct rseq_ids ids;
};
#else /* CONFIG_RSEQ */
diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
index 823b47d..ce85d65 100644
--- a/include/trace/events/rseq.h
+++ b/include/trace/events/rseq.h
@@ -21,9 +21,9 @@ TRACE_EVENT(rseq_update,
),
TP_fast_assign(
- __entry->cpu_id = raw_smp_processor_id();
+ __entry->cpu_id = t->rseq.ids.cpu_id;
__entry->node_id = cpu_to_node(__entry->cpu_id);
- __entry->mm_cid = task_mm_cid(t);
+ __entry->mm_cid = t->rseq.ids.mm_cid;
),
TP_printk("cpu_id=%d node_id=%d mm_cid=%d", __entry->cpu_id,
diff --git a/kernel/rseq.c b/kernel/rseq.c
index aae6266..ad1e7ce 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -184,6 +184,10 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
rseq_unsafe_put_user(t, node_id, node_id, efault_end);
rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
+ /* Cache the user space values */
+ t->rseq.ids.cpu_id = cpu_id;
+ t->rseq.ids.mm_cid = mm_cid;
+
/*
* Additional feature fields added after ORIG_RSEQ_SIZE
* need to be conditionally updated only if
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] sched: Move MM CID related functions to sched.h
2025-10-27 8:44 ` [patch V6 13/31] sched: Move MM CID related functions to sched.h Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 73b4efcabadc3278686af88b792bee8355f3f09e
Gitweb: https://git.kernel.org/tip/73b4efcabadc3278686af88b792bee8355f3f09e
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:42 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:16 +01:00
sched: Move MM CID related functions to sched.h
There is nothing mm specific in that and including mm.h can cause header
recursion hell.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.778457951@linutronix.de
---
include/linux/mm.h | 25 -------------------------
include/linux/sched.h | 26 ++++++++++++++++++++++++++
2 files changed, 26 insertions(+), 25 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33b..17cfbba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2401,31 +2401,6 @@ struct zap_details {
/* Set in unmap_vmas() to indicate a final unmap call. Only used by hugetlb */
#define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1))
-#ifdef CONFIG_SCHED_MM_CID
-void sched_mm_cid_before_execve(struct task_struct *t);
-void sched_mm_cid_after_execve(struct task_struct *t);
-void sched_mm_cid_fork(struct task_struct *t);
-void sched_mm_cid_exit_signals(struct task_struct *t);
-static inline int task_mm_cid(struct task_struct *t)
-{
- return t->mm_cid;
-}
-#else
-static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_fork(struct task_struct *t) { }
-static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
-static inline int task_mm_cid(struct task_struct *t)
-{
- /*
- * Use the processor id as a fall-back when the mm cid feature is
- * disabled. This provides functional per-cpu data structure accesses
- * in user-space, althrough it won't provide the memory usage benefits.
- */
- return raw_smp_processor_id();
-}
-#endif
-
#ifdef CONFIG_MMU
extern bool can_do_mlock(void);
#else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1562776..24a9da7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2310,6 +2310,32 @@ static __always_inline void alloc_tag_restore(struct alloc_tag *tag, struct allo
#define alloc_tag_restore(_tag, _old) do {} while (0)
#endif
+/* Avoids recursive inclusion hell */
+#ifdef CONFIG_SCHED_MM_CID
+void sched_mm_cid_before_execve(struct task_struct *t);
+void sched_mm_cid_after_execve(struct task_struct *t);
+void sched_mm_cid_fork(struct task_struct *t);
+void sched_mm_cid_exit_signals(struct task_struct *t);
+static inline int task_mm_cid(struct task_struct *t)
+{
+ return t->mm_cid;
+}
+#else
+static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_fork(struct task_struct *t) { }
+static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
+static inline int task_mm_cid(struct task_struct *t)
+{
+ /*
+ * Use the processor id as a fall-back when the mm cid feature is
+ * disabled. This provides functional per-cpu data structure accesses
+ * in user-space, althrough it won't provide the memory usage benefits.
+ */
+ return task_cpu(t);
+}
+#endif
+
#ifndef MODULE
#ifndef COMPILE_OFFSETS
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Inline irqentry_enter/exit_from/to_user_mode()
2025-10-27 8:44 ` [patch V6 12/31] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 0df6b9bcdd0f3536b4bb7b2dc828e75f07fcface
Gitweb: https://git.kernel.org/tip/0df6b9bcdd0f3536b4bb7b2dc828e75f07fcface
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:40 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:16 +01:00
entry: Inline irqentry_enter/exit_from/to_user_mode()
There is no point to have this as a function which just inlines
enter_from_user_mode(). The function call overhead is larger than the
function itself.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.715309918@linutronix.de
---
include/linux/irq-entry-common.h | 13 +++++++++++--
kernel/entry/common.c | 13 -------------
2 files changed, 11 insertions(+), 15 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 9b1f386..83c9d84 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -278,7 +278,10 @@ static __always_inline void exit_to_user_mode(void)
*
* The function establishes state (lockdep, RCU (context tracking), tracing)
*/
-void irqentry_enter_from_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
+{
+ enter_from_user_mode(regs);
+}
/**
* irqentry_exit_to_user_mode - Interrupt exit work
@@ -293,7 +296,13 @@ void irqentry_enter_from_user_mode(struct pt_regs *regs);
* Interrupt exit is not invoking #1 which is the syscall specific one time
* work.
*/
-void irqentry_exit_to_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
+{
+ instrumentation_begin();
+ exit_to_user_mode_prepare(regs);
+ instrumentation_end();
+ exit_to_user_mode();
+}
#ifndef irqentry_state
/**
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f62e1d1..70a16db 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -62,19 +62,6 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
return ti_work;
}
-noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
-{
- enter_from_user_mode(regs);
-}
-
-noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
-{
- instrumentation_begin();
- exit_to_user_mode_prepare(regs);
- instrumentation_end();
- exit_to_user_mode();
-}
-
noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
{
irqentry_state_t ret = {
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Remove syscall_enter_from_user_mode_prepare()
2025-10-27 8:44 ` [patch V6 11/31] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 5e7be1e23bd119d8403cbe3110cfe2ce40306ffe
Gitweb: https://git.kernel.org/tip/5e7be1e23bd119d8403cbe3110cfe2ce40306ffe
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:38 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:16 +01:00
entry: Remove syscall_enter_from_user_mode_prepare()
Open code the only user in the x86 syscall code and reduce the zoo of
functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.652839989@linutronix.de
---
arch/x86/entry/syscall_32.c | 3 ++-
include/linux/entry-common.h | 26 +++++---------------------
kernel/entry/syscall-common.c | 8 --------
3 files changed, 7 insertions(+), 30 deletions(-)
diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
index 2b15ea1..a67a644 100644
--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@@ -274,9 +274,10 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
* fetch EBP before invoking any of the syscall entry work
* functions.
*/
- syscall_enter_from_user_mode_prepare(regs);
+ enter_from_user_mode(regs);
instrumentation_begin();
+ local_irq_enable();
/* Fetch EBP from where the vDSO stashed it. */
if (IS_ENABLED(CONFIG_X86_64)) {
/*
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index c585221..75b194c 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -45,23 +45,6 @@
SYSCALL_WORK_SYSCALL_EXIT_TRAP | \
ARCH_SYSCALL_WORK_EXIT)
-/**
- * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
- * @regs: Pointer to currents pt_regs
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct, interrupts are enabled and the
- * subsequent functions can be instrumented.
- *
- * This handles lockdep, RCU (context tracking) and tracing state, i.e.
- * the functionality provided by enter_from_user_mode().
- *
- * This is invoked when there is extra architecture specific functionality
- * to be done between establishing state and handling user mode entry work.
- */
-void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-
long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
/**
@@ -71,8 +54,8 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
* @syscall: The syscall number
*
* Invoked from architecture specific syscall entry code with interrupts
- * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
- * architecture specific work.
+ * enabled after invoking enter_from_user_mode(), enabling interrupts and
+ * extra architecture specific work.
*
* Returns: The original or a modified syscall number
*
@@ -108,8 +91,9 @@ static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *re
* function returns all state is correct, interrupts are enabled and the
* subsequent functions can be instrumented.
*
- * This is combination of syscall_enter_from_user_mode_prepare() and
- * syscall_enter_from_user_mode_work().
+ * This is the combination of enter_from_user_mode() and
+ * syscall_enter_from_user_mode_work() to be used when there is no
+ * architecture specific work to be done between the two.
*
* Returns: The original or a modified syscall number. See
* syscall_enter_from_user_mode_work() for further explanation.
diff --git a/kernel/entry/syscall-common.c b/kernel/entry/syscall-common.c
index 66e6ba7..940a597 100644
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -63,14 +63,6 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
return ret ? : syscall;
}
-noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
-{
- enter_from_user_mode(regs);
- instrumentation_begin();
- local_irq_enable();
- instrumentation_end();
-}
-
/*
* If SYSCALL_EMU is set, then the only reason to report is when
* SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Cleanup header
2025-10-27 8:44 ` [patch V6 10/31] entry: Cleanup header Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` [tip: core/rseq] entry: Clean up header tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: d73c292f40b4ebb8369f5d0273b4dd5fc229b707
Gitweb: https://git.kernel.org/tip/d73c292f40b4ebb8369f5d0273b4dd5fc229b707
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:36 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:15 +01:00
entry: Cleanup header
Cleanup the include ordering, kernel-doc and other trivialities before
making further changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.590338411@linutronix.de
---
include/linux/entry-common.h | 8 ++++----
include/linux/irq-entry-common.h | 2 ++
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 7177436..c585221 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -3,11 +3,11 @@
#define __LINUX_ENTRYCOMMON_H
#include <linux/irq-entry-common.h>
+#include <linux/livepatch.h>
#include <linux/ptrace.h>
+#include <linux/resume_user_mode.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
-#include <linux/livepatch.h>
-#include <linux/resume_user_mode.h>
#include <asm/entry-common.h>
#include <asm/syscall.h>
@@ -37,6 +37,7 @@
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_ENTER)
+
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
@@ -61,8 +62,7 @@
*/
void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
- unsigned long work);
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
/**
* syscall_enter_from_user_mode_work - Check and handle work before invoking
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index e5941df..9b1f386 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_eqs(void) { return false; }
/**
* enter_from_user_mode - Establish state when coming from user mode
+ * @regs: Pointer to currents pt_regs
*
* Syscall/interrupt entry disables interrupts, but user mode is traced as
* interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
* Conditional reschedule with additional sanity checks.
*/
void raw_irqentry_exit_cond_resched(void);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Introduce struct rseq_data
2025-10-27 8:44 ` [patch V6 09/31] rseq: Introduce struct rseq_data Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 45c4a917848e90a269d1e1839a9a5fad15333771
Gitweb: https://git.kernel.org/tip/45c4a917848e90a269d1e1839a9a5fad15333771
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:33 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:15 +01:00
rseq: Introduce struct rseq_data
In preparation for a major rewrite of this code, provide a data structure
for rseq management.
Put all the rseq related data into it (except for the debug part), which
allows to simplify fork/execve by using memset() and memcpy() instead of
adding new fields to initialize over and over.
Create a storage struct for event management as well and put the
sched_switch event and a indicator for RSEQ on a task into it as a
start. That uses a union, which allows to mask and clear the whole lot
efficiently.
The indicators are explicitly not a bit field. Bit fields generate abysmal
code.
The boolean members are defined as u8 as that actually guarantees that it
fits. There seem to be strange architecture ABIs which need more than 8
bits for a boolean.
The has_rseq member is redundant vs. task::rseq, but it turns out that
boolean operations and quick checks on the union generate better code than
fiddling with separate entities and data types.
This struct will be extended over time to carry more information.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.527086690@linutronix.de
---
include/linux/rseq.h | 48 ++++++++++++----------------
include/linux/rseq_types.h | 51 ++++++++++++++++++++++++++++++-
include/linux/sched.h | 14 +-------
kernel/ptrace.c | 6 ++--
kernel/rseq.c | 63 ++++++++++++++++++-------------------
5 files changed, 110 insertions(+), 72 deletions(-)
create mode 100644 include/linux/rseq_types.h
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index c6267f7..ab91b1e 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -9,22 +9,22 @@ void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
- if (current->rseq)
+ if (current->rseq.event.has_rseq)
__rseq_handle_notify_resume(NULL, regs);
}
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
- if (current->rseq) {
- current->rseq_event_pending = true;
+ if (current->rseq.event.has_rseq) {
+ current->rseq.event.sched_switch = true;
__rseq_handle_notify_resume(ksig, regs);
}
}
static inline void rseq_sched_switch_event(struct task_struct *t)
{
- if (t->rseq) {
- t->rseq_event_pending = true;
+ if (t->rseq.event.has_rseq) {
+ t->rseq.event.sched_switch = true;
set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
}
}
@@ -32,8 +32,9 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
static __always_inline void rseq_exit_to_user_mode(void)
{
if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
- current->rseq_event_pending = false;
+ if (WARN_ON_ONCE(current->rseq.event.has_rseq &&
+ current->rseq.event.events))
+ current->rseq.event.events = 0;
}
}
@@ -49,35 +50,30 @@ static __always_inline void rseq_exit_to_user_mode(void)
*/
static inline void rseq_virt_userspace_exit(void)
{
- if (current->rseq_event_pending)
+ if (current->rseq.event.sched_switch)
set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
}
+static inline void rseq_reset(struct task_struct *t)
+{
+ memset(&t->rseq, 0, sizeof(t->rseq));
+}
+
+static inline void rseq_execve(struct task_struct *t)
+{
+ rseq_reset(t);
+}
+
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM) {
- t->rseq = NULL;
- t->rseq_len = 0;
- t->rseq_sig = 0;
- t->rseq_event_pending = false;
- } else {
+ if (clone_flags & CLONE_VM)
+ rseq_reset(t);
+ else
t->rseq = current->rseq;
- t->rseq_len = current->rseq_len;
- t->rseq_sig = current->rseq_sig;
- t->rseq_event_pending = current->rseq_event_pending;
- }
-}
-
-static inline void rseq_execve(struct task_struct *t)
-{
- t->rseq = NULL;
- t->rseq_len = 0;
- t->rseq_sig = 0;
- t->rseq_event_pending = false;
}
#else /* CONFIG_RSEQ */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
new file mode 100644
index 0000000..f7a60c8
--- /dev/null
+++ b/include/linux/rseq_types.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_TYPES_H
+#define _LINUX_RSEQ_TYPES_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_RSEQ
+struct rseq;
+
+/**
+ * struct rseq_event - Storage for rseq related event management
+ * @all: Compound to initialize and clear the data efficiently
+ * @events: Compound to access events with a single load/store
+ * @sched_switch: True if the task was scheduled out
+ * @has_rseq: True if the task has a rseq pointer installed
+ */
+struct rseq_event {
+ union {
+ u32 all;
+ struct {
+ union {
+ u16 events;
+ struct {
+ u8 sched_switch;
+ };
+ };
+
+ u8 has_rseq;
+ };
+ };
+};
+
+/**
+ * struct rseq_data - Storage for all rseq related data
+ * @usrptr: Pointer to the registered user space RSEQ memory
+ * @len: Length of the RSEQ region
+ * @sig: Signature of critial section abort IPs
+ * @event: Storage for event management
+ */
+struct rseq_data {
+ struct rseq __user *usrptr;
+ u32 len;
+ u32 sig;
+ struct rseq_event event;
+};
+
+#else /* CONFIG_RSEQ */
+struct rseq_data { };
+#endif /* !CONFIG_RSEQ */
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6627c52..1562776 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -41,6 +41,7 @@
#include <linux/task_io_accounting.h>
#include <linux/posix-timers_types.h>
#include <linux/restart_block.h>
+#include <linux/rseq_types.h>
#include <uapi/linux/rseq.h>
#include <linux/seqlock_types.h>
#include <linux/kcsan.h>
@@ -1406,16 +1407,8 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
-#ifdef CONFIG_RSEQ
- struct rseq __user *rseq;
- u32 rseq_len;
- u32 rseq_sig;
- /*
- * RmW on rseq_event_pending must be performed atomically
- * with respect to preemption.
- */
- bool rseq_event_pending;
-# ifdef CONFIG_DEBUG_RSEQ
+ struct rseq_data rseq;
+#ifdef CONFIG_DEBUG_RSEQ
/*
* This is a place holder to save a copy of the rseq fields for
* validation of read-only fields. The struct rseq has a
@@ -1423,7 +1416,6 @@ struct task_struct {
* directly. Reserve a size large enough for the known fields.
*/
char rseq_fields[sizeof(struct rseq)];
-# endif
#endif
#ifdef CONFIG_SCHED_MM_CID
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84ef..392ec2f 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -793,9 +793,9 @@ static long ptrace_get_rseq_configuration(struct task_struct *task,
unsigned long size, void __user *data)
{
struct ptrace_rseq_configuration conf = {
- .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
- .rseq_abi_size = task->rseq_len,
- .signature = task->rseq_sig,
+ .rseq_abi_pointer = (u64)(uintptr_t)task->rseq.usrptr,
+ .rseq_abi_size = task->rseq.len,
+ .signature = task->rseq.sig,
.flags = 0,
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 81dddaf..aae6266 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -103,13 +103,13 @@ static int rseq_validate_ro_fields(struct task_struct *t)
DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
u32 cpu_id_start, cpu_id, node_id, mm_cid;
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
/*
* Validate fields which are required to be read-only by
* user-space.
*/
- if (!user_read_access_begin(rseq, t->rseq_len))
+ if (!user_read_access_begin(rseq, t->rseq.len))
goto efault;
unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
@@ -147,10 +147,10 @@ efault:
* Update an rseq field and its in-kernel copy in lock-step to keep a coherent
* state.
*/
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- do { \
- unsafe_put_user(value, &t->rseq->field, error_label); \
- rseq_kernel_fields(t)->field = value; \
+#define rseq_unsafe_put_user(t, value, field, error_label) \
+ do { \
+ unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \
+ rseq_kernel_fields(t)->field = value; \
} while (0)
#else
@@ -160,12 +160,12 @@ static int rseq_validate_ro_fields(struct task_struct *t)
}
#define rseq_unsafe_put_user(t, value, field, error_label) \
- unsafe_put_user(value, &t->rseq->field, error_label)
+ unsafe_put_user(value, &t->rseq.usrptr->field, error_label)
#endif
static int rseq_update_cpu_node_id(struct task_struct *t)
{
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
u32 cpu_id = raw_smp_processor_id();
u32 node_id = cpu_to_node(cpu_id);
u32 mm_cid = task_mm_cid(t);
@@ -176,7 +176,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
if (rseq_validate_ro_fields(t))
goto efault;
WARN_ON_ONCE((int) mm_cid < 0);
- if (!user_write_access_begin(rseq, t->rseq_len))
+ if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
@@ -201,7 +201,7 @@ efault:
static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
{
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
mm_cid = 0;
@@ -211,7 +211,7 @@ static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
if (rseq_validate_ro_fields(t))
goto efault;
- if (!user_write_access_begin(rseq, t->rseq_len))
+ if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
/*
@@ -272,7 +272,7 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
u32 sig;
int ret;
- ret = rseq_get_rseq_cs_ptr_val(t->rseq, &ptr);
+ ret = rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr);
if (ret)
return ret;
@@ -305,10 +305,10 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
if (ret)
return ret;
- if (current->rseq_sig != sig) {
+ if (current->rseq.sig != sig) {
printk_ratelimited(KERN_WARNING
"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq_sig, current->pid, usig);
+ sig, current->rseq.sig, current->pid, usig);
return -EINVAL;
}
return 0;
@@ -338,7 +338,7 @@ static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
return -EINVAL;
/* Get thread flags. */
- ret = get_user(flags, &t->rseq->flags);
+ ret = get_user(flags, &t->rseq.usrptr->flags);
if (ret)
return ret;
@@ -392,13 +392,13 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
* Clear the rseq_cs pointer and return.
*/
if (!in_rseq_cs(ip, &rseq_cs))
- return clear_rseq_cs(t->rseq);
+ return clear_rseq_cs(t->rseq.usrptr);
ret = rseq_check_flags(t, rseq_cs.flags);
if (ret < 0)
return ret;
if (!abort)
return 0;
- ret = clear_rseq_cs(t->rseq);
+ ret = clear_rseq_cs(t->rseq.usrptr);
if (ret)
return ret;
trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
@@ -460,8 +460,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
* inconsistencies.
*/
scoped_guard(RSEQ_EVENT_GUARD) {
- event = t->rseq_event_pending;
- t->rseq_event_pending = false;
+ event = t->rseq.event.sched_switch;
+ t->rseq.event.sched_switch = false;
}
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -492,7 +492,7 @@ void rseq_syscall(struct pt_regs *regs)
struct task_struct *t = current;
struct rseq_cs rseq_cs;
- if (!t->rseq)
+ if (!t->rseq.usrptr)
return;
if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
force_sig(SIGSEGV);
@@ -511,33 +511,31 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
/* Unregister rseq for current thread. */
- if (current->rseq != rseq || !current->rseq)
+ if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
return -EINVAL;
- if (rseq_len != current->rseq_len)
+ if (rseq_len != current->rseq.len)
return -EINVAL;
- if (current->rseq_sig != sig)
+ if (current->rseq.sig != sig)
return -EPERM;
ret = rseq_reset_rseq_cpu_node_id(current);
if (ret)
return ret;
- current->rseq = NULL;
- current->rseq_sig = 0;
- current->rseq_len = 0;
+ rseq_reset(current);
return 0;
}
if (unlikely(flags))
return -EINVAL;
- if (current->rseq) {
+ if (current->rseq.usrptr) {
/*
* If rseq is already registered, check whether
* the provided address differs from the prior
* one.
*/
- if (current->rseq != rseq || rseq_len != current->rseq_len)
+ if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
return -EINVAL;
- if (current->rseq_sig != sig)
+ if (current->rseq.sig != sig)
return -EPERM;
/* Already registered. */
return -EBUSY;
@@ -586,15 +584,16 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
*/
- current->rseq = rseq;
- current->rseq_len = rseq_len;
- current->rseq_sig = sig;
+ current->rseq.usrptr = rseq;
+ current->rseq.len = rseq_len;
+ current->rseq.sig = sig;
/*
* If rseq was previously inactive, and has just been
* registered, ensure the cpu_id_start and cpu_id fields
* are updated before returning to user-space.
*/
+ current->rseq.event.has_rseq = true;
rseq_sched_switch_event(current);
return 0;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Avoid CPU/MM CID updates when no event pending
2025-10-27 8:44 ` [patch V6 08/31] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: d3fd8e5061bf3506c0cbf974536855470bd142b9
Gitweb: https://git.kernel.org/tip/d3fd8e5061bf3506c0cbf974536855470bd142b9
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:31 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:15 +01:00
rseq: Avoid CPU/MM CID updates when no event pending
There is no need to update these values unconditionally if there is no
event pending.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.462964916@linutronix.de
---
kernel/rseq.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 01e7113..81dddaf 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -464,11 +464,12 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
t->rseq_event_pending = false;
}
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
- goto error;
- }
+ if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+ return;
+
+ ret = rseq_ip_fixup(regs, event);
+ if (unlikely(ret < 0))
+ goto error;
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq, virt: Retrigger RSEQ after vcpu_run()
2025-10-27 8:44 ` [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
2025-10-28 15:08 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers,
Sean Christopherson, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 2016e8735ac1c36ac7c1afbfe0645d723f147644
Gitweb: https://git.kernel.org/tip/2016e8735ac1c36ac7c1afbfe0645d723f147644
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:28 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:14 +01:00
rseq, virt: Retrigger RSEQ after vcpu_run()
Hypervisors invoke resume_user_mode_work() before entering the guest, which
clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
space context available to them, so the rseq notify handler skips
inspecting the critical section, but updates the CPU/MM CID values
unconditionally so that the eventual pending rseq event is not lost on the
way to user space.
This is a pointless exercise as the task might be rescheduled before
actually returning to user space and it creates unnecessary work in the
vcpu_run() loops.
It's way more efficient to ignore that invocation based on @regs == NULL
and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
vcpu_run() loop before returning from the ioctl().
This ensures that a pending RSEQ update is not lost and the IDs are updated
before returning to user space.
Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
a NOOP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20251027084306.399495855@linutronix.de
---
drivers/hv/mshv_root_main.c | 3 +-
include/linux/rseq.h | 17 ++++++++-
kernel/rseq.c | 78 ++++++++++++++++++------------------
virt/kvm/kvm_main.c | 7 +++-
4 files changed, 68 insertions(+), 37 deletions(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e3b2bd4..a21a0eb 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -29,6 +29,7 @@
#include <linux/crash_dump.h>
#include <linux/panic_notifier.h>
#include <linux/vmalloc.h>
+#include <linux/rseq.h>
#include "mshv_eventfd.h"
#include "mshv.h"
@@ -560,6 +561,8 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
}
} while (!vp->run.flags.intercept_suspend);
+ rseq_virt_userspace_exit();
+
return ret;
}
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 241067b..c6267f7 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to_user_mode(void)
}
/*
+ * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
+ * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
+ * that case just to do it eventually again before returning to user space,
+ * the entry resume_user_mode_work() invocation is ignored as the register
+ * argument is NULL.
+ *
+ * After returning from guest mode, they have to invoke this function to
+ * re-raise TIF_NOTIFY_RESUME if necessary.
+ */
+static inline void rseq_virt_userspace_exit(void)
+{
+ if (current->rseq_event_pending)
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+}
+
+/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
*/
@@ -68,6 +84,7 @@ static inline void rseq_execve(struct task_struct *t)
static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 59adc1a..01e7113 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
int ret, sig;
+ bool event;
+
+ /*
+ * If invoked from hypervisors before entering the guest via
+ * resume_user_mode_work(), then @regs is a NULL pointer.
+ *
+ * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+ * it before returning from the ioctl() to user space when
+ * rseq_event.sched_switch is set.
+ *
+ * So it's safe to ignore here instead of pointlessly updating it
+ * in the vcpu_run() loop.
+ */
+ if (!regs)
+ return;
if (unlikely(t->flags & PF_EXITING))
return;
/*
- * If invoked from hypervisors or IO-URING, then @regs is a NULL
- * pointer, so fixup cannot be done. If the syscall which led to
- * this invocation was invoked inside a critical section, then it
- * will either end up in this code again or a possible violation of
- * a syscall inside a critical region can only be detected by the
- * debug code in rseq_syscall() in a debug enabled kernel.
+ * Read and clear the event pending bit first. If the task
+ * was not preempted or migrated or a signal is on the way,
+ * there is no point in doing any of the heavy lifting here
+ * on production kernels. In that case TIF_NOTIFY_RESUME
+ * was raised by some other functionality.
+ *
+ * This is correct because the read/clear operation is
+ * guarded against scheduler preemption, which makes it CPU
+ * local atomic. If the task is preempted right after
+ * re-enabling preemption then TIF_NOTIFY_RESUME is set
+ * again and this function is invoked another time _before_
+ * the task is able to return to user mode.
+ *
+ * On a debug kernel, invoke the fixup code unconditionally
+ * with the result handed in to allow the detection of
+ * inconsistencies.
*/
- if (regs) {
- /*
- * Read and clear the event pending bit first. If the task
- * was not preempted or migrated or a signal is on the way,
- * there is no point in doing any of the heavy lifting here
- * on production kernels. In that case TIF_NOTIFY_RESUME
- * was raised by some other functionality.
- *
- * This is correct because the read/clear operation is
- * guarded against scheduler preemption, which makes it CPU
- * local atomic. If the task is preempted right after
- * re-enabling preemption then TIF_NOTIFY_RESUME is set
- * again and this function is invoked another time _before_
- * the task is able to return to user mode.
- *
- * On a debug kernel, invoke the fixup code unconditionally
- * with the result handed in to allow the detection of
- * inconsistencies.
- */
- bool event;
-
- scoped_guard(RSEQ_EVENT_GUARD) {
- event = t->rseq_event_pending;
- t->rseq_event_pending = false;
- }
-
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
- goto error;
- }
+ scoped_guard(RSEQ_EVENT_GUARD) {
+ event = t->rseq_event_pending;
+ t->rseq_event_pending = false;
}
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+ ret = rseq_ip_fixup(regs, event);
+ if (unlikely(ret < 0))
+ goto error;
+ }
+
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
return;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7a0ae2..4255fcf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
#include <linux/lockdep.h>
#include <linux/kthread.h>
#include <linux/suspend.h>
+#include <linux/rseq.h>
#include <asm/processor.h>
#include <asm/ioctl.h>
@@ -4476,6 +4477,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
r = kvm_arch_vcpu_ioctl_run(vcpu);
vcpu->wants_to_run = false;
+ /*
+ * FIXME: Remove this hack once all KVM architectures
+ * support the generic TIF bits, i.e. a dedicated TIF_RSEQ.
+ */
+ rseq_virt_userspace_exit();
+
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
break;
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Simplify the event notification
2025-10-27 8:44 ` [patch V6 06/31] rseq: Simplify the event notification Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 6f9a08b9988964d7e3fea0860c6e1c8f89a35269
Gitweb: https://git.kernel.org/tip/6f9a08b9988964d7e3fea0860c6e1c8f89a35269
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:26 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:14 +01:00
rseq: Simplify the event notification
Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
flags") the bits in task::rseq_event_mask are meaningless and just extra
work in terms of setting them individually.
Aside of that the only relevant point where an event has to be raised is
context switch. Neither the CPU nor MM CID can change without going through
a context switch.
Collapse them all into a single boolean which simplifies the code a lot and
remove the pointless invocations which have been sprinkled all over the
place for no value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.336978188@linutronix.de
---
fs/exec.c | 2 +-
include/linux/rseq.h | 66 +++++++-------------------------------
include/linux/sched.h | 10 +++---
include/uapi/linux/rseq.h | 21 ++++--------
kernel/rseq.c | 28 +++++++++-------
kernel/sched/core.c | 5 +---
kernel/sched/membarrier.c | 8 ++---
7 files changed, 48 insertions(+), 92 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 4298e7e..e45b298 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ out:
force_fatal_sig(SIGSEGV);
sched_mm_cid_after_execve(current);
- rseq_set_notify_resume(current);
+ rseq_sched_switch_event(current);
current->in_execve = 0;
return retval;
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index d72ddf7..241067b 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -3,38 +3,8 @@
#define _LINUX_RSEQ_H
#ifdef CONFIG_RSEQ
-
-#include <linux/preempt.h>
#include <linux/sched.h>
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD irq
-#else
-# define RSEQ_EVENT_GUARD preempt
-#endif
-
-/*
- * Map the event mask on the user-space ABI enum rseq_cs_flags
- * for direct mask checks.
- */
-enum rseq_event_mask_bits {
- RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
- RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
- RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
-};
-
-enum rseq_event_mask {
- RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT),
- RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT),
- RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT),
-};
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
- if (t->rseq)
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
-}
-
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
@@ -43,35 +13,27 @@ static inline void rseq_handle_notify_resume(struct pt_regs *regs)
__rseq_handle_notify_resume(NULL, regs);
}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
if (current->rseq) {
- scoped_guard(RSEQ_EVENT_GUARD)
- __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ current->rseq_event_pending = true;
__rseq_handle_notify_resume(ksig, regs);
}
}
-/* rseq_preempt() requires preemption to be disabled. */
-static inline void rseq_preempt(struct task_struct *t)
+static inline void rseq_sched_switch_event(struct task_struct *t)
{
- __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
-}
-
-/* rseq_migrate() requires preemption to be disabled. */
-static inline void rseq_migrate(struct task_struct *t)
-{
- __set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
+ if (t->rseq) {
+ t->rseq_event_pending = true;
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ }
}
static __always_inline void rseq_exit_to_user_mode(void)
{
if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
- current->rseq_event_mask = 0;
+ if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
+ current->rseq_event_pending = false;
}
}
@@ -85,12 +47,12 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
- t->rseq_event_mask = 0;
+ t->rseq_event_pending = false;
} else {
t->rseq = current->rseq;
t->rseq_len = current->rseq_len;
t->rseq_sig = current->rseq_sig;
- t->rseq_event_mask = current->rseq_event_mask;
+ t->rseq_event_pending = current->rseq_event_pending;
}
}
@@ -99,15 +61,13 @@ static inline void rseq_execve(struct task_struct *t)
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
- t->rseq_event_mask = 0;
+ t->rseq_event_pending = false;
}
#else /* CONFIG_RSEQ */
-static inline void rseq_set_notify_resume(struct task_struct *t) { }
static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
-static inline void rseq_preempt(struct task_struct *t) { }
-static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b469878..6627c52 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1407,14 +1407,14 @@ struct task_struct {
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_RSEQ
- struct rseq __user *rseq;
- u32 rseq_len;
- u32 rseq_sig;
+ struct rseq __user *rseq;
+ u32 rseq_len;
+ u32 rseq_sig;
/*
- * RmW on rseq_event_mask must be performed atomically
+ * RmW on rseq_event_pending must be performed atomically
* with respect to preemption.
*/
- unsigned long rseq_event_mask;
+ bool rseq_event_pending;
# ifdef CONFIG_DEBUG_RSEQ
/*
* This is a place holder to save a copy of the rseq fields for
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae..1b76d50 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -114,20 +114,13 @@ struct rseq {
/*
* Restartable sequences flags field.
*
- * This field should only be updated by the thread which
- * registered this data structure. Read by the kernel.
- * Mainly used for single-stepping through rseq critical sections
- * with debuggers.
- *
- * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
- * Inhibit instruction sequence block restart on preemption
- * for this thread.
- * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
- * Inhibit instruction sequence block restart on signal
- * delivery for this thread.
- * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
- * Inhibit instruction sequence block restart on migration for
- * this thread.
+ * This field was initially intended to allow event masking for
+ * single-stepping through rseq critical sections with debuggers.
+ * The kernel does not support this anymore and the relevant bits
+ * are checked for being always false:
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
*/
__u32 flags;
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 80af48a..59adc1a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -78,6 +78,12 @@
#define CREATE_TRACE_POINTS
#include <trace/events/rseq.h>
+#ifdef CONFIG_MEMBARRIER
+# define RSEQ_EVENT_GUARD irq
+#else
+# define RSEQ_EVENT_GUARD preempt
+#endif
+
/* The original rseq structure size (including padding) is 32 bytes. */
#define ORIG_RSEQ_SIZE 32
@@ -430,11 +436,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
*/
if (regs) {
/*
- * Read and clear the event mask first. If the task was not
- * preempted or migrated or a signal is on the way, there
- * is no point in doing any of the heavy lifting here on
- * production kernels. In that case TIF_NOTIFY_RESUME was
- * raised by some other functionality.
+ * Read and clear the event pending bit first. If the task
+ * was not preempted or migrated or a signal is on the way,
+ * there is no point in doing any of the heavy lifting here
+ * on production kernels. In that case TIF_NOTIFY_RESUME
+ * was raised by some other functionality.
*
* This is correct because the read/clear operation is
* guarded against scheduler preemption, which makes it CPU
@@ -447,15 +453,15 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
* with the result handed in to allow the detection of
* inconsistencies.
*/
- u32 event_mask;
+ bool event;
scoped_guard(RSEQ_EVENT_GUARD) {
- event_mask = t->rseq_event_mask;
- t->rseq_event_mask = 0;
+ event = t->rseq_event_pending;
+ t->rseq_event_pending = false;
}
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
- ret = rseq_ip_fixup(regs, !!event_mask);
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+ ret = rseq_ip_fixup(regs, event);
if (unlikely(ret < 0))
goto error;
}
@@ -584,7 +590,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* registered, ensure the cpu_id_start and cpu_id fields
* are updated before returning to user-space.
*/
- rseq_set_notify_resume(current);
+ rseq_sched_switch_event(current);
return 0;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67..b75e8e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3329,7 +3329,6 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
if (p->sched_class->migrate_task_rq)
p->sched_class->migrate_task_rq(p, new_cpu);
p->se.nr_migrations++;
- rseq_migrate(p);
sched_mm_cid_migrate_from(p);
perf_event_task_migrate(p);
}
@@ -4763,7 +4762,6 @@ int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
p->sched_task_group = tg;
}
#endif
- rseq_migrate(p);
/*
* We're setting the CPU for the first time, we don't migrate,
* so use __set_task_cpu().
@@ -4827,7 +4825,6 @@ void wake_up_new_task(struct task_struct *p)
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
- rseq_migrate(p);
__set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
@@ -5121,7 +5118,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
kcov_prepare_switch(prev);
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
- rseq_preempt(prev);
+ rseq_sched_switch_event(prev);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 62fba83..6234456 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -199,7 +199,7 @@ static void ipi_rseq(void *info)
* is negligible.
*/
smp_mb();
- rseq_preempt(current);
+ rseq_sched_switch_event(current);
}
static void ipi_sync_rq_state(void *info)
@@ -407,9 +407,9 @@ static int membarrier_private_expedited(int flags, int cpu_id)
* membarrier, we will end up with some thread in the mm
* running without a core sync.
*
- * For RSEQ, don't rseq_preempt() the caller. User code
- * is not supposed to issue syscalls at all from inside an
- * rseq critical section.
+ * For RSEQ, don't invoke rseq_sched_switch_event() on the
+ * caller. User code is not supposed to issue syscalls at
+ * all from inside an rseq critical section.
*/
if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
preempt_disable();
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Simplify registration
2025-10-27 8:44 ` [patch V6 05/31] rseq: Simplify registration Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: ae3051d908dda0409d18d8e31a3b89023376ada3
Gitweb: https://git.kernel.org/tip/ae3051d908dda0409d18d8e31a3b89023376ada3
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:24 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:13 +01:00
rseq: Simplify registration
There is no point to read the critical section element in the newly
registered user space RSEQ struct first in order to clear it.
Just clear it and be done with it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.274661227@linutronix.de
---
kernel/rseq.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 51fafc4..80af48a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -492,11 +492,9 @@ void rseq_syscall(struct pt_regs *regs)
/*
* sys_rseq - setup restartable sequences for caller thread.
*/
-SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
- int, flags, u32, sig)
+SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
int ret;
- u64 rseq_cs;
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
@@ -557,11 +555,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
* avoid a potential segfault on return to user-space. The proper thing
* to do would have been to fail the registration but this would break
* older libcs that reuse the rseq area for new threads without
- * clearing the fields.
+ * clearing the fields. Don't bother reading it, just reset it.
*/
- if (rseq_get_rseq_cs_ptr_val(rseq, &rseq_cs))
- return -EFAULT;
- if (rseq_cs && clear_rseq_cs(rseq))
+ if (put_user(0UL, &rseq->rseq_cs))
return -EFAULT;
#ifdef CONFIG_DEBUG_RSEQ
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Remove the ksig argument from rseq_handle_notify_resume()
2025-10-27 8:44 ` [patch V6 04/31] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: fd41ddc3ac6329c242faefccadabbdfa8512fee4
Gitweb: https://git.kernel.org/tip/fd41ddc3ac6329c242faefccadabbdfa8512fee4
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:22 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:13 +01:00
rseq: Remove the ksig argument from rseq_handle_notify_resume()
There is no point for this being visible in the resume_to_user_mode()
handling.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.211520245@linutronix.de
---
include/linux/resume_user_mode.h | 2 +-
include/linux/rseq.h | 15 ++++++++-------
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0..dd3bf7d 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
- rseq_handle_notify_resume(NULL, regs);
+ rseq_handle_notify_resume(regs);
}
#endif /* LINUX_RESUME_USER_MODE_H */
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 21f875a..d72ddf7 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -37,19 +37,20 @@ static inline void rseq_set_notify_resume(struct task_struct *t)
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
if (current->rseq)
- __rseq_handle_notify_resume(ksig, regs);
+ __rseq_handle_notify_resume(NULL, regs);
}
static inline void rseq_signal_deliver(struct ksignal *ksig,
struct pt_regs *regs)
{
- scoped_guard(RSEQ_EVENT_GUARD)
- __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
- rseq_handle_notify_resume(ksig, regs);
+ if (current->rseq) {
+ scoped_guard(RSEQ_EVENT_GUARD)
+ __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ __rseq_handle_notify_resume(ksig, regs);
+ }
}
/* rseq_preempt() requires preemption to be disabled. */
@@ -103,7 +104,7 @@ static inline void rseq_execve(struct task_struct *t)
#else /* CONFIG_RSEQ */
static inline void rseq_set_notify_resume(struct task_struct *t) { }
-static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_preempt(struct task_struct *t) { }
static inline void rseq_migrate(struct task_struct *t) { }
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Move algorithm comment to top
2025-10-27 8:44 ` [patch V6 03/31] rseq: Move algorithm comment to top Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 02cde183ec67135d7ac01898c0ab77e29ad2c19f
Gitweb: https://git.kernel.org/tip/02cde183ec67135d7ac01898c0ab77e29ad2c19f
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:20 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:13 +01:00
rseq: Move algorithm comment to top
Move the comment which documents the RSEQ algorithm to the top of the file,
so it does not create horrible diffs later when the actual implementation
is fed into the mincer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.149519580@linutronix.de
---
kernel/rseq.c | 119 ++++++++++++++++++++++++-------------------------
1 file changed, 59 insertions(+), 60 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 246319d..51fafc4 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -8,6 +8,65 @@
* Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
*/
+/*
+ * Restartable sequences are a lightweight interface that allows
+ * user-level code to be executed atomically relative to scheduler
+ * preemption and signal delivery. Typically used for implementing
+ * per-cpu operations.
+ *
+ * It allows user-space to perform update operations on per-cpu data
+ * without requiring heavy-weight atomic operations.
+ *
+ * Detailed algorithm of rseq user-space assembly sequences:
+ *
+ * init(rseq_cs)
+ * cpu = TLS->rseq::cpu_id_start
+ * [1] TLS->rseq::rseq_cs = rseq_cs
+ * [start_ip] ----------------------------
+ * [2] if (cpu != TLS->rseq::cpu_id)
+ * goto abort_ip;
+ * [3] <last_instruction_in_cs>
+ * [post_commit_ip] ----------------------------
+ *
+ * The address of jump target abort_ip must be outside the critical
+ * region, i.e.:
+ *
+ * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
+ *
+ * Steps [2]-[3] (inclusive) need to be a sequence of instructions in
+ * userspace that can handle being interrupted between any of those
+ * instructions, and then resumed to the abort_ip.
+ *
+ * 1. Userspace stores the address of the struct rseq_cs assembly
+ * block descriptor into the rseq_cs field of the registered
+ * struct rseq TLS area. This update is performed through a single
+ * store within the inline assembly instruction sequence.
+ * [start_ip]
+ *
+ * 2. Userspace tests to check whether the current cpu_id field match
+ * the cpu number loaded before start_ip, branching to abort_ip
+ * in case of a mismatch.
+ *
+ * If the sequence is preempted or interrupted by a signal
+ * at or after start_ip and before post_commit_ip, then the kernel
+ * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
+ * ip to abort_ip before returning to user-space, so the preempted
+ * execution resumes at abort_ip.
+ *
+ * 3. Userspace critical section final instruction before
+ * post_commit_ip is the commit. The critical section is
+ * self-terminating.
+ * [post_commit_ip]
+ *
+ * 4. <success>
+ *
+ * On failure at [2], or if interrupted by preempt or signal delivery
+ * between [1] and [3]:
+ *
+ * [abort_ip]
+ * F1. <failure>
+ */
+
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
@@ -98,66 +157,6 @@ static int rseq_validate_ro_fields(struct task_struct *t)
unsafe_put_user(value, &t->rseq->field, error_label)
#endif
-/*
- *
- * Restartable sequences are a lightweight interface that allows
- * user-level code to be executed atomically relative to scheduler
- * preemption and signal delivery. Typically used for implementing
- * per-cpu operations.
- *
- * It allows user-space to perform update operations on per-cpu data
- * without requiring heavy-weight atomic operations.
- *
- * Detailed algorithm of rseq user-space assembly sequences:
- *
- * init(rseq_cs)
- * cpu = TLS->rseq::cpu_id_start
- * [1] TLS->rseq::rseq_cs = rseq_cs
- * [start_ip] ----------------------------
- * [2] if (cpu != TLS->rseq::cpu_id)
- * goto abort_ip;
- * [3] <last_instruction_in_cs>
- * [post_commit_ip] ----------------------------
- *
- * The address of jump target abort_ip must be outside the critical
- * region, i.e.:
- *
- * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
- *
- * Steps [2]-[3] (inclusive) need to be a sequence of instructions in
- * userspace that can handle being interrupted between any of those
- * instructions, and then resumed to the abort_ip.
- *
- * 1. Userspace stores the address of the struct rseq_cs assembly
- * block descriptor into the rseq_cs field of the registered
- * struct rseq TLS area. This update is performed through a single
- * store within the inline assembly instruction sequence.
- * [start_ip]
- *
- * 2. Userspace tests to check whether the current cpu_id field match
- * the cpu number loaded before start_ip, branching to abort_ip
- * in case of a mismatch.
- *
- * If the sequence is preempted or interrupted by a signal
- * at or after start_ip and before post_commit_ip, then the kernel
- * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
- * ip to abort_ip before returning to user-space, so the preempted
- * execution resumes at abort_ip.
- *
- * 3. Userspace critical section final instruction before
- * post_commit_ip is the commit. The critical section is
- * self-terminating.
- * [post_commit_ip]
- *
- * 4. <success>
- *
- * On failure at [2], or if interrupted by preempt or signal delivery
- * between [1] and [3]:
- *
- * [abort_ip]
- * F1. <failure>
- */
-
static int rseq_update_cpu_node_id(struct task_struct *t)
{
struct rseq __user *rseq = t->rseq;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Condense the inline stubs
2025-10-27 8:44 ` [patch V6 02/31] rseq: Condense the inline stubs Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: a096aaa7b909f5857892e49f23505be01dc091dd
Gitweb: https://git.kernel.org/tip/a096aaa7b909f5857892e49f23505be01dc091dd
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:18 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:13 +01:00
rseq: Condense the inline stubs
Scrolling over tons of pointless
{
}
lines to find the actual code is annoying at best.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.085971048@linutronix.de
---
include/linux/rseq.h | 47 ++++++++++---------------------------------
1 file changed, 12 insertions(+), 35 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 7622b73..21f875a 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -101,44 +101,21 @@ static inline void rseq_execve(struct task_struct *t)
t->rseq_event_mask = 0;
}
-#else
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
-}
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_preempt(struct task_struct *t)
-{
-}
-static inline void rseq_migrate(struct task_struct *t)
-{
-}
-static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
-{
-}
-static inline void rseq_execve(struct task_struct *t)
-{
-}
+#else /* CONFIG_RSEQ */
+static inline void rseq_set_notify_resume(struct task_struct *t) { }
+static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_preempt(struct task_struct *t) { }
+static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
+static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
-#endif
+#endif /* !CONFIG_RSEQ */
#ifdef CONFIG_DEBUG_RSEQ
-
void rseq_syscall(struct pt_regs *regs);
-
-#else
-
-static inline void rseq_syscall(struct pt_regs *regs)
-{
-}
-
-#endif
+#else /* CONFIG_DEBUG_RSEQ */
+static inline void rseq_syscall(struct pt_regs *regs) { }
+#endif /* !CONFIG_DEBUG_RSEQ */
#endif /* _LINUX_RSEQ_H */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Avoid pointless evaluation in __rseq_notify_resume()
2025-10-27 8:44 ` [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-03 14:47 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Mathieu Desnoyers, x86,
linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 92cd030734580b8d44443e3006f804a59eb281dd
Gitweb: https://git.kernel.org/tip/92cd030734580b8d44443e3006f804a59eb281dd
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:16 +01:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 03 Nov 2025 15:26:12 +01:00
rseq: Avoid pointless evaluation in __rseq_notify_resume()
The RSEQ critical section mechanism only clears the event mask when a
critical section is registered, otherwise it is stale and collects
bits.
That means once a critical section is installed the first invocation of
that code when TIF_NOTIFY_RESUME is set will abort the critical section,
even when the TIF bit was not raised by the rseq preempt/migrate/signal
helpers.
This also has a performance implication because TIF_NOTIFY_RESUME is a
multiplexing TIF bit, which is utilized by quite some infrastructure. That
means every invocation of __rseq_notify_resume() goes unconditionally
through the heavy lifting of user space access and consistency checks even
if there is no reason to do so.
Keeping the stale event mask around when exiting to user space also
prevents it from being utilized by the upcoming time slice extension
mechanism.
Avoid this by reading and clearing the event mask before doing the user
space critical section access with interrupts or preemption disabled, which
ensures that the read and clear operation is CPU local atomic versus
scheduling and the membarrier IPI.
This is correct as after re-enabling interrupts/preemption any relevant
event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the
user space exit code take another round of TIF bit clearing.
If the event mask was non-zero, invoke the slow path. On debug kernels the
slow path is invoked unconditionally and the result of the event mask
evaluation is handed in.
Add a exit path check after the TIF bit loop, which validates on debug
kernels that the event mask is zero before exiting to user space.
While at it reword the convoluted comment why the pt_regs pointer can be
NULL under certain circumstances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.022571576@linutronix.de
---
include/linux/irq-entry-common.h | 7 ++-
include/linux/rseq.h | 10 ++++-
kernel/rseq.c | 66 ++++++++++++++++++++-----------
3 files changed, 58 insertions(+), 25 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index d643c7c..e5941df 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -2,11 +2,12 @@
#ifndef __LINUX_IRQENTRYCOMMON_H
#define __LINUX_IRQENTRYCOMMON_H
+#include <linux/context_tracking.h>
+#include <linux/kmsan.h>
+#include <linux/rseq.h>
#include <linux/static_call_types.h>
#include <linux/syscalls.h>
-#include <linux/context_tracking.h>
#include <linux/tick.h>
-#include <linux/kmsan.h>
#include <linux/unwind_deferred.h>
#include <asm/entry-common.h>
@@ -226,6 +227,8 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
arch_exit_to_user_mode_prepare(regs, ti_work);
+ rseq_exit_to_user_mode();
+
/* Ensure that kernel state is sane for a return to userspace */
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 69553e7..7622b73 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -66,6 +66,14 @@ static inline void rseq_migrate(struct task_struct *t)
rseq_set_notify_resume(t);
}
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
+ if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
+ current->rseq_event_mask = 0;
+ }
+}
+
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
@@ -118,7 +126,7 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
static inline void rseq_execve(struct task_struct *t)
{
}
-
+static inline void rseq_exit_to_user_mode(void) { }
#endif
#ifdef CONFIG_DEBUG_RSEQ
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 2452b73..246319d 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char *str, u32 flags)
return true;
}
-static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
+static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
{
- u32 flags, event_mask;
+ u32 flags;
int ret;
if (rseq_warn_flags("rseq_cs", cs_flags))
@@ -339,17 +339,7 @@ static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
if (rseq_warn_flags("rseq", flags))
return -EINVAL;
-
- /*
- * Load and clear event mask atomically with respect to
- * scheduler preemption and membarrier IPIs.
- */
- scoped_guard(RSEQ_EVENT_GUARD) {
- event_mask = t->rseq_event_mask;
- t->rseq_event_mask = 0;
- }
-
- return !!event_mask;
+ return 0;
}
static int clear_rseq_cs(struct rseq __user *rseq)
@@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
}
-static int rseq_ip_fixup(struct pt_regs *regs)
+static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
{
unsigned long ip = instruction_pointer(regs);
struct task_struct *t = current;
@@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs *regs)
*/
if (!in_rseq_cs(ip, &rseq_cs))
return clear_rseq_cs(t->rseq);
- ret = rseq_need_restart(t, rseq_cs.flags);
- if (ret <= 0)
+ ret = rseq_check_flags(t, rseq_cs.flags);
+ if (ret < 0)
return ret;
+ if (!abort)
+ return 0;
ret = clear_rseq_cs(t->rseq);
if (ret)
return ret;
@@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
return;
/*
- * regs is NULL if and only if the caller is in a syscall path. Skip
- * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
- * kill a misbehaving userspace on debug kernels.
+ * If invoked from hypervisors or IO-URING, then @regs is a NULL
+ * pointer, so fixup cannot be done. If the syscall which led to
+ * this invocation was invoked inside a critical section, then it
+ * will either end up in this code again or a possible violation of
+ * a syscall inside a critical region can only be detected by the
+ * debug code in rseq_syscall() in a debug enabled kernel.
*/
if (regs) {
- ret = rseq_ip_fixup(regs);
- if (unlikely(ret < 0))
- goto error;
+ /*
+ * Read and clear the event mask first. If the task was not
+ * preempted or migrated or a signal is on the way, there
+ * is no point in doing any of the heavy lifting here on
+ * production kernels. In that case TIF_NOTIFY_RESUME was
+ * raised by some other functionality.
+ *
+ * This is correct because the read/clear operation is
+ * guarded against scheduler preemption, which makes it CPU
+ * local atomic. If the task is preempted right after
+ * re-enabling preemption then TIF_NOTIFY_RESUME is set
+ * again and this function is invoked another time _before_
+ * the task is able to return to user mode.
+ *
+ * On a debug kernel, invoke the fixup code unconditionally
+ * with the result handed in to allow the detection of
+ * inconsistencies.
+ */
+ u32 event_mask;
+
+ scoped_guard(RSEQ_EVENT_GUARD) {
+ event_mask = t->rseq_event_mask;
+ t->rseq_event_mask = 0;
+ }
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
+ ret = rseq_ip_fixup(regs, !!event_mask);
+ if (unlikely(ret < 0))
+ goto error;
+ }
}
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Switch to TIF_RSEQ if supported
2025-10-27 8:45 ` [patch V6 31/31] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 32034df66b5f49626aa450ceaf1849a08d87906e
Gitweb: https://git.kernel.org/tip/32034df66b5f49626aa450ceaf1849a08d87906e
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:26 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:35:37 +01:00
rseq: Switch to TIF_RSEQ if supported
TIF_NOTIFY_RESUME is a multiplexing TIF bit, which is suboptimal especially
with the RSEQ fast path depending on it, but not really handling it.
Define a separate TIF_RSEQ in the generic TIF space and enable the full
separation of fast and slow path for architectures which utilize that.
That avoids the hassle with invocations of resume_user_mode_work() from
hypervisors, which clear TIF_NOTIFY_RESUME. It makes the therefore required
re-evaluation at the end of vcpu_run() a NOOP on architectures which
utilize the generic TIF space and have a separate TIF_RSEQ.
The hypervisor TIF handling does not include the separate TIF_RSEQ as there
is no point in doing so. The guest does neither know nor care about the VMM
host applications RSEQ state. That state is only relevant when the ioctl()
returns to user space.
The fastpath implementation still utilizes TIF_NOTIFY_RESUME for failure
handling, but this only happens within exit_to_user_mode_loop(), so
arguably the hypervisor ioctl() code is long done when this happens.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.903622031@linutronix.de
---
include/asm-generic/thread_info_tif.h | 3 ++-
include/linux/irq-entry-common.h | 2 +-
include/linux/rseq.h | 22 ++++++++++++------
include/linux/rseq_entry.h | 32 +++++++++++++++++++++++---
include/linux/thread_info.h | 5 ++++-
kernel/entry/common.c | 10 ++++++--
6 files changed, 61 insertions(+), 13 deletions(-)
diff --git a/include/asm-generic/thread_info_tif.h b/include/asm-generic/thread_info_tif.h
index ee3793e..da1610a 100644
--- a/include/asm-generic/thread_info_tif.h
+++ b/include/asm-generic/thread_info_tif.h
@@ -45,4 +45,7 @@
# define _TIF_RESTORE_SIGMASK BIT(TIF_RESTORE_SIGMASK)
#endif
+#define TIF_RSEQ 11 // Run RSEQ fast path
+#define _TIF_RSEQ BIT(TIF_RSEQ)
+
#endif /* _ASM_GENERIC_THREAD_INFO_TIF_H_ */
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index bc5d178..72e3f7a 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -30,7 +30,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
_TIF_NEED_RESCHED | _TIF_NEED_RESCHED_LAZY | \
- _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | \
+ _TIF_PATCH_PENDING | _TIF_NOTIFY_SIGNAL | _TIF_RSEQ | \
ARCH_EXIT_TO_USER_MODE_WORK)
/**
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index ded4baa..b5e4803 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -42,7 +42,7 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *reg
static inline void rseq_raise_notify_resume(struct task_struct *t)
{
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ set_tsk_thread_flag(t, TIF_RSEQ);
}
/* Invoked from context switch to force evaluation on exit to user */
@@ -114,17 +114,25 @@ static inline void rseq_force_update(void)
/*
* KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
- * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
- * that case just to do it eventually again before returning to user space,
- * the entry resume_user_mode_work() invocation is ignored as the register
- * argument is NULL.
+ * which clears TIF_NOTIFY_RESUME on architectures that don't use the
+ * generic TIF bits and therefore can't provide a separate TIF_RSEQ flag.
*
- * After returning from guest mode, they have to invoke this function to
- * re-raise TIF_NOTIFY_RESUME if necessary.
+ * To avoid updating user space RSEQ in that case just to do it eventually
+ * again before returning to user space, because __rseq_handle_slowpath()
+ * does nothing when invoked with NULL register state.
+ *
+ * After returning from guest mode, before exiting to userspace, hypervisors
+ * must invoke this function to re-raise TIF_NOTIFY_RESUME if necessary.
*/
static inline void rseq_virt_userspace_exit(void)
{
if (current->rseq.event.sched_switch)
+ /*
+ * The generic optimization for deferring RSEQ updates until the next
+ * exit relies on having a dedicated TIF_RSEQ.
+ */
+ if (!IS_ENABLED(CONFIG_HAVE_GENERIC_TIF_BITS) &&
+ current->rseq.event.sched_switch)
rseq_raise_notify_resume(current);
}
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 958a63e..c92167f 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -507,18 +507,44 @@ static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *reg
return false;
}
-static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+/* Required to allow conversion to GENERIC_ENTRY w/o GENERIC_TIF_BITS */
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
+static __always_inline bool test_tif_rseq(unsigned long ti_work)
{
+ return ti_work & _TIF_RSEQ;
+}
+
+static __always_inline void clear_tif_rseq(void)
+{
+ static_assert(TIF_RSEQ != TIF_NOTIFY_RESUME);
+ clear_thread_flag(TIF_RSEQ);
+}
+#else
+static __always_inline bool test_tif_rseq(unsigned long ti_work) { return true; }
+static __always_inline void clear_tif_rseq(void) { }
+#endif
+
+static __always_inline bool
+rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
+{
+ if (likely(!test_tif_rseq(ti_work)))
+ return false;
+
if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
current->rseq.event.slowpath = true;
set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
return true;
}
+
+ clear_tif_rseq();
return false;
}
#else /* CONFIG_GENERIC_ENTRY */
-static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs) { return false; }
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
+{
+ return false;
+}
#endif /* !CONFIG_GENERIC_ENTRY */
static __always_inline void rseq_syscall_exit_to_user_mode(void)
@@ -577,7 +603,7 @@ static inline void rseq_debug_syscall_return(struct pt_regs *regs)
}
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
-static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs, unsigned long ti_work)
{
return false;
}
diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index dd925d8..b40de9b 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -67,6 +67,11 @@ enum syscall_work_bit {
#define _TIF_NEED_RESCHED_LAZY _TIF_NEED_RESCHED
#endif
+#ifndef TIF_RSEQ
+# define TIF_RSEQ TIF_NOTIFY_RESUME
+# define _TIF_RSEQ _TIF_NOTIFY_RESUME
+#endif
+
#ifdef __KERNEL__
#ifndef arch_set_restart_data
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 523a3e7..5c792b3 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,6 +11,12 @@
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
+#ifdef CONFIG_HAVE_GENERIC_TIF_BITS
+#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK & ~_TIF_RSEQ)
+#else
+#define EXIT_TO_USER_MODE_WORK_LOOP (EXIT_TO_USER_MODE_WORK)
+#endif
+
static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
unsigned long ti_work)
{
@@ -18,7 +24,7 @@ static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *re
* Before returning to user space ensure that all pending work
* items have been completed.
*/
- while (ti_work & EXIT_TO_USER_MODE_WORK) {
+ while (ti_work & EXIT_TO_USER_MODE_WORK_LOOP) {
local_irq_enable_exit_to_user(ti_work);
@@ -68,7 +74,7 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
for (;;) {
ti_work = __exit_to_user_mode_loop(regs, ti_work);
- if (likely(!rseq_exit_to_user_mode_restart(regs)))
+ if (likely(!rseq_exit_to_user_mode_restart(regs, ti_work)))
return ti_work;
ti_work = read_thread_flags();
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Split up rseq_exit_to_user_mode()
2025-10-27 8:45 ` [patch V6 30/31] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 7a5201ea1907534efe3a6e9c001ef4c0257cb3f0
Gitweb: https://git.kernel.org/tip/7a5201ea1907534efe3a6e9c001ef4c0257cb3f0
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:24 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:35:30 +01:00
rseq: Split up rseq_exit_to_user_mode()
Separate the interrupt and syscall exit handling. Syscall exit does not
require to clear the user_irq bit as it can't be set. On interrupt exit it
can be set when the interrupt did not result in a scheduling event and
therefore the return path did not invoke the TIF work handling, which would
have cleared it.
The debug check for the event state is also not really required even when
debug mode is enabled via the static key. Debug mode is largely aiding user
space by enabling a larger amount of validation checks, which cause a
segfault when a malformed critical section is detected. In production mode
the critical section handling takes the content mostly as is and lets user
space keep the pieces when it screwed up.
On kernel changes in that area the state check is useful, but that can be
done when lockdep is enabled, which is anyway a required test scenario for
fundamental changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.842785700@linutronix.de
---
include/linux/irq-entry-common.h | 6 ++---
include/linux/rseq_entry.h | 36 +++++++++++++++++++++++++++++--
2 files changed, 37 insertions(+), 5 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 5ea6172..bc5d178 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -240,7 +240,7 @@ static __always_inline void __exit_to_user_mode_validate(void)
static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_exit_to_user_mode_legacy();
__exit_to_user_mode_validate();
}
@@ -254,7 +254,7 @@ static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *reg
static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_syscall_exit_to_user_mode();
__exit_to_user_mode_validate();
}
@@ -268,7 +268,7 @@ static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *re
static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
{
__exit_to_user_mode_prepare(regs);
- rseq_exit_to_user_mode();
+ rseq_irqentry_exit_to_user_mode();
__exit_to_user_mode_validate();
}
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 3f13be7..958a63e 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -521,7 +521,37 @@ static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs) { return false; }
#endif /* !CONFIG_GENERIC_ENTRY */
-static __always_inline void rseq_exit_to_user_mode(void)
+static __always_inline void rseq_syscall_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ /* Needed to remove the store for the !lockdep case */
+ if (IS_ENABLED(CONFIG_LOCKDEP)) {
+ WARN_ON_ONCE(ev->sched_switch);
+ ev->events = 0;
+ }
+}
+
+static __always_inline void rseq_irqentry_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ lockdep_assert_once(!ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing could not clear it.
+ */
+ ev->events = 0;
+}
+
+/* Required to keep ARM64 working */
+static __always_inline void rseq_exit_to_user_mode_legacy(void)
{
struct rseq_event *ev = ¤t->rseq.event;
@@ -551,7 +581,9 @@ static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
{
return false;
}
-static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_syscall_exit_to_user_mode(void) { }
+static inline void rseq_irqentry_exit_to_user_mode(void) { }
+static inline void rseq_exit_to_user_mode_legacy(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Split up exit_to_user_mode_prepare()
2025-10-27 8:45 ` [patch V6 29/31] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 70fe25a3bc53a891f0e6184c12bd55cc524cb13b
Gitweb: https://git.kernel.org/tip/70fe25a3bc53a891f0e6184c12bd55cc524cb13b
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:21 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:35:17 +01:00
entry: Split up exit_to_user_mode_prepare()
exit_to_user_mode_prepare() is used for both interrupts and syscalls, but
there is extra rseq work, which is only required for in the interrupt exit
case.
Split up the function and provide wrappers for syscalls and interrupts,
which allows to separate the rseq exit work in the next step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.782234789@linutronix.de
---
arch/arm64/kernel/entry-common.c | 2 +-
include/linux/entry-common.h | 2 +-
include/linux/irq-entry-common.h | 49 +++++++++++++++++++++++++++----
3 files changed, 46 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index a9c8171..0a97e26 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -100,7 +100,7 @@ static __always_inline void arm64_enter_from_user_mode(struct pt_regs *regs)
static __always_inline void arm64_exit_to_user_mode(struct pt_regs *regs)
{
local_irq_disable();
- exit_to_user_mode_prepare(regs);
+ exit_to_user_mode_prepare_legacy(regs);
local_daif_mask();
mte_check_tfsr_exit();
exit_to_user_mode();
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index d967184..87efb38 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -156,7 +156,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
if (unlikely(work & SYSCALL_WORK_EXIT))
syscall_exit_work(regs, work);
local_irq_disable_exit_to_user();
- exit_to_user_mode_prepare(regs);
+ syscall_exit_to_user_mode_prepare(regs);
}
/**
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 8f5ceea..5ea6172 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -201,7 +201,7 @@ void arch_do_signal_or_restart(struct pt_regs *regs);
unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
/**
- * exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * __exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
* @regs: Pointer to pt_regs on entry stack
*
* 1) check that interrupts are disabled
@@ -209,8 +209,10 @@ unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work
* 3) call exit_to_user_mode_loop() if any flags from
* EXIT_TO_USER_MODE_WORK are set
* 4) check that interrupts are still disabled
+ *
+ * Don't invoke directly, use the syscall/irqentry_ prefixed variants below
*/
-static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
+static __always_inline void __exit_to_user_mode_prepare(struct pt_regs *regs)
{
unsigned long ti_work;
@@ -224,15 +226,52 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
ti_work = exit_to_user_mode_loop(regs, ti_work);
arch_exit_to_user_mode_prepare(regs, ti_work);
+}
- rseq_exit_to_user_mode();
-
+static __always_inline void __exit_to_user_mode_validate(void)
+{
/* Ensure that kernel state is sane for a return to userspace */
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
lockdep_sys_exit();
}
+/* Temporary workaround to keep ARM64 alive */
+static __always_inline void exit_to_user_mode_prepare_legacy(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
+/**
+ * syscall_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs: Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
+/**
+ * irqentry_exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
+ * @regs: Pointer to pt_regs on entry stack
+ *
+ * Wrapper around __exit_to_user_mode_prepare() to separate the exit work for
+ * syscalls and interrupts.
+ */
+static __always_inline void irqentry_exit_to_user_mode_prepare(struct pt_regs *regs)
+{
+ __exit_to_user_mode_prepare(regs);
+ rseq_exit_to_user_mode();
+ __exit_to_user_mode_validate();
+}
+
/**
* exit_to_user_mode - Fixup state when exiting to user mode
*
@@ -297,7 +336,7 @@ static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
{
instrumentation_begin();
- exit_to_user_mode_prepare(regs);
+ irqentry_exit_to_user_mode_prepare(regs);
instrumentation_end();
exit_to_user_mode();
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Switch to fast path processing on exit to user
2025-10-27 8:45 ` [patch V6 28/31] rseq: Switch to fast path processing on " Thomas Gleixner
` (2 preceding siblings ...)
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 3db6b38dfe640207da706b286d4181237391f5bd
Gitweb: https://git.kernel.org/tip/3db6b38dfe640207da706b286d4181237391f5bd
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:19 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:34:39 +01:00
rseq: Switch to fast path processing on exit to user
Now that all bits and pieces are in place, hook the RSEQ handling fast path
function into exit_to_user_mode_prepare() after the TIF work bits have been
handled. If case of fast path failure, TIF_NOTIFY_RESUME has been raised
and the caller needs to take another turn through the TIF handling slow
path.
This only works for architectures which use the generic entry code.
Architectures who still have their own incomplete hacks are not supported
and won't be.
This results in the following improvements:
Kernel build Before After Reduction
exit to user 80692981 80514451
signal checks: 32581 121 99%
slowpath runs: 1201408 1.49% 198 0.00% 100%
fastpath runs: 675941 0.84% N/A
id updates: 1233989 1.53% 50541 0.06% 96%
cs checks: 1125366 1.39% 0 0.00% 100%
cs cleared: 1125366 100% 0 100%
cs fixup: 0 0% 0
RSEQ selftests Before After Reduction
exit to user: 386281778 387373750
signal checks: 35661203 0 100%
slowpath runs: 140542396 36.38% 100 0.00% 100%
fastpath runs: 9509789 2.51% N/A
id updates: 176203599 45.62% 9087994 2.35% 95%
cs checks: 175587856 45.46% 4728394 1.22% 98%
cs cleared: 172359544 98.16% 1319307 27.90% 99%
cs fixup: 3228312 1.84% 3409087 72.10%
The 'cs cleared' and 'cs fixup' percentages are not relative to the exit to
user invocations, they are relative to the actual 'cs check' invocations.
While some of this could have been avoided in the original code, like the
obvious clearing of CS when it's already clear, the main problem of going
through TIF_NOTIFY_RESUME cannot be solved. In some workloads the RSEQ
notify handler is invoked more than once before going out to user
space. Doing this once when everything has stabilized is the only solution
to avoid this.
The initial attempt to completely decouple it from the TIF work turned out
to be suboptimal for workloads, which do a lot of quick and short system
calls. Even if the fast path decision is only 4 instructions (including a
conditional branch), this adds up quickly and becomes measurable when the
rate for actually having to handle rseq is in the low single digit
percentage range of user/kernel transitions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.701201365@linutronix.de
---
include/linux/irq-entry-common.h | 7 ++-----
include/linux/resume_user_mode.h | 2 +-
include/linux/rseq.h | 18 ++++++++++++------
init/Kconfig | 2 +-
kernel/entry/common.c | 26 +++++++++++++++++++-------
kernel/rseq.c | 8 ++++++--
6 files changed, 41 insertions(+), 22 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index cb31fb8..8f5ceea 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -197,11 +197,8 @@ static __always_inline void arch_exit_to_user_mode(void) { }
*/
void arch_do_signal_or_restart(struct pt_regs *regs);
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- */
-unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work);
+/* Handle pending TIF work */
+unsigned long exit_to_user_mode_loop(struct pt_regs *regs, unsigned long ti_work);
/**
* exit_to_user_mode_prepare - call exit_to_user_mode_loop() if required
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index dd3bf7d..bf92227 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
- rseq_handle_notify_resume(regs);
+ rseq_handle_slowpath(regs);
}
#endif /* LINUX_RESUME_USER_MODE_H */
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index abfbeb4..ded4baa 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -7,13 +7,19 @@
#include <uapi/linux/rseq.h>
-void __rseq_handle_notify_resume(struct pt_regs *regs);
+void __rseq_handle_slowpath(struct pt_regs *regs);
-static inline void rseq_handle_notify_resume(struct pt_regs *regs)
+/* Invoked from resume_user_mode_work() */
+static inline void rseq_handle_slowpath(struct pt_regs *regs)
{
- /* '&' is intentional to spare one conditional branch */
- if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
- __rseq_handle_notify_resume(regs);
+ if (IS_ENABLED(CONFIG_GENERIC_ENTRY)) {
+ if (current->rseq.event.slowpath)
+ __rseq_handle_slowpath(regs);
+ } else {
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
+ __rseq_handle_slowpath(regs);
+ }
}
void __rseq_signal_deliver(int sig, struct pt_regs *regs);
@@ -152,7 +158,7 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
}
#else /* CONFIG_RSEQ */
-static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
+static inline void rseq_handle_slowpath(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
diff --git a/init/Kconfig b/init/Kconfig
index bde40ab..d1c606e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1941,7 +1941,7 @@ config RSEQ_DEBUG_DEFAULT_ENABLE
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
- depends on RSEQ && DEBUG_KERNEL
+ depends on RSEQ && DEBUG_KERNEL && !GENERIC_ENTRY
select RSEQ_DEBUG_DEFAULT_ENABLE
help
Enable extra debugging checks for the rseq system call.
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 70a16db..523a3e7 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -11,13 +11,8 @@
/* Workaround to allow gradual conversion of architecture code */
void __weak arch_do_signal_or_restart(struct pt_regs *regs) { }
-/**
- * exit_to_user_mode_loop - do any pending work before leaving to user space
- * @regs: Pointer to pt_regs on entry stack
- * @ti_work: TIF work flags as read by the caller
- */
-__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
- unsigned long ti_work)
+static __always_inline unsigned long __exit_to_user_mode_loop(struct pt_regs *regs,
+ unsigned long ti_work)
{
/*
* Before returning to user space ensure that all pending work
@@ -62,6 +57,23 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
return ti_work;
}
+/**
+ * exit_to_user_mode_loop - do any pending work before leaving to user space
+ * @regs: Pointer to pt_regs on entry stack
+ * @ti_work: TIF work flags as read by the caller
+ */
+__always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
+ unsigned long ti_work)
+{
+ for (;;) {
+ ti_work = __exit_to_user_mode_loop(regs, ti_work);
+
+ if (likely(!rseq_exit_to_user_mode_restart(regs)))
+ return ti_work;
+ ti_work = read_thread_flags();
+ }
+}
+
noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
{
irqentry_state_t ret = {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index c5d6336..395d8b0 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -237,7 +237,11 @@ efault:
static void rseq_slowpath_update_usr(struct pt_regs *regs)
{
- /* Preserve rseq state and user_irq state for exit to user */
+ /*
+ * Preserve rseq state and user_irq state. The generic entry code
+ * clears user_irq on the way out, the non-generic entry
+ * architectures are not having user_irq.
+ */
const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
struct task_struct *t = current;
struct rseq_ids ids;
@@ -289,7 +293,7 @@ static void rseq_slowpath_update_usr(struct pt_regs *regs)
}
}
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+void __rseq_handle_slowpath(struct pt_regs *regs)
{
/*
* If invoked from hypervisors before entering the guest via
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Implement fast path for exit to user
2025-10-27 8:45 ` [patch V6 27/31] rseq: Implement fast path for exit to user Thomas Gleixner
` (3 preceding siblings ...)
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
4 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 05b44aef709cae5e4274590f050cf35049dcc24e
Gitweb: https://git.kernel.org/tip/05b44aef709cae5e4274590f050cf35049dcc24e
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:17 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:34:18 +01:00
rseq: Implement fast path for exit to user
Implement the actual logic for handling RSEQ updates in a fast path after
handling the TIF work and at the point where the task is actually returning
to user space.
This is the right point to do that because at this point the CPU and the MM
CID are stable and cannot longer change due to yet another reschedule.
That happens when the task is handling it via TIF_NOTIFY_RESUME in
resume_user_mode_work(), which is invoked from the exit to user mode work
loop.
The function is invoked after the TIF work is handled and runs with
interrupts disabled, which means it cannot resolve page faults. It
therefore disables page faults and in case the access to the user space
memory faults, it:
- notes the fail in the event struct
- raises TIF_NOTIFY_RESUME
- returns false to the caller
The caller has to go back to the TIF work, which runs with interrupts
enabled and therefore can resolve the page faults. This happens mostly on
fork() when the memory is marked COW.
If the user memory inspection finds invalid data, the function returns
false as well and sets the fatal flag in the event struct along with
TIF_NOTIFY_RESUME. The slow path notify handler has to evaluate that flag
and terminate the task with SIGSEGV as documented.
The initial decision to invoke any of this is based on one flags in the
event struct: @sched_switch. The decision is in pseudo ASM:
load tsk::event::sched_switch
jnz inspect_user_space
mov $0, tsk::event::events
...
leave
So for the common case where the task was not scheduled out, this really
boils down to three instructions before going out if the compiler is not
completely stupid (and yes, some of them are).
If the condition is true, then it checks, whether CPU ID or MM CID have
changed. If so, then the CPU/MM IDs have to be updated and are thereby
cached for the next round. The update unconditionally retrieves the user
space critical section address to spare another user*begin/end() pair. If
that's not zero and tsk::event::user_irq is set, then the critical section
is analyzed and acted upon. If either zero or the entry came via syscall
the critical section analysis is skipped.
If the comparison is false then the critical section has to be analyzed
because the event flag is then only true when entry from user was by
interrupt.
This is provided without the actual hookup to let reviewers focus on the
implementation details. The hookup happens in the next step.
Note: As with quite some other optimizations this depends on the generic
entry infrastructure and is not enabled to be sucked into random
architecture implementations.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.638929615@linutronix.de
---
include/linux/rseq_entry.h | 133 +++++++++++++++++++++++++++++++++++-
include/linux/rseq_types.h | 3 +-
kernel/rseq.c | 2 +-
3 files changed, 135 insertions(+), 3 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index aa1c046..3f13be7 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -10,6 +10,7 @@ struct rseq_stats {
unsigned long exit;
unsigned long signal;
unsigned long slowpath;
+ unsigned long fastpath;
unsigned long ids;
unsigned long cs;
unsigned long clear;
@@ -245,12 +246,13 @@ rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long c
{
struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
unsigned long ip = instruction_pointer(regs);
+ unsigned long tasksize = TASK_SIZE;
u64 start_ip, abort_ip, offset;
u32 usig, __user *uc_sig;
rseq_stat_inc(rseq_stats.cs);
- if (unlikely(csaddr >= TASK_SIZE)) {
+ if (unlikely(csaddr >= tasksize)) {
t->rseq.event.fatal = true;
return false;
}
@@ -287,7 +289,7 @@ rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long c
* in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
* protection.
*/
- if (abort_ip >= TASK_SIZE || abort_ip < sizeof(*uc_sig))
+ if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
goto die;
/* The address is guaranteed to be >= 0 and < TASK_SIZE */
@@ -397,6 +399,128 @@ static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *r
return rseq_update_user_cs(t, regs, csaddr);
}
+/*
+ * If you want to use this then convert your architecture to the generic
+ * entry code. I'm tired of building workarounds for people who can't be
+ * bothered to make the maintenance of generic infrastructure less
+ * burdensome. Just sucking everything into the architecture code and
+ * thereby making others chase the horrible hacks and keep them working is
+ * neither acceptable nor sustainable.
+ */
+#ifdef CONFIG_GENERIC_ENTRY
+
+/*
+ * This is inlined into the exit path because:
+ *
+ * 1) It's a one time comparison in the fast path when there is no event to
+ * handle
+ *
+ * 2) The access to the user space rseq memory (TLS) is unlikely to fault
+ * so the straight inline operation is:
+ *
+ * - Four 32-bit stores only if CPU ID/ MM CID need to be updated
+ * - One 64-bit load to retrieve the critical section address
+ *
+ * 3) In the unlikely case that the critical section address is != NULL:
+ *
+ * - One 64-bit load to retrieve the start IP
+ * - One 64-bit load to retrieve the offset for calculating the end
+ * - One 64-bit load to retrieve the abort IP
+ * - One 64-bit load to retrieve the signature
+ * - One store to clear the critical section address
+ *
+ * The non-debug case implements only the minimal required checking. It
+ * provides protection against a rogue abort IP in kernel space, which
+ * would be exploitable at least on x86, and also against a rogue CS
+ * descriptor by checking the signature at the abort IP. Any fallout from
+ * invalid critical section descriptors is a user space problem. The debug
+ * case provides the full set of checks and terminates the task if a
+ * condition is not met.
+ *
+ * In case of a fault or an invalid value, this sets TIF_NOTIFY_RESUME and
+ * tells the caller to loop back into exit_to_user_mode_loop(). The rseq
+ * slow path there will handle the failure.
+ */
+static __always_inline bool rseq_exit_user_update(struct pt_regs *regs, struct task_struct *t)
+{
+ /*
+ * Page faults need to be disabled as this is called with
+ * interrupts disabled
+ */
+ guard(pagefault)();
+ if (likely(!t->rseq.event.ids_changed)) {
+ struct rseq __user *rseq = t->rseq.usrptr;
+ /*
+ * If IDs have not changed rseq_event::user_irq must be true
+ * See rseq_sched_switch_event().
+ */
+ u64 csaddr;
+
+ if (unlikely(get_user_inline(csaddr, &rseq->rseq_cs)))
+ return false;
+
+ if (static_branch_unlikely(&rseq_debug_enabled) || unlikely(csaddr)) {
+ if (unlikely(!rseq_update_user_cs(t, regs, csaddr)))
+ return false;
+ }
+ return true;
+ }
+
+ struct rseq_ids ids = {
+ .cpu_id = task_cpu(t),
+ .mm_cid = task_mm_cid(t),
+ };
+ u32 node_id = cpu_to_node(ids.cpu_id);
+
+ return rseq_update_usr(t, regs, &ids, node_id);
+}
+
+static __always_inline bool __rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ struct task_struct *t = current;
+
+ /*
+ * If the task did not go through schedule or got the flag enforced
+ * by the rseq syscall or execve, then nothing to do here.
+ *
+ * CPU ID and MM CID can only change when going through a context
+ * switch.
+ *
+ * rseq_sched_switch_event() sets the rseq_event::sched_switch bit
+ * only when rseq_event::has_rseq is true. That conditional is
+ * required to avoid setting the TIF bit if RSEQ is not registered
+ * for a task. rseq_event::sched_switch is cleared when RSEQ is
+ * unregistered by a task so it's sufficient to check for the
+ * sched_switch bit alone.
+ *
+ * A sane compiler requires three instructions for the nothing to do
+ * case including clearing the events, but your mileage might vary.
+ */
+ if (unlikely((t->rseq.event.sched_switch))) {
+ rseq_stat_inc(rseq_stats.fastpath);
+
+ if (unlikely(!rseq_exit_user_update(regs, t)))
+ return true;
+ }
+ /* Clear state so next entry starts from a clean slate */
+ t->rseq.event.events = 0;
+ return false;
+}
+
+static __always_inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ if (unlikely(__rseq_exit_to_user_mode_restart(regs))) {
+ current->rseq.event.slowpath = true;
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ return true;
+ }
+ return false;
+}
+
+#else /* CONFIG_GENERIC_ENTRY */
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs) { return false; }
+#endif /* !CONFIG_GENERIC_ENTRY */
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
@@ -421,9 +545,12 @@ static inline void rseq_debug_syscall_return(struct pt_regs *regs)
if (static_branch_unlikely(&rseq_debug_enabled))
__rseq_debug_syscall_return(regs);
}
-
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
+static inline bool rseq_exit_to_user_mode_restart(struct pt_regs *regs)
+{
+ return false;
+}
static inline void rseq_exit_to_user_mode(void) { }
static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index a1389ff..9c7a341 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -18,6 +18,8 @@ struct rseq;
* @has_rseq: True if the task has a rseq pointer installed
* @error: Compound error code for the slow path to analyze
* @fatal: User space data corrupted or invalid
+ * @slowpath: Indicator that slow path processing via TIF_NOTIFY_RESUME
+ * is required
*
* @sched_switch and @ids_changed must be adjacent and the combo must be
* 16bit aligned to allow a single store, when both are set at the same
@@ -42,6 +44,7 @@ struct rseq_event {
u16 error;
struct {
u8 fatal;
+ u8 slowpath;
};
};
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 183dde7..c5d6336 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -133,6 +133,7 @@ static int rseq_stats_show(struct seq_file *m, void *p)
stats.exit += data_race(per_cpu(rseq_stats.exit, cpu));
stats.signal += data_race(per_cpu(rseq_stats.signal, cpu));
stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu));
+ stats.fastpath += data_race(per_cpu(rseq_stats.fastpath, cpu));
stats.ids += data_race(per_cpu(rseq_stats.ids, cpu));
stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
@@ -142,6 +143,7 @@ static int rseq_stats_show(struct seq_file *m, void *p)
seq_printf(m, "exit: %16lu\n", stats.exit);
seq_printf(m, "signal: %16lu\n", stats.signal);
seq_printf(m, "slowp: %16lu\n", stats.slowpath);
+ seq_printf(m, "fastp: %16lu\n", stats.fastpath);
seq_printf(m, "ids: %16lu\n", stats.ids);
seq_printf(m, "cs: %16lu\n", stats.cs);
seq_printf(m, "clear: %16lu\n", stats.clear);
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Optimize event setting
2025-10-27 8:45 ` [patch V6 26/31] rseq: Optimize event setting Thomas Gleixner
` (2 preceding siblings ...)
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:16 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 39a167560a61f913560ba803a96dbe6c15239f5c
Gitweb: https://git.kernel.org/tip/39a167560a61f913560ba803a96dbe6c15239f5c
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:14 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:34:03 +01:00
rseq: Optimize event setting
After removing the various condition bits earlier it turns out that one
extra information is needed to avoid setting event::sched_switch and
TIF_NOTIFY_RESUME unconditionally on every context switch.
The update of the RSEQ user space memory is only required, when either
the task was interrupted in user space and schedules
or
the CPU or MM CID changes in schedule() independent of the entry mode
Right now only the interrupt from user information is available.
Add an event flag, which is set when the CPU or MM CID or both change.
Evaluate this event in the scheduler to decide whether the sched_switch
event and the TIF bit need to be set.
It's an extra conditional in context_switch(), but the downside of
unconditionally handling RSEQ after a context switch to user is way more
significant. The utilized boolean logic minimizes this to a single
conditional branch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.578058898@linutronix.de
---
fs/exec.c | 2 +-
include/linux/rseq.h | 81 +++++++++++++++++++++++++++++++++----
include/linux/rseq_types.h | 11 ++++-
kernel/rseq.c | 2 +-
kernel/sched/core.c | 7 ++-
kernel/sched/sched.h | 5 +-
6 files changed, 95 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index e45b298..90e47eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ out:
force_fatal_sig(SIGSEGV);
sched_mm_cid_after_execve(current);
- rseq_sched_switch_event(current);
+ rseq_force_update();
current->in_execve = 0;
return retval;
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index f5a4318..abfbeb4 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -11,7 +11,8 @@ void __rseq_handle_notify_resume(struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
- if (current->rseq.event.has_rseq)
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.sched_switch & current->rseq.event.has_rseq)
__rseq_handle_notify_resume(regs);
}
@@ -33,12 +34,75 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *reg
}
}
-/* Raised from context switch and exevce to force evaluation on exit to user */
-static inline void rseq_sched_switch_event(struct task_struct *t)
+static inline void rseq_raise_notify_resume(struct task_struct *t)
{
- if (t->rseq.event.has_rseq) {
- t->rseq.event.sched_switch = true;
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+
+/* Invoked from context switch to force evaluation on exit to user */
+static __always_inline void rseq_sched_switch_event(struct task_struct *t)
+{
+ struct rseq_event *ev = &t->rseq.event;
+
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /*
+ * Avoid a boat load of conditionals by using simple logic
+ * to determine whether NOTIFY_RESUME needs to be raised.
+ *
+ * It's required when the CPU or MM CID has changed or
+ * the entry was from user space.
+ */
+ bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
+
+ if (raise) {
+ ev->sched_switch = true;
+ rseq_raise_notify_resume(t);
+ }
+ } else {
+ if (ev->has_rseq) {
+ t->rseq.event.sched_switch = true;
+ rseq_raise_notify_resume(t);
+ }
+ }
+}
+
+/*
+ * Invoked from __set_task_cpu() when a task migrates to enforce an IDs
+ * update.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu)
+{
+ t->rseq.event.ids_changed = true;
+}
+
+/*
+ * Invoked from switch_mm_cid() in context switch when the task gets a MM
+ * CID assigned.
+ *
+ * This does not raise TIF_NOTIFY_RESUME as that happens in
+ * rseq_sched_switch_event().
+ */
+static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid)
+{
+ /*
+ * Requires a comparison as the switch_mm_cid() code does not
+ * provide a conditional for it readily. So avoid excessive updates
+ * when nothing changes.
+ */
+ if (t->rseq.ids.mm_cid != cid)
+ t->rseq.event.ids_changed = true;
+}
+
+/* Enforce a full update after RSEQ registration and when execve() failed */
+static inline void rseq_force_update(void)
+{
+ if (current->rseq.event.has_rseq) {
+ current->rseq.event.ids_changed = true;
+ current->rseq.event.sched_switch = true;
+ rseq_raise_notify_resume(current);
}
}
@@ -55,7 +119,7 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
static inline void rseq_virt_userspace_exit(void)
{
if (current->rseq.event.sched_switch)
- set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+ rseq_raise_notify_resume(current);
}
static inline void rseq_reset(struct task_struct *t)
@@ -91,6 +155,9 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
+static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid) { }
+static inline void rseq_force_update(void) { }
static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 7c12394..a1389ff 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -11,20 +11,27 @@ struct rseq;
* struct rseq_event - Storage for rseq related event management
* @all: Compound to initialize and clear the data efficiently
* @events: Compound to access events with a single load/store
- * @sched_switch: True if the task was scheduled out
+ * @sched_switch: True if the task was scheduled and needs update on
+ * exit to user
+ * @ids_changed: Indicator that IDs need to be updated
* @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
* @error: Compound error code for the slow path to analyze
* @fatal: User space data corrupted or invalid
+ *
+ * @sched_switch and @ids_changed must be adjacent and the combo must be
+ * 16bit aligned to allow a single store, when both are set at the same
+ * time in the scheduler.
*/
struct rseq_event {
union {
u64 all;
struct {
union {
- u16 events;
+ u32 events;
struct {
u8 sched_switch;
+ u8 ids_changed;
u8 user_irq;
};
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 148fb21..183dde7 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -464,7 +464,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* are updated before returning to user-space.
*/
current->rseq.event.has_rseq = true;
- rseq_sched_switch_event(current);
+ rseq_force_update();
return 0;
efault:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b75e8e1..579a8e9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5118,7 +5118,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
kcov_prepare_switch(prev);
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
- rseq_sched_switch_event(prev);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
@@ -5316,6 +5315,12 @@ context_switch(struct rq *rq, struct task_struct *prev,
/* switch_mm_cid() requires the memory barriers above. */
switch_mm_cid(rq, prev, next);
+ /*
+ * Tell rseq that the task was scheduled in. Must be after
+ * switch_mm_cid() to get the TIF flag set.
+ */
+ rseq_sched_switch_event(next);
+
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index adfb6e3..4838dda 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2209,6 +2209,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
smp_wmb();
WRITE_ONCE(task_thread_info(p)->cpu, cpu);
p->wake_cpu = cpu;
+ rseq_sched_set_task_cpu(p, cpu);
#endif /* CONFIG_SMP */
}
@@ -3807,8 +3808,10 @@ static inline void switch_mm_cid(struct rq *rq,
mm_cid_put_lazy(prev);
prev->mm_cid = -1;
}
- if (next->mm_cid_active)
+ if (next->mm_cid_active) {
next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
+ rseq_sched_set_task_mm_cid(next, next->mm_cid);
+ }
}
#else /* !CONFIG_SCHED_MM_CID: */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Rework the TIF_NOTIFY handler
2025-10-27 8:45 ` [patch V6 25/31] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: e2d4f42271155045a49b89530f2c06ad8e9f1a1e
Gitweb: https://git.kernel.org/tip/e2d4f42271155045a49b89530f2c06ad8e9f1a1e
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:12 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:33:54 +01:00
rseq: Rework the TIF_NOTIFY handler
Replace the whole logic with a new implementation, which is shared with
signal delivery and the upcoming exit fast path.
Contrary to the original implementation, this ignores invocations from
KVM/IO-uring, which invoke resume_user_mode_work() with the @regs argument
set to NULL.
The original implementation updated the CPU/Node/MM CID fields, but that
was just a side effect, which was addressing the problem that this
invocation cleared TIF_NOTIFY_RESUME, which in turn could cause an update
on return to user space to be lost.
This problem has been addressed differently, so that it's not longer
required to do that update before entering the guest.
That might be considered a user visible change, when the hosts thread TLS
memory is mapped into the guest, but as this was never intentionally
supported, this abuse of kernel internal implementation details is not
considered an ABI break.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.517640811@linutronix.de
---
include/linux/rseq_entry.h | 29 ++++++++++++++-
kernel/rseq.c | 76 ++++++++++++++++---------------------
2 files changed, 62 insertions(+), 43 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 37444e8..aa1c046 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -368,6 +368,35 @@ efault:
return false;
}
+/*
+ * Update user space with new IDs and conditionally check whether the task
+ * is in a critical section.
+ */
+static rseq_inline bool rseq_update_usr(struct task_struct *t, struct pt_regs *regs,
+ struct rseq_ids *ids, u32 node_id)
+{
+ u64 csaddr;
+
+ if (!rseq_set_ids_get_csaddr(t, ids, node_id, &csaddr))
+ return false;
+
+ /*
+ * On architectures which utilize the generic entry code this
+ * allows to skip the critical section when the entry was not from
+ * a user space interrupt, unless debug mode is enabled.
+ */
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ if (!static_branch_unlikely(&rseq_debug_enabled)) {
+ if (likely(!t->rseq.event.user_irq))
+ return true;
+ }
+ }
+ if (likely(!csaddr))
+ return true;
+ /* Sigh, this really needs to do work */
+ return rseq_update_user_cs(t, regs, csaddr);
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 13faadc..148fb21 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -82,12 +82,6 @@
#define CREATE_TRACE_POINTS
#include <trace/events/rseq.h>
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD irq
-#else
-# define RSEQ_EVENT_GUARD preempt
-#endif
-
DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
static inline void rseq_control_debug(bool on)
@@ -239,38 +233,15 @@ efault:
return false;
}
-/*
- * This resume handler must always be executed between any of:
- * - preemption,
- * - signal delivery,
- * and return to user-space.
- *
- * This is how we can ensure that the entire rseq critical section
- * will issue the commit instruction only if executed atomically with
- * respect to other threads scheduled on the same CPU, and with respect
- * to signal handlers.
- */
-void __rseq_handle_notify_resume(struct pt_regs *regs)
+static void rseq_slowpath_update_usr(struct pt_regs *regs)
{
+ /* Preserve rseq state and user_irq state for exit to user */
+ const struct rseq_event evt_mask = { .has_rseq = true, .user_irq = true, };
struct task_struct *t = current;
struct rseq_ids ids;
u32 node_id;
bool event;
- /*
- * If invoked from hypervisors before entering the guest via
- * resume_user_mode_work(), then @regs is a NULL pointer.
- *
- * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
- * it before returning from the ioctl() to user space when
- * rseq_event.sched_switch is set.
- *
- * So it's safe to ignore here instead of pointlessly updating it
- * in the vcpu_run() loop.
- */
- if (!regs)
- return;
-
if (unlikely(t->flags & PF_EXITING))
return;
@@ -294,26 +265,45 @@ void __rseq_handle_notify_resume(struct pt_regs *regs)
* with the result handed in to allow the detection of
* inconsistencies.
*/
- scoped_guard(RSEQ_EVENT_GUARD) {
+ scoped_guard(irq) {
event = t->rseq.event.sched_switch;
- t->rseq.event.sched_switch = false;
+ t->rseq.event.all &= evt_mask.all;
ids.cpu_id = task_cpu(t);
ids.mm_cid = task_mm_cid(t);
}
- if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+ if (!event)
return;
- if (!rseq_handle_cs(t, regs))
- goto error;
-
node_id = cpu_to_node(ids.cpu_id);
- if (!rseq_set_ids(t, &ids, node_id))
- goto error;
- return;
-error:
- force_sig(SIGSEGV);
+ if (unlikely(!rseq_update_usr(t, regs, &ids, node_id))) {
+ /*
+ * Clear the errors just in case this might survive magically, but
+ * leave the rest intact.
+ */
+ t->rseq.event.error = 0;
+ force_sig(SIGSEGV);
+ }
+}
+
+void __rseq_handle_notify_resume(struct pt_regs *regs)
+{
+ /*
+ * If invoked from hypervisors before entering the guest via
+ * resume_user_mode_work(), then @regs is a NULL pointer.
+ *
+ * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+ * it before returning from the ioctl() to user space when
+ * rseq_event.sched_switch is set.
+ *
+ * So it's safe to ignore here instead of pointlessly updating it
+ * in the vcpu_run() loop.
+ */
+ if (!regs)
+ return;
+
+ rseq_slowpath_update_usr(regs);
}
void __rseq_signal_deliver(int sig, struct pt_regs *regs)
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Separate the signal delivery path
2025-10-27 8:45 ` [patch V6 24/31] rseq: Separate the signal delivery path Thomas Gleixner
` (2 preceding siblings ...)
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 9f6ffd4cebda86841700775de3213f22bb0ea22d
Gitweb: https://git.kernel.org/tip/9f6ffd4cebda86841700775de3213f22bb0ea22d
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:10 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:33:47 +01:00
rseq: Separate the signal delivery path
Completely separate the signal delivery path from the notify handler as
they have different semantics versus the event handling.
The signal delivery only needs to ensure that the interrupted user context
was not in a critical section or the section is aborted before it switches
to the signal frame context. The signal frame context does not have the
original instruction pointer anymore, so that can't be handled on exit to
user space.
No point in updating the CPU/CID ids as they might change again before the
task returns to user space for real.
The fast path optimization, which checks for the 'entry from user via
interrupt' condition is only available for architectures which use the
generic entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.455429038@linutronix.de
---
include/linux/rseq.h | 21 ++++++++++++++++-----
kernel/rseq.c | 30 ++++++++++++++++++++++--------
2 files changed, 38 insertions(+), 13 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 92f9cd4..f5a4318 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -7,22 +7,33 @@
#include <uapi/linux/rseq.h>
-void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
+void __rseq_handle_notify_resume(struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
if (current->rseq.event.has_rseq)
- __rseq_handle_notify_resume(NULL, regs);
+ __rseq_handle_notify_resume(regs);
}
+void __rseq_signal_deliver(int sig, struct pt_regs *regs);
+
+/*
+ * Invoked from signal delivery to fixup based on the register context before
+ * switching to the signal delivery context.
+ */
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
- if (current->rseq.event.has_rseq) {
- current->rseq.event.sched_switch = true;
- __rseq_handle_notify_resume(ksig, regs);
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /* '&' is intentional to spare one conditional branch */
+ if (current->rseq.event.has_rseq & current->rseq.event.user_irq)
+ __rseq_signal_deliver(ksig->sig, regs);
+ } else {
+ if (current->rseq.event.has_rseq)
+ __rseq_signal_deliver(ksig->sig, regs);
}
}
+/* Raised from context switch and exevce to force evaluation on exit to user */
static inline void rseq_sched_switch_event(struct task_struct *t)
{
if (t->rseq.event.has_rseq) {
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 1e4f1d2..13faadc 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -250,13 +250,12 @@ efault:
* respect to other threads scheduled on the same CPU, and with respect
* to signal handlers.
*/
-void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
+void __rseq_handle_notify_resume(struct pt_regs *regs)
{
struct task_struct *t = current;
struct rseq_ids ids;
u32 node_id;
bool event;
- int sig;
/*
* If invoked from hypervisors before entering the guest via
@@ -275,10 +274,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING))
return;
- if (ksig)
- rseq_stat_inc(rseq_stats.signal);
- else
- rseq_stat_inc(rseq_stats.slowpath);
+ rseq_stat_inc(rseq_stats.slowpath);
/*
* Read and clear the event pending bit first. If the task
@@ -317,8 +313,26 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
return;
error:
- sig = ksig ? ksig->sig : 0;
- force_sigsegv(sig);
+ force_sig(SIGSEGV);
+}
+
+void __rseq_signal_deliver(int sig, struct pt_regs *regs)
+{
+ rseq_stat_inc(rseq_stats.signal);
+ /*
+ * Don't update IDs, they are handled on exit to user if
+ * necessary. The important thing is to abort a critical section of
+ * the interrupted context as after this point the instruction
+ * pointer in @regs points to the signal handler.
+ */
+ if (unlikely(!rseq_handle_cs(current, regs))) {
+ /*
+ * Clear the errors just in case this might survive
+ * magically, but leave the rest intact.
+ */
+ current->rseq.event.error = 0;
+ force_sigsegv(sig);
+ }
}
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide and use rseq_set_ids()
2025-10-27 8:45 ` [patch V6 23/31] rseq: Provide and use rseq_set_ids() Thomas Gleixner
` (2 preceding siblings ...)
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 0f085b41880e3140efa6941ff2b8fd43bac4d659
Gitweb: https://git.kernel.org/tip/0f085b41880e3140efa6941ff2b8fd43bac4d659
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:08 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:33:33 +01:00
rseq: Provide and use rseq_set_ids()
Provide a new and straight forward implementation to set the IDs (CPU ID,
Node ID and MM CID), which can be later inlined into the fast path.
It does all operations in one scoped_user_rw_access() section and retrieves
also the critical section member (rseq::cs_rseq) from user space to avoid
another user..begin/end() pair. This is in preparation for optimizing the
fast path to avoid extra work when not required.
On rseq registration set the CPU ID fields to RSEQ_CPU_ID_UNINITIALIZED and
node and MM CID to zero. That's the same as the kernel internal reset
values. That makes the debug validation in the exit code work correctly on
the first exit to user space.
Use it to replace the whole related zoo in rseq.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.393972266@linutronix.de
---
fs/binfmt_elf.c | 2 +-
include/linux/rseq.h | 16 +-
include/linux/rseq_entry.h | 89 ++++++++++++++-
include/linux/sched.h | 10 +--
kernel/rseq.c | 236 +++++++-----------------------------
5 files changed, 151 insertions(+), 202 deletions(-)
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index e4653bb..3eb734c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -46,7 +46,7 @@
#include <linux/cred.h>
#include <linux/dax.h>
#include <linux/uaccess.h>
-#include <linux/rseq.h>
+#include <uapi/linux/rseq.h>
#include <asm/param.h>
#include <asm/page.h>
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 7f347c3..92f9cd4 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -5,6 +5,8 @@
#ifdef CONFIG_RSEQ
#include <linux/sched.h>
+#include <uapi/linux/rseq.h>
+
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
@@ -48,7 +50,7 @@ static inline void rseq_virt_userspace_exit(void)
static inline void rseq_reset(struct task_struct *t)
{
memset(&t->rseq, 0, sizeof(t->rseq));
- t->rseq.ids.cpu_cid = ~0ULL;
+ t->rseq.ids.cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
}
static inline void rseq_execve(struct task_struct *t)
@@ -59,15 +61,19 @@ static inline void rseq_execve(struct task_struct *t)
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
+ *
+ * On fork, keep the IDs (CPU, MMCID) of the parent, which avoids a fault
+ * on the COW page on exit to user space, when the child stays on the same
+ * CPU as the parent. That's obviously not guaranteed, but in overcommit
+ * scenarios it is more likely and optimizes for the fork/exec case without
+ * taking the fault.
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM) {
+ if (clone_flags & CLONE_VM)
rseq_reset(t);
- } else {
+ else
t->rseq = current->rseq;
- t->rseq.ids.cpu_cid = ~0ULL;
- }
}
#else /* CONFIG_RSEQ */
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index fb53a6f..37444e8 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -75,6 +75,7 @@ DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
#endif
bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+bool rseq_debug_validate_ids(struct task_struct *t);
static __always_inline void rseq_note_user_irq_entry(void)
{
@@ -194,6 +195,43 @@ efault:
return false;
}
+/*
+ * On debug kernels validate that user space did not mess with it if the
+ * debug branch is enabled.
+ */
+bool rseq_debug_validate_ids(struct task_struct *t)
+{
+ struct rseq __user *rseq = t->rseq.usrptr;
+ u32 cpu_id, uval, node_id;
+
+ /*
+ * On the first exit after registering the rseq region CPU ID is
+ * RSEQ_CPU_ID_UNINITIALIZED and node_id in user space is 0!
+ */
+ node_id = t->rseq.ids.cpu_id != RSEQ_CPU_ID_UNINITIALIZED ?
+ cpu_to_node(t->rseq.ids.cpu_id) : 0;
+
+ scoped_user_read_access(rseq, efault) {
+ unsafe_get_user(cpu_id, &rseq->cpu_id_start, efault);
+ if (cpu_id != t->rseq.ids.cpu_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->cpu_id, efault);
+ if (uval != cpu_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->node_id, efault);
+ if (uval != node_id)
+ goto die;
+ unsafe_get_user(uval, &rseq->mm_cid, efault);
+ if (uval != t->rseq.ids.mm_cid)
+ goto die;
+ }
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
#endif /* RSEQ_BUILD_SLOW_PATH */
/*
@@ -279,6 +317,57 @@ efault:
return false;
}
+/*
+ * Updates CPU ID, Node ID and MM CID and reads the critical section
+ * address, when @csaddr != NULL. This allows to put the ID update and the
+ * read under the same uaccess region to spare a separate begin/end.
+ *
+ * As this is either invoked from a C wrapper with @csaddr = NULL or from
+ * the fast path code with a valid pointer, a clever compiler should be
+ * able to optimize the read out. Spares a duplicate implementation.
+ *
+ * Returns true, if the operation was successful, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when invalid data
+ * was found on debug kernels. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+static rseq_inline
+bool rseq_set_ids_get_csaddr(struct task_struct *t, struct rseq_ids *ids,
+ u32 node_id, u64 *csaddr)
+{
+ struct rseq __user *rseq = t->rseq.usrptr;
+
+ if (static_branch_unlikely(&rseq_debug_enabled)) {
+ if (!rseq_debug_validate_ids(t))
+ return false;
+ }
+
+ scoped_user_rw_access(rseq, efault) {
+ unsafe_put_user(ids->cpu_id, &rseq->cpu_id_start, efault);
+ unsafe_put_user(ids->cpu_id, &rseq->cpu_id, efault);
+ unsafe_put_user(node_id, &rseq->node_id, efault);
+ unsafe_put_user(ids->mm_cid, &rseq->mm_cid, efault);
+ if (csaddr)
+ unsafe_get_user(*csaddr, &rseq->rseq_cs, efault);
+ }
+
+ /* Cache the new values */
+ t->rseq.ids.cpu_cid = ids->cpu_cid;
+ rseq_stat_inc(rseq_stats.ids);
+ rseq_trace_update(t, ids);
+ return true;
+efault:
+ return false;
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 24a9da7..e47abc8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -42,7 +42,6 @@
#include <linux/posix-timers_types.h>
#include <linux/restart_block.h>
#include <linux/rseq_types.h>
-#include <uapi/linux/rseq.h>
#include <linux/seqlock_types.h>
#include <linux/kcsan.h>
#include <linux/rv.h>
@@ -1408,15 +1407,6 @@ struct task_struct {
#endif /* CONFIG_NUMA_BALANCING */
struct rseq_data rseq;
-#ifdef CONFIG_DEBUG_RSEQ
- /*
- * This is a place holder to save a copy of the rseq fields for
- * validation of read-only fields. The struct rseq has a
- * variable-length array at the end, so it cannot be used
- * directly. Reserve a size large enough for the known fields.
- */
- char rseq_fields[sizeof(struct rseq)];
-#endif
#ifdef CONFIG_SCHED_MM_CID
int mm_cid; /* Current cid in mm */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 9763155..1e4f1d2 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -88,13 +88,6 @@
# define RSEQ_EVENT_GUARD preempt
#endif
-/* The original rseq structure size (including padding) is 32 bytes. */
-#define ORIG_RSEQ_SIZE 32
-
-#define RSEQ_CS_NO_RESTART_FLAGS (RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT | \
- RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
- RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
-
DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
static inline void rseq_control_debug(bool on)
@@ -227,159 +220,9 @@ static int __init rseq_debugfs_init(void)
__initcall(rseq_debugfs_init);
#endif /* CONFIG_DEBUG_FS */
-#ifdef CONFIG_DEBUG_RSEQ
-static struct rseq *rseq_kernel_fields(struct task_struct *t)
-{
- return (struct rseq *) t->rseq_fields;
-}
-
-static int rseq_validate_ro_fields(struct task_struct *t)
-{
- static DEFINE_RATELIMIT_STATE(_rs,
- DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
- u32 cpu_id_start, cpu_id, node_id, mm_cid;
- struct rseq __user *rseq = t->rseq.usrptr;
-
- /*
- * Validate fields which are required to be read-only by
- * user-space.
- */
- if (!user_read_access_begin(rseq, t->rseq.len))
- goto efault;
- unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
- unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
- unsafe_get_user(node_id, &rseq->node_id, efault_end);
- unsafe_get_user(mm_cid, &rseq->mm_cid, efault_end);
- user_read_access_end();
-
- if ((cpu_id_start != rseq_kernel_fields(t)->cpu_id_start ||
- cpu_id != rseq_kernel_fields(t)->cpu_id ||
- node_id != rseq_kernel_fields(t)->node_id ||
- mm_cid != rseq_kernel_fields(t)->mm_cid) && __ratelimit(&_rs)) {
-
- pr_warn("Detected rseq corruption for pid: %d, name: %s\n"
- "\tcpu_id_start: %u ?= %u\n"
- "\tcpu_id: %u ?= %u\n"
- "\tnode_id: %u ?= %u\n"
- "\tmm_cid: %u ?= %u\n",
- t->pid, t->comm,
- cpu_id_start, rseq_kernel_fields(t)->cpu_id_start,
- cpu_id, rseq_kernel_fields(t)->cpu_id,
- node_id, rseq_kernel_fields(t)->node_id,
- mm_cid, rseq_kernel_fields(t)->mm_cid);
- }
-
- /* For now, only print a console warning on mismatch. */
- return 0;
-
-efault_end:
- user_read_access_end();
-efault:
- return -EFAULT;
-}
-
-/*
- * Update an rseq field and its in-kernel copy in lock-step to keep a coherent
- * state.
- */
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- do { \
- unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \
- rseq_kernel_fields(t)->field = value; \
- } while (0)
-
-#else
-static int rseq_validate_ro_fields(struct task_struct *t)
-{
- return 0;
-}
-
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- unsafe_put_user(value, &t->rseq.usrptr->field, error_label)
-#endif
-
-static int rseq_update_cpu_node_id(struct task_struct *t)
-{
- struct rseq __user *rseq = t->rseq.usrptr;
- u32 cpu_id = raw_smp_processor_id();
- u32 node_id = cpu_to_node(cpu_id);
- u32 mm_cid = task_mm_cid(t);
-
- rseq_stat_inc(rseq_stats.ids);
-
- /* Validate read-only rseq fields on debug kernels */
- if (rseq_validate_ro_fields(t))
- goto efault;
- WARN_ON_ONCE((int) mm_cid < 0);
-
- if (!user_write_access_begin(rseq, t->rseq.len))
- goto efault;
-
- rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
- rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
- rseq_unsafe_put_user(t, node_id, node_id, efault_end);
- rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
- /* Cache the user space values */
- t->rseq.ids.cpu_id = cpu_id;
- t->rseq.ids.mm_cid = mm_cid;
-
- /*
- * Additional feature fields added after ORIG_RSEQ_SIZE
- * need to be conditionally updated only if
- * t->rseq_len != ORIG_RSEQ_SIZE.
- */
- user_write_access_end();
- trace_rseq_update(t);
- return 0;
-
-efault_end:
- user_write_access_end();
-efault:
- return -EFAULT;
-}
-
-static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
+static bool rseq_set_ids(struct task_struct *t, struct rseq_ids *ids, u32 node_id)
{
- struct rseq __user *rseq = t->rseq.usrptr;
- u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
- mm_cid = 0;
-
- /*
- * Validate read-only rseq fields.
- */
- if (rseq_validate_ro_fields(t))
- goto efault;
-
- if (!user_write_access_begin(rseq, t->rseq.len))
- goto efault;
-
- /*
- * Reset all fields to their initial state.
- *
- * All fields have an initial state of 0 except cpu_id which is set to
- * RSEQ_CPU_ID_UNINITIALIZED, so that any user coming in after
- * unregistration can figure out that rseq needs to be registered
- * again.
- */
- rseq_unsafe_put_user(t, cpu_id_start, cpu_id_start, efault_end);
- rseq_unsafe_put_user(t, cpu_id, cpu_id, efault_end);
- rseq_unsafe_put_user(t, node_id, node_id, efault_end);
- rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
-
- /*
- * Additional feature fields added after ORIG_RSEQ_SIZE
- * need to be conditionally reset only if
- * t->rseq_len != ORIG_RSEQ_SIZE.
- */
- user_write_access_end();
- return 0;
-
-efault_end:
- user_write_access_end();
-efault:
- return -EFAULT;
+ return rseq_set_ids_get_csaddr(t, ids, node_id, NULL);
}
static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
@@ -410,6 +253,8 @@ efault:
void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
+ struct rseq_ids ids;
+ u32 node_id;
bool event;
int sig;
@@ -456,6 +301,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
scoped_guard(RSEQ_EVENT_GUARD) {
event = t->rseq.event.sched_switch;
t->rseq.event.sched_switch = false;
+ ids.cpu_id = task_cpu(t);
+ ids.mm_cid = task_mm_cid(t);
}
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -464,7 +311,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (!rseq_handle_cs(t, regs))
goto error;
- if (unlikely(rseq_update_cpu_node_id(t)))
+ node_id = cpu_to_node(ids.cpu_id);
+ if (!rseq_set_ids(t, &ids, node_id))
goto error;
return;
@@ -504,13 +352,33 @@ void rseq_syscall(struct pt_regs *regs)
}
#endif
+static bool rseq_reset_ids(void)
+{
+ struct rseq_ids ids = {
+ .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
+ .mm_cid = 0,
+ };
+
+ /*
+ * If this fails, terminate it because this leaves the kernel in
+ * stupid state as exit to user space will try to fixup the ids
+ * again.
+ */
+ if (rseq_set_ids(current, &ids, 0))
+ return true;
+
+ force_sig(SIGSEGV);
+ return false;
+}
+
+/* The original rseq structure size (including padding) is 32 bytes. */
+#define ORIG_RSEQ_SIZE 32
+
/*
* sys_rseq - setup restartable sequences for caller thread.
*/
SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
- int ret;
-
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
@@ -521,9 +389,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
return -EINVAL;
if (current->rseq.sig != sig)
return -EPERM;
- ret = rseq_reset_rseq_cpu_node_id(current);
- if (ret)
- return ret;
+ if (!rseq_reset_ids())
+ return -EFAULT;
rseq_reset(current);
return 0;
}
@@ -563,27 +430,22 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
if (!access_ok(rseq, rseq_len))
return -EFAULT;
- /*
- * If the rseq_cs pointer is non-NULL on registration, clear it to
- * avoid a potential segfault on return to user-space. The proper thing
- * to do would have been to fail the registration but this would break
- * older libcs that reuse the rseq area for new threads without
- * clearing the fields. Don't bother reading it, just reset it.
- */
- if (put_user(0UL, &rseq->rseq_cs))
- return -EFAULT;
+ scoped_user_write_access(rseq, efault) {
+ /*
+ * If the rseq_cs pointer is non-NULL on registration, clear it to
+ * avoid a potential segfault on return to user-space. The proper thing
+ * to do would have been to fail the registration but this would break
+ * older libcs that reuse the rseq area for new threads without
+ * clearing the fields. Don't bother reading it, just reset it.
+ */
+ unsafe_put_user(0UL, &rseq->rseq_cs, efault);
+ /* Initialize IDs in user space */
+ unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id_start, efault);
+ unsafe_put_user(RSEQ_CPU_ID_UNINITIALIZED, &rseq->cpu_id, efault);
+ unsafe_put_user(0U, &rseq->node_id, efault);
+ unsafe_put_user(0U, &rseq->mm_cid, efault);
+ }
-#ifdef CONFIG_DEBUG_RSEQ
- /*
- * Initialize the in-kernel rseq fields copy for validation of
- * read-only fields.
- */
- if (get_user(rseq_kernel_fields(current)->cpu_id_start, &rseq->cpu_id_start) ||
- get_user(rseq_kernel_fields(current)->cpu_id, &rseq->cpu_id) ||
- get_user(rseq_kernel_fields(current)->node_id, &rseq->node_id) ||
- get_user(rseq_kernel_fields(current)->mm_cid, &rseq->mm_cid))
- return -EFAULT;
-#endif
/*
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
@@ -599,6 +461,8 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
*/
current->rseq.event.has_rseq = true;
rseq_sched_switch_event(current);
-
return 0;
+
+efault:
+ return -EFAULT;
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
2025-10-27 8:45 ` [patch V6 22/31] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: eaa9088d568c84afd72fa32dbe01833aef861d0d
Gitweb: https://git.kernel.org/tip/eaa9088d568c84afd72fa32dbe01833aef861d0d
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:05 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:33:27 +01:00
rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y
Make the syscall exit debug mechanism available via the static branch on
architectures which utilize the generic entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.333440475@linutronix.de
---
include/linux/entry-common.h | 2 +-
include/linux/rseq_entry.h | 9 +++++++++
kernel/rseq.c | 10 ++++++++--
3 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 75b194c..d967184 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -146,7 +146,7 @@ static __always_inline void syscall_exit_to_user_mode_work(struct pt_regs *regs)
local_irq_enable();
}
- rseq_syscall(regs);
+ rseq_debug_syscall_return(regs);
/*
* Do one-time syscall specific work. If these work items are
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 5bdcf5b..fb53a6f 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -296,9 +296,18 @@ static __always_inline void rseq_exit_to_user_mode(void)
ev->events = 0;
}
+void __rseq_debug_syscall_return(struct pt_regs *regs);
+
+static inline void rseq_debug_syscall_return(struct pt_regs *regs)
+{
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ __rseq_debug_syscall_return(regs);
+}
+
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
static inline void rseq_exit_to_user_mode(void) { }
+static inline void rseq_debug_syscall_return(struct pt_regs *regs) { }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/kernel/rseq.c b/kernel/rseq.c
index abd6bfa..9763155 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -473,12 +473,11 @@ error:
force_sigsegv(sig);
}
-#ifdef CONFIG_DEBUG_RSEQ
/*
* Terminate the process if a syscall is issued within a restartable
* sequence.
*/
-void rseq_syscall(struct pt_regs *regs)
+void __rseq_debug_syscall_return(struct pt_regs *regs)
{
struct task_struct *t = current;
u64 csaddr;
@@ -496,6 +495,13 @@ void rseq_syscall(struct pt_regs *regs)
fail:
force_sig(SIGSEGV);
}
+
+#ifdef CONFIG_DEBUG_RSEQ
+/* Kept around to keep GENERIC_ENTRY=n architectures supported. */
+void rseq_syscall(struct pt_regs *regs)
+{
+ __rseq_debug_syscall_return(regs);
+}
#endif
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Make exit debugging static branch based
2025-10-27 8:45 ` [patch V6 21/31] rseq: Make exit debugging static branch based Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: c1cbad8f99b5c73c6af6e96acbfa64eaaaeb085f
Gitweb: https://git.kernel.org/tip/c1cbad8f99b5c73c6af6e96acbfa64eaaaeb085f
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:02 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:33:20 +01:00
rseq: Make exit debugging static branch based
Disconnect it from the config switch and use the static debug branch. This
is a temporary measure for validating the rework. At the end this check
needs to be hidden behind lockdep as it has nothing to do with the other
debug infrastructure, which mainly aids user space debugging by enabling a
zoo of checks which terminate misbehaving tasks instead of letting them
keep the hard to diagnose pieces.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.272660745@linutronix.de
---
include/linux/rseq_entry.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index f9510ce..5bdcf5b 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -285,7 +285,7 @@ static __always_inline void rseq_exit_to_user_mode(void)
rseq_stat_inc(rseq_stats.exit);
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ if (static_branch_unlikely(&rseq_debug_enabled))
WARN_ON_ONCE(ev->sched_switch);
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Replace the original debug implementation
2025-10-27 8:45 ` [patch V6 20/31] rseq: Replace the original debug implementation Thomas Gleixner
` (2 preceding siblings ...)
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: f7ee1964ac397bee5c6d1c017557c0eec8856145
Gitweb: https://git.kernel.org/tip/f7ee1964ac397bee5c6d1c017557c0eec8856145
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:45:00 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:33:12 +01:00
rseq: Replace the original debug implementation
Just utilize the new infrastructure and put the original one to rest.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.212510692@linutronix.de
---
kernel/rseq.c | 81 +++++++-------------------------------------------
1 file changed, 12 insertions(+), 69 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 12a9b6a..abd6bfa 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -475,84 +475,27 @@ error:
#ifdef CONFIG_DEBUG_RSEQ
/*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
- return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
- struct rseq __user *urseq = t->rseq.usrptr;
- struct rseq_cs __user *urseq_cs;
- u32 __user *usig;
- u64 ptr;
- u32 sig;
- int ret;
-
- if (get_user(ptr, &rseq->rseq_cs))
- return -EFAULT;
-
- /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
- if (!ptr) {
- memset(rseq_cs, 0, sizeof(*rseq_cs));
- return 0;
- }
- /* Check that the pointer value fits in the user-space process space. */
- if (ptr >= TASK_SIZE)
- return -EINVAL;
- urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
- if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-
- if (rseq_cs->start_ip >= TASK_SIZE ||
- rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
- rseq_cs->abort_ip >= TASK_SIZE ||
- rseq_cs->version > 0)
- return -EINVAL;
- /* Check for overflow. */
- if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
- return -EINVAL;
- /* Ensure that abort_ip is not in the critical section. */
- if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
- return -EINVAL;
-
- usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
- ret = get_user(sig, usig);
- if (ret)
- return ret;
-
- if (current->rseq.sig != sig) {
- printk_ratelimited(KERN_WARNING
- "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq.sig, current->pid, usig);
- return -EINVAL;
- }
- return 0;
-}
-
-/*
* Terminate the process if a syscall is issued within a restartable
* sequence.
*/
void rseq_syscall(struct pt_regs *regs)
{
- unsigned long ip = instruction_pointer(regs);
struct task_struct *t = current;
- struct rseq_cs rseq_cs;
+ u64 csaddr;
- if (!t->rseq.usrptr)
+ if (!t->rseq.event.has_rseq)
return;
- if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
- force_sig(SIGSEGV);
+ if (get_user(csaddr, &t->rseq.usrptr->rseq_cs))
+ goto fail;
+ if (likely(!csaddr))
+ return;
+ if (unlikely(csaddr >= TASK_SIZE))
+ goto fail;
+ if (rseq_debug_update_user_cs(t, regs, csaddr))
+ return;
+fail:
+ force_sig(SIGSEGV);
}
-
#endif
/*
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide and use rseq_update_user_cs()
2025-10-27 8:44 ` [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
` (3 preceding siblings ...)
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
4 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: abc850e7616c91ebaa3f5ba3617ab0a104d45039
Gitweb: https://git.kernel.org/tip/abc850e7616c91ebaa3f5ba3617ab0a104d45039
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:57 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:32:57 +01:00
rseq: Provide and use rseq_update_user_cs()
Provide a straight forward implementation to check for and eventually
clear/fixup critical sections in user space.
The non-debug version does only the minimal sanity checks and aims for
efficiency.
There are two attack vectors, which are checked for:
1) An abort IP which is in the kernel address space. That would cause at
least x86 to return to kernel space via IRET.
2) A rogue critical section descriptor with an abort IP pointing to some
arbitrary address, which is not preceded by the RSEQ signature.
If the section descriptors are invalid then the resulting misbehaviour of
the user space application is not the kernels problem.
The kernel provides a run-time switchable debug slow path, which implements
the full zoo of checks including termination of the task when one of the
gazillion conditions is not met.
Replace the zoo in rseq.c with it and invoke it from the TIF_NOTIFY_RESUME
handler. Move the remainders into the CONFIG_DEBUG_RSEQ section, which will
be replaced and removed in a subsequent step.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.151465632@linutronix.de
---
include/linux/rseq_entry.h | 206 ++++++++++++++++++++++++++++++-
include/linux/rseq_types.h | 11 +-
kernel/rseq.c | 246 ++++++++++--------------------------
3 files changed, 291 insertions(+), 172 deletions(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index ed8e5f8..f9510ce 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -36,6 +36,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
#ifdef CONFIG_RSEQ
#include <linux/jump_label.h>
#include <linux/rseq.h>
+#include <linux/uaccess.h>
#include <linux/tracepoint-defs.h>
@@ -67,12 +68,217 @@ static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_inline
+#else
+#define rseq_inline __always_inline
+#endif
+
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr);
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
current->rseq.event.user_irq = true;
}
+/*
+ * Check whether there is a valid critical section and whether the
+ * instruction pointer in @regs is inside the critical section.
+ *
+ * - If the critical section is invalid, terminate the task.
+ *
+ * - If valid and the instruction pointer is inside, set it to the abort IP.
+ *
+ * - If valid and the instruction pointer is outside, clear the critical
+ * section address.
+ *
+ * Returns true, if the section was valid and either fixup or clear was
+ * done, false otherwise.
+ *
+ * In the failure case task::rseq_event::fatal is set when a invalid
+ * section was found. It's clear when the failure was an unresolved page
+ * fault.
+ *
+ * If inlined into the exit to user path with interrupts disabled, the
+ * caller has to protect against page faults with pagefault_disable().
+ *
+ * In preemptible task context this would be counterproductive as the page
+ * faults could not be fully resolved. As a consequence unresolved page
+ * faults in task context are fatal too.
+ */
+
+#ifdef RSEQ_BUILD_SLOW_PATH
+/*
+ * The debug version is put out of line, but kept here so the code stays
+ * together.
+ *
+ * @csaddr has already been checked by the caller to be in user space
+ */
+bool rseq_debug_update_user_cs(struct task_struct *t, struct pt_regs *regs,
+ unsigned long csaddr)
+{
+ struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+ u64 start_ip, abort_ip, offset, cs_end, head, tasksize = TASK_SIZE;
+ unsigned long ip = instruction_pointer(regs);
+ u64 __user *uc_head = (u64 __user *) ucs;
+ u32 usig, __user *uc_sig;
+
+ scoped_user_rw_access(ucs, efault) {
+ /*
+ * Evaluate the user pile and exit if one of the conditions
+ * is not fulfilled.
+ */
+ unsafe_get_user(start_ip, &ucs->start_ip, efault);
+ if (unlikely(start_ip >= tasksize))
+ goto die;
+ /* If outside, just clear the critical section. */
+ if (ip < start_ip)
+ goto clear;
+
+ unsafe_get_user(offset, &ucs->post_commit_offset, efault);
+ cs_end = start_ip + offset;
+ /* Check for overflow and wraparound */
+ if (unlikely(cs_end >= tasksize || cs_end < start_ip))
+ goto die;
+
+ /* If not inside, clear it. */
+ if (ip >= cs_end)
+ goto clear;
+
+ unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
+ /* Ensure it's "valid" */
+ if (unlikely(abort_ip >= tasksize || abort_ip < sizeof(*uc_sig)))
+ goto die;
+ /* Validate that the abort IP is not in the critical section */
+ if (unlikely(abort_ip - start_ip < offset))
+ goto die;
+
+ /*
+ * Check version and flags for 0. No point in emitting
+ * deprecated warnings before dying. That could be done in
+ * the slow path eventually, but *shrug*.
+ */
+ unsafe_get_user(head, uc_head, efault);
+ if (unlikely(head))
+ goto die;
+
+ /* abort_ip - 4 is >= 0. See abort_ip check above */
+ uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+ unsafe_get_user(usig, uc_sig, efault);
+ if (unlikely(usig != t->rseq.sig))
+ goto die;
+
+ /* rseq_event.user_irq is only valid if CONFIG_GENERIC_IRQ_ENTRY=y */
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
+ /* If not in interrupt from user context, let it die */
+ if (unlikely(!t->rseq.event.user_irq))
+ goto die;
+ }
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ instruction_pointer_set(regs, (unsigned long)abort_ip);
+ rseq_stat_inc(rseq_stats.fixup);
+ break;
+ clear:
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ rseq_stat_inc(rseq_stats.clear);
+ abort_ip = 0ULL;
+ }
+
+ if (unlikely(abort_ip))
+ rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
+#endif /* RSEQ_BUILD_SLOW_PATH */
+
+/*
+ * This only ensures that abort_ip is in the user address space and
+ * validates that it is preceded by the signature.
+ *
+ * No other sanity checks are done here, that's what the debug code is for.
+ */
+static rseq_inline bool
+rseq_update_user_cs(struct task_struct *t, struct pt_regs *regs, unsigned long csaddr)
+{
+ struct rseq_cs __user *ucs = (struct rseq_cs __user *)(unsigned long)csaddr;
+ unsigned long ip = instruction_pointer(regs);
+ u64 start_ip, abort_ip, offset;
+ u32 usig, __user *uc_sig;
+
+ rseq_stat_inc(rseq_stats.cs);
+
+ if (unlikely(csaddr >= TASK_SIZE)) {
+ t->rseq.event.fatal = true;
+ return false;
+ }
+
+ if (static_branch_unlikely(&rseq_debug_enabled))
+ return rseq_debug_update_user_cs(t, regs, csaddr);
+
+ scoped_user_rw_access(ucs, efault) {
+ unsafe_get_user(start_ip, &ucs->start_ip, efault);
+ unsafe_get_user(offset, &ucs->post_commit_offset, efault);
+ unsafe_get_user(abort_ip, &ucs->abort_ip, efault);
+
+ /*
+ * No sanity checks. If user space screwed it up, it can
+ * keep the pieces. That's what debug code is for.
+ *
+ * If outside, just clear the critical section.
+ */
+ if (ip - start_ip >= offset)
+ goto clear;
+
+ /*
+ * Two requirements for @abort_ip:
+ * - Must be in user space as x86 IRET would happily return to
+ * the kernel.
+ * - The four bytes preceding the instruction at @abort_ip must
+ * contain the signature.
+ *
+ * The latter protects against the following attack vector:
+ *
+ * An attacker with limited abilities to write, creates a critical
+ * section descriptor, sets the abort IP to a library function or
+ * some other ROP gadget and stores the address of the descriptor
+ * in TLS::rseq::rseq_cs. An RSEQ abort would then evade ROP
+ * protection.
+ */
+ if (abort_ip >= TASK_SIZE || abort_ip < sizeof(*uc_sig))
+ goto die;
+
+ /* The address is guaranteed to be >= 0 and < TASK_SIZE */
+ uc_sig = (u32 __user *)(unsigned long)(abort_ip - sizeof(*uc_sig));
+ unsafe_get_user(usig, uc_sig, efault);
+ if (unlikely(usig != t->rseq.sig))
+ goto die;
+
+ /* Invalidate the critical section */
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ /* Update the instruction pointer */
+ instruction_pointer_set(regs, (unsigned long)abort_ip);
+ rseq_stat_inc(rseq_stats.fixup);
+ break;
+ clear:
+ unsafe_put_user(0ULL, &t->rseq.usrptr->rseq_cs, efault);
+ rseq_stat_inc(rseq_stats.clear);
+ abort_ip = 0ULL;
+ }
+
+ if (unlikely(abort_ip))
+ rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+ return true;
+die:
+ t->rseq.event.fatal = true;
+efault:
+ return false;
+}
+
static __always_inline void rseq_exit_to_user_mode(void)
{
struct rseq_event *ev = ¤t->rseq.event;
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 80f6c39..7c12394 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -14,10 +14,12 @@ struct rseq;
* @sched_switch: True if the task was scheduled out
* @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
+ * @error: Compound error code for the slow path to analyze
+ * @fatal: User space data corrupted or invalid
*/
struct rseq_event {
union {
- u32 all;
+ u64 all;
struct {
union {
u16 events;
@@ -28,6 +30,13 @@ struct rseq_event {
};
u8 has_rseq;
+ u8 __pad;
+ union {
+ u16 error;
+ struct {
+ u8 fatal;
+ };
+ };
};
};
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 679ab8e..12a9b6a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -382,175 +382,18 @@ efault:
return -EFAULT;
}
-/*
- * Get the user-space pointer value stored in the 'rseq_cs' field.
- */
-static int rseq_get_rseq_cs_ptr_val(struct rseq __user *rseq, u64 *rseq_cs)
-{
- if (!rseq_cs)
- return -EFAULT;
-
-#ifdef CONFIG_64BIT
- if (get_user(*rseq_cs, &rseq->rseq_cs))
- return -EFAULT;
-#else
- if (copy_from_user(rseq_cs, &rseq->rseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-#endif
-
- return 0;
-}
-
-/*
- * If the rseq_cs field of 'struct rseq' contains a valid pointer to
- * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
- */
-static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
-{
- struct rseq_cs __user *urseq_cs;
- u64 ptr;
- u32 __user *usig;
- u32 sig;
- int ret;
-
- ret = rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr);
- if (ret)
- return ret;
-
- /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
- if (!ptr) {
- memset(rseq_cs, 0, sizeof(*rseq_cs));
- return 0;
- }
- /* Check that the pointer value fits in the user-space process space. */
- if (ptr >= TASK_SIZE)
- return -EINVAL;
- urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
- if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
- return -EFAULT;
-
- if (rseq_cs->start_ip >= TASK_SIZE ||
- rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
- rseq_cs->abort_ip >= TASK_SIZE ||
- rseq_cs->version > 0)
- return -EINVAL;
- /* Check for overflow. */
- if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
- return -EINVAL;
- /* Ensure that abort_ip is not in the critical section. */
- if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
- return -EINVAL;
-
- usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
- ret = get_user(sig, usig);
- if (ret)
- return ret;
-
- if (current->rseq.sig != sig) {
- printk_ratelimited(KERN_WARNING
- "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq.sig, current->pid, usig);
- return -EINVAL;
- }
- return 0;
-}
-
-static bool rseq_warn_flags(const char *str, u32 flags)
-{
- u32 test_flags;
-
- if (!flags)
- return false;
- test_flags = flags & RSEQ_CS_NO_RESTART_FLAGS;
- if (test_flags)
- pr_warn_once("Deprecated flags (%u) in %s ABI structure", test_flags, str);
- test_flags = flags & ~RSEQ_CS_NO_RESTART_FLAGS;
- if (test_flags)
- pr_warn_once("Unknown flags (%u) in %s ABI structure", test_flags, str);
- return true;
-}
-
-static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
-{
- u32 flags;
- int ret;
-
- if (rseq_warn_flags("rseq_cs", cs_flags))
- return -EINVAL;
-
- /* Get thread flags. */
- ret = get_user(flags, &t->rseq.usrptr->flags);
- if (ret)
- return ret;
-
- if (rseq_warn_flags("rseq", flags))
- return -EINVAL;
- return 0;
-}
-
-static int clear_rseq_cs(struct rseq __user *rseq)
-{
- /*
- * The rseq_cs field is set to NULL on preemption or signal
- * delivery on top of rseq assembly block, as well as on top
- * of code outside of the rseq assembly block. This performs
- * a lazy clear of the rseq_cs field.
- *
- * Set rseq_cs to NULL.
- */
-#ifdef CONFIG_64BIT
- return put_user(0UL, &rseq->rseq_cs);
-#else
- if (clear_user(&rseq->rseq_cs, sizeof(rseq->rseq_cs)))
- return -EFAULT;
- return 0;
-#endif
-}
-
-/*
- * Unsigned comparison will be true when ip >= start_ip, and when
- * ip < start_ip + post_commit_offset.
- */
-static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
-{
- return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
-}
-
-static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
+static bool rseq_handle_cs(struct task_struct *t, struct pt_regs *regs)
{
- unsigned long ip = instruction_pointer(regs);
- struct task_struct *t = current;
- struct rseq_cs rseq_cs;
- int ret;
-
- rseq_stat_inc(rseq_stats.cs);
-
- ret = rseq_get_rseq_cs(t, &rseq_cs);
- if (ret)
- return ret;
-
- /*
- * Handle potentially not being within a critical section.
- * If not nested over a rseq critical section, restart is useless.
- * Clear the rseq_cs pointer and return.
- */
- if (!in_rseq_cs(ip, &rseq_cs)) {
- rseq_stat_inc(rseq_stats.clear);
- return clear_rseq_cs(t->rseq.usrptr);
- }
- ret = rseq_check_flags(t, rseq_cs.flags);
- if (ret < 0)
- return ret;
- if (!abort)
- return 0;
- ret = clear_rseq_cs(t->rseq.usrptr);
- if (ret)
- return ret;
- rseq_stat_inc(rseq_stats.fixup);
- trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
- rseq_cs.abort_ip);
- instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
- return 0;
+ struct rseq __user *urseq = t->rseq.usrptr;
+ u64 csaddr;
+
+ scoped_user_read_access(urseq, efault)
+ unsafe_get_user(csaddr, &urseq->rseq_cs, efault);
+ if (likely(!csaddr))
+ return true;
+ return rseq_update_user_cs(t, regs, csaddr);
+efault:
+ return false;
}
/*
@@ -567,8 +410,8 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
- int ret, sig;
bool event;
+ int sig;
/*
* If invoked from hypervisors before entering the guest via
@@ -618,8 +461,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
return;
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
+ if (!rseq_handle_cs(t, regs))
goto error;
if (unlikely(rseq_update_cpu_node_id(t)))
@@ -632,6 +474,68 @@ error:
}
#ifdef CONFIG_DEBUG_RSEQ
+/*
+ * Unsigned comparison will be true when ip >= start_ip, and when
+ * ip < start_ip + post_commit_offset.
+ */
+static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
+{
+ return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
+}
+
+/*
+ * If the rseq_cs field of 'struct rseq' contains a valid pointer to
+ * user-space, copy 'struct rseq_cs' from user-space and validate its fields.
+ */
+static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
+{
+ struct rseq __user *urseq = t->rseq.usrptr;
+ struct rseq_cs __user *urseq_cs;
+ u32 __user *usig;
+ u64 ptr;
+ u32 sig;
+ int ret;
+
+ if (get_user(ptr, &rseq->rseq_cs))
+ return -EFAULT;
+
+ /* If the rseq_cs pointer is NULL, return a cleared struct rseq_cs. */
+ if (!ptr) {
+ memset(rseq_cs, 0, sizeof(*rseq_cs));
+ return 0;
+ }
+ /* Check that the pointer value fits in the user-space process space. */
+ if (ptr >= TASK_SIZE)
+ return -EINVAL;
+ urseq_cs = (struct rseq_cs __user *)(unsigned long)ptr;
+ if (copy_from_user(rseq_cs, urseq_cs, sizeof(*rseq_cs)))
+ return -EFAULT;
+
+ if (rseq_cs->start_ip >= TASK_SIZE ||
+ rseq_cs->start_ip + rseq_cs->post_commit_offset >= TASK_SIZE ||
+ rseq_cs->abort_ip >= TASK_SIZE ||
+ rseq_cs->version > 0)
+ return -EINVAL;
+ /* Check for overflow. */
+ if (rseq_cs->start_ip + rseq_cs->post_commit_offset < rseq_cs->start_ip)
+ return -EINVAL;
+ /* Ensure that abort_ip is not in the critical section. */
+ if (rseq_cs->abort_ip - rseq_cs->start_ip < rseq_cs->post_commit_offset)
+ return -EINVAL;
+
+ usig = (u32 __user *)(unsigned long)(rseq_cs->abort_ip - sizeof(u32));
+ ret = get_user(sig, usig);
+ if (ret)
+ return ret;
+
+ if (current->rseq.sig != sig) {
+ printk_ratelimited(KERN_WARNING
+ "Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
+ sig, current->rseq.sig, current->pid, usig);
+ return -EINVAL;
+ }
+ return 0;
+}
/*
* Terminate the process if a syscall is issued within a restartable
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide static branch for runtime debugging
2025-10-27 8:44 ` [patch V6 18/31] rseq: Provide static branch for runtime debugging Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 9c37cb6e80b8fcdddc1236ba42ffd438f511192b
Gitweb: https://git.kernel.org/tip/9c37cb6e80b8fcdddc1236ba42ffd438f511192b
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:55 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:32:49 +01:00
rseq: Provide static branch for runtime debugging
Config based debug is rarely turned on and is not available easily when
things go wrong.
Provide a static branch to allow permanent integration of debug mechanisms
along with the usual toggles in Kconfig, command line and debugfs.
Requested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.089270547@linutronix.de
---
Documentation/admin-guide/kernel-parameters.txt | 4 +-
include/linux/rseq_entry.h | 3 +-
init/Kconfig | 14 +++-
kernel/rseq.c | 73 +++++++++++++++-
4 files changed, 90 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6c42061..e638274 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -6500,6 +6500,10 @@
Memory area to be used by remote processor image,
managed by CMA.
+ rseq_debug= [KNL] Enable or disable restartable sequence
+ debug mode. Defaults to CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE.
+ Format: <bool>
+
rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling
when CONFIG_RT_GROUP_SCHED=y. Defaults to
!CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED.
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index ff9080b..ed8e5f8 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -34,6 +34,7 @@ DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
#endif /* !CONFIG_RSEQ_STATS */
#ifdef CONFIG_RSEQ
+#include <linux/jump_label.h>
#include <linux/rseq.h>
#include <linux/tracepoint-defs.h>
@@ -64,6 +65,8 @@ static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
unsigned long offset, unsigned long abort_ip) { }
#endif /* !CONFIG_TRACEPOINT */
+DECLARE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
diff --git a/init/Kconfig b/init/Kconfig
index f39fdfb..bde40ab 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1925,10 +1925,24 @@ config RSEQ_STATS
If unsure, say N.
+config RSEQ_DEBUG_DEFAULT_ENABLE
+ default n
+ bool "Enable restartable sequences debug mode by default" if EXPERT
+ depends on RSEQ
+ help
+ This enables the static branch for debug mode of restartable
+ sequences.
+
+ This also can be controlled on the kernel command line via the
+ command line parameter "rseq_debug=0/1" and through debugfs.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
depends on RSEQ && DEBUG_KERNEL
+ select RSEQ_DEBUG_DEFAULT_ENABLE
help
Enable extra debugging checks for the rseq system call.
diff --git a/kernel/rseq.c b/kernel/rseq.c
index c0dbe2e..679ab8e 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -95,6 +95,27 @@
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
+DEFINE_STATIC_KEY_MAYBE(CONFIG_RSEQ_DEBUG_DEFAULT_ENABLE, rseq_debug_enabled);
+
+static inline void rseq_control_debug(bool on)
+{
+ if (on)
+ static_branch_enable(&rseq_debug_enabled);
+ else
+ static_branch_disable(&rseq_debug_enabled);
+}
+
+static int __init rseq_setup_debug(char *str)
+{
+ bool on;
+
+ if (kstrtobool(str, &on))
+ return -EINVAL;
+ rseq_control_debug(on);
+ return 1;
+}
+__setup("rseq_debug=", rseq_setup_debug);
+
#ifdef CONFIG_TRACEPOINTS
/*
* Out of line, so the actual update functions can be in a header to be
@@ -112,10 +133,11 @@ void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
}
#endif /* CONFIG_TRACEPOINTS */
+#ifdef CONFIG_DEBUG_FS
#ifdef CONFIG_RSEQ_STATS
DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
-static int rseq_debug_show(struct seq_file *m, void *p)
+static int rseq_stats_show(struct seq_file *m, void *p)
{
struct rseq_stats stats = { };
unsigned int cpu;
@@ -140,14 +162,56 @@ static int rseq_debug_show(struct seq_file *m, void *p)
return 0;
}
+static int rseq_stats_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, rseq_stats_show, inode->i_private);
+}
+
+static const struct file_operations stat_ops = {
+ .open = rseq_stats_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init rseq_stats_init(struct dentry *root_dir)
+{
+ debugfs_create_file("stats", 0444, root_dir, NULL, &stat_ops);
+ return 0;
+}
+#else
+static inline void rseq_stats_init(struct dentry *root_dir) { }
+#endif /* CONFIG_RSEQ_STATS */
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+ bool on = static_branch_unlikely(&rseq_debug_enabled);
+
+ seq_printf(m, "%d\n", on);
+ return 0;
+}
+
+static ssize_t rseq_debug_write(struct file *file, const char __user *ubuf,
+ size_t count, loff_t *ppos)
+{
+ bool on;
+
+ if (kstrtobool_from_user(ubuf, count, &on))
+ return -EINVAL;
+
+ rseq_control_debug(on);
+ return count;
+}
+
static int rseq_debug_open(struct inode *inode, struct file *file)
{
return single_open(file, rseq_debug_show, inode->i_private);
}
-static const struct file_operations dfs_ops = {
+static const struct file_operations debug_ops = {
.open = rseq_debug_open,
.read = seq_read,
+ .write = rseq_debug_write,
.llseek = seq_lseek,
.release = single_release,
};
@@ -156,11 +220,12 @@ static int __init rseq_debugfs_init(void)
{
struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
- debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+ debugfs_create_file("debug", 0644, root_dir, NULL, &debug_ops);
+ rseq_stats_init(root_dir);
return 0;
}
__initcall(rseq_debugfs_init);
-#endif /* CONFIG_RSEQ_STATS */
+#endif /* CONFIG_DEBUG_FS */
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Expose lightweight statistics in debugfs
2025-10-27 8:44 ` [patch V6 17/31] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 5412910487d0839111e4f2f3a6f33f6c9af9b007
Gitweb: https://git.kernel.org/tip/5412910487d0839111e4f2f3a6f33f6c9af9b007
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:52 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:32:41 +01:00
rseq: Expose lightweight statistics in debugfs
Analyzing the call frequency without actually using tracing is helpful for
analysis of this infrastructure. The overhead is minimal as it just
increments a per CPU counter associated to each operation.
The debugfs readout provides a racy sum of all counters.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084307.027916598@linutronix.de
---
include/linux/rseq.h | 16 +-------
include/linux/rseq_entry.h | 49 +++++++++++++++++++++++-
init/Kconfig | 12 ++++++-
kernel/rseq.c | 79 +++++++++++++++++++++++++++++++++----
4 files changed, 133 insertions(+), 23 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index a200836..7f347c3 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -29,21 +29,6 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
}
}
-static __always_inline void rseq_exit_to_user_mode(void)
-{
- struct rseq_event *ev = ¤t->rseq.event;
-
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
- WARN_ON_ONCE(ev->sched_switch);
-
- /*
- * Ensure that event (especially user_irq) is cleared when the
- * interrupt did not result in a schedule and therefore the
- * rseq processing did not clear it.
- */
- ev->events = 0;
-}
-
/*
* KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
* which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
@@ -92,7 +77,6 @@ static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
-static inline void rseq_exit_to_user_mode(void) { }
#endif /* !CONFIG_RSEQ */
#ifdef CONFIG_DEBUG_RSEQ
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index 5be507a..ff9080b 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -2,6 +2,37 @@
#ifndef _LINUX_RSEQ_ENTRY_H
#define _LINUX_RSEQ_ENTRY_H
+/* Must be outside the CONFIG_RSEQ guard to resolve the stubs */
+#ifdef CONFIG_RSEQ_STATS
+#include <linux/percpu.h>
+
+struct rseq_stats {
+ unsigned long exit;
+ unsigned long signal;
+ unsigned long slowpath;
+ unsigned long ids;
+ unsigned long cs;
+ unsigned long clear;
+ unsigned long fixup;
+};
+
+DECLARE_PER_CPU(struct rseq_stats, rseq_stats);
+
+/*
+ * Slow path has interrupts and preemption enabled, but the fast path
+ * runs with interrupts disabled so there is no point in having the
+ * preemption checks implied in __this_cpu_inc() for every operation.
+ */
+#ifdef RSEQ_BUILD_SLOW_PATH
+#define rseq_stat_inc(which) this_cpu_inc((which))
+#else
+#define rseq_stat_inc(which) raw_cpu_inc((which))
+#endif
+
+#else /* CONFIG_RSEQ_STATS */
+#define rseq_stat_inc(x) do { } while (0)
+#endif /* !CONFIG_RSEQ_STATS */
+
#ifdef CONFIG_RSEQ
#include <linux/rseq.h>
@@ -39,8 +70,26 @@ static __always_inline void rseq_note_user_irq_entry(void)
current->rseq.event.user_irq = true;
}
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ rseq_stat_inc(rseq_stats.exit);
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ WARN_ON_ONCE(ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing did not clear it.
+ */
+ ev->events = 0;
+}
+
#else /* CONFIG_RSEQ */
static inline void rseq_note_user_irq_entry(void) { }
+static inline void rseq_exit_to_user_mode(void) { }
#endif /* !CONFIG_RSEQ */
#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/init/Kconfig b/init/Kconfig
index cab3ad2..f39fdfb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1913,6 +1913,18 @@ config RSEQ
If unsure, say Y.
+config RSEQ_STATS
+ default n
+ bool "Enable lightweight statistics of restartable sequences" if EXPERT
+ depends on RSEQ && DEBUG_FS
+ help
+ Enable lightweight counters which expose information about the
+ frequency of RSEQ operations via debugfs. Mostly interesting for
+ kernel debugging or performance analysis. While lightweight it's
+ still adding code into the user/kernel mode transitions.
+
+ If unsure, say N.
+
config DEBUG_RSEQ
default n
bool "Enable debugging of rseq() system call" if EXPERT
diff --git a/kernel/rseq.c b/kernel/rseq.c
index f49d311..c0dbe2e 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -67,12 +67,16 @@
* F1. <failure>
*/
+/* Required to select the proper per_cpu ops for rseq_stats_inc() */
+#define RSEQ_BUILD_SLOW_PATH
+
+#include <linux/debugfs.h>
+#include <linux/ratelimit.h>
+#include <linux/rseq_entry.h>
#include <linux/sched.h>
-#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/rseq_entry.h>
+#include <linux/uaccess.h>
#include <linux/types.h>
-#include <linux/ratelimit.h>
#include <asm/ptrace.h>
#define CREATE_TRACE_POINTS
@@ -108,6 +112,56 @@ void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
}
#endif /* CONFIG_TRACEPOINTS */
+#ifdef CONFIG_RSEQ_STATS
+DEFINE_PER_CPU(struct rseq_stats, rseq_stats);
+
+static int rseq_debug_show(struct seq_file *m, void *p)
+{
+ struct rseq_stats stats = { };
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ stats.exit += data_race(per_cpu(rseq_stats.exit, cpu));
+ stats.signal += data_race(per_cpu(rseq_stats.signal, cpu));
+ stats.slowpath += data_race(per_cpu(rseq_stats.slowpath, cpu));
+ stats.ids += data_race(per_cpu(rseq_stats.ids, cpu));
+ stats.cs += data_race(per_cpu(rseq_stats.cs, cpu));
+ stats.clear += data_race(per_cpu(rseq_stats.clear, cpu));
+ stats.fixup += data_race(per_cpu(rseq_stats.fixup, cpu));
+ }
+
+ seq_printf(m, "exit: %16lu\n", stats.exit);
+ seq_printf(m, "signal: %16lu\n", stats.signal);
+ seq_printf(m, "slowp: %16lu\n", stats.slowpath);
+ seq_printf(m, "ids: %16lu\n", stats.ids);
+ seq_printf(m, "cs: %16lu\n", stats.cs);
+ seq_printf(m, "clear: %16lu\n", stats.clear);
+ seq_printf(m, "fixup: %16lu\n", stats.fixup);
+ return 0;
+}
+
+static int rseq_debug_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, rseq_debug_show, inode->i_private);
+}
+
+static const struct file_operations dfs_ops = {
+ .open = rseq_debug_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static int __init rseq_debugfs_init(void)
+{
+ struct dentry *root_dir = debugfs_create_dir("rseq", NULL);
+
+ debugfs_create_file("stats", 0444, root_dir, NULL, &dfs_ops);
+ return 0;
+}
+__initcall(rseq_debugfs_init);
+#endif /* CONFIG_RSEQ_STATS */
+
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
{
@@ -187,12 +241,13 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
u32 node_id = cpu_to_node(cpu_id);
u32 mm_cid = task_mm_cid(t);
- /*
- * Validate read-only rseq fields.
- */
+ rseq_stat_inc(rseq_stats.ids);
+
+ /* Validate read-only rseq fields on debug kernels */
if (rseq_validate_ro_fields(t))
goto efault;
WARN_ON_ONCE((int) mm_cid < 0);
+
if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
@@ -403,6 +458,8 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
struct rseq_cs rseq_cs;
int ret;
+ rseq_stat_inc(rseq_stats.cs);
+
ret = rseq_get_rseq_cs(t, &rseq_cs);
if (ret)
return ret;
@@ -412,8 +469,10 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
* If not nested over a rseq critical section, restart is useless.
* Clear the rseq_cs pointer and return.
*/
- if (!in_rseq_cs(ip, &rseq_cs))
+ if (!in_rseq_cs(ip, &rseq_cs)) {
+ rseq_stat_inc(rseq_stats.clear);
return clear_rseq_cs(t->rseq.usrptr);
+ }
ret = rseq_check_flags(t, rseq_cs.flags);
if (ret < 0)
return ret;
@@ -422,6 +481,7 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
ret = clear_rseq_cs(t->rseq.usrptr);
if (ret)
return ret;
+ rseq_stat_inc(rseq_stats.fixup);
trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
rseq_cs.abort_ip);
instruction_pointer_set(regs, (unsigned long)rseq_cs.abort_ip);
@@ -462,6 +522,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(t->flags & PF_EXITING))
return;
+ if (ksig)
+ rseq_stat_inc(rseq_stats.signal);
+ else
+ rseq_stat_inc(rseq_stats.slowpath);
+
/*
* Read and clear the event pending bit first. If the task
* was not preempted or migrated or a signal is on the way,
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Provide tracepoint wrappers for inline code
2025-10-27 8:44 ` [patch V6 16/31] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: dab344753e021fe84c24f9d8b0b63cb5bcf463d7
Gitweb: https://git.kernel.org/tip/dab344753e021fe84c24f9d8b0b63cb5bcf463d7
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:50 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:32:35 +01:00
rseq: Provide tracepoint wrappers for inline code
Provide tracepoint wrappers for the upcoming RSEQ exit to user space inline
fast path, so that the header can be safely included by code which defines
actual trace points.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.967114316@linutronix.de
---
include/linux/rseq_entry.h | 28 ++++++++++++++++++++++++++++
kernel/rseq.c | 19 ++++++++++++++++++-
2 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
index ce30e87..5be507a 100644
--- a/include/linux/rseq_entry.h
+++ b/include/linux/rseq_entry.h
@@ -5,6 +5,34 @@
#ifdef CONFIG_RSEQ
#include <linux/rseq.h>
+#include <linux/tracepoint-defs.h>
+
+#ifdef CONFIG_TRACEPOINTS
+DECLARE_TRACEPOINT(rseq_update);
+DECLARE_TRACEPOINT(rseq_ip_fixup);
+void __rseq_trace_update(struct task_struct *t);
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip);
+
+static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids)
+{
+ if (tracepoint_enabled(rseq_update) && ids)
+ __rseq_trace_update(t);
+}
+
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip)
+{
+ if (tracepoint_enabled(rseq_ip_fixup))
+ __rseq_trace_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+
+#else /* CONFIG_TRACEPOINT */
+static inline void rseq_trace_update(struct task_struct *t, struct rseq_ids *ids) { }
+static inline void rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip) { }
+#endif /* !CONFIG_TRACEPOINT */
+
static __always_inline void rseq_note_user_irq_entry(void)
{
if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
diff --git a/kernel/rseq.c b/kernel/rseq.c
index ad1e7ce..f49d311 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -70,7 +70,7 @@
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
-#include <linux/rseq.h>
+#include <linux/rseq_entry.h>
#include <linux/types.h>
#include <linux/ratelimit.h>
#include <asm/ptrace.h>
@@ -91,6 +91,23 @@
RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL | \
RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE)
+#ifdef CONFIG_TRACEPOINTS
+/*
+ * Out of line, so the actual update functions can be in a header to be
+ * inlined into the exit to user code.
+ */
+void __rseq_trace_update(struct task_struct *t)
+{
+ trace_rseq_update(t);
+}
+
+void __rseq_trace_ip_fixup(unsigned long ip, unsigned long start_ip,
+ unsigned long offset, unsigned long abort_ip)
+{
+ trace_rseq_ip_fixup(ip, start_ip, offset, abort_ip);
+}
+#endif /* CONFIG_TRACEPOINTS */
+
#ifdef CONFIG_DEBUG_RSEQ
static struct rseq *rseq_kernel_fields(struct task_struct *t)
{
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Record interrupt from user space
2025-10-27 8:44 ` [patch V6 15/31] rseq: Record interrupt from user space Thomas Gleixner
` (2 preceding siblings ...)
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 2fc0e4b4126caadfa5772ba69276b350609584dd
Gitweb: https://git.kernel.org/tip/2fc0e4b4126caadfa5772ba69276b350609584dd
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:48 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:32:23 +01:00
rseq: Record interrupt from user space
For RSEQ the only relevant reason to inspect and eventually fixup (abort)
user space critical sections is when user space was interrupted and the
task was scheduled out.
If the user to kernel entry was from a syscall no fixup is required. If
user space invokes a syscall from a critical section it can keep the
pieces as documented.
This is only supported on architectures which utilize the generic entry
code. If your architecture does not use it, bad luck.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.905067101@linutronix.de
---
include/linux/irq-entry-common.h | 3 ++-
include/linux/rseq.h | 16 +++++++++++-----
include/linux/rseq_entry.h | 18 ++++++++++++++++++
include/linux/rseq_types.h | 2 ++
4 files changed, 33 insertions(+), 6 deletions(-)
create mode 100644 include/linux/rseq_entry.h
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 83c9d84..cb31fb8 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -4,7 +4,7 @@
#include <linux/context_tracking.h>
#include <linux/kmsan.h>
-#include <linux/rseq.h>
+#include <linux/rseq_entry.h>
#include <linux/static_call_types.h>
#include <linux/syscalls.h>
#include <linux/tick.h>
@@ -281,6 +281,7 @@ static __always_inline void exit_to_user_mode(void)
static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
{
enter_from_user_mode(regs);
+ rseq_note_user_irq_entry();
}
/**
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index d315a92..a200836 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -31,11 +31,17 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
static __always_inline void rseq_exit_to_user_mode(void)
{
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq.event.has_rseq &&
- current->rseq.event.events))
- current->rseq.event.events = 0;
- }
+ struct rseq_event *ev = ¤t->rseq.event;
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ))
+ WARN_ON_ONCE(ev->sched_switch);
+
+ /*
+ * Ensure that event (especially user_irq) is cleared when the
+ * interrupt did not result in a schedule and therefore the
+ * rseq processing did not clear it.
+ */
+ ev->events = 0;
}
/*
diff --git a/include/linux/rseq_entry.h b/include/linux/rseq_entry.h
new file mode 100644
index 0000000..ce30e87
--- /dev/null
+++ b/include/linux/rseq_entry.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_ENTRY_H
+#define _LINUX_RSEQ_ENTRY_H
+
+#ifdef CONFIG_RSEQ
+#include <linux/rseq.h>
+
+static __always_inline void rseq_note_user_irq_entry(void)
+{
+ if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY))
+ current->rseq.event.user_irq = true;
+}
+
+#else /* CONFIG_RSEQ */
+static inline void rseq_note_user_irq_entry(void) { }
+#endif /* !CONFIG_RSEQ */
+
+#endif /* _LINUX_RSEQ_ENTRY_H */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index 40901b0..80f6c39 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -12,6 +12,7 @@ struct rseq;
* @all: Compound to initialize and clear the data efficiently
* @events: Compound to access events with a single load/store
* @sched_switch: True if the task was scheduled out
+ * @user_irq: True on interrupt entry from user mode
* @has_rseq: True if the task has a rseq pointer installed
*/
struct rseq_event {
@@ -22,6 +23,7 @@ struct rseq_event {
u16 events;
struct {
u8 sched_switch;
+ u8 user_irq;
};
};
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Cache CPU ID and MM CID values
2025-10-27 8:44 ` [patch V6 14/31] rseq: Cache CPU ID and MM CID values Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 4b7de6df20d43dd651031aef8d818fa5da981dbf
Gitweb: https://git.kernel.org/tip/4b7de6df20d43dd651031aef8d818fa5da981dbf
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:45 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:32:14 +01:00
rseq: Cache CPU ID and MM CID values
In preparation for rewriting RSEQ exit to user space handling provide
storage to cache the CPU ID and MM CID values which were written to user
space. That prepares for a quick check, which avoids the update when
nothing changed.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.841964081@linutronix.de
---
include/linux/rseq.h | 7 +++++--
include/linux/rseq_types.h | 21 +++++++++++++++++++++
include/trace/events/rseq.h | 4 ++--
kernel/rseq.c | 4 ++++
4 files changed, 32 insertions(+), 4 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index ab91b1e..d315a92 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -57,6 +57,7 @@ static inline void rseq_virt_userspace_exit(void)
static inline void rseq_reset(struct task_struct *t)
{
memset(&t->rseq, 0, sizeof(t->rseq));
+ t->rseq.ids.cpu_cid = ~0ULL;
}
static inline void rseq_execve(struct task_struct *t)
@@ -70,10 +71,12 @@ static inline void rseq_execve(struct task_struct *t)
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM)
+ if (clone_flags & CLONE_VM) {
rseq_reset(t);
- else
+ } else {
t->rseq = current->rseq;
+ t->rseq.ids.cpu_cid = ~0ULL;
+ }
}
#else /* CONFIG_RSEQ */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
index f7a60c8..40901b0 100644
--- a/include/linux/rseq_types.h
+++ b/include/linux/rseq_types.h
@@ -31,17 +31,38 @@ struct rseq_event {
};
/**
+ * struct rseq_ids - Cache for ids, which need to be updated
+ * @cpu_cid: Compound of @cpu_id and @mm_cid to make the
+ * compiler emit a single compare on 64-bit
+ * @cpu_id: The CPU ID which was written last to user space
+ * @mm_cid: The MM CID which was written last to user space
+ *
+ * @cpu_id and @mm_cid are updated when the data is written to user space.
+ */
+struct rseq_ids {
+ union {
+ u64 cpu_cid;
+ struct {
+ u32 cpu_id;
+ u32 mm_cid;
+ };
+ };
+};
+
+/**
* struct rseq_data - Storage for all rseq related data
* @usrptr: Pointer to the registered user space RSEQ memory
* @len: Length of the RSEQ region
* @sig: Signature of critial section abort IPs
* @event: Storage for event management
+ * @ids: Storage for cached CPU ID and MM CID
*/
struct rseq_data {
struct rseq __user *usrptr;
u32 len;
u32 sig;
struct rseq_event event;
+ struct rseq_ids ids;
};
#else /* CONFIG_RSEQ */
diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
index 823b47d..ce85d65 100644
--- a/include/trace/events/rseq.h
+++ b/include/trace/events/rseq.h
@@ -21,9 +21,9 @@ TRACE_EVENT(rseq_update,
),
TP_fast_assign(
- __entry->cpu_id = raw_smp_processor_id();
+ __entry->cpu_id = t->rseq.ids.cpu_id;
__entry->node_id = cpu_to_node(__entry->cpu_id);
- __entry->mm_cid = task_mm_cid(t);
+ __entry->mm_cid = t->rseq.ids.mm_cid;
),
TP_printk("cpu_id=%d node_id=%d mm_cid=%d", __entry->cpu_id,
diff --git a/kernel/rseq.c b/kernel/rseq.c
index aae6266..ad1e7ce 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -184,6 +184,10 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
rseq_unsafe_put_user(t, node_id, node_id, efault_end);
rseq_unsafe_put_user(t, mm_cid, mm_cid, efault_end);
+ /* Cache the user space values */
+ t->rseq.ids.cpu_id = cpu_id;
+ t->rseq.ids.mm_cid = mm_cid;
+
/*
* Additional feature fields added after ORIG_RSEQ_SIZE
* need to be conditionally updated only if
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] sched: Move MM CID related functions to sched.h
2025-10-27 8:44 ` [patch V6 13/31] sched: Move MM CID related functions to sched.h Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 4fc9225d19ad6289c03340a520d35e3a6d1aebed
Gitweb: https://git.kernel.org/tip/4fc9225d19ad6289c03340a520d35e3a6d1aebed
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:42 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:32:04 +01:00
sched: Move MM CID related functions to sched.h
There is nothing mm specific in that and including mm.h can cause header
recursion hell.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.778457951@linutronix.de
---
include/linux/mm.h | 25 -------------------------
include/linux/sched.h | 26 ++++++++++++++++++++++++++
2 files changed, 26 insertions(+), 25 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16b33b..17cfbba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2401,31 +2401,6 @@ struct zap_details {
/* Set in unmap_vmas() to indicate a final unmap call. Only used by hugetlb */
#define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1))
-#ifdef CONFIG_SCHED_MM_CID
-void sched_mm_cid_before_execve(struct task_struct *t);
-void sched_mm_cid_after_execve(struct task_struct *t);
-void sched_mm_cid_fork(struct task_struct *t);
-void sched_mm_cid_exit_signals(struct task_struct *t);
-static inline int task_mm_cid(struct task_struct *t)
-{
- return t->mm_cid;
-}
-#else
-static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
-static inline void sched_mm_cid_fork(struct task_struct *t) { }
-static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
-static inline int task_mm_cid(struct task_struct *t)
-{
- /*
- * Use the processor id as a fall-back when the mm cid feature is
- * disabled. This provides functional per-cpu data structure accesses
- * in user-space, althrough it won't provide the memory usage benefits.
- */
- return raw_smp_processor_id();
-}
-#endif
-
#ifdef CONFIG_MMU
extern bool can_do_mlock(void);
#else
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1562776..24a9da7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2310,6 +2310,32 @@ static __always_inline void alloc_tag_restore(struct alloc_tag *tag, struct allo
#define alloc_tag_restore(_tag, _old) do {} while (0)
#endif
+/* Avoids recursive inclusion hell */
+#ifdef CONFIG_SCHED_MM_CID
+void sched_mm_cid_before_execve(struct task_struct *t);
+void sched_mm_cid_after_execve(struct task_struct *t);
+void sched_mm_cid_fork(struct task_struct *t);
+void sched_mm_cid_exit_signals(struct task_struct *t);
+static inline int task_mm_cid(struct task_struct *t)
+{
+ return t->mm_cid;
+}
+#else
+static inline void sched_mm_cid_before_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_after_execve(struct task_struct *t) { }
+static inline void sched_mm_cid_fork(struct task_struct *t) { }
+static inline void sched_mm_cid_exit_signals(struct task_struct *t) { }
+static inline int task_mm_cid(struct task_struct *t)
+{
+ /*
+ * Use the processor id as a fall-back when the mm cid feature is
+ * disabled. This provides functional per-cpu data structure accesses
+ * in user-space, althrough it won't provide the memory usage benefits.
+ */
+ return task_cpu(t);
+}
+#endif
+
#ifndef MODULE
#ifndef COMPILE_OFFSETS
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Inline irqentry_enter/exit_from/to_user_mode()
2025-10-27 8:44 ` [patch V6 12/31] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 7702a9c2856794b6bf961b408eba3bacb753bd5b
Gitweb: https://git.kernel.org/tip/7702a9c2856794b6bf961b408eba3bacb753bd5b
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:40 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:31:47 +01:00
entry: Inline irqentry_enter/exit_from/to_user_mode()
There is no point to have this as a function which just inlines
enter_from_user_mode(). The function call overhead is larger than the
function itself.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.715309918@linutronix.de
---
include/linux/irq-entry-common.h | 13 +++++++++++--
kernel/entry/common.c | 13 -------------
2 files changed, 11 insertions(+), 15 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index 9b1f386..83c9d84 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -278,7 +278,10 @@ static __always_inline void exit_to_user_mode(void)
*
* The function establishes state (lockdep, RCU (context tracking), tracing)
*/
-void irqentry_enter_from_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_enter_from_user_mode(struct pt_regs *regs)
+{
+ enter_from_user_mode(regs);
+}
/**
* irqentry_exit_to_user_mode - Interrupt exit work
@@ -293,7 +296,13 @@ void irqentry_enter_from_user_mode(struct pt_regs *regs);
* Interrupt exit is not invoking #1 which is the syscall specific one time
* work.
*/
-void irqentry_exit_to_user_mode(struct pt_regs *regs);
+static __always_inline void irqentry_exit_to_user_mode(struct pt_regs *regs)
+{
+ instrumentation_begin();
+ exit_to_user_mode_prepare(regs);
+ instrumentation_end();
+ exit_to_user_mode();
+}
#ifndef irqentry_state
/**
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index f62e1d1..70a16db 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -62,19 +62,6 @@ __always_inline unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
return ti_work;
}
-noinstr void irqentry_enter_from_user_mode(struct pt_regs *regs)
-{
- enter_from_user_mode(regs);
-}
-
-noinstr void irqentry_exit_to_user_mode(struct pt_regs *regs)
-{
- instrumentation_begin();
- exit_to_user_mode_prepare(regs);
- instrumentation_end();
- exit_to_user_mode();
-}
-
noinstr irqentry_state_t irqentry_enter(struct pt_regs *regs)
{
irqentry_state_t ret = {
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Remove syscall_enter_from_user_mode_prepare()
2025-10-27 8:44 ` [patch V6 11/31] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 54a5ab56242f96555999aaa41228f77b4a76e386
Gitweb: https://git.kernel.org/tip/54a5ab56242f96555999aaa41228f77b4a76e386
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:38 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:31:37 +01:00
entry: Remove syscall_enter_from_user_mode_prepare()
Open code the only user in the x86 syscall code and reduce the zoo of
functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.652839989@linutronix.de
---
arch/x86/entry/syscall_32.c | 3 ++-
include/linux/entry-common.h | 26 +++++---------------------
kernel/entry/syscall-common.c | 8 --------
3 files changed, 7 insertions(+), 30 deletions(-)
diff --git a/arch/x86/entry/syscall_32.c b/arch/x86/entry/syscall_32.c
index 2b15ea1..a67a644 100644
--- a/arch/x86/entry/syscall_32.c
+++ b/arch/x86/entry/syscall_32.c
@@ -274,9 +274,10 @@ static noinstr bool __do_fast_syscall_32(struct pt_regs *regs)
* fetch EBP before invoking any of the syscall entry work
* functions.
*/
- syscall_enter_from_user_mode_prepare(regs);
+ enter_from_user_mode(regs);
instrumentation_begin();
+ local_irq_enable();
/* Fetch EBP from where the vDSO stashed it. */
if (IS_ENABLED(CONFIG_X86_64)) {
/*
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index c585221..75b194c 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -45,23 +45,6 @@
SYSCALL_WORK_SYSCALL_EXIT_TRAP | \
ARCH_SYSCALL_WORK_EXIT)
-/**
- * syscall_enter_from_user_mode_prepare - Establish state and enable interrupts
- * @regs: Pointer to currents pt_regs
- *
- * Invoked from architecture specific syscall entry code with interrupts
- * disabled. The calling code has to be non-instrumentable. When the
- * function returns all state is correct, interrupts are enabled and the
- * subsequent functions can be instrumented.
- *
- * This handles lockdep, RCU (context tracking) and tracing state, i.e.
- * the functionality provided by enter_from_user_mode().
- *
- * This is invoked when there is extra architecture specific functionality
- * to be done between establishing state and handling user mode entry work.
- */
-void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-
long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
/**
@@ -71,8 +54,8 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work)
* @syscall: The syscall number
*
* Invoked from architecture specific syscall entry code with interrupts
- * enabled after invoking syscall_enter_from_user_mode_prepare() and extra
- * architecture specific work.
+ * enabled after invoking enter_from_user_mode(), enabling interrupts and
+ * extra architecture specific work.
*
* Returns: The original or a modified syscall number
*
@@ -108,8 +91,9 @@ static __always_inline long syscall_enter_from_user_mode_work(struct pt_regs *re
* function returns all state is correct, interrupts are enabled and the
* subsequent functions can be instrumented.
*
- * This is combination of syscall_enter_from_user_mode_prepare() and
- * syscall_enter_from_user_mode_work().
+ * This is the combination of enter_from_user_mode() and
+ * syscall_enter_from_user_mode_work() to be used when there is no
+ * architecture specific work to be done between the two.
*
* Returns: The original or a modified syscall number. See
* syscall_enter_from_user_mode_work() for further explanation.
diff --git a/kernel/entry/syscall-common.c b/kernel/entry/syscall-common.c
index 66e6ba7..940a597 100644
--- a/kernel/entry/syscall-common.c
+++ b/kernel/entry/syscall-common.c
@@ -63,14 +63,6 @@ long syscall_trace_enter(struct pt_regs *regs, long syscall,
return ret ? : syscall;
}
-noinstr void syscall_enter_from_user_mode_prepare(struct pt_regs *regs)
-{
- enter_from_user_mode(regs);
- instrumentation_begin();
- local_irq_enable();
- instrumentation_end();
-}
-
/*
* If SYSCALL_EMU is set, then the only reason to report is when
* SINGLESTEP is set (i.e. PTRACE_SYSEMU_SINGLESTEP). This syscall
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] entry: Clean up header
2025-10-27 8:44 ` [patch V6 10/31] entry: Cleanup header Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 5204be16790f305febbf331d0ec2cead7978b3c3
Gitweb: https://git.kernel.org/tip/5204be16790f305febbf331d0ec2cead7978b3c3
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:36 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:31:14 +01:00
entry: Clean up header
Clean up the include ordering, kernel-doc and other trivialities before
making further changes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.590338411@linutronix.de
---
include/linux/entry-common.h | 8 ++++----
include/linux/irq-entry-common.h | 2 ++
2 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 7177436..c585221 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -3,11 +3,11 @@
#define __LINUX_ENTRYCOMMON_H
#include <linux/irq-entry-common.h>
+#include <linux/livepatch.h>
#include <linux/ptrace.h>
+#include <linux/resume_user_mode.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
-#include <linux/livepatch.h>
-#include <linux/resume_user_mode.h>
#include <asm/entry-common.h>
#include <asm/syscall.h>
@@ -37,6 +37,7 @@
SYSCALL_WORK_SYSCALL_AUDIT | \
SYSCALL_WORK_SYSCALL_USER_DISPATCH | \
ARCH_SYSCALL_WORK_ENTER)
+
#define SYSCALL_WORK_EXIT (SYSCALL_WORK_SYSCALL_TRACEPOINT | \
SYSCALL_WORK_SYSCALL_TRACE | \
SYSCALL_WORK_SYSCALL_AUDIT | \
@@ -61,8 +62,7 @@
*/
void syscall_enter_from_user_mode_prepare(struct pt_regs *regs);
-long syscall_trace_enter(struct pt_regs *regs, long syscall,
- unsigned long work);
+long syscall_trace_enter(struct pt_regs *regs, long syscall, unsigned long work);
/**
* syscall_enter_from_user_mode_work - Check and handle work before invoking
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index e5941df..9b1f386 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -68,6 +68,7 @@ static __always_inline bool arch_in_rcu_eqs(void) { return false; }
/**
* enter_from_user_mode - Establish state when coming from user mode
+ * @regs: Pointer to currents pt_regs
*
* Syscall/interrupt entry disables interrupts, but user mode is traced as
* interrupts enabled. Also with NO_HZ_FULL RCU might be idle.
@@ -357,6 +358,7 @@ irqentry_state_t noinstr irqentry_enter(struct pt_regs *regs);
* Conditional reschedule with additional sanity checks.
*/
void raw_irqentry_exit_cond_resched(void);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
#if defined(CONFIG_HAVE_PREEMPT_DYNAMIC_CALL)
#define irqentry_exit_cond_resched_dynamic_enabled raw_irqentry_exit_cond_resched
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Introduce struct rseq_data
2025-10-27 8:44 ` [patch V6 09/31] rseq: Introduce struct rseq_data Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: faba9d250eaec7afa248bba71531a08ccc497aab
Gitweb: https://git.kernel.org/tip/faba9d250eaec7afa248bba71531a08ccc497aab
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:33 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:30:50 +01:00
rseq: Introduce struct rseq_data
In preparation for a major rewrite of this code, provide a data structure
for rseq management.
Put all the rseq related data into it (except for the debug part), which
allows to simplify fork/execve by using memset() and memcpy() instead of
adding new fields to initialize over and over.
Create a storage struct for event management as well and put the
sched_switch event and a indicator for RSEQ on a task into it as a
start. That uses a union, which allows to mask and clear the whole lot
efficiently.
The indicators are explicitly not a bit field. Bit fields generate abysmal
code.
The boolean members are defined as u8 as that actually guarantees that it
fits. There seem to be strange architecture ABIs which need more than 8
bits for a boolean.
The has_rseq member is redundant vs. task::rseq, but it turns out that
boolean operations and quick checks on the union generate better code than
fiddling with separate entities and data types.
This struct will be extended over time to carry more information.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.527086690@linutronix.de
---
include/linux/rseq.h | 48 ++++++++++++----------------
include/linux/rseq_types.h | 51 ++++++++++++++++++++++++++++++-
include/linux/sched.h | 14 +-------
kernel/ptrace.c | 6 ++--
kernel/rseq.c | 63 ++++++++++++++++++-------------------
5 files changed, 110 insertions(+), 72 deletions(-)
create mode 100644 include/linux/rseq_types.h
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index c6267f7..ab91b1e 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -9,22 +9,22 @@ void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
- if (current->rseq)
+ if (current->rseq.event.has_rseq)
__rseq_handle_notify_resume(NULL, regs);
}
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
- if (current->rseq) {
- current->rseq_event_pending = true;
+ if (current->rseq.event.has_rseq) {
+ current->rseq.event.sched_switch = true;
__rseq_handle_notify_resume(ksig, regs);
}
}
static inline void rseq_sched_switch_event(struct task_struct *t)
{
- if (t->rseq) {
- t->rseq_event_pending = true;
+ if (t->rseq.event.has_rseq) {
+ t->rseq.event.sched_switch = true;
set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
}
}
@@ -32,8 +32,9 @@ static inline void rseq_sched_switch_event(struct task_struct *t)
static __always_inline void rseq_exit_to_user_mode(void)
{
if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
- current->rseq_event_pending = false;
+ if (WARN_ON_ONCE(current->rseq.event.has_rseq &&
+ current->rseq.event.events))
+ current->rseq.event.events = 0;
}
}
@@ -49,35 +50,30 @@ static __always_inline void rseq_exit_to_user_mode(void)
*/
static inline void rseq_virt_userspace_exit(void)
{
- if (current->rseq_event_pending)
+ if (current->rseq.event.sched_switch)
set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
}
+static inline void rseq_reset(struct task_struct *t)
+{
+ memset(&t->rseq, 0, sizeof(t->rseq));
+}
+
+static inline void rseq_execve(struct task_struct *t)
+{
+ rseq_reset(t);
+}
+
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
*/
static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
{
- if (clone_flags & CLONE_VM) {
- t->rseq = NULL;
- t->rseq_len = 0;
- t->rseq_sig = 0;
- t->rseq_event_pending = false;
- } else {
+ if (clone_flags & CLONE_VM)
+ rseq_reset(t);
+ else
t->rseq = current->rseq;
- t->rseq_len = current->rseq_len;
- t->rseq_sig = current->rseq_sig;
- t->rseq_event_pending = current->rseq_event_pending;
- }
-}
-
-static inline void rseq_execve(struct task_struct *t)
-{
- t->rseq = NULL;
- t->rseq_len = 0;
- t->rseq_sig = 0;
- t->rseq_event_pending = false;
}
#else /* CONFIG_RSEQ */
diff --git a/include/linux/rseq_types.h b/include/linux/rseq_types.h
new file mode 100644
index 0000000..f7a60c8
--- /dev/null
+++ b/include/linux/rseq_types.h
@@ -0,0 +1,51 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_RSEQ_TYPES_H
+#define _LINUX_RSEQ_TYPES_H
+
+#include <linux/types.h>
+
+#ifdef CONFIG_RSEQ
+struct rseq;
+
+/**
+ * struct rseq_event - Storage for rseq related event management
+ * @all: Compound to initialize and clear the data efficiently
+ * @events: Compound to access events with a single load/store
+ * @sched_switch: True if the task was scheduled out
+ * @has_rseq: True if the task has a rseq pointer installed
+ */
+struct rseq_event {
+ union {
+ u32 all;
+ struct {
+ union {
+ u16 events;
+ struct {
+ u8 sched_switch;
+ };
+ };
+
+ u8 has_rseq;
+ };
+ };
+};
+
+/**
+ * struct rseq_data - Storage for all rseq related data
+ * @usrptr: Pointer to the registered user space RSEQ memory
+ * @len: Length of the RSEQ region
+ * @sig: Signature of critial section abort IPs
+ * @event: Storage for event management
+ */
+struct rseq_data {
+ struct rseq __user *usrptr;
+ u32 len;
+ u32 sig;
+ struct rseq_event event;
+};
+
+#else /* CONFIG_RSEQ */
+struct rseq_data { };
+#endif /* !CONFIG_RSEQ */
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6627c52..1562776 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -41,6 +41,7 @@
#include <linux/task_io_accounting.h>
#include <linux/posix-timers_types.h>
#include <linux/restart_block.h>
+#include <linux/rseq_types.h>
#include <uapi/linux/rseq.h>
#include <linux/seqlock_types.h>
#include <linux/kcsan.h>
@@ -1406,16 +1407,8 @@ struct task_struct {
unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */
-#ifdef CONFIG_RSEQ
- struct rseq __user *rseq;
- u32 rseq_len;
- u32 rseq_sig;
- /*
- * RmW on rseq_event_pending must be performed atomically
- * with respect to preemption.
- */
- bool rseq_event_pending;
-# ifdef CONFIG_DEBUG_RSEQ
+ struct rseq_data rseq;
+#ifdef CONFIG_DEBUG_RSEQ
/*
* This is a place holder to save a copy of the rseq fields for
* validation of read-only fields. The struct rseq has a
@@ -1423,7 +1416,6 @@ struct task_struct {
* directly. Reserve a size large enough for the known fields.
*/
char rseq_fields[sizeof(struct rseq)];
-# endif
#endif
#ifdef CONFIG_SCHED_MM_CID
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 75a84ef..392ec2f 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -793,9 +793,9 @@ static long ptrace_get_rseq_configuration(struct task_struct *task,
unsigned long size, void __user *data)
{
struct ptrace_rseq_configuration conf = {
- .rseq_abi_pointer = (u64)(uintptr_t)task->rseq,
- .rseq_abi_size = task->rseq_len,
- .signature = task->rseq_sig,
+ .rseq_abi_pointer = (u64)(uintptr_t)task->rseq.usrptr,
+ .rseq_abi_size = task->rseq.len,
+ .signature = task->rseq.sig,
.flags = 0,
};
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 81dddaf..aae6266 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -103,13 +103,13 @@ static int rseq_validate_ro_fields(struct task_struct *t)
DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
u32 cpu_id_start, cpu_id, node_id, mm_cid;
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
/*
* Validate fields which are required to be read-only by
* user-space.
*/
- if (!user_read_access_begin(rseq, t->rseq_len))
+ if (!user_read_access_begin(rseq, t->rseq.len))
goto efault;
unsafe_get_user(cpu_id_start, &rseq->cpu_id_start, efault_end);
unsafe_get_user(cpu_id, &rseq->cpu_id, efault_end);
@@ -147,10 +147,10 @@ efault:
* Update an rseq field and its in-kernel copy in lock-step to keep a coherent
* state.
*/
-#define rseq_unsafe_put_user(t, value, field, error_label) \
- do { \
- unsafe_put_user(value, &t->rseq->field, error_label); \
- rseq_kernel_fields(t)->field = value; \
+#define rseq_unsafe_put_user(t, value, field, error_label) \
+ do { \
+ unsafe_put_user(value, &t->rseq.usrptr->field, error_label); \
+ rseq_kernel_fields(t)->field = value; \
} while (0)
#else
@@ -160,12 +160,12 @@ static int rseq_validate_ro_fields(struct task_struct *t)
}
#define rseq_unsafe_put_user(t, value, field, error_label) \
- unsafe_put_user(value, &t->rseq->field, error_label)
+ unsafe_put_user(value, &t->rseq.usrptr->field, error_label)
#endif
static int rseq_update_cpu_node_id(struct task_struct *t)
{
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
u32 cpu_id = raw_smp_processor_id();
u32 node_id = cpu_to_node(cpu_id);
u32 mm_cid = task_mm_cid(t);
@@ -176,7 +176,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t)
if (rseq_validate_ro_fields(t))
goto efault;
WARN_ON_ONCE((int) mm_cid < 0);
- if (!user_write_access_begin(rseq, t->rseq_len))
+ if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
rseq_unsafe_put_user(t, cpu_id, cpu_id_start, efault_end);
@@ -201,7 +201,7 @@ efault:
static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
{
- struct rseq __user *rseq = t->rseq;
+ struct rseq __user *rseq = t->rseq.usrptr;
u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0,
mm_cid = 0;
@@ -211,7 +211,7 @@ static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
if (rseq_validate_ro_fields(t))
goto efault;
- if (!user_write_access_begin(rseq, t->rseq_len))
+ if (!user_write_access_begin(rseq, t->rseq.len))
goto efault;
/*
@@ -272,7 +272,7 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
u32 sig;
int ret;
- ret = rseq_get_rseq_cs_ptr_val(t->rseq, &ptr);
+ ret = rseq_get_rseq_cs_ptr_val(t->rseq.usrptr, &ptr);
if (ret)
return ret;
@@ -305,10 +305,10 @@ static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)
if (ret)
return ret;
- if (current->rseq_sig != sig) {
+ if (current->rseq.sig != sig) {
printk_ratelimited(KERN_WARNING
"Possible attack attempt. Unexpected rseq signature 0x%x, expecting 0x%x (pid=%d, addr=%p).\n",
- sig, current->rseq_sig, current->pid, usig);
+ sig, current->rseq.sig, current->pid, usig);
return -EINVAL;
}
return 0;
@@ -338,7 +338,7 @@ static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
return -EINVAL;
/* Get thread flags. */
- ret = get_user(flags, &t->rseq->flags);
+ ret = get_user(flags, &t->rseq.usrptr->flags);
if (ret)
return ret;
@@ -392,13 +392,13 @@ static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
* Clear the rseq_cs pointer and return.
*/
if (!in_rseq_cs(ip, &rseq_cs))
- return clear_rseq_cs(t->rseq);
+ return clear_rseq_cs(t->rseq.usrptr);
ret = rseq_check_flags(t, rseq_cs.flags);
if (ret < 0)
return ret;
if (!abort)
return 0;
- ret = clear_rseq_cs(t->rseq);
+ ret = clear_rseq_cs(t->rseq.usrptr);
if (ret)
return ret;
trace_rseq_ip_fixup(ip, rseq_cs.start_ip, rseq_cs.post_commit_offset,
@@ -460,8 +460,8 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
* inconsistencies.
*/
scoped_guard(RSEQ_EVENT_GUARD) {
- event = t->rseq_event_pending;
- t->rseq_event_pending = false;
+ event = t->rseq.event.sched_switch;
+ t->rseq.event.sched_switch = false;
}
if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
@@ -492,7 +492,7 @@ void rseq_syscall(struct pt_regs *regs)
struct task_struct *t = current;
struct rseq_cs rseq_cs;
- if (!t->rseq)
+ if (!t->rseq.usrptr)
return;
if (rseq_get_rseq_cs(t, &rseq_cs) || in_rseq_cs(ip, &rseq_cs))
force_sig(SIGSEGV);
@@ -511,33 +511,31 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
if (flags & ~RSEQ_FLAG_UNREGISTER)
return -EINVAL;
/* Unregister rseq for current thread. */
- if (current->rseq != rseq || !current->rseq)
+ if (current->rseq.usrptr != rseq || !current->rseq.usrptr)
return -EINVAL;
- if (rseq_len != current->rseq_len)
+ if (rseq_len != current->rseq.len)
return -EINVAL;
- if (current->rseq_sig != sig)
+ if (current->rseq.sig != sig)
return -EPERM;
ret = rseq_reset_rseq_cpu_node_id(current);
if (ret)
return ret;
- current->rseq = NULL;
- current->rseq_sig = 0;
- current->rseq_len = 0;
+ rseq_reset(current);
return 0;
}
if (unlikely(flags))
return -EINVAL;
- if (current->rseq) {
+ if (current->rseq.usrptr) {
/*
* If rseq is already registered, check whether
* the provided address differs from the prior
* one.
*/
- if (current->rseq != rseq || rseq_len != current->rseq_len)
+ if (current->rseq.usrptr != rseq || rseq_len != current->rseq.len)
return -EINVAL;
- if (current->rseq_sig != sig)
+ if (current->rseq.sig != sig)
return -EPERM;
/* Already registered. */
return -EBUSY;
@@ -586,15 +584,16 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* Activate the registration by setting the rseq area address, length
* and signature in the task struct.
*/
- current->rseq = rseq;
- current->rseq_len = rseq_len;
- current->rseq_sig = sig;
+ current->rseq.usrptr = rseq;
+ current->rseq.len = rseq_len;
+ current->rseq.sig = sig;
/*
* If rseq was previously inactive, and has just been
* registered, ensure the cpu_id_start and cpu_id fields
* are updated before returning to user-space.
*/
+ current->rseq.event.has_rseq = true;
rseq_sched_switch_event(current);
return 0;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Avoid CPU/MM CID updates when no event pending
2025-10-27 8:44 ` [patch V6 08/31] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 566d8015f7eef11d82cd63dc4e1f620fcfc2a394
Gitweb: https://git.kernel.org/tip/566d8015f7eef11d82cd63dc4e1f620fcfc2a394
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:31 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:30:43 +01:00
rseq: Avoid CPU/MM CID updates when no event pending
There is no need to update these values unconditionally if there is no
event pending.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.462964916@linutronix.de
---
kernel/rseq.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 01e7113..81dddaf 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -464,11 +464,12 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
t->rseq_event_pending = false;
}
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
- goto error;
- }
+ if (!IS_ENABLED(CONFIG_DEBUG_RSEQ) && !event)
+ return;
+
+ ret = rseq_ip_fixup(regs, event);
+ if (unlikely(ret < 0))
+ goto error;
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq, virt: Retrigger RSEQ after vcpu_run()
2025-10-27 8:44 ` [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
` (2 preceding siblings ...)
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
3 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, Sean Christopherson, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 83409986f49f17b14a675f9c598ad50d4c60191b
Gitweb: https://git.kernel.org/tip/83409986f49f17b14a675f9c598ad50d4c60191b
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:28 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:30:23 +01:00
rseq, virt: Retrigger RSEQ after vcpu_run()
Hypervisors invoke resume_user_mode_work() before entering the guest, which
clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
space context available to them, so the rseq notify handler skips
inspecting the critical section, but updates the CPU/MM CID values
unconditionally so that the eventual pending rseq event is not lost on the
way to user space.
This is a pointless exercise as the task might be rescheduled before
actually returning to user space and it creates unnecessary work in the
vcpu_run() loops.
It's way more efficient to ignore that invocation based on @regs == NULL
and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
vcpu_run() loop before returning from the ioctl().
This ensures that a pending RSEQ update is not lost and the IDs are updated
before returning to user space.
Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
a NOOP.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Sean Christopherson <seanjc@google.com>
Link: https://patch.msgid.link/20251027084306.399495855@linutronix.de
---
drivers/hv/mshv_root_main.c | 3 +-
include/linux/rseq.h | 17 ++++++++-
kernel/rseq.c | 78 ++++++++++++++++++------------------
virt/kvm/kvm_main.c | 7 +++-
4 files changed, 68 insertions(+), 37 deletions(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index e3b2bd4..a21a0eb 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -29,6 +29,7 @@
#include <linux/crash_dump.h>
#include <linux/panic_notifier.h>
#include <linux/vmalloc.h>
+#include <linux/rseq.h>
#include "mshv_eventfd.h"
#include "mshv.h"
@@ -560,6 +561,8 @@ static long mshv_run_vp_with_root_scheduler(struct mshv_vp *vp)
}
} while (!vp->run.flags.intercept_suspend);
+ rseq_virt_userspace_exit();
+
return ret;
}
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 241067b..c6267f7 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to_user_mode(void)
}
/*
+ * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
+ * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
+ * that case just to do it eventually again before returning to user space,
+ * the entry resume_user_mode_work() invocation is ignored as the register
+ * argument is NULL.
+ *
+ * After returning from guest mode, they have to invoke this function to
+ * re-raise TIF_NOTIFY_RESUME if necessary.
+ */
+static inline void rseq_virt_userspace_exit(void)
+{
+ if (current->rseq_event_pending)
+ set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
+}
+
+/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
*/
@@ -68,6 +84,7 @@ static inline void rseq_execve(struct task_struct *t)
static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_sched_switch_event(struct task_struct *t) { }
+static inline void rseq_virt_userspace_exit(void) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 59adc1a..01e7113 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
{
struct task_struct *t = current;
int ret, sig;
+ bool event;
+
+ /*
+ * If invoked from hypervisors before entering the guest via
+ * resume_user_mode_work(), then @regs is a NULL pointer.
+ *
+ * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
+ * it before returning from the ioctl() to user space when
+ * rseq_event.sched_switch is set.
+ *
+ * So it's safe to ignore here instead of pointlessly updating it
+ * in the vcpu_run() loop.
+ */
+ if (!regs)
+ return;
if (unlikely(t->flags & PF_EXITING))
return;
/*
- * If invoked from hypervisors or IO-URING, then @regs is a NULL
- * pointer, so fixup cannot be done. If the syscall which led to
- * this invocation was invoked inside a critical section, then it
- * will either end up in this code again or a possible violation of
- * a syscall inside a critical region can only be detected by the
- * debug code in rseq_syscall() in a debug enabled kernel.
+ * Read and clear the event pending bit first. If the task
+ * was not preempted or migrated or a signal is on the way,
+ * there is no point in doing any of the heavy lifting here
+ * on production kernels. In that case TIF_NOTIFY_RESUME
+ * was raised by some other functionality.
+ *
+ * This is correct because the read/clear operation is
+ * guarded against scheduler preemption, which makes it CPU
+ * local atomic. If the task is preempted right after
+ * re-enabling preemption then TIF_NOTIFY_RESUME is set
+ * again and this function is invoked another time _before_
+ * the task is able to return to user mode.
+ *
+ * On a debug kernel, invoke the fixup code unconditionally
+ * with the result handed in to allow the detection of
+ * inconsistencies.
*/
- if (regs) {
- /*
- * Read and clear the event pending bit first. If the task
- * was not preempted or migrated or a signal is on the way,
- * there is no point in doing any of the heavy lifting here
- * on production kernels. In that case TIF_NOTIFY_RESUME
- * was raised by some other functionality.
- *
- * This is correct because the read/clear operation is
- * guarded against scheduler preemption, which makes it CPU
- * local atomic. If the task is preempted right after
- * re-enabling preemption then TIF_NOTIFY_RESUME is set
- * again and this function is invoked another time _before_
- * the task is able to return to user mode.
- *
- * On a debug kernel, invoke the fixup code unconditionally
- * with the result handed in to allow the detection of
- * inconsistencies.
- */
- bool event;
-
- scoped_guard(RSEQ_EVENT_GUARD) {
- event = t->rseq_event_pending;
- t->rseq_event_pending = false;
- }
-
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
- ret = rseq_ip_fixup(regs, event);
- if (unlikely(ret < 0))
- goto error;
- }
+ scoped_guard(RSEQ_EVENT_GUARD) {
+ event = t->rseq_event_pending;
+ t->rseq_event_pending = false;
}
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+ ret = rseq_ip_fixup(regs, event);
+ if (unlikely(ret < 0))
+ goto error;
+ }
+
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
return;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7a0ae2..4255fcf 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
#include <linux/lockdep.h>
#include <linux/kthread.h>
#include <linux/suspend.h>
+#include <linux/rseq.h>
#include <asm/processor.h>
#include <asm/ioctl.h>
@@ -4476,6 +4477,12 @@ static long kvm_vcpu_ioctl(struct file *filp,
r = kvm_arch_vcpu_ioctl_run(vcpu);
vcpu->wants_to_run = false;
+ /*
+ * FIXME: Remove this hack once all KVM architectures
+ * support the generic TIF bits, i.e. a dedicated TIF_RSEQ.
+ */
+ rseq_virt_userspace_exit();
+
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
break;
}
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Simplify the event notification
2025-10-27 8:44 ` [patch V6 06/31] rseq: Simplify the event notification Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: d923739e2e356424cc566143a3323c62cd6ed067
Gitweb: https://git.kernel.org/tip/d923739e2e356424cc566143a3323c62cd6ed067
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:26 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:30:09 +01:00
rseq: Simplify the event notification
Since commit 0190e4198e47 ("rseq: Deprecate RSEQ_CS_FLAG_NO_RESTART_ON_*
flags") the bits in task::rseq_event_mask are meaningless and just extra
work in terms of setting them individually.
Aside of that the only relevant point where an event has to be raised is
context switch. Neither the CPU nor MM CID can change without going through
a context switch.
Collapse them all into a single boolean which simplifies the code a lot and
remove the pointless invocations which have been sprinkled all over the
place for no value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.336978188@linutronix.de
---
fs/exec.c | 2 +-
include/linux/rseq.h | 66 +++++++-------------------------------
include/linux/sched.h | 10 +++---
include/uapi/linux/rseq.h | 21 ++++--------
kernel/rseq.c | 28 +++++++++-------
kernel/sched/core.c | 5 +---
kernel/sched/membarrier.c | 8 ++---
7 files changed, 48 insertions(+), 92 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 4298e7e..e45b298 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1775,7 +1775,7 @@ out:
force_fatal_sig(SIGSEGV);
sched_mm_cid_after_execve(current);
- rseq_set_notify_resume(current);
+ rseq_sched_switch_event(current);
current->in_execve = 0;
return retval;
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index d72ddf7..241067b 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -3,38 +3,8 @@
#define _LINUX_RSEQ_H
#ifdef CONFIG_RSEQ
-
-#include <linux/preempt.h>
#include <linux/sched.h>
-#ifdef CONFIG_MEMBARRIER
-# define RSEQ_EVENT_GUARD irq
-#else
-# define RSEQ_EVENT_GUARD preempt
-#endif
-
-/*
- * Map the event mask on the user-space ABI enum rseq_cs_flags
- * for direct mask checks.
- */
-enum rseq_event_mask_bits {
- RSEQ_EVENT_PREEMPT_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT_BIT,
- RSEQ_EVENT_SIGNAL_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL_BIT,
- RSEQ_EVENT_MIGRATE_BIT = RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT,
-};
-
-enum rseq_event_mask {
- RSEQ_EVENT_PREEMPT = (1U << RSEQ_EVENT_PREEMPT_BIT),
- RSEQ_EVENT_SIGNAL = (1U << RSEQ_EVENT_SIGNAL_BIT),
- RSEQ_EVENT_MIGRATE = (1U << RSEQ_EVENT_MIGRATE_BIT),
-};
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
- if (t->rseq)
- set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
-}
-
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
static inline void rseq_handle_notify_resume(struct pt_regs *regs)
@@ -43,35 +13,27 @@ static inline void rseq_handle_notify_resume(struct pt_regs *regs)
__rseq_handle_notify_resume(NULL, regs);
}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs)
{
if (current->rseq) {
- scoped_guard(RSEQ_EVENT_GUARD)
- __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ current->rseq_event_pending = true;
__rseq_handle_notify_resume(ksig, regs);
}
}
-/* rseq_preempt() requires preemption to be disabled. */
-static inline void rseq_preempt(struct task_struct *t)
+static inline void rseq_sched_switch_event(struct task_struct *t)
{
- __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
-}
-
-/* rseq_migrate() requires preemption to be disabled. */
-static inline void rseq_migrate(struct task_struct *t)
-{
- __set_bit(RSEQ_EVENT_MIGRATE_BIT, &t->rseq_event_mask);
- rseq_set_notify_resume(t);
+ if (t->rseq) {
+ t->rseq_event_pending = true;
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+ }
}
static __always_inline void rseq_exit_to_user_mode(void)
{
if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
- if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
- current->rseq_event_mask = 0;
+ if (WARN_ON_ONCE(current->rseq && current->rseq_event_pending))
+ current->rseq_event_pending = false;
}
}
@@ -85,12 +47,12 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
- t->rseq_event_mask = 0;
+ t->rseq_event_pending = false;
} else {
t->rseq = current->rseq;
t->rseq_len = current->rseq_len;
t->rseq_sig = current->rseq_sig;
- t->rseq_event_mask = current->rseq_event_mask;
+ t->rseq_event_pending = current->rseq_event_pending;
}
}
@@ -99,15 +61,13 @@ static inline void rseq_execve(struct task_struct *t)
t->rseq = NULL;
t->rseq_len = 0;
t->rseq_sig = 0;
- t->rseq_event_mask = 0;
+ t->rseq_event_pending = false;
}
#else /* CONFIG_RSEQ */
-static inline void rseq_set_notify_resume(struct task_struct *t) { }
static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
-static inline void rseq_preempt(struct task_struct *t) { }
-static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_sched_switch_event(struct task_struct *t) { }
static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b469878..6627c52 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1407,14 +1407,14 @@ struct task_struct {
#endif /* CONFIG_NUMA_BALANCING */
#ifdef CONFIG_RSEQ
- struct rseq __user *rseq;
- u32 rseq_len;
- u32 rseq_sig;
+ struct rseq __user *rseq;
+ u32 rseq_len;
+ u32 rseq_sig;
/*
- * RmW on rseq_event_mask must be performed atomically
+ * RmW on rseq_event_pending must be performed atomically
* with respect to preemption.
*/
- unsigned long rseq_event_mask;
+ bool rseq_event_pending;
# ifdef CONFIG_DEBUG_RSEQ
/*
* This is a place holder to save a copy of the rseq fields for
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index c233aae..1b76d50 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -114,20 +114,13 @@ struct rseq {
/*
* Restartable sequences flags field.
*
- * This field should only be updated by the thread which
- * registered this data structure. Read by the kernel.
- * Mainly used for single-stepping through rseq critical sections
- * with debuggers.
- *
- * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
- * Inhibit instruction sequence block restart on preemption
- * for this thread.
- * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
- * Inhibit instruction sequence block restart on signal
- * delivery for this thread.
- * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
- * Inhibit instruction sequence block restart on migration for
- * this thread.
+ * This field was initially intended to allow event masking for
+ * single-stepping through rseq critical sections with debuggers.
+ * The kernel does not support this anymore and the relevant bits
+ * are checked for being always false:
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_PREEMPT
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_SIGNAL
+ * - RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE
*/
__u32 flags;
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 80af48a..59adc1a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -78,6 +78,12 @@
#define CREATE_TRACE_POINTS
#include <trace/events/rseq.h>
+#ifdef CONFIG_MEMBARRIER
+# define RSEQ_EVENT_GUARD irq
+#else
+# define RSEQ_EVENT_GUARD preempt
+#endif
+
/* The original rseq structure size (including padding) is 32 bytes. */
#define ORIG_RSEQ_SIZE 32
@@ -430,11 +436,11 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
*/
if (regs) {
/*
- * Read and clear the event mask first. If the task was not
- * preempted or migrated or a signal is on the way, there
- * is no point in doing any of the heavy lifting here on
- * production kernels. In that case TIF_NOTIFY_RESUME was
- * raised by some other functionality.
+ * Read and clear the event pending bit first. If the task
+ * was not preempted or migrated or a signal is on the way,
+ * there is no point in doing any of the heavy lifting here
+ * on production kernels. In that case TIF_NOTIFY_RESUME
+ * was raised by some other functionality.
*
* This is correct because the read/clear operation is
* guarded against scheduler preemption, which makes it CPU
@@ -447,15 +453,15 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
* with the result handed in to allow the detection of
* inconsistencies.
*/
- u32 event_mask;
+ bool event;
scoped_guard(RSEQ_EVENT_GUARD) {
- event_mask = t->rseq_event_mask;
- t->rseq_event_mask = 0;
+ event = t->rseq_event_pending;
+ t->rseq_event_pending = false;
}
- if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
- ret = rseq_ip_fixup(regs, !!event_mask);
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
+ ret = rseq_ip_fixup(regs, event);
if (unlikely(ret < 0))
goto error;
}
@@ -584,7 +590,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32
* registered, ensure the cpu_id_start and cpu_id fields
* are updated before returning to user-space.
*/
- rseq_set_notify_resume(current);
+ rseq_sched_switch_event(current);
return 0;
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67..b75e8e1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3329,7 +3329,6 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
if (p->sched_class->migrate_task_rq)
p->sched_class->migrate_task_rq(p, new_cpu);
p->se.nr_migrations++;
- rseq_migrate(p);
sched_mm_cid_migrate_from(p);
perf_event_task_migrate(p);
}
@@ -4763,7 +4762,6 @@ int sched_cgroup_fork(struct task_struct *p, struct kernel_clone_args *kargs)
p->sched_task_group = tg;
}
#endif
- rseq_migrate(p);
/*
* We're setting the CPU for the first time, we don't migrate,
* so use __set_task_cpu().
@@ -4827,7 +4825,6 @@ void wake_up_new_task(struct task_struct *p)
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
- rseq_migrate(p);
__set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
@@ -5121,7 +5118,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
kcov_prepare_switch(prev);
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
- rseq_preempt(prev);
+ rseq_sched_switch_event(prev);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 62fba83..6234456 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -199,7 +199,7 @@ static void ipi_rseq(void *info)
* is negligible.
*/
smp_mb();
- rseq_preempt(current);
+ rseq_sched_switch_event(current);
}
static void ipi_sync_rq_state(void *info)
@@ -407,9 +407,9 @@ static int membarrier_private_expedited(int flags, int cpu_id)
* membarrier, we will end up with some thread in the mm
* running without a core sync.
*
- * For RSEQ, don't rseq_preempt() the caller. User code
- * is not supposed to issue syscalls at all from inside an
- * rseq critical section.
+ * For RSEQ, don't invoke rseq_sched_switch_event() on the
+ * caller. User code is not supposed to issue syscalls at
+ * all from inside an rseq critical section.
*/
if (flags != MEMBARRIER_FLAG_SYNC_CORE) {
preempt_disable();
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Simplify registration
2025-10-27 8:44 ` [patch V6 05/31] rseq: Simplify registration Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 067b3b41b4dd5bf51d6874206f5c1f72e0684eeb
Gitweb: https://git.kernel.org/tip/067b3b41b4dd5bf51d6874206f5c1f72e0684eeb
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:24 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:30:05 +01:00
rseq: Simplify registration
There is no point to read the critical section element in the newly
registered user space RSEQ struct first in order to clear it.
Just clear it and be done with it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.274661227@linutronix.de
---
kernel/rseq.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 51fafc4..80af48a 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -492,11 +492,9 @@ void rseq_syscall(struct pt_regs *regs)
/*
* sys_rseq - setup restartable sequences for caller thread.
*/
-SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
- int, flags, u32, sig)
+SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
{
int ret;
- u64 rseq_cs;
if (flags & RSEQ_FLAG_UNREGISTER) {
if (flags & ~RSEQ_FLAG_UNREGISTER)
@@ -557,11 +555,9 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
* avoid a potential segfault on return to user-space. The proper thing
* to do would have been to fail the registration but this would break
* older libcs that reuse the rseq area for new threads without
- * clearing the fields.
+ * clearing the fields. Don't bother reading it, just reset it.
*/
- if (rseq_get_rseq_cs_ptr_val(rseq, &rseq_cs))
- return -EFAULT;
- if (rseq_cs && clear_rseq_cs(rseq))
+ if (put_user(0UL, &rseq->rseq_cs))
return -EFAULT;
#ifdef CONFIG_DEBUG_RSEQ
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Remove the ksig argument from rseq_handle_notify_resume()
2025-10-27 8:44 ` [patch V6 04/31] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 41b43a6ba3848be8ceec77b8b2a56ddeca6167ed
Gitweb: https://git.kernel.org/tip/41b43a6ba3848be8ceec77b8b2a56ddeca6167ed
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:22 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:30:01 +01:00
rseq: Remove the ksig argument from rseq_handle_notify_resume()
There is no point for this being visible in the resume_to_user_mode()
handling.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.211520245@linutronix.de
---
include/linux/resume_user_mode.h | 2 +-
include/linux/rseq.h | 15 ++++++++-------
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/include/linux/resume_user_mode.h b/include/linux/resume_user_mode.h
index e0135e0..dd3bf7d 100644
--- a/include/linux/resume_user_mode.h
+++ b/include/linux/resume_user_mode.h
@@ -59,7 +59,7 @@ static inline void resume_user_mode_work(struct pt_regs *regs)
mem_cgroup_handle_over_high(GFP_KERNEL);
blkcg_maybe_throttle_current();
- rseq_handle_notify_resume(NULL, regs);
+ rseq_handle_notify_resume(regs);
}
#endif /* LINUX_RESUME_USER_MODE_H */
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 21f875a..d72ddf7 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -37,19 +37,20 @@ static inline void rseq_set_notify_resume(struct task_struct *t)
void __rseq_handle_notify_resume(struct ksignal *sig, struct pt_regs *regs);
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
+static inline void rseq_handle_notify_resume(struct pt_regs *regs)
{
if (current->rseq)
- __rseq_handle_notify_resume(ksig, regs);
+ __rseq_handle_notify_resume(NULL, regs);
}
static inline void rseq_signal_deliver(struct ksignal *ksig,
struct pt_regs *regs)
{
- scoped_guard(RSEQ_EVENT_GUARD)
- __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
- rseq_handle_notify_resume(ksig, regs);
+ if (current->rseq) {
+ scoped_guard(RSEQ_EVENT_GUARD)
+ __set_bit(RSEQ_EVENT_SIGNAL_BIT, ¤t->rseq_event_mask);
+ __rseq_handle_notify_resume(ksig, regs);
+ }
}
/* rseq_preempt() requires preemption to be disabled. */
@@ -103,7 +104,7 @@ static inline void rseq_execve(struct task_struct *t)
#else /* CONFIG_RSEQ */
static inline void rseq_set_notify_resume(struct task_struct *t) { }
-static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_handle_notify_resume(struct pt_regs *regs) { }
static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
static inline void rseq_preempt(struct task_struct *t) { }
static inline void rseq_migrate(struct task_struct *t) { }
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Move algorithm comment to top
2025-10-27 8:44 ` [patch V6 03/31] rseq: Move algorithm comment to top Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 77f19e4d4fc90a9364f5055a4daf8b98a76cb303
Gitweb: https://git.kernel.org/tip/77f19e4d4fc90a9364f5055a4daf8b98a76cb303
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:20 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:29:52 +01:00
rseq: Move algorithm comment to top
Move the comment which documents the RSEQ algorithm to the top of the file,
so it does not create horrible diffs later when the actual implementation
is fed into the mincer.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.149519580@linutronix.de
---
kernel/rseq.c | 119 ++++++++++++++++++++++++-------------------------
1 file changed, 59 insertions(+), 60 deletions(-)
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 246319d..51fafc4 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -8,6 +8,65 @@
* Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
*/
+/*
+ * Restartable sequences are a lightweight interface that allows
+ * user-level code to be executed atomically relative to scheduler
+ * preemption and signal delivery. Typically used for implementing
+ * per-cpu operations.
+ *
+ * It allows user-space to perform update operations on per-cpu data
+ * without requiring heavy-weight atomic operations.
+ *
+ * Detailed algorithm of rseq user-space assembly sequences:
+ *
+ * init(rseq_cs)
+ * cpu = TLS->rseq::cpu_id_start
+ * [1] TLS->rseq::rseq_cs = rseq_cs
+ * [start_ip] ----------------------------
+ * [2] if (cpu != TLS->rseq::cpu_id)
+ * goto abort_ip;
+ * [3] <last_instruction_in_cs>
+ * [post_commit_ip] ----------------------------
+ *
+ * The address of jump target abort_ip must be outside the critical
+ * region, i.e.:
+ *
+ * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
+ *
+ * Steps [2]-[3] (inclusive) need to be a sequence of instructions in
+ * userspace that can handle being interrupted between any of those
+ * instructions, and then resumed to the abort_ip.
+ *
+ * 1. Userspace stores the address of the struct rseq_cs assembly
+ * block descriptor into the rseq_cs field of the registered
+ * struct rseq TLS area. This update is performed through a single
+ * store within the inline assembly instruction sequence.
+ * [start_ip]
+ *
+ * 2. Userspace tests to check whether the current cpu_id field match
+ * the cpu number loaded before start_ip, branching to abort_ip
+ * in case of a mismatch.
+ *
+ * If the sequence is preempted or interrupted by a signal
+ * at or after start_ip and before post_commit_ip, then the kernel
+ * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
+ * ip to abort_ip before returning to user-space, so the preempted
+ * execution resumes at abort_ip.
+ *
+ * 3. Userspace critical section final instruction before
+ * post_commit_ip is the commit. The critical section is
+ * self-terminating.
+ * [post_commit_ip]
+ *
+ * 4. <success>
+ *
+ * On failure at [2], or if interrupted by preempt or signal delivery
+ * between [1] and [3]:
+ *
+ * [abort_ip]
+ * F1. <failure>
+ */
+
#include <linux/sched.h>
#include <linux/uaccess.h>
#include <linux/syscalls.h>
@@ -98,66 +157,6 @@ static int rseq_validate_ro_fields(struct task_struct *t)
unsafe_put_user(value, &t->rseq->field, error_label)
#endif
-/*
- *
- * Restartable sequences are a lightweight interface that allows
- * user-level code to be executed atomically relative to scheduler
- * preemption and signal delivery. Typically used for implementing
- * per-cpu operations.
- *
- * It allows user-space to perform update operations on per-cpu data
- * without requiring heavy-weight atomic operations.
- *
- * Detailed algorithm of rseq user-space assembly sequences:
- *
- * init(rseq_cs)
- * cpu = TLS->rseq::cpu_id_start
- * [1] TLS->rseq::rseq_cs = rseq_cs
- * [start_ip] ----------------------------
- * [2] if (cpu != TLS->rseq::cpu_id)
- * goto abort_ip;
- * [3] <last_instruction_in_cs>
- * [post_commit_ip] ----------------------------
- *
- * The address of jump target abort_ip must be outside the critical
- * region, i.e.:
- *
- * [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
- *
- * Steps [2]-[3] (inclusive) need to be a sequence of instructions in
- * userspace that can handle being interrupted between any of those
- * instructions, and then resumed to the abort_ip.
- *
- * 1. Userspace stores the address of the struct rseq_cs assembly
- * block descriptor into the rseq_cs field of the registered
- * struct rseq TLS area. This update is performed through a single
- * store within the inline assembly instruction sequence.
- * [start_ip]
- *
- * 2. Userspace tests to check whether the current cpu_id field match
- * the cpu number loaded before start_ip, branching to abort_ip
- * in case of a mismatch.
- *
- * If the sequence is preempted or interrupted by a signal
- * at or after start_ip and before post_commit_ip, then the kernel
- * clears TLS->__rseq_abi::rseq_cs, and sets the user-space return
- * ip to abort_ip before returning to user-space, so the preempted
- * execution resumes at abort_ip.
- *
- * 3. Userspace critical section final instruction before
- * post_commit_ip is the commit. The critical section is
- * self-terminating.
- * [post_commit_ip]
- *
- * 4. <success>
- *
- * On failure at [2], or if interrupted by preempt or signal delivery
- * between [1] and [3]:
- *
- * [abort_ip]
- * F1. <failure>
- */
-
static int rseq_update_cpu_node_id(struct task_struct *t)
{
struct rseq __user *rseq = t->rseq;
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Condense the inline stubs
2025-10-27 8:44 ` [patch V6 02/31] rseq: Condense the inline stubs Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: fdc0f39d289ebcf46ef44f43460207ef24c94ed7
Gitweb: https://git.kernel.org/tip/fdc0f39d289ebcf46ef44f43460207ef24c94ed7
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:18 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:29:08 +01:00
rseq: Condense the inline stubs
Scrolling over tons of pointless
{
}
lines to find the actual code is annoying at best.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.085971048@linutronix.de
---
include/linux/rseq.h | 47 ++++++++++---------------------------------
1 file changed, 12 insertions(+), 35 deletions(-)
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 7622b73..21f875a 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -101,44 +101,21 @@ static inline void rseq_execve(struct task_struct *t)
t->rseq_event_mask = 0;
}
-#else
-
-static inline void rseq_set_notify_resume(struct task_struct *t)
-{
-}
-static inline void rseq_handle_notify_resume(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_signal_deliver(struct ksignal *ksig,
- struct pt_regs *regs)
-{
-}
-static inline void rseq_preempt(struct task_struct *t)
-{
-}
-static inline void rseq_migrate(struct task_struct *t)
-{
-}
-static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
-{
-}
-static inline void rseq_execve(struct task_struct *t)
-{
-}
+#else /* CONFIG_RSEQ */
+static inline void rseq_set_notify_resume(struct task_struct *t) { }
+static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
+static inline void rseq_preempt(struct task_struct *t) { }
+static inline void rseq_migrate(struct task_struct *t) { }
+static inline void rseq_fork(struct task_struct *t, u64 clone_flags) { }
+static inline void rseq_execve(struct task_struct *t) { }
static inline void rseq_exit_to_user_mode(void) { }
-#endif
+#endif /* !CONFIG_RSEQ */
#ifdef CONFIG_DEBUG_RSEQ
-
void rseq_syscall(struct pt_regs *regs);
-
-#else
-
-static inline void rseq_syscall(struct pt_regs *regs)
-{
-}
-
-#endif
+#else /* CONFIG_DEBUG_RSEQ */
+static inline void rseq_syscall(struct pt_regs *regs) { }
+#endif /* !CONFIG_DEBUG_RSEQ */
#endif /* _LINUX_RSEQ_H */
^ permalink raw reply related [flat|nested] 142+ messages in thread
* [tip: core/rseq] rseq: Avoid pointless evaluation in __rseq_notify_resume()
2025-10-27 8:44 ` [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
@ 2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2 siblings, 0 replies; 142+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2025-11-04 8:17 UTC (permalink / raw)
To: linux-tip-commits
Cc: Thomas Gleixner, Peter Zijlstra (Intel), Ingo Molnar,
Mathieu Desnoyers, x86, linux-kernel
The following commit has been merged into the core/rseq branch of tip:
Commit-ID: 3ca59da7aa5c7f569b04a511dc8670861d58b509
Gitweb: https://git.kernel.org/tip/3ca59da7aa5c7f569b04a511dc8670861d58b509
Author: Thomas Gleixner <tglx@linutronix.de>
AuthorDate: Mon, 27 Oct 2025 09:44:16 +01:00
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 04 Nov 2025 08:28:38 +01:00
rseq: Avoid pointless evaluation in __rseq_notify_resume()
The RSEQ critical section mechanism only clears the event mask when a
critical section is registered, otherwise it is stale and collects
bits.
That means once a critical section is installed the first invocation of
that code when TIF_NOTIFY_RESUME is set will abort the critical section,
even when the TIF bit was not raised by the rseq preempt/migrate/signal
helpers.
This also has a performance implication because TIF_NOTIFY_RESUME is a
multiplexing TIF bit, which is utilized by quite some infrastructure. That
means every invocation of __rseq_notify_resume() goes unconditionally
through the heavy lifting of user space access and consistency checks even
if there is no reason to do so.
Keeping the stale event mask around when exiting to user space also
prevents it from being utilized by the upcoming time slice extension
mechanism.
Avoid this by reading and clearing the event mask before doing the user
space critical section access with interrupts or preemption disabled, which
ensures that the read and clear operation is CPU local atomic versus
scheduling and the membarrier IPI.
This is correct as after re-enabling interrupts/preemption any relevant
event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the
user space exit code take another round of TIF bit clearing.
If the event mask was non-zero, invoke the slow path. On debug kernels the
slow path is invoked unconditionally and the result of the event mask
evaluation is handed in.
Add a exit path check after the TIF bit loop, which validates on debug
kernels that the event mask is zero before exiting to user space.
While at it reword the convoluted comment why the pt_regs pointer can be
NULL under certain circumstances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://patch.msgid.link/20251027084306.022571576@linutronix.de
---
include/linux/irq-entry-common.h | 7 ++-
include/linux/rseq.h | 10 ++++-
kernel/rseq.c | 66 ++++++++++++++++++++-----------
3 files changed, 58 insertions(+), 25 deletions(-)
diff --git a/include/linux/irq-entry-common.h b/include/linux/irq-entry-common.h
index d643c7c..e5941df 100644
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -2,11 +2,12 @@
#ifndef __LINUX_IRQENTRYCOMMON_H
#define __LINUX_IRQENTRYCOMMON_H
+#include <linux/context_tracking.h>
+#include <linux/kmsan.h>
+#include <linux/rseq.h>
#include <linux/static_call_types.h>
#include <linux/syscalls.h>
-#include <linux/context_tracking.h>
#include <linux/tick.h>
-#include <linux/kmsan.h>
#include <linux/unwind_deferred.h>
#include <asm/entry-common.h>
@@ -226,6 +227,8 @@ static __always_inline void exit_to_user_mode_prepare(struct pt_regs *regs)
arch_exit_to_user_mode_prepare(regs, ti_work);
+ rseq_exit_to_user_mode();
+
/* Ensure that kernel state is sane for a return to userspace */
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
diff --git a/include/linux/rseq.h b/include/linux/rseq.h
index 69553e7..7622b73 100644
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -66,6 +66,14 @@ static inline void rseq_migrate(struct task_struct *t)
rseq_set_notify_resume(t);
}
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
+ if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
+ current->rseq_event_mask = 0;
+ }
+}
+
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
@@ -118,7 +126,7 @@ static inline void rseq_fork(struct task_struct *t, u64 clone_flags)
static inline void rseq_execve(struct task_struct *t)
{
}
-
+static inline void rseq_exit_to_user_mode(void) { }
#endif
#ifdef CONFIG_DEBUG_RSEQ
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 2452b73..246319d 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char *str, u32 flags)
return true;
}
-static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
+static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
{
- u32 flags, event_mask;
+ u32 flags;
int ret;
if (rseq_warn_flags("rseq_cs", cs_flags))
@@ -339,17 +339,7 @@ static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
if (rseq_warn_flags("rseq", flags))
return -EINVAL;
-
- /*
- * Load and clear event mask atomically with respect to
- * scheduler preemption and membarrier IPIs.
- */
- scoped_guard(RSEQ_EVENT_GUARD) {
- event_mask = t->rseq_event_mask;
- t->rseq_event_mask = 0;
- }
-
- return !!event_mask;
+ return 0;
}
static int clear_rseq_cs(struct rseq __user *rseq)
@@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs *rseq_cs)
return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
}
-static int rseq_ip_fixup(struct pt_regs *regs)
+static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
{
unsigned long ip = instruction_pointer(regs);
struct task_struct *t = current;
@@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs *regs)
*/
if (!in_rseq_cs(ip, &rseq_cs))
return clear_rseq_cs(t->rseq);
- ret = rseq_need_restart(t, rseq_cs.flags);
- if (ret <= 0)
+ ret = rseq_check_flags(t, rseq_cs.flags);
+ if (ret < 0)
return ret;
+ if (!abort)
+ return 0;
ret = clear_rseq_cs(t->rseq);
if (ret)
return ret;
@@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
return;
/*
- * regs is NULL if and only if the caller is in a syscall path. Skip
- * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
- * kill a misbehaving userspace on debug kernels.
+ * If invoked from hypervisors or IO-URING, then @regs is a NULL
+ * pointer, so fixup cannot be done. If the syscall which led to
+ * this invocation was invoked inside a critical section, then it
+ * will either end up in this code again or a possible violation of
+ * a syscall inside a critical region can only be detected by the
+ * debug code in rseq_syscall() in a debug enabled kernel.
*/
if (regs) {
- ret = rseq_ip_fixup(regs);
- if (unlikely(ret < 0))
- goto error;
+ /*
+ * Read and clear the event mask first. If the task was not
+ * preempted or migrated or a signal is on the way, there
+ * is no point in doing any of the heavy lifting here on
+ * production kernels. In that case TIF_NOTIFY_RESUME was
+ * raised by some other functionality.
+ *
+ * This is correct because the read/clear operation is
+ * guarded against scheduler preemption, which makes it CPU
+ * local atomic. If the task is preempted right after
+ * re-enabling preemption then TIF_NOTIFY_RESUME is set
+ * again and this function is invoked another time _before_
+ * the task is able to return to user mode.
+ *
+ * On a debug kernel, invoke the fixup code unconditionally
+ * with the result handed in to allow the detection of
+ * inconsistencies.
+ */
+ u32 event_mask;
+
+ scoped_guard(RSEQ_EVENT_GUARD) {
+ event_mask = t->rseq_event_mask;
+ t->rseq_event_mask = 0;
+ }
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
+ ret = rseq_ip_fixup(regs, !!event_mask);
+ if (unlikely(ret < 0))
+ goto error;
+ }
}
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
^ permalink raw reply related [flat|nested] 142+ messages in thread
end of thread, other threads:[~2025-11-04 8:17 UTC | newest]
Thread overview: 142+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
2025-10-27 8:44 ` [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 02/31] rseq: Condense the inline stubs Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 03/31] rseq: Move algorithm comment to top Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 04/31] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 05/31] rseq: Simplify registration Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 06/31] rseq: Simplify the event notification Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
2025-10-28 15:08 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 08/31] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 09/31] rseq: Introduce struct rseq_data Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 10/31] entry: Cleanup header Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` [tip: core/rseq] entry: Clean up header tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 11/31] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 12/31] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 13/31] sched: Move MM CID related functions to sched.h Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 14/31] rseq: Cache CPU ID and MM CID values Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 15/31] rseq: Record interrupt from user space Thomas Gleixner
2025-10-28 15:26 ` Mathieu Desnoyers
2025-10-28 17:02 ` Thomas Gleixner
2025-10-28 17:53 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 16/31] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 17/31] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 18/31] rseq: Provide static branch for runtime debugging Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
2025-10-28 15:40 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-10-29 16:04 ` [patch V6 19/31] " Steven Rostedt
2025-10-29 21:00 ` Thomas Gleixner
2025-10-29 21:53 ` Steven Rostedt
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 20/31] rseq: Replace the original debug implementation Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-10-30 21:52 ` [patch V6 20/31] " Prakash Sangappa
2025-10-31 14:27 ` Thomas Gleixner
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 21/31] rseq: Make exit debugging static branch based Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 22/31] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 23/31] rseq: Provide and use rseq_set_ids() Thomas Gleixner
2025-10-28 15:47 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 24/31] rseq: Separate the signal delivery path Thomas Gleixner
2025-10-28 15:51 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 25/31] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 26/31] rseq: Optimize event setting Thomas Gleixner
2025-10-28 15:57 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 27/31] rseq: Implement fast path for exit to user Thomas Gleixner
2025-10-28 16:09 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-10-29 16:28 ` [patch V6 27/31] " Steven Rostedt
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 28/31] rseq: Switch to fast path processing on " Thomas Gleixner
2025-10-28 16:14 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 29/31] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 30/31] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 31/31] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-29 10:23 ` [patch V6 00/31] rseq: Optimize exit to user space Peter Zijlstra
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox