From: Thomas Gleixner <tglx@linutronix.de>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Jeanson <mjeanson@efficios.com>,
Jens Axboe <axboe@kernel.dk>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
Peter Zijlstra <peterz@infradead.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
x86@kernel.org, Sean Christopherson <seanjc@google.com>,
Wei Liu <wei.liu@kernel.org>
Subject: [patch V6 01/31] rseq: Avoid pointless evaluation in __rseq_notify_resume()
Date: Mon, 27 Oct 2025 09:44:16 +0100 (CET) [thread overview]
Message-ID: <20251027084306.022571576@linutronix.de> (raw)
In-Reply-To: 20251027084220.785525188@linutronix.de
From: Thomas Gleixner <tglx@linutronix.de>
The RSEQ critical section mechanism only clears the event mask when a
critical section is registered, otherwise it is stale and collects
bits.
That means once a critical section is installed the first invocation of
that code when TIF_NOTIFY_RESUME is set will abort the critical section,
even when the TIF bit was not raised by the rseq preempt/migrate/signal
helpers.
This also has a performance implication because TIF_NOTIFY_RESUME is a
multiplexing TIF bit, which is utilized by quite some infrastructure. That
means every invocation of __rseq_notify_resume() goes unconditionally
through the heavy lifting of user space access and consistency checks even
if there is no reason to do so.
Keeping the stale event mask around when exiting to user space also
prevents it from being utilized by the upcoming time slice extension
mechanism.
Avoid this by reading and clearing the event mask before doing the user
space critical section access with interrupts or preemption disabled, which
ensures that the read and clear operation is CPU local atomic versus
scheduling and the membarrier IPI.
This is correct as after re-enabling interrupts/preemption any relevant
event will set the bit again and raise TIF_NOTIFY_RESUME, which makes the
user space exit code take another round of TIF bit clearing.
If the event mask was non-zero, invoke the slow path. On debug kernels the
slow path is invoked unconditionally and the result of the event mask
evaluation is handed in.
Add a exit path check after the TIF bit loop, which validates on debug
kernels that the event mask is zero before exiting to user space.
While at it reword the convoluted comment why the pt_regs pointer can be
NULL under certain circumstances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
---
include/linux/irq-entry-common.h | 7 ++--
include/linux/rseq.h | 10 +++++
kernel/rseq.c | 66 ++++++++++++++++++++++++++-------------
3 files changed, 58 insertions(+), 25 deletions(-)
---
--- a/include/linux/irq-entry-common.h
+++ b/include/linux/irq-entry-common.h
@@ -2,11 +2,12 @@
#ifndef __LINUX_IRQENTRYCOMMON_H
#define __LINUX_IRQENTRYCOMMON_H
+#include <linux/context_tracking.h>
+#include <linux/kmsan.h>
+#include <linux/rseq.h>
#include <linux/static_call_types.h>
#include <linux/syscalls.h>
-#include <linux/context_tracking.h>
#include <linux/tick.h>
-#include <linux/kmsan.h>
#include <linux/unwind_deferred.h>
#include <asm/entry-common.h>
@@ -226,6 +227,8 @@ static __always_inline void exit_to_user
arch_exit_to_user_mode_prepare(regs, ti_work);
+ rseq_exit_to_user_mode();
+
/* Ensure that kernel state is sane for a return to userspace */
kmap_assert_nomap();
lockdep_assert_irqs_disabled();
--- a/include/linux/rseq.h
+++ b/include/linux/rseq.h
@@ -66,6 +66,14 @@ static inline void rseq_migrate(struct t
rseq_set_notify_resume(t);
}
+static __always_inline void rseq_exit_to_user_mode(void)
+{
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ)) {
+ if (WARN_ON_ONCE(current->rseq && current->rseq_event_mask))
+ current->rseq_event_mask = 0;
+ }
+}
+
/*
* If parent process has a registered restartable sequences area, the
* child inherits. Unregister rseq for a clone with CLONE_VM set.
@@ -118,7 +126,7 @@ static inline void rseq_fork(struct task
static inline void rseq_execve(struct task_struct *t)
{
}
-
+static inline void rseq_exit_to_user_mode(void) { }
#endif
#ifdef CONFIG_DEBUG_RSEQ
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -324,9 +324,9 @@ static bool rseq_warn_flags(const char *
return true;
}
-static int rseq_need_restart(struct task_struct *t, u32 cs_flags)
+static int rseq_check_flags(struct task_struct *t, u32 cs_flags)
{
- u32 flags, event_mask;
+ u32 flags;
int ret;
if (rseq_warn_flags("rseq_cs", cs_flags))
@@ -339,17 +339,7 @@ static int rseq_need_restart(struct task
if (rseq_warn_flags("rseq", flags))
return -EINVAL;
-
- /*
- * Load and clear event mask atomically with respect to
- * scheduler preemption and membarrier IPIs.
- */
- scoped_guard(RSEQ_EVENT_GUARD) {
- event_mask = t->rseq_event_mask;
- t->rseq_event_mask = 0;
- }
-
- return !!event_mask;
+ return 0;
}
static int clear_rseq_cs(struct rseq __user *rseq)
@@ -380,7 +370,7 @@ static bool in_rseq_cs(unsigned long ip,
return ip - rseq_cs->start_ip < rseq_cs->post_commit_offset;
}
-static int rseq_ip_fixup(struct pt_regs *regs)
+static int rseq_ip_fixup(struct pt_regs *regs, bool abort)
{
unsigned long ip = instruction_pointer(regs);
struct task_struct *t = current;
@@ -398,9 +388,11 @@ static int rseq_ip_fixup(struct pt_regs
*/
if (!in_rseq_cs(ip, &rseq_cs))
return clear_rseq_cs(t->rseq);
- ret = rseq_need_restart(t, rseq_cs.flags);
- if (ret <= 0)
+ ret = rseq_check_flags(t, rseq_cs.flags);
+ if (ret < 0)
return ret;
+ if (!abort)
+ return 0;
ret = clear_rseq_cs(t->rseq);
if (ret)
return ret;
@@ -430,14 +422,44 @@ void __rseq_handle_notify_resume(struct
return;
/*
- * regs is NULL if and only if the caller is in a syscall path. Skip
- * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
- * kill a misbehaving userspace on debug kernels.
+ * If invoked from hypervisors or IO-URING, then @regs is a NULL
+ * pointer, so fixup cannot be done. If the syscall which led to
+ * this invocation was invoked inside a critical section, then it
+ * will either end up in this code again or a possible violation of
+ * a syscall inside a critical region can only be detected by the
+ * debug code in rseq_syscall() in a debug enabled kernel.
*/
if (regs) {
- ret = rseq_ip_fixup(regs);
- if (unlikely(ret < 0))
- goto error;
+ /*
+ * Read and clear the event mask first. If the task was not
+ * preempted or migrated or a signal is on the way, there
+ * is no point in doing any of the heavy lifting here on
+ * production kernels. In that case TIF_NOTIFY_RESUME was
+ * raised by some other functionality.
+ *
+ * This is correct because the read/clear operation is
+ * guarded against scheduler preemption, which makes it CPU
+ * local atomic. If the task is preempted right after
+ * re-enabling preemption then TIF_NOTIFY_RESUME is set
+ * again and this function is invoked another time _before_
+ * the task is able to return to user mode.
+ *
+ * On a debug kernel, invoke the fixup code unconditionally
+ * with the result handed in to allow the detection of
+ * inconsistencies.
+ */
+ u32 event_mask;
+
+ scoped_guard(RSEQ_EVENT_GUARD) {
+ event_mask = t->rseq_event_mask;
+ t->rseq_event_mask = 0;
+ }
+
+ if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event_mask) {
+ ret = rseq_ip_fixup(regs, !!event_mask);
+ if (unlikely(ret < 0))
+ goto error;
+ }
}
if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
next prev parent reply other threads:[~2025-10-27 8:44 UTC|newest]
Thread overview: 142+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-27 8:44 [patch V6 00/31] rseq: Optimize exit to user space Thomas Gleixner
2025-10-27 8:44 ` Thomas Gleixner [this message]
2025-10-29 10:24 ` [tip: core/rseq] rseq: Avoid pointless evaluation in __rseq_notify_resume() tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 02/31] rseq: Condense the inline stubs Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 03/31] rseq: Move algorithm comment to top Thomas Gleixner
2025-10-29 10:24 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 04/31] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 05/31] rseq: Simplify registration Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 06/31] rseq: Simplify the event notification Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 07/31] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
2025-10-28 15:08 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 08/31] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 09/31] rseq: Introduce struct rseq_data Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 10/31] entry: Cleanup header Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` [tip: core/rseq] entry: Clean up header tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 11/31] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 12/31] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 13/31] sched: Move MM CID related functions to sched.h Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 14/31] rseq: Cache CPU ID and MM CID values Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 15/31] rseq: Record interrupt from user space Thomas Gleixner
2025-10-28 15:26 ` Mathieu Desnoyers
2025-10-28 17:02 ` Thomas Gleixner
2025-10-28 17:53 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 16/31] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 17/31] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 18/31] rseq: Provide static branch for runtime debugging Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:44 ` [patch V6 19/31] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
2025-10-28 15:40 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-10-29 16:04 ` [patch V6 19/31] " Steven Rostedt
2025-10-29 21:00 ` Thomas Gleixner
2025-10-29 21:53 ` Steven Rostedt
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 20/31] rseq: Replace the original debug implementation Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-10-30 21:52 ` [patch V6 20/31] " Prakash Sangappa
2025-10-31 14:27 ` Thomas Gleixner
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 21/31] rseq: Make exit debugging static branch based Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 22/31] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 23/31] rseq: Provide and use rseq_set_ids() Thomas Gleixner
2025-10-28 15:47 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 24/31] rseq: Separate the signal delivery path Thomas Gleixner
2025-10-28 15:51 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 25/31] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:17 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 26/31] rseq: Optimize event setting Thomas Gleixner
2025-10-28 15:57 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 27/31] rseq: Implement fast path for exit to user Thomas Gleixner
2025-10-28 16:09 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-10-29 16:28 ` [patch V6 27/31] " Steven Rostedt
2025-11-03 14:47 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 28/31] rseq: Switch to fast path processing on " Thomas Gleixner
2025-10-28 16:14 ` Mathieu Desnoyers
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 29/31] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 30/31] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-27 8:45 ` [patch V6 31/31] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-10-29 10:23 ` [tip: core/rseq] " tip-bot2 for Thomas Gleixner
2025-11-03 14:47 ` tip-bot2 for Thomas Gleixner
2025-11-04 8:16 ` tip-bot2 for Thomas Gleixner
2025-10-29 10:23 ` [patch V6 00/31] rseq: Optimize exit to user space Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251027084306.022571576@linutronix.de \
--to=tglx@linutronix.de \
--cc=axboe@kernel.dk \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mjeanson@efficios.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=seanjc@google.com \
--cc=wei.liu@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox