From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Thomas Gleixner <tglx@linutronix.de>,
LKML <linux-kernel@vger.kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>,
Peter Zijlstra <peterz@infradead.org>,
"Paul E. McKenney" <paulmck@kernel.org>,
Boqun Feng <boqun.feng@gmail.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Sean Christopherson <seanjc@google.com>,
Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
x86@kernel.org, Arnd Bergmann <arnd@arndb.de>,
Heiko Carstens <hca@linux.ibm.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Sven Schnelle <svens@linux.ibm.com>,
Huacai Chen <chenhuacai@kernel.org>,
Paul Walmsley <paul.walmsley@sifive.com>,
Palmer Dabbelt <palmer@dabbelt.com>
Subject: Re: [patch V2 26/37] rseq: Optimize event setting
Date: Tue, 26 Aug 2025 11:26:02 -0400 [thread overview]
Message-ID: <80f966ea-d8a5-401d-ad2f-dba5035cce0c@efficios.com> (raw)
In-Reply-To: <20250823161654.935413328@linutronix.de>
On 2025-08-23 12:40, Thomas Gleixner wrote:
> After removing the various condition bits earlier it turns out that one
> extra information is needed to avoid setting event::sched_switch and
> TIF_NOTIFY_RESUME unconditionally on every context switch.
>
> The update of the RSEQ user space memory is only required, when either
>
> the task was interrupted in user space and schedules
>
> or
>
> the CPU or MM CID changes in schedule() independent of the entry mode
>
> Right now only the interrupt from user information is available.
>
> Add a event flag, which is set when the CPU or MM CID or both change.
We should figure out what to do for powerpc's dynamic numa node id
to cpu mapping here.
>
> Evaluate this event in the scheduler to decide whether the sched_switch
> event and the TIF bit need to be set.
>
> It's an extra conditional in context_switch(), but the downside of
> unconditionally handling RSEQ after a context switch to user is way more
> significant. The utilized boolean logic minimizes this to a single
> conditional branch.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> fs/exec.c | 2 -
> include/linux/rseq.h | 81 +++++++++++++++++++++++++++++++++++++++++----
> include/linux/rseq_types.h | 11 +++++-
> kernel/rseq.c | 2 -
> kernel/sched/core.c | 7 +++
> kernel/sched/sched.h | 5 ++
> 6 files changed, 95 insertions(+), 13 deletions(-)
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1775,7 +1775,7 @@ static int bprm_execve(struct linux_binp
> force_fatal_sig(SIGSEGV);
>
> sched_mm_cid_after_execve(current);
> - rseq_sched_switch_event(current);
> + rseq_force_update();
> current->in_execve = 0;
>
> return retval;
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -9,7 +9,8 @@ void __rseq_handle_notify_resume(struct
>
> static inline void rseq_handle_notify_resume(struct pt_regs *regs)
> {
> - if (current->rseq_event.has_rseq)
> + /* '&' is intentional to spare one conditional branch */
> + if (current->rseq_event.sched_switch & current->rseq_event.has_rseq)
> __rseq_handle_notify_resume(regs);
> }
>
> @@ -31,12 +32,75 @@ static inline void rseq_signal_deliver(s
> }
> }
>
> -/* Raised from context switch and exevce to force evaluation on exit to user */
> -static inline void rseq_sched_switch_event(struct task_struct *t)
> +static inline void rseq_raise_notify_resume(struct task_struct *t)
> {
> - if (t->rseq_event.has_rseq) {
> - t->rseq_event.sched_switch = true;
> - set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> + set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
> +}
> +
> +/* Invoked from context switch to force evaluation on exit to user */
> +static __always_inline void rseq_sched_switch_event(struct task_struct *t)
> +{
> + struct rseq_event *ev = &t->rseq_event;
> +
> + if (IS_ENABLED(CONFIG_GENERIC_IRQ_ENTRY)) {
> + /*
> + * Avoid a boat load of conditionals by using simple logic
> + * to determine whether NOTIFY_RESUME needs to be raised.
> + *
> + * It's required when the CPU or MM CID has changed or
> + * the entry was from user space.
> + */
> + bool raise = (ev->user_irq | ev->ids_changed) & ev->has_rseq;
> +
> + if (raise) {
> + ev->sched_switch = true;
> + rseq_raise_notify_resume(t);
> + }
> + } else {
> + if (ev->has_rseq) {
> + t->rseq_event.sched_switch = true;
> + rseq_raise_notify_resume(t);
> + }
> + }
> +}
> +
> +/*
> + * Invoked from __set_task_cpu() when a task migrates to enforce an IDs
> + * update.
> + *
> + * This does not raise TIF_NOTIFY_RESUME as that happens in
> + * rseq_sched_switch_event().
> + */
> +static __always_inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu)
> +{
> + t->rseq_event.ids_changed = true;
> +}
> +
> +/*
> + * Invoked from switch_mm_cid() in context switch when the task gets a MM
> + * CID assigned.
> + *
> + * This does not raise TIF_NOTIFY_RESUME as that happens in
> + * rseq_sched_switch_event().
> + */
> +static __always_inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid)
> +{
> + /*
> + * Requires a comparison as the switch_mm_cid() code does not
> + * provide a conditional for it readily. So avoid excessive updates
> + * when nothing changes.
> + */
> + if (t->rseq_ids.mm_cid != cid)
> + t->rseq_event.ids_changed = true;
> +}
> +
> +/* Enforce a full update after RSEQ registration and when execve() failed */
> +static inline void rseq_force_update(void)
> +{
> + if (current->rseq_event.has_rseq) {
> + current->rseq_event.ids_changed = true;
> + current->rseq_event.sched_switch = true;
> + rseq_raise_notify_resume(current);
> }
> }
>
> @@ -53,7 +117,7 @@ static inline void rseq_sched_switch_eve
> static inline void rseq_virt_userspace_exit(void)
> {
> if (current->rseq_event.sched_switch)
> - set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> + rseq_raise_notify_resume(current);
> }
>
> /*
> @@ -90,6 +154,9 @@ static inline void rseq_execve(struct ta
> static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
> static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
> static inline void rseq_sched_switch_event(struct task_struct *t) { }
> +static inline void rseq_sched_set_task_cpu(struct task_struct *t, unsigned int cpu) { }
> +static inline void rseq_sched_set_task_mm_cid(struct task_struct *t, unsigned int cid) { }
> +static inline void rseq_force_update(void) { }
> static inline void rseq_virt_userspace_exit(void) { }
> static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
> static inline void rseq_execve(struct task_struct *t) { }
> --- a/include/linux/rseq_types.h
> +++ b/include/linux/rseq_types.h
> @@ -10,20 +10,27 @@ struct rseq;
> * struct rseq_event - Storage for rseq related event management
> * @all: Compound to initialize and clear the data efficiently
> * @events: Compund to access events with a single load/store
> - * @sched_switch: True if the task was scheduled out
> + * @sched_switch: True if the task was scheduled and needs update on
> + * exit to user
> + * @ids_changed: Indicator that IDs need to be updated
> * @user_irq: True on interrupt entry from user mode
> * @has_rseq: True if the task has a rseq pointer installed
> * @error: Compound error code for the slow path to analyze
> * @fatal: User space data corrupted or invalid
> + *
> + * @sched_switch and @ids_changed must be adjacent and the combo must be
> + * 16bit aligned to allow a single store, when both are set at the same
> + * time in the scheduler.
> */
> struct rseq_event {
> union {
> u64 all;
> struct {
> union {
> - u16 events;
> + u32 events;
> struct {
> u8 sched_switch;
> + u8 ids_changed;
> u8 user_irq;
> };
> };
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -459,7 +459,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user
> * are updated before returning to user-space.
> */
> current->rseq_event.has_rseq = true;
> - rseq_sched_switch_event(current);
> + rseq_force_update();
>
> return 0;
> }
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5150,7 +5150,6 @@ prepare_task_switch(struct rq *rq, struc
> kcov_prepare_switch(prev);
> sched_info_switch(rq, prev, next);
> perf_event_task_sched_out(prev, next);
> - rseq_sched_switch_event(prev);
> fire_sched_out_preempt_notifiers(prev, next);
> kmap_local_sched_out();
> prepare_task(next);
> @@ -5348,6 +5347,12 @@ context_switch(struct rq *rq, struct tas
> /* switch_mm_cid() requires the memory barriers above. */
> switch_mm_cid(rq, prev, next);
>
> + /*
> + * Tell rseq that the task was scheduled in. Must be after
> + * switch_mm_cid() to get the TIF flag set.
> + */
> + rseq_sched_switch_event(next);
> +
> prepare_lock_switch(rq, next, rf);
>
> /* Here we just switch the register state and the stack. */
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2181,6 +2181,7 @@ static inline void __set_task_cpu(struct
> smp_wmb();
> WRITE_ONCE(task_thread_info(p)->cpu, cpu);
> p->wake_cpu = cpu;
> + rseq_sched_set_task_cpu(p, cpu);
The combination of patch
"rseq: Simplify the event notification" and this
ends up moving those three rseq_migrate events to __set_task_cpu:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index be00629f0ba4..695c23939345 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3364,7 +3364,6 @@ void set_task_cpu(struct task_struct *p, unsigned
int new_cpu)
if (p->sched_class->migrate_task_rq)
p->sched_class->migrate_task_rq(p, new_cpu);
p->se.nr_migrations++;
- rseq_migrate(p);
sched_mm_cid_migrate_from(p);
perf_event_task_migrate(p);
}
@@ -4795,7 +4794,6 @@ int sched_cgroup_fork(struct task_struct *p,
struct kernel_clone_args *kargs)
p->sched_task_group = tg;
}
#endif
- rseq_migrate(p);
/*
* We're setting the CPU for the first time, we don't migrate,
* so use __set_task_cpu().
@@ -4859,7 +4857,6 @@ void wake_up_new_task(struct task_struct *p)
* as we're not fully set-up yet.
*/
p->recent_used_cpu = task_cpu(p);
- rseq_migrate(p);
__set_task_cpu(p, select_task_rq(p, task_cpu(p), &wake_flags));
rq = __task_rq_lock(p, &rf);
update_rq_clock(rq);
AFAIR those were placed in the callers to benefit from the conditional
in set_task_cpu():
if (task_cpu(p) != new_cpu) {
perhaps it's not a big deal, but I think it's relevant to point it out.
Thanks,
Mathieu
> #endif /* CONFIG_SMP */
> }
>
> @@ -3778,8 +3779,10 @@ static inline void switch_mm_cid(struct
> mm_cid_put_lazy(prev);
> prev->mm_cid = -1;
> }
> - if (next->mm_cid_active)
> + if (next->mm_cid_active) {
> next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next, next->mm);
> + rseq_sched_set_task_mm_cid(next, next->mm_cid);
> + }
> }
>
> #else /* !CONFIG_SCHED_MM_CID: */
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
next prev parent reply other threads:[~2025-08-26 15:26 UTC|newest]
Thread overview: 93+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
2025-08-23 16:39 ` [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-08-25 15:39 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 02/37] rseq: Condense the inline stubs Thomas Gleixner
2025-08-25 15:40 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 03/37] resq: Move algorithm comment to top Thomas Gleixner
2025-08-25 15:41 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
2025-08-25 15:43 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 05/37] rseq: Simplify registration Thomas Gleixner
2025-08-25 15:44 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 06/37] rseq: Simplify the event notification Thomas Gleixner
2025-08-25 17:36 ` Mathieu Desnoyers
2025-09-02 13:39 ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
2025-08-25 17:54 ` Mathieu Desnoyers
2025-08-25 20:24 ` Sean Christopherson
2025-09-02 15:37 ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
2025-08-25 18:02 ` Mathieu Desnoyers
2025-09-02 13:41 ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 09/37] rseq: Introduce struct rseq_event Thomas Gleixner
2025-08-25 18:11 ` Mathieu Desnoyers
2025-09-02 13:45 ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 10/37] entry: Cleanup header Thomas Gleixner
2025-08-25 18:13 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 11/37] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
2025-08-23 16:39 ` [patch V2 12/37] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
2025-08-23 16:39 ` [patch V2 13/37] sched: Move MM CID related functions to sched.h Thomas Gleixner
2025-08-25 18:14 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 14/37] rseq: Cache CPU ID and MM CID values Thomas Gleixner
2025-08-25 18:19 ` Mathieu Desnoyers
2025-09-02 13:48 ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 15/37] rseq: Record interrupt from user space Thomas Gleixner
2025-08-25 18:29 ` Mathieu Desnoyers
2025-09-02 13:54 ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 16/37] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
2025-08-25 18:32 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 17/37] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
2025-08-25 18:34 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 18/37] rseq: Provide static branch for runtime debugging Thomas Gleixner
2025-08-25 18:36 ` Mathieu Desnoyers
2025-08-25 20:30 ` Michael Jeanson
2025-09-02 13:56 ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 19/37] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
2025-08-25 19:16 ` Mathieu Desnoyers
2025-09-02 15:19 ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 20/37] rseq: Replace the debug crud Thomas Gleixner
2025-08-26 14:21 ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 21/37] rseq: Make exit debugging static branch based Thomas Gleixner
2025-08-26 14:23 ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
2025-08-26 14:28 ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 23/37] rseq: Provide and use rseq_set_uids() Thomas Gleixner
2025-08-26 14:52 ` Mathieu Desnoyers
2025-09-02 14:08 ` Thomas Gleixner
2025-09-02 16:33 ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 24/37] rseq: Seperate the signal delivery path Thomas Gleixner
2025-08-26 15:08 ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
2025-08-26 15:12 ` Mathieu Desnoyers
2025-09-02 17:32 ` Thomas Gleixner
2025-09-04 9:52 ` Sean Christopherson
2025-08-23 16:40 ` [patch V2 26/37] rseq: Optimize event setting Thomas Gleixner
2025-08-26 15:26 ` Mathieu Desnoyers [this message]
2025-09-02 14:17 ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 27/37] rseq: Implement fast path for exit to user Thomas Gleixner
2025-08-26 15:33 ` Mathieu Desnoyers
2025-09-02 18:31 ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 28/37] rseq: Switch to fast path processing on " Thomas Gleixner
2025-08-26 15:40 ` Mathieu Desnoyers
2025-08-27 13:45 ` Mathieu Desnoyers
2025-09-02 18:36 ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 29/37] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
2025-08-26 15:41 ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 30/37] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
2025-08-26 15:45 ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 31/37] asm-generic: Provide generic TIF infrastructure Thomas Gleixner
2025-08-23 20:37 ` Arnd Bergmann
2025-08-25 19:33 ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 32/37] x86: Use generic TIF bits Thomas Gleixner
2025-08-25 19:34 ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 33/37] s390: " Thomas Gleixner
2025-08-23 16:40 ` [patch V2 34/37] loongarch: " Thomas Gleixner
2025-08-23 16:40 ` [patch V2 35/37] riscv: " Thomas Gleixner
2025-08-23 16:40 ` [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-08-25 19:39 ` Mathieu Desnoyers
2025-08-25 20:02 ` Sean Christopherson
2025-09-02 11:03 ` Thomas Gleixner
2025-09-04 10:08 ` Sean Christopherson
2025-08-23 16:40 ` [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit Thomas Gleixner
2025-08-25 19:43 ` Mathieu Desnoyers
2025-08-25 15:10 ` [patch V2 00/37] rseq: Optimize exit to user space Mathieu Desnoyers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=80f966ea-d8a5-401d-ad2f-dba5035cce0c@efficios.com \
--to=mathieu.desnoyers@efficios.com \
--cc=arnd@arndb.de \
--cc=axboe@kernel.dk \
--cc=boqun.feng@gmail.com \
--cc=borntraeger@linux.ibm.com \
--cc=chenhuacai@kernel.org \
--cc=decui@microsoft.com \
--cc=hca@linux.ibm.com \
--cc=linux-kernel@vger.kernel.org \
--cc=palmer@dabbelt.com \
--cc=paul.walmsley@sifive.com \
--cc=paulmck@kernel.org \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=seanjc@google.com \
--cc=svens@linux.ibm.com \
--cc=tglx@linutronix.de \
--cc=wei.liu@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).