linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	LKML <linux-kernel@vger.kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>, Paolo Bonzini <pbonzini@redhat.com>,
	Sean Christopherson <seanjc@google.com>,
	Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
	Peter Zijlstra <peterz@infradead.org>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	x86@kernel.org, Arnd Bergmann <arnd@arndb.de>,
	Heiko Carstens <hca@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Huacai Chen <chenhuacai@kernel.org>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Palmer Dabbelt <palmer@dabbelt.com>
Subject: Re: [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run()
Date: Mon, 25 Aug 2025 13:54:19 -0400	[thread overview]
Message-ID: <d8c69b7a-43ca-41b7-af8a-6eb1772c55a4@efficios.com> (raw)
In-Reply-To: <20250823161653.711118277@linutronix.de>

On 2025-08-23 12:39, Thomas Gleixner wrote:
> Hypervisors invoke resume_user_mode_work() before entering the guest, which
> clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user
> space context available to them, so the rseq notify handler skips
> inspecting the critical section, but updates the CPU/MM CID values
> unconditionally so that the eventual pending rseq event is not lost on the
> way to user space.
> 
> This is a pointless exercise as the task might be rescheduled before
> actually returning to user space and it creates unnecessary work in the
> vcpu_run() loops.

One question here: AFAIU, this removes the updates to the cpu_id_start,
cpu_id, mm_cid, and node_id fields on exit to virt usermode. This means
that while the virt guest is running in usermode, the host hypervisor
process has stale rseq fields, until it eventually returns to the
hypervisor's host userspace (from ioctl).

Considering the rseq uapi documentation, this should not matter.
Each of those fields have this statement:

"This field should only be read by the thread which registered this data
structure."

I can however think of use-cases for reading the rseq fields from other
hypervisor threads to figure out information about thread placement.
Doing so would however go against the documented uapi.

I'd rather ask whether anyone is misusing this uapi in that way before
going ahead with the change, just to prevent surprises.

I'm OK with the re-trigger of rseq, as it does indeed appear to fix
an issue, but I'm concerned about the ABI impact of skipping the
rseq_update_cpu_node_id() on return to virt userspace.

Thoughts ?

Thanks,

Mathieu

> 
> It's way more efficient to ignore that invocation based on @regs == NULL
> and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the
> vcpu_run() loop before returning from the ioctl().
> 
> This ensures that a pending RSEQ update is not lost and the IDs are updated
> before returning to user space.
> 
> Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into
> a NOOP.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Cc: Wei Liu <wei.liu@kernel.org>
> Cc: Dexuan Cui <decui@microsoft.com>
> ---
>   drivers/hv/mshv_root_main.c |    2 +
>   include/linux/rseq.h        |   17 +++++++++
>   kernel/rseq.c               |   76 +++++++++++++++++++++++---------------------
>   virt/kvm/kvm_main.c         |    3 +
>   4 files changed, 62 insertions(+), 36 deletions(-)
> 
> --- a/drivers/hv/mshv_root_main.c
> +++ b/drivers/hv/mshv_root_main.c
> @@ -585,6 +585,8 @@ static long mshv_run_vp_with_root_schedu
>   		}
>   	} while (!vp->run.flags.intercept_suspend);
>   
> +	rseq_virt_userspace_exit();
> +
>   	return ret;
>   }
>   
> --- a/include/linux/rseq.h
> +++ b/include/linux/rseq.h
> @@ -38,6 +38,22 @@ static __always_inline void rseq_exit_to
>   }
>   
>   /*
> + * KVM/HYPERV invoke resume_user_mode_work() before entering guest mode,
> + * which clears TIF_NOTIFY_RESUME. To avoid updating user space RSEQ in
> + * that case just to do it eventually again before returning to user space,
> + * the entry resume_user_mode_work() invocation is ignored as the register
> + * argument is NULL.
> + *
> + * After returning from guest mode, they have to invoke this function to
> + * re-raise TIF_NOTIFY_RESUME if necessary.
> + */
> +static inline void rseq_virt_userspace_exit(void)
> +{
> +	if (current->rseq_event_pending)
> +		set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> +}
> +
> +/*
>    * If parent process has a registered restartable sequences area, the
>    * child inherits. Unregister rseq for a clone with CLONE_VM set.
>    */
> @@ -68,6 +84,7 @@ static inline void rseq_execve(struct ta
>   static inline void rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { }
>   static inline void rseq_sched_switch_event(struct task_struct *t) { }
> +static inline void rseq_virt_userspace_exit(void) { }
>   static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) { }
>   static inline void rseq_execve(struct task_struct *t) { }
>   static inline void rseq_exit_to_user_mode(void) { }
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -422,50 +422,54 @@ void __rseq_handle_notify_resume(struct
>   {
>   	struct task_struct *t = current;
>   	int ret, sig;
> +	bool event;
> +
> +	/*
> +	 * If invoked from hypervisors before entering the guest via
> +	 * resume_user_mode_work(), then @regs is a NULL pointer.
> +	 *
> +	 * resume_user_mode_work() clears TIF_NOTIFY_RESUME and re-raises
> +	 * it before returning from the ioctl() to user space when
> +	 * rseq_event.sched_switch is set.
> +	 *
> +	 * So it's safe to ignore here instead of pointlessly updating it
> +	 * in the vcpu_run() loop.
> +	 */
> +	if (!regs)
> +		return;
>   
>   	if (unlikely(t->flags & PF_EXITING))
>   		return;
>   
>   	/*
> -	 * If invoked from hypervisors or IO-URING, then @regs is a NULL
> -	 * pointer, so fixup cannot be done. If the syscall which led to
> -	 * this invocation was invoked inside a critical section, then it
> -	 * will either end up in this code again or a possible violation of
> -	 * a syscall inside a critical region can only be detected by the
> -	 * debug code in rseq_syscall() in a debug enabled kernel.
> +	 * Read and clear the event pending bit first. If the task
> +	 * was not preempted or migrated or a signal is on the way,
> +	 * there is no point in doing any of the heavy lifting here
> +	 * on production kernels. In that case TIF_NOTIFY_RESUME
> +	 * was raised by some other functionality.
> +	 *
> +	 * This is correct because the read/clear operation is
> +	 * guarded against scheduler preemption, which makes it CPU
> +	 * local atomic. If the task is preempted right after
> +	 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> +	 * again and this function is invoked another time _before_
> +	 * the task is able to return to user mode.
> +	 *
> +	 * On a debug kernel, invoke the fixup code unconditionally
> +	 * with the result handed in to allow the detection of
> +	 * inconsistencies.
>   	 */
> -	if (regs) {
> -		/*
> -		 * Read and clear the event pending bit first. If the task
> -		 * was not preempted or migrated or a signal is on the way,
> -		 * there is no point in doing any of the heavy lifting here
> -		 * on production kernels. In that case TIF_NOTIFY_RESUME
> -		 * was raised by some other functionality.
> -		 *
> -		 * This is correct because the read/clear operation is
> -		 * guarded against scheduler preemption, which makes it CPU
> -		 * local atomic. If the task is preempted right after
> -		 * re-enabling preemption then TIF_NOTIFY_RESUME is set
> -		 * again and this function is invoked another time _before_
> -		 * the task is able to return to user mode.
> -		 *
> -		 * On a debug kernel, invoke the fixup code unconditionally
> -		 * with the result handed in to allow the detection of
> -		 * inconsistencies.
> -		 */
> -		bool event;
> -
> -		scoped_guard(RSEQ_EVENT_GUARD) {
> -			event = t->rseq_event_pending;
> -			t->rseq_event_pending = false;
> -		}
> +	scoped_guard(RSEQ_EVENT_GUARD) {
> +		event = t->rseq_event_pending;
> +		t->rseq_event_pending = false;
> +	}
>   
> -		if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> -			ret = rseq_ip_fixup(regs, event);
> -			if (unlikely(ret < 0))
> -				goto error;
> -		}
> +	if (IS_ENABLED(CONFIG_DEBUG_RSEQ) || event) {
> +		ret = rseq_ip_fixup(regs, event);
> +		if (unlikely(ret < 0))
> +			goto error;
>   	}
> +
>   	if (unlikely(rseq_update_cpu_node_id(t)))
>   		goto error;
>   	return;
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -49,6 +49,7 @@
>   #include <linux/lockdep.h>
>   #include <linux/kthread.h>
>   #include <linux/suspend.h>
> +#include <linux/rseq.h>
>   
>   #include <asm/processor.h>
>   #include <asm/ioctl.h>
> @@ -4466,6 +4467,8 @@ static long kvm_vcpu_ioctl(struct file *
>   		r = kvm_arch_vcpu_ioctl_run(vcpu);
>   		vcpu->wants_to_run = false;
>   
> +		rseq_virt_userspace_exit();
> +
>   		trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
>   		break;
>   	}
> 


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

  reply	other threads:[~2025-08-25 17:54 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-23 16:39 [patch V2 00/37] rseq: Optimize exit to user space Thomas Gleixner
2025-08-23 16:39 ` [patch V2 01/37] rseq: Avoid pointless evaluation in __rseq_notify_resume() Thomas Gleixner
2025-08-25 15:39   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 02/37] rseq: Condense the inline stubs Thomas Gleixner
2025-08-25 15:40   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 03/37] resq: Move algorithm comment to top Thomas Gleixner
2025-08-25 15:41   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 04/37] rseq: Remove the ksig argument from rseq_handle_notify_resume() Thomas Gleixner
2025-08-25 15:43   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 05/37] rseq: Simplify registration Thomas Gleixner
2025-08-25 15:44   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 06/37] rseq: Simplify the event notification Thomas Gleixner
2025-08-25 17:36   ` Mathieu Desnoyers
2025-09-02 13:39     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 07/37] rseq, virt: Retrigger RSEQ after vcpu_run() Thomas Gleixner
2025-08-25 17:54   ` Mathieu Desnoyers [this message]
2025-08-25 20:24     ` Sean Christopherson
2025-09-02 15:37       ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 08/37] rseq: Avoid CPU/MM CID updates when no event pending Thomas Gleixner
2025-08-25 18:02   ` Mathieu Desnoyers
2025-09-02 13:41     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 09/37] rseq: Introduce struct rseq_event Thomas Gleixner
2025-08-25 18:11   ` Mathieu Desnoyers
2025-09-02 13:45     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 10/37] entry: Cleanup header Thomas Gleixner
2025-08-25 18:13   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 11/37] entry: Remove syscall_enter_from_user_mode_prepare() Thomas Gleixner
2025-08-23 16:39 ` [patch V2 12/37] entry: Inline irqentry_enter/exit_from/to_user_mode() Thomas Gleixner
2025-08-23 16:39 ` [patch V2 13/37] sched: Move MM CID related functions to sched.h Thomas Gleixner
2025-08-25 18:14   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 14/37] rseq: Cache CPU ID and MM CID values Thomas Gleixner
2025-08-25 18:19   ` Mathieu Desnoyers
2025-09-02 13:48     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 15/37] rseq: Record interrupt from user space Thomas Gleixner
2025-08-25 18:29   ` Mathieu Desnoyers
2025-09-02 13:54     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 16/37] rseq: Provide tracepoint wrappers for inline code Thomas Gleixner
2025-08-25 18:32   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 17/37] rseq: Expose lightweight statistics in debugfs Thomas Gleixner
2025-08-25 18:34   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 18/37] rseq: Provide static branch for runtime debugging Thomas Gleixner
2025-08-25 18:36   ` Mathieu Desnoyers
2025-08-25 20:30   ` Michael Jeanson
2025-09-02 13:56     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 19/37] rseq: Provide and use rseq_update_user_cs() Thomas Gleixner
2025-08-25 19:16   ` Mathieu Desnoyers
2025-09-02 15:19     ` Thomas Gleixner
2025-08-23 16:39 ` [patch V2 20/37] rseq: Replace the debug crud Thomas Gleixner
2025-08-26 14:21   ` Mathieu Desnoyers
2025-08-23 16:39 ` [patch V2 21/37] rseq: Make exit debugging static branch based Thomas Gleixner
2025-08-26 14:23   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 22/37] rseq: Use static branch for syscall exit debug when GENERIC_IRQ_ENTRY=y Thomas Gleixner
2025-08-26 14:28   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 23/37] rseq: Provide and use rseq_set_uids() Thomas Gleixner
2025-08-26 14:52   ` Mathieu Desnoyers
2025-09-02 14:08     ` Thomas Gleixner
2025-09-02 16:33       ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 24/37] rseq: Seperate the signal delivery path Thomas Gleixner
2025-08-26 15:08   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 25/37] rseq: Rework the TIF_NOTIFY handler Thomas Gleixner
2025-08-26 15:12   ` Mathieu Desnoyers
2025-09-02 17:32     ` Thomas Gleixner
2025-09-04  9:52       ` Sean Christopherson
2025-09-04 10:53         ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 26/37] rseq: Optimize event setting Thomas Gleixner
2025-08-26 15:26   ` Mathieu Desnoyers
2025-09-02 14:17     ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 27/37] rseq: Implement fast path for exit to user Thomas Gleixner
2025-08-26 15:33   ` Mathieu Desnoyers
2025-09-02 18:31     ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 28/37] rseq: Switch to fast path processing on " Thomas Gleixner
2025-08-26 15:40   ` Mathieu Desnoyers
2025-08-27 13:45     ` Mathieu Desnoyers
2025-09-02 18:36       ` Thomas Gleixner
2025-08-23 16:40 ` [patch V2 29/37] entry: Split up exit_to_user_mode_prepare() Thomas Gleixner
2025-08-26 15:41   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 30/37] rseq: Split up rseq_exit_to_user_mode() Thomas Gleixner
2025-08-26 15:45   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 31/37] asm-generic: Provide generic TIF infrastructure Thomas Gleixner
2025-08-23 20:37   ` Arnd Bergmann
2025-08-25 19:33   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 32/37] x86: Use generic TIF bits Thomas Gleixner
2025-08-25 19:34   ` Mathieu Desnoyers
2025-08-23 16:40 ` [patch V2 33/37] s390: " Thomas Gleixner
2025-08-23 16:40 ` [patch V2 34/37] loongarch: " Thomas Gleixner
2025-08-23 16:40 ` [patch V2 35/37] riscv: " Thomas Gleixner
2025-08-23 16:40 ` [patch V2 36/37] rseq: Switch to TIF_RSEQ if supported Thomas Gleixner
2025-08-25 19:39   ` Mathieu Desnoyers
2025-08-25 20:02   ` Sean Christopherson
2025-09-02 11:03     ` Thomas Gleixner
2025-09-04 10:08       ` Sean Christopherson
2025-08-23 16:40 ` [patch V2 37/37] entry/rseq: Optimize for TIF_RSEQ on exit Thomas Gleixner
2025-08-25 19:43   ` Mathieu Desnoyers
2025-08-25 15:10 ` [patch V2 00/37] rseq: Optimize exit to user space Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d8c69b7a-43ca-41b7-af8a-6eb1772c55a4@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=arnd@arndb.de \
    --cc=axboe@kernel.dk \
    --cc=boqun.feng@gmail.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=chenhuacai@kernel.org \
    --cc=decui@microsoft.com \
    --cc=hca@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=paulmck@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=seanjc@google.com \
    --cc=svens@linux.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=wei.liu@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).