[PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task
@ 2025-08-26  0:40 Sean Christopherson
  2025-08-26  0:40 ` [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed Sean Christopherson
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Sean Christopherson @ 2025-08-26  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Michael S. Tsirkin,
	Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel,
	Sebastian Andrzej Siewior

Fix a bug where KVM attempts to wake a vhost task that has already exited in
response to a fatal signal, and tack on a few cleanups to harden against
introducing similar bugs in the future.

Somehow, this only started causing problems when commit 56180dd20c19 ("futex:
Use RCU-based per-CPU reference counting instead of rcuref_t") landed.  I have
no idea why the futex changes exposed the bug, and I don't care all that much,
as this is firmly a KVM bug.

Sean Christopherson (3):
  vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task
    was killed
  vhost_task: Allow caller to omit handle_sigkill() callback
  KVM: x86/mmu: Don't register a sigkill callback for NX hugepage
    recovery tasks

 arch/x86/kvm/mmu/mmu.c           |  9 ++----
 include/linux/sched/vhost_task.h |  1 +
 kernel/vhost_task.c              | 52 +++++++++++++++++++++++++++++---
 3 files changed, 51 insertions(+), 11 deletions(-)


base-commit: 1b237f190eb3d36f52dffe07a40b5eb210280e00
-- 
2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed
  2025-08-26  0:40 [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task Sean Christopherson
@ 2025-08-26  0:40 ` Sean Christopherson
  2025-08-26  7:52   ` Michael S. Tsirkin
  2025-08-26  0:40 ` [PATCH 2/3] vhost_task: Allow caller to omit handle_sigkill() callback Sean Christopherson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2025-08-26  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Michael S. Tsirkin,
	Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel,
	Sebastian Andrzej Siewior

Add a vhost_task_wake_safe() variant to handle the case where a vhost task
has exited due to a signal, i.e. before being explicitly stopped by the
owner of the task, and use the "safe" API in KVM when waking NX hugepage
recovery tasks.  This fixes a bug where KVM will attempt to wake a task
that has exited, which ultimately results in all manner of badness, e.g.

  Oops: general protection fault, probably for non-canonical address 0xff0e899fa1566052: 0000 [#1] SMP
  CPU: 51 UID: 0 PID: 53807 Comm: tee Tainted: G S         O        6.17.0-smp--38183c31756a-next #826 NONE
  Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
  Hardware name: Google LLC Indus/Indus_QC_03, BIOS 30.110.0 09/13/2024
  RIP: 0010:queued_spin_lock_slowpath+0x123/0x250
  Code: ... <48> 89 8c 02 c0 da 47 a2 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8
  RSP: 0018:ffffbf55cffe7cf8 EFLAGS: 00010006
  RAX: ff0e899fff0e8562 RBX: 0000000000d00000 RCX: ffffa39b40aefac0
  RDX: 0000000000000030 RSI: fffffffffffffff8 RDI: ffffa39d0592e68c
  RBP: 0000000000d00000 R08: 00000000ffffff80 R09: 0000000400000000
  R10: ffffa36cce4fe401 R11: 0000000000000800 R12: 0000000000000003
  R13: 0000000000000000 R14: ffffa39d0592e68c R15: ffffa39b9e672000
  FS:  00007f233b2e9740(0000) GS:ffffa39b9e672000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f233b39fda0 CR3: 00000004d031f002 CR4: 00000000007726f0
  PKRU: 55555554
  Call Trace:
   <TASK>
   _raw_spin_lock_irqsave+0x50/0x60
   try_to_wake_up+0x4f/0x5d0
   set_nx_huge_pages+0xe4/0x1c0 [kvm]
   param_attr_store+0x89/0xf0
   module_attr_store+0x1e/0x30
   kernfs_fop_write_iter+0xe4/0x160
   vfs_write+0x2cb/0x420
   ksys_write+0x7f/0xf0
   do_syscall_64+0x6f/0x1f0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7f233b4178b3
  R13: 0000000000000002 R14: 00000000226ff3d0 R15: 0000000000000002
   </TASK>

Provide an API in vhost task instead of forcing KVM to solve the problem,
as KVM would literally just add an equivalent to VHOST_TASK_FLAGS_KILLED,
along with a new lock to protect said flag.  In general, forcing simple
usage of vhost task to care about signals _and_ take non-trivial action to
do the right thing isn't developer friendly, and is likely to lead to
similar bugs in the future.

Debugged-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/all/aKkLEtoDXKxAAWju@google.com
Link: https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c           |  2 +-
 include/linux/sched/vhost_task.h |  1 +
 kernel/vhost_task.c              | 42 +++++++++++++++++++++++++++++---
 3 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6e838cb6c9e1..d11730467fd4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7376,7 +7376,7 @@ static void kvm_wake_nx_recovery_thread(struct kvm *kvm)
 	struct vhost_task *nx_thread = READ_ONCE(kvm->arch.nx_huge_page_recovery_thread);
 
 	if (nx_thread)
-		vhost_task_wake(nx_thread);
+		vhost_task_wake_safe(nx_thread);
 }
 
 static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp)
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 25446c5d3508..5d5c187088f7 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -10,5 +10,6 @@ struct vhost_task *vhost_task_create(bool (*fn)(void *),
 void vhost_task_start(struct vhost_task *vtsk);
 void vhost_task_stop(struct vhost_task *vtsk);
 void vhost_task_wake(struct vhost_task *vtsk);
+void vhost_task_wake_safe(struct vhost_task *vtsk);
 
 #endif /* _LINUX_SCHED_VHOST_TASK_H */
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index bc738fa90c1d..5aa8ddf88d01 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -67,18 +67,54 @@ static int vhost_task_fn(void *data)
 	do_exit(0);
 }
 
+static void __vhost_task_wake(struct vhost_task *vtsk)
+{
+	wake_up_process(vtsk->task);
+}
+
 /**
  * vhost_task_wake - wakeup the vhost_task
  * @vtsk: vhost_task to wake
  *
- * wake up the vhost_task worker thread
+ * Wake up the vhost_task worker thread.  The caller is responsible for ensuring
+ * that the task hasn't exited.
  */
 void vhost_task_wake(struct vhost_task *vtsk)
 {
-	wake_up_process(vtsk->task);
+	/*
+	 * Checking VHOST_TASK_FLAGS_KILLED can race with signal delivery, but
+	 * a race can only result in false negatives and this is just a sanity
+	 * check, i.e. if KILLED is set, the caller is buggy no matter what.
+	 */
+	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)))
+		return;
+
+	__vhost_task_wake(vtsk);
 }
 EXPORT_SYMBOL_GPL(vhost_task_wake);
 
+/**
+ * vhost_task_wake_safe - wakeup the vhost_task if it hasn't been killed
+ * @vtsk: vhost_task to wake
+ *
+ * Wake up the vhost_task worker thread if the task hasn't exited, e.g. due to
+ * a signal.
+ */
+void vhost_task_wake_safe(struct vhost_task *vtsk)
+{
+	guard(mutex)(&vtsk->exit_mutex);
+
+	/* Attempting to wake a task that has been explicitly stopped is a bug. */
+	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)))
+		return;
+
+	if (test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags))
+		return;
+
+	__vhost_task_wake(vtsk);
+}
+EXPORT_SYMBOL_GPL(vhost_task_wake_safe);
+
 /**
  * vhost_task_stop - stop a vhost_task
  * @vtsk: vhost_task to stop
@@ -91,7 +127,7 @@ void vhost_task_stop(struct vhost_task *vtsk)
 	mutex_lock(&vtsk->exit_mutex);
 	if (!test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)) {
 		set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
-		vhost_task_wake(vtsk);
+		__vhost_task_wake(vtsk);
 	}
 	mutex_unlock(&vtsk->exit_mutex);
 
-- 
2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] vhost_task: Allow caller to omit handle_sigkill() callback
  2025-08-26  0:40 [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task Sean Christopherson
  2025-08-26  0:40 ` [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed Sean Christopherson
@ 2025-08-26  0:40 ` Sean Christopherson
  2025-08-26  6:29   ` Sebastian Andrzej Siewior
  2025-08-26  0:40 ` [PATCH 3/3] KVM: x86/mmu: Don't register a sigkill callback for NX hugepage recovery tasks Sean Christopherson
  2025-08-28  2:19 ` [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task Lei Yang
  3 siblings, 1 reply; 11+ messages in thread
From: Sean Christopherson @ 2025-08-26  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Michael S. Tsirkin,
	Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel,
	Sebastian Andrzej Siewior

Now that vhost_task provides an API to safely wake a task without relying
on the caller to react to signalas, make handle_sigkill() optional and
WARN if the "unsafe" vhost_task_wake() is used without hooking sigkill.
Requiring the user to react to sigkill adds no meaningful value, e.g. it
didn't help KVM do anything useful, and adding a sanity check in
vhost_task_wake() gives developers a hint as to what needs to be done in
response to sigkill.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 kernel/vhost_task.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index 5aa8ddf88d01..e0ec6bfe61e6 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -59,7 +59,8 @@ static int vhost_task_fn(void *data)
 	 */
 	if (!test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
 		set_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags);
-		vtsk->handle_sigkill(vtsk->data);
+		if (vtsk->handle_sigkill)
+			vtsk->handle_sigkill(vtsk->data);
 	}
 	mutex_unlock(&vtsk->exit_mutex);
 	complete(&vtsk->exited);
@@ -81,6 +82,13 @@ static void __vhost_task_wake(struct vhost_task *vtsk)
  */
 void vhost_task_wake(struct vhost_task *vtsk)
 {
+	/*
+	 * Waking the task without taking exit_mutex is safe if and only if the
+	 * implementation hooks sigkill, as that's the only way the caller can
+	 * know if the task has exited prematurely due to a signal.
+	 */
+	WARN_ON_ONCE(!vtsk->handle_sigkill);
+
 	/*
 	 * Checking VHOST_TASK_FLAGS_KILLED can race with signal delivery, but
 	 * a race can only result in false negatives and this is just a sanity
-- 
2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] KVM: x86/mmu: Don't register a sigkill callback for NX hugepage recovery tasks
  2025-08-26  0:40 [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task Sean Christopherson
  2025-08-26  0:40 ` [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed Sean Christopherson
  2025-08-26  0:40 ` [PATCH 2/3] vhost_task: Allow caller to omit handle_sigkill() callback Sean Christopherson
@ 2025-08-26  0:40 ` Sean Christopherson
  2025-08-28  2:19 ` [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task Lei Yang
  3 siblings, 0 replies; 11+ messages in thread
From: Sean Christopherson @ 2025-08-26  0:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Michael S. Tsirkin,
	Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel,
	Sebastian Andrzej Siewior

Don't register a sigkill callback with vhost_task when creating NX hugepage
recovery threads now that said callback is optional.  In addition to
removing what is effectively dead code, not registering a sigkill "handler"
also guards against improper use of vhost_task_wake().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d11730467fd4..dd90cf8a8170 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7677,10 +7677,6 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
-static void kvm_nx_huge_page_recovery_worker_kill(void *data)
-{
-}
-
 static bool kvm_nx_huge_page_recovery_worker(void *data)
 {
 	struct kvm *kvm = data;
@@ -7713,8 +7709,7 @@ static int kvm_mmu_start_lpage_recovery(struct once *once)
 	struct vhost_task *nx_thread;
 
 	kvm->arch.nx_huge_page_last = get_jiffies_64();
-	nx_thread = vhost_task_create(kvm_nx_huge_page_recovery_worker,
-				      kvm_nx_huge_page_recovery_worker_kill,
+	nx_thread = vhost_task_create(kvm_nx_huge_page_recovery_worker, NULL,
 				      kvm, "kvm-nx-lpage-recovery");
 
 	if (IS_ERR(nx_thread))
-- 
2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] vhost_task: Allow caller to omit handle_sigkill() callback
  2025-08-26  0:40 ` [PATCH 2/3] vhost_task: Allow caller to omit handle_sigkill() callback Sean Christopherson
@ 2025-08-26  6:29   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-26  6:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Michael S. Tsirkin, Jason Wang, kvm,
	virtualization, netdev, linux-kernel

On 2025-08-25 17:40:10 [-0700], Sean Christopherson wrote:
> Now that vhost_task provides an API to safely wake a task without relying
> on the caller to react to signalas, make handle_sigkill() optional and
                                  ^
Sounds Spanish :)

…

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed
  2025-08-26  0:40 ` [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed Sean Christopherson
@ 2025-08-26  7:52   ` Michael S. Tsirkin
  2025-08-26 14:03     ` Sean Christopherson
  0 siblings, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2025-08-26  7:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Jason Wang, kvm, virtualization, netdev,
	linux-kernel, Sebastian Andrzej Siewior

On Mon, Aug 25, 2025 at 05:40:09PM -0700, Sean Christopherson wrote:
> Add a vhost_task_wake_safe() variant to handle the case where a vhost task
> has exited due to a signal, i.e. before being explicitly stopped by the
> owner of the task, and use the "safe" API in KVM when waking NX hugepage
> recovery tasks.  This fixes a bug where KVM will attempt to wake a task
> that has exited, which ultimately results in all manner of badness, e.g.
> 
>   Oops: general protection fault, probably for non-canonical address 0xff0e899fa1566052: 0000 [#1] SMP
>   CPU: 51 UID: 0 PID: 53807 Comm: tee Tainted: G S         O        6.17.0-smp--38183c31756a-next #826 NONE
>   Tainted: [S]=CPU_OUT_OF_SPEC, [O]=OOT_MODULE
>   Hardware name: Google LLC Indus/Indus_QC_03, BIOS 30.110.0 09/13/2024
>   RIP: 0010:queued_spin_lock_slowpath+0x123/0x250
>   Code: ... <48> 89 8c 02 c0 da 47 a2 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8
>   RSP: 0018:ffffbf55cffe7cf8 EFLAGS: 00010006
>   RAX: ff0e899fff0e8562 RBX: 0000000000d00000 RCX: ffffa39b40aefac0
>   RDX: 0000000000000030 RSI: fffffffffffffff8 RDI: ffffa39d0592e68c
>   RBP: 0000000000d00000 R08: 00000000ffffff80 R09: 0000000400000000
>   R10: ffffa36cce4fe401 R11: 0000000000000800 R12: 0000000000000003
>   R13: 0000000000000000 R14: ffffa39d0592e68c R15: ffffa39b9e672000
>   FS:  00007f233b2e9740(0000) GS:ffffa39b9e672000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007f233b39fda0 CR3: 00000004d031f002 CR4: 00000000007726f0
>   PKRU: 55555554
>   Call Trace:
>    <TASK>
>    _raw_spin_lock_irqsave+0x50/0x60
>    try_to_wake_up+0x4f/0x5d0
>    set_nx_huge_pages+0xe4/0x1c0 [kvm]
>    param_attr_store+0x89/0xf0
>    module_attr_store+0x1e/0x30
>    kernfs_fop_write_iter+0xe4/0x160
>    vfs_write+0x2cb/0x420
>    ksys_write+0x7f/0xf0
>    do_syscall_64+0x6f/0x1f0
>    entry_SYSCALL_64_after_hwframe+0x4b/0x53
>   RIP: 0033:0x7f233b4178b3
>   R13: 0000000000000002 R14: 00000000226ff3d0 R15: 0000000000000002
>    </TASK>
> 
> Provide an API in vhost task instead of forcing KVM to solve the problem,
> as KVM would literally just add an equivalent to VHOST_TASK_FLAGS_KILLED,
> along with a new lock to protect said flag.  In general, forcing simple
> usage of vhost task to care about signals _and_ take non-trivial action to
> do the right thing isn't developer friendly, and is likely to lead to
> similar bugs in the future.
> 
> Debugged-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Link: https://lore.kernel.org/all/aKkLEtoDXKxAAWju@google.com
> Link: https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com
> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>

OK but I dislike the API.

Default APIs should be safe. So vhost_task_wake_safe should be
vhost_task_wake

This also reduces the changes to kvm.


It does not look like we need the "unsafe" variant, so pls drop it.
If we do need it, it should be called __vhost_task_wake.






> ---
>  arch/x86/kvm/mmu/mmu.c           |  2 +-
>  include/linux/sched/vhost_task.h |  1 +
>  kernel/vhost_task.c              | 42 +++++++++++++++++++++++++++++---
>  3 files changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6e838cb6c9e1..d11730467fd4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7376,7 +7376,7 @@ static void kvm_wake_nx_recovery_thread(struct kvm *kvm)
>  	struct vhost_task *nx_thread = READ_ONCE(kvm->arch.nx_huge_page_recovery_thread);
>  
>  	if (nx_thread)
> -		vhost_task_wake(nx_thread);
> +		vhost_task_wake_safe(nx_thread);
>  }
>  
>  static int get_nx_huge_pages(char *buffer, const struct kernel_param *kp)
> diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
> index 25446c5d3508..5d5c187088f7 100644
> --- a/include/linux/sched/vhost_task.h
> +++ b/include/linux/sched/vhost_task.h
> @@ -10,5 +10,6 @@ struct vhost_task *vhost_task_create(bool (*fn)(void *),
>  void vhost_task_start(struct vhost_task *vtsk);
>  void vhost_task_stop(struct vhost_task *vtsk);
>  void vhost_task_wake(struct vhost_task *vtsk);
> +void vhost_task_wake_safe(struct vhost_task *vtsk);
>  
>  #endif /* _LINUX_SCHED_VHOST_TASK_H */
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index bc738fa90c1d..5aa8ddf88d01 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c
> @@ -67,18 +67,54 @@ static int vhost_task_fn(void *data)
>  	do_exit(0);
>  }
>  
> +static void __vhost_task_wake(struct vhost_task *vtsk)
> +{
> +	wake_up_process(vtsk->task);
> +}
> +
>  /**
>   * vhost_task_wake - wakeup the vhost_task
>   * @vtsk: vhost_task to wake
>   *
> - * wake up the vhost_task worker thread
> + * Wake up the vhost_task worker thread.  The caller is responsible for ensuring
> + * that the task hasn't exited.
>   */
>  void vhost_task_wake(struct vhost_task *vtsk)
>  {
> -	wake_up_process(vtsk->task);
> +	/*
> +	 * Checking VHOST_TASK_FLAGS_KILLED can race with signal delivery, but
> +	 * a race can only result in false negatives and this is just a sanity
> +	 * check, i.e. if KILLED is set, the caller is buggy no matter what.
> +	 */
> +	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)))
> +		return;
> +
> +	__vhost_task_wake(vtsk);
>  }
>  EXPORT_SYMBOL_GPL(vhost_task_wake);
>  
> +/**
> + * vhost_task_wake_safe - wakeup the vhost_task if it hasn't been killed
> + * @vtsk: vhost_task to wake
> + *
> + * Wake up the vhost_task worker thread if the task hasn't exited, e.g. due to
> + * a signal.
> + */
> +void vhost_task_wake_safe(struct vhost_task *vtsk)
> +{
> +	guard(mutex)(&vtsk->exit_mutex);
> +
> +	/* Attempting to wake a task that has been explicitly stopped is a bug. */
> +	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)))
> +		return;
> +
> +	if (test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags))
> +		return;
> +
> +	__vhost_task_wake(vtsk);
> +}
> +EXPORT_SYMBOL_GPL(vhost_task_wake_safe);
> +
>  /**
>   * vhost_task_stop - stop a vhost_task
>   * @vtsk: vhost_task to stop
> @@ -91,7 +127,7 @@ void vhost_task_stop(struct vhost_task *vtsk)
>  	mutex_lock(&vtsk->exit_mutex);
>  	if (!test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)) {
>  		set_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags);
> -		vhost_task_wake(vtsk);
> +		__vhost_task_wake(vtsk);
>  	}
>  	mutex_unlock(&vtsk->exit_mutex);
>  
> -- 
> 2.51.0.261.g7ce5a0a67e-goog


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed
  2025-08-26  7:52   ` Michael S. Tsirkin
@ 2025-08-26 14:03     ` Sean Christopherson
  2025-08-26 14:15       ` Sebastian Andrzej Siewior
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Sean Christopherson @ 2025-08-26 14:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Jason Wang, kvm, virtualization, netdev,
	linux-kernel, Sebastian Andrzej Siewior

On Tue, Aug 26, 2025, Michael S. Tsirkin wrote:
> On Mon, Aug 25, 2025 at 05:40:09PM -0700, Sean Christopherson wrote:
> > Provide an API in vhost task instead of forcing KVM to solve the problem,
> > as KVM would literally just add an equivalent to VHOST_TASK_FLAGS_KILLED,
> > along with a new lock to protect said flag.  In general, forcing simple
> > usage of vhost task to care about signals _and_ take non-trivial action to
> > do the right thing isn't developer friendly, and is likely to lead to
> > similar bugs in the future.
> > 
> > Debugged-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Link: https://lore.kernel.org/all/aKkLEtoDXKxAAWju@google.com
> > Link: https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com
> > Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> OK but I dislike the API.

FWIW, I don't love it either.

> Default APIs should be safe. So vhost_task_wake_safe should be
> vhost_task_wake
> 
> This also reduces the changes to kvm.
> 
> 
> It does not look like we need the "unsafe" variant, so pls drop it.

vhost_vq_work_queue() calls

  vhost_worker_queue()
  |
  -> worker->ops->wakeup(worker)
     |
     -> vhost_task_wakeup()
        |
        -> vhost_task_wake()

while holding RCU and so can't sleep.

	rcu_read_lock();
	worker = rcu_dereference(vq->worker);
	if (worker) {
		queued = true;
		vhost_worker_queue(worker, work);
	}
	rcu_read_unlock();

And the call from __vhost_worker_flush() is done while holding a vhost_worker.mutex.
That's probably ok?  But there are many paths that lead to __vhost_worker_flush(),
which makes it difficult to audit all flows.  So even if there is an easy change
for the RCU conflict, I wouldn't be comfortable adding a mutex_lock() to so many
flows in a patch that needs to go to stable@.

> If we do need it, it should be called __vhost_task_wake.

I initially had that, but didn't like that vhost_task_wake() wouldn't call
__vhost_task_wake(), i.e. wouldn't follow the semi-standard pattern of the
no-underscores function being a wrapper for the double-underscores function.

I'm definitely not opposed to that though (or any other naming options).  Sans
comments, this was my other idea for names:


static void ____vhost_task_wake(struct vhost_task *vtsk)
{
	wake_up_process(vtsk->task);
}

void __vhost_task_wake(struct vhost_task *vtsk)
{
	WARN_ON_ONCE(!vtsk->handle_sigkill);

	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)))
		return;

	____vhost_task_wake(vtsk);
}
EXPORT_SYMBOL_GPL(__vhost_task_wake);

void vhost_task_wake(struct vhost_task *vtsk)
{
	guard(mutex)(&vtsk->exit_mutex);

	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)))
		return;

	if (test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags))
		return;

	____vhost_task_wake(vtsk);
}
EXPORT_SYMBOL_GPL(vhost_task_wake);

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed
  2025-08-26 14:03     ` Sean Christopherson
@ 2025-08-26 14:15       ` Sebastian Andrzej Siewior
  2025-08-26 14:40       ` Michael S. Tsirkin
  2025-08-26 14:43       ` Michael S. Tsirkin
  2 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-08-26 14:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Michael S. Tsirkin, Paolo Bonzini, Jason Wang, kvm,
	virtualization, netdev, linux-kernel

On 2025-08-26 07:03:33 [-0700], Sean Christopherson wrote:
> And the call from __vhost_worker_flush() is done while holding a vhost_worker.mutex.
> That's probably ok?  But there are many paths that lead to __vhost_worker_flush(),
> which makes it difficult to audit all flows.  So even if there is an easy change
> for the RCU conflict, I wouldn't be comfortable adding a mutex_lock() to so many
> flows in a patch that needs to go to stable@.

If I may throw something else into the mix: If you do "early"
get_task_struct() on the thread (within the thread), then you could wake
it even after its do_exit() since the task_struct would remain valid.
Once you remove it from all structs where it can be found, you would do
the final put_task_struct().

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed
  2025-08-26 14:03     ` Sean Christopherson
  2025-08-26 14:15       ` Sebastian Andrzej Siewior
@ 2025-08-26 14:40       ` Michael S. Tsirkin
  2025-08-26 14:43       ` Michael S. Tsirkin
  2 siblings, 0 replies; 11+ messages in thread
From: Michael S. Tsirkin @ 2025-08-26 14:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Jason Wang, kvm, virtualization, netdev,
	linux-kernel, Sebastian Andrzej Siewior

On Tue, Aug 26, 2025 at 07:03:33AM -0700, Sean Christopherson wrote:
> On Tue, Aug 26, 2025, Michael S. Tsirkin wrote:
> > On Mon, Aug 25, 2025 at 05:40:09PM -0700, Sean Christopherson wrote:
> > > Provide an API in vhost task instead of forcing KVM to solve the problem,
> > > as KVM would literally just add an equivalent to VHOST_TASK_FLAGS_KILLED,
> > > along with a new lock to protect said flag.  In general, forcing simple
> > > usage of vhost task to care about signals _and_ take non-trivial action to
> > > do the right thing isn't developer friendly, and is likely to lead to
> > > similar bugs in the future.
> > > 
> > > Debugged-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > Link: https://lore.kernel.org/all/aKkLEtoDXKxAAWju@google.com
> > > Link: https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com
> > > Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > 
> > OK but I dislike the API.
> 
> FWIW, I don't love it either.
> 
> > Default APIs should be safe. So vhost_task_wake_safe should be
> > vhost_task_wake
> > 
> > This also reduces the changes to kvm.
> > 
> > 
> > It does not look like we need the "unsafe" variant, so pls drop it.
> 
> vhost_vq_work_queue() calls
> 
>   vhost_worker_queue()
>   |
>   -> worker->ops->wakeup(worker)
>      |
>      -> vhost_task_wakeup()
>         |
>         -> vhost_task_wake()
> 
> while holding RCU and so can't sleep.
> 
> 	rcu_read_lock();
> 	worker = rcu_dereference(vq->worker);
> 	if (worker) {
> 		queued = true;
> 		vhost_worker_queue(worker, work);
> 	}
> 	rcu_read_unlock();
> 
> And the call from __vhost_worker_flush() is done while holding a vhost_worker.mutex.
> That's probably ok?  But there are many paths that lead to __vhost_worker_flush(),
> which makes it difficult to audit all flows.  So even if there is an easy change
> for the RCU conflict, I wouldn't be comfortable adding a mutex_lock() to so many
> flows in a patch that needs to go to stable@.
> 
> > If we do need it, it should be called __vhost_task_wake.
> 
> I initially had that, but didn't like that vhost_task_wake() wouldn't call
> __vhost_task_wake(), i.e. wouldn't follow the semi-standard pattern of the
> no-underscores function being a wrapper for the double-underscores function.

Eh. that's not really a standard. the standard is that __ is an unsafe
variant.

> I'm definitely not opposed to that though (or any other naming options).  Sans
> comments, this was my other idea for names:
> 
> 
> static void ____vhost_task_wake(struct vhost_task *vtsk)

That's way too many __. Just vhost_task_wake_up_process will do.

> {
> 	wake_up_process(vtsk->task);
> }



Pls add docs explaining the usage of __vhost_task_wake
and vhost_task_wake respectively.

> void __vhost_task_wake(struct vhost_task *vtsk)
> {
> 	WARN_ON_ONCE(!vtsk->handle_sigkill);
> 
> 	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)))
> 		return;

Add comments here please explaining why we warn.

> 	____vhost_task_wake(vtsk);
> }
> EXPORT_SYMBOL_GPL(__vhost_task_wake);



> void vhost_task_wake(struct vhost_task *vtsk)


> {
> 	guard(mutex)(&vtsk->exit_mutex);
> 
> 	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)))

Add comments here please explaining why we warn.

> 		return;
> 
> 	if (test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags))
> 		return;
> 
> 	____vhost_task_wake(vtsk);
> }
> EXPORT_SYMBOL_GPL(vhost_task_wake);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed
  2025-08-26 14:03     ` Sean Christopherson
  2025-08-26 14:15       ` Sebastian Andrzej Siewior
  2025-08-26 14:40       ` Michael S. Tsirkin
@ 2025-08-26 14:43       ` Michael S. Tsirkin
  2 siblings, 0 replies; 11+ messages in thread
From: Michael S. Tsirkin @ 2025-08-26 14:43 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Jason Wang, kvm, virtualization, netdev,
	linux-kernel, Sebastian Andrzej Siewior

On Tue, Aug 26, 2025 at 07:03:33AM -0700, Sean Christopherson wrote:
> On Tue, Aug 26, 2025, Michael S. Tsirkin wrote:
> > On Mon, Aug 25, 2025 at 05:40:09PM -0700, Sean Christopherson wrote:
> > > Provide an API in vhost task instead of forcing KVM to solve the problem,
> > > as KVM would literally just add an equivalent to VHOST_TASK_FLAGS_KILLED,
> > > along with a new lock to protect said flag.  In general, forcing simple
> > > usage of vhost task to care about signals _and_ take non-trivial action to
> > > do the right thing isn't developer friendly, and is likely to lead to
> > > similar bugs in the future.
> > > 
> > > Debugged-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > Link: https://lore.kernel.org/all/aKkLEtoDXKxAAWju@google.com
> > > Link: https://lore.kernel.org/all/aJ_vEP2EHj6l0xRT@google.com
> > > Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > Fixes: d96c77bd4eeb ("KVM: x86: switch hugepage recovery thread to vhost_task")
> > > Cc: stable@vger.kernel.org
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > 
> > OK but I dislike the API.
> 
> FWIW, I don't love it either.
> 
> > Default APIs should be safe. So vhost_task_wake_safe should be
> > vhost_task_wake
> > 
> > This also reduces the changes to kvm.
> > 
> > 
> > It does not look like we need the "unsafe" variant, so pls drop it.
> 
> vhost_vq_work_queue() calls
> 
>   vhost_worker_queue()
>   |
>   -> worker->ops->wakeup(worker)
>      |
>      -> vhost_task_wakeup()
>         |
>         -> vhost_task_wake()
> 
> while holding RCU and so can't sleep.
> 
> 	rcu_read_lock();
> 	worker = rcu_dereference(vq->worker);
> 	if (worker) {
> 		queued = true;
> 		vhost_worker_queue(worker, work);
> 	}
> 	rcu_read_unlock();

OK so this needs to change to call the __ variant then.


> And the call from __vhost_worker_flush() is done while holding a vhost_worker.mutex.
> That's probably ok?  But there are many paths that lead to __vhost_worker_flush(),
> which makes it difficult to audit all flows.  So even if there is an easy change
> for the RCU conflict, I wouldn't be comfortable adding a mutex_lock() to so many
> flows in a patch that needs to go to stable@.
> 
> > If we do need it, it should be called __vhost_task_wake.
> 
> I initially had that, but didn't like that vhost_task_wake() wouldn't call
> __vhost_task_wake(), i.e. wouldn't follow the semi-standard pattern of the
> no-underscores function being a wrapper for the double-underscores function.
> 
> I'm definitely not opposed to that though (or any other naming options).  Sans
> comments, this was my other idea for names:
> 
> 
> static void ____vhost_task_wake(struct vhost_task *vtsk)
> {
> 	wake_up_process(vtsk->task);
> }
> 
> void __vhost_task_wake(struct vhost_task *vtsk)
> {
> 	WARN_ON_ONCE(!vtsk->handle_sigkill);
> 
> 	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags)))
> 		return;
> 
> 	____vhost_task_wake(vtsk);
> }
> EXPORT_SYMBOL_GPL(__vhost_task_wake);
> 
> void vhost_task_wake(struct vhost_task *vtsk)
> {
> 	guard(mutex)(&vtsk->exit_mutex);
> 
> 	if (WARN_ON_ONCE(test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)))
> 		return;
> 
> 	if (test_bit(VHOST_TASK_FLAGS_KILLED, &vtsk->flags))
> 		return;
> 
> 	____vhost_task_wake(vtsk);
> }
> EXPORT_SYMBOL_GPL(vhost_task_wake);


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task
  2025-08-26  0:40 [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task Sean Christopherson
                   ` (2 preceding siblings ...)
  2025-08-26  0:40 ` [PATCH 3/3] KVM: x86/mmu: Don't register a sigkill callback for NX hugepage recovery tasks Sean Christopherson
@ 2025-08-28  2:19 ` Lei Yang
  3 siblings, 0 replies; 11+ messages in thread
From: Lei Yang @ 2025-08-28  2:19 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Michael S. Tsirkin, Jason Wang, kvm,
	virtualization, netdev, linux-kernel, Sebastian Andrzej Siewior

Tested this series of patches with vhost-net regression tests,
everything works fine.

Tested-by: Lei Yang <leiyang@redhat.com>

On Tue, Aug 26, 2025 at 8:40 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Fix a bug where KVM attempts to wake a vhost task that has already exited in
> response to a fatal signal, and tack on a few cleanups to harden against
> introducing similar bugs in the future.
>
> Somehow, this only started causing problems when commit 56180dd20c19 ("futex:
> Use RCU-based per-CPU reference counting instead of rcuref_t") landed.  I have
> no idea why the futex changes exposed the bug, and I don't care all that much,
> as this is firmly a KVM bug.
>
> Sean Christopherson (3):
>   vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task
>     was killed
>   vhost_task: Allow caller to omit handle_sigkill() callback
>   KVM: x86/mmu: Don't register a sigkill callback for NX hugepage
>     recovery tasks
>
>  arch/x86/kvm/mmu/mmu.c           |  9 ++----
>  include/linux/sched/vhost_task.h |  1 +
>  kernel/vhost_task.c              | 52 +++++++++++++++++++++++++++++---
>  3 files changed, 51 insertions(+), 11 deletions(-)
>
>
> base-commit: 1b237f190eb3d36f52dffe07a40b5eb210280e00
> --
> 2.51.0.261.g7ce5a0a67e-goog
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-08-28  2:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26  0:40 [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task Sean Christopherson
2025-08-26  0:40 ` [PATCH 1/3] vhost_task: KVM: Don't wake KVM x86's recovery thread if vhost task was killed Sean Christopherson
2025-08-26  7:52   ` Michael S. Tsirkin
2025-08-26 14:03     ` Sean Christopherson
2025-08-26 14:15       ` Sebastian Andrzej Siewior
2025-08-26 14:40       ` Michael S. Tsirkin
2025-08-26 14:43       ` Michael S. Tsirkin
2025-08-26  0:40 ` [PATCH 2/3] vhost_task: Allow caller to omit handle_sigkill() callback Sean Christopherson
2025-08-26  6:29   ` Sebastian Andrzej Siewior
2025-08-26  0:40 ` [PATCH 3/3] KVM: x86/mmu: Don't register a sigkill callback for NX hugepage recovery tasks Sean Christopherson
2025-08-28  2:19 ` [PATCH 0/3] vhost_task: KVM: Fix a race where KVM wakes an exited task Lei Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).