[PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s)

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s)
@ 2023-10-18 20:46 Sean Christopherson
  2023-10-18 20:46 ` [PATCH 1/3] KVM: Set file_operations.owner appropriately for all such structures Sean Christopherson
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Sean Christopherson @ 2023-10-18 20:46 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Al Viro, David Matlack

Clean up a KVM module refcounting mess that Al pointed out in the context
of the guest_memfd series.  The worst behavior was recently introduced by
an ill-fated attempt to fix a bug in x86's async #PF code.  Instead of
fixing the underlying bug of not flushing a workqueue (see patch 2), KVM
fudged around the bug by gifting every VM a reference to the KVM module.

That made the reproducer happy (hopefully there was actually a reproducer
at one point), but it didn't fully fix the use-after-free bug, it just made
the bug harder to hit.  E.g. as pointed out by Al, if kvm_destroy_vm() is
preempted after putting the last KVM module reference, KVM can be unloaded
before kvm_destroy_vm() completes, and scheduling back in the associated
task will explode (preemption isn't strictly required, it's just the most
obvious path to failure).

Then after applying that "fix", we/I made an even bigger goof by relying on
the nonexistent "protection" provided by the VM's reference and removed the
code which guaranteed that the KVM module would be pinned until *after* the
last reference to a KVM-owned file was put.

Undo the mess we created and fix the original async #PF workqueue bug.

Sean Christopherson (3):
  KVM: Set file_operations.owner appropriately for all such structures
  KVM: Always flush async #PF workqueue when vCPU is being destroyed
  Revert "KVM: Prevent module exit until all VMs are freed"

 arch/x86/kvm/debugfs.c |  1 +
 virt/kvm/async_pf.c    | 15 ++++++++++++---
 virt/kvm/kvm_main.c    | 18 ++++++++----------
 3 files changed, 21 insertions(+), 13 deletions(-)

base-commit: 437bba5ad2bba00c2056c896753a32edf80860cc
-- 
2.42.0.655.g421f12c284-goog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] KVM: Set file_operations.owner appropriately for all such structures
  2023-10-18 20:46 [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s) Sean Christopherson
@ 2023-10-18 20:46 ` Sean Christopherson
  2023-10-18 20:46 ` [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2023-10-18 20:46 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Al Viro, David Matlack

Set .owner for all KVM-owned filed types so that the KVM module is pinned
until any files with callbacks back into KVM are completely freed.  Using
"struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
while there are active VMs, doesn't provide full protection.

Userspace can invoke delete_module() the instant the last reference to KVM
is put.  If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
then it's possible for KVM to be preempted and deleted/unloaded before KVM
fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
in, it will jump to a code page that is no longer mapped.

Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
THIS_MODULE (which points at kvm.ko).  KVM assumes that if /dev/kvm is
reachable, e.g. VMs are active, then the vendor module is loaded.

To reduce the probability of forgetting to set .owner entirely, use
THIS_MODULE for stats files where KVM does not call back into vendor code.

This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes
several other file types that have been buggy since their introduction.

Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"")
Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/debugfs.c |  1 +
 virt/kvm/kvm_main.c    | 11 ++++++++---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index ee8c4c3496ed..eea6ea7f14af 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -182,6 +182,7 @@ static int kvm_mmu_rmaps_stat_release(struct inode *inode, struct file *file)
 }
 
 static const struct file_operations mmu_rmaps_stat_fops = {
+	.owner		= THIS_MODULE,
 	.open		= kvm_mmu_rmaps_stat_open,
 	.read		= seq_read,
 	.llseek		= seq_lseek,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 486800a7024b..1e65a506985f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3887,7 +3887,7 @@ static int kvm_vcpu_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static const struct file_operations kvm_vcpu_fops = {
+static struct file_operations kvm_vcpu_fops = {
 	.release        = kvm_vcpu_release,
 	.unlocked_ioctl = kvm_vcpu_ioctl,
 	.mmap           = kvm_vcpu_mmap,
@@ -4081,6 +4081,7 @@ static int kvm_vcpu_stats_release(struct inode *inode, struct file *file)
 }
 
 static const struct file_operations kvm_vcpu_stats_fops = {
+	.owner = THIS_MODULE,
 	.read = kvm_vcpu_stats_read,
 	.release = kvm_vcpu_stats_release,
 	.llseek = noop_llseek,
@@ -4431,7 +4432,7 @@ static int kvm_device_release(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static const struct file_operations kvm_device_fops = {
+static struct file_operations kvm_device_fops = {
 	.unlocked_ioctl = kvm_device_ioctl,
 	.release = kvm_device_release,
 	KVM_COMPAT(kvm_device_ioctl),
@@ -4759,6 +4760,7 @@ static int kvm_vm_stats_release(struct inode *inode, struct file *file)
 }
 
 static const struct file_operations kvm_vm_stats_fops = {
+	.owner = THIS_MODULE,
 	.read = kvm_vm_stats_read,
 	.release = kvm_vm_stats_release,
 	.llseek = noop_llseek,
@@ -5060,7 +5062,7 @@ static long kvm_vm_compat_ioctl(struct file *filp,
 }
 #endif
 
-static const struct file_operations kvm_vm_fops = {
+static struct file_operations kvm_vm_fops = {
 	.release        = kvm_vm_release,
 	.unlocked_ioctl = kvm_vm_ioctl,
 	.llseek		= noop_llseek,
@@ -6095,6 +6097,9 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 		goto err_async_pf;
 
 	kvm_chardev_ops.owner = module;
+	kvm_vm_fops.owner = module;
+	kvm_vcpu_fops.owner = module;
+	kvm_device_fops.owner = module;
 
 	kvm_preempt_ops.sched_in = kvm_sched_in;
 	kvm_preempt_ops.sched_out = kvm_sched_out;
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2023-10-18 20:46 [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s) Sean Christopherson
  2023-10-18 20:46 ` [PATCH 1/3] KVM: Set file_operations.owner appropriately for all such structures Sean Christopherson
@ 2023-10-18 20:46 ` Sean Christopherson
  2023-10-23 18:24   ` David Matlack
  2023-10-24  8:21   ` Xu Yilun
  2023-10-18 20:46 ` [PATCH 3/3] Revert "KVM: Prevent module exit until all VMs are freed" Sean Christopherson
  2023-12-01 23:30 ` [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s) Sean Christopherson
  3 siblings, 2 replies; 9+ messages in thread
From: Sean Christopherson @ 2023-10-18 20:46 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Al Viro, David Matlack

Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
completion queue, i.e. when a VM and all its vCPUs is being destroyed.
KVM must ensure that none of its workqueue callbacks is running when the
last reference to the KVM _module_ is put.  Gifting a reference to the
associated VM prevents the workqueue callback from dereferencing freed
vCPU/VM memory, but does not prevent the KVM module from being unloaded
before the callback completes.

Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:

 WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
 Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
 CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
 Workqueue: events async_pf_execute [kvm]
 RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
 Call Trace:
  <TASK>
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>
 ---[ end trace 0000000000000000 ]---
 INFO: task kworker/8:1:251 blocked for more than 120 seconds.
       Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
 Workqueue: events async_pf_execute [kvm]
 Call Trace:
  <TASK>
  __schedule+0x33f/0xa40
  schedule+0x53/0xc0
  schedule_timeout+0x12a/0x140
  __wait_for_common+0x8d/0x1d0
  __flush_work.isra.0+0x19f/0x2c0
  kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
  kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
  kvm_put_kvm+0x1c1/0x320 [kvm]
  async_pf_execute+0x198/0x260 [kvm]
  process_one_work+0x145/0x2d0
  worker_thread+0x27e/0x3a0
  kthread+0xba/0xe0
  ret_from_fork+0x2d/0x50
  ret_from_fork_asm+0x11/0x20
  </TASK>

If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
then there's no need to gift async_pf_execute() a reference because all
invocations of async_pf_execute() will be forced to complete before the
vCPU and its VM are destroyed/freed.  And that in turn fixes the module
unloading bug as __fput() won't do module_put() on the last vCPU reference
until the vCPU has been freed, e.g. if closing the vCPU file also puts the
last reference to the KVM module.

Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
freed") *tried* to fix the module refcounting issue by having VMs grab a
reference to the module, but that only made the bug slightly harder to hit
as it gave async_pf_execute() a bit more time to complete before the KVM
module could be unloaded.

Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
Cc: stable@vger.kernel.org
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/async_pf.c | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index e033c79d528e..7aeb9d1f43b1 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -87,7 +87,6 @@ static void async_pf_execute(struct work_struct *work)
 	__kvm_vcpu_wake_up(vcpu);
 
 	mmput(mm);
-	kvm_put_kvm(vcpu->kvm);
 }
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
@@ -114,7 +113,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
 #else
 		if (cancel_work_sync(&work->work)) {
 			mmput(work->mm);
-			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
 			kmem_cache_free(async_pf_cache, work);
 		}
 #endif
@@ -126,7 +124,19 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
 			list_first_entry(&vcpu->async_pf.done,
 					 typeof(*work), link);
 		list_del(&work->link);
+
+		spin_unlock(&vcpu->async_pf.lock);
+
+		/*
+		 * The async #PF is "done", but KVM must wait for the work item
+		 * itself, i.e. async_pf_execute(), to run to completion.  If
+		 * KVM is a module, KVM must ensure *no* code owned by the KVM
+		 * (the module) can be run after the last call to module_put(),
+		 * i.e. after the last reference to the last vCPU's file is put.
+		 */
+		flush_work(&work->work);
 		kmem_cache_free(async_pf_cache, work);
+		spin_lock(&vcpu->async_pf.lock);
 	}
 	spin_unlock(&vcpu->async_pf.lock);
 
@@ -186,7 +196,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	work->arch = *arch;
 	work->mm = current->mm;
 	mmget(work->mm);
-	kvm_get_kvm(work->vcpu->kvm);
 
 	INIT_WORK(&work->work, async_pf_execute);
 
-- 
2.42.0.655.g421f12c284-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2023-10-18 20:46 ` [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
@ 2023-10-23 18:24   ` David Matlack
  2023-10-24  8:21   ` Xu Yilun
  1 sibling, 0 replies; 9+ messages in thread
From: David Matlack @ 2023-10-23 18:24 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, linux-kernel, Al Viro

On 2023-10-18 01:46 PM, Sean Christopherson wrote:
> Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> completion queue, i.e. when a VM and all its vCPUs is being destroyed.

nit:

  ... or when the guest toggles CR0.PG or async_pf support.

> KVM must ensure that none of its workqueue callbacks is running when the
> last reference to the KVM _module_ is put.  Gifting a reference to the
> associated VM prevents the workqueue callback from dereferencing freed
> vCPU/VM memory, but does not prevent the KVM module from being unloaded
> before the callback completes.
> 
> Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
>
[...]
> 
> Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> freed") *tried* to fix the module refcounting issue by having VMs grab a
> reference to the module, but that only made the bug slightly harder to hit
> as it gave async_pf_execute() a bit more time to complete before the KVM
> module could be unloaded.

Blegh! Thanks for the fix.

> 
> Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> Cc: stable@vger.kernel.org
> Cc: David Matlack <dmatlack@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Matlack <dmatlack@google.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2023-10-18 20:46 ` [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
  2023-10-23 18:24   ` David Matlack
@ 2023-10-24  8:21   ` Xu Yilun
  2023-10-24 15:49     ` Sean Christopherson
  1 sibling, 1 reply; 9+ messages in thread
From: Xu Yilun @ 2023-10-24  8:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Al Viro, David Matlack

On Wed, Oct 18, 2023 at 01:46:23PM -0700, Sean Christopherson wrote:
> Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its
> completion queue, i.e. when a VM and all its vCPUs is being destroyed.
> KVM must ensure that none of its workqueue callbacks is running when the
> last reference to the KVM _module_ is put.  Gifting a reference to the
> associated VM prevents the workqueue callback from dereferencing freed
> vCPU/VM memory, but does not prevent the KVM module from being unloaded
> before the callback completes.
> 
> Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from
> async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will
> result in deadlock.  async_pf_execute() can't return until kvm_put_kvm()
> finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes:
> 
>  WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm]
>  Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass
>  CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>  Workqueue: events async_pf_execute [kvm]
>  RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm]
>  Call Trace:
>   <TASK>
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
>  ---[ end trace 0000000000000000 ]---
>  INFO: task kworker/8:1:251 blocked for more than 120 seconds.
>        Tainted: G        W          6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  task:kworker/8:1     state:D stack:0     pid:251   ppid:2      flags:0x00004000
>  Workqueue: events async_pf_execute [kvm]
>  Call Trace:
>   <TASK>
>   __schedule+0x33f/0xa40
>   schedule+0x53/0xc0
>   schedule_timeout+0x12a/0x140
>   __wait_for_common+0x8d/0x1d0
>   __flush_work.isra.0+0x19f/0x2c0
>   kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm]
>   kvm_arch_destroy_vm+0x78/0x1b0 [kvm]
>   kvm_put_kvm+0x1c1/0x320 [kvm]
>   async_pf_execute+0x198/0x260 [kvm]
>   process_one_work+0x145/0x2d0
>   worker_thread+0x27e/0x3a0
>   kthread+0xba/0xe0
>   ret_from_fork+0x2d/0x50
>   ret_from_fork_asm+0x11/0x20
>   </TASK>
> 
> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue,
> then there's no need to gift async_pf_execute() a reference because all
> invocations of async_pf_execute() will be forced to complete before the
> vCPU and its VM are destroyed/freed.  And that in turn fixes the module
> unloading bug as __fput() won't do module_put() on the last vCPU reference
> until the vCPU has been freed, e.g. if closing the vCPU file also puts the
> last reference to the KVM module.
> 
> Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are
> freed") *tried* to fix the module refcounting issue by having VMs grab a
> reference to the module, but that only made the bug slightly harder to hit
> as it gave async_pf_execute() a bit more time to complete before the KVM
> module could be unloaded.
> 
> Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
> Cc: stable@vger.kernel.org
> Cc: David Matlack <dmatlack@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/async_pf.c | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
> 
> diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
> index e033c79d528e..7aeb9d1f43b1 100644
> --- a/virt/kvm/async_pf.c
> +++ b/virt/kvm/async_pf.c
> @@ -87,7 +87,6 @@ static void async_pf_execute(struct work_struct *work)
>  	__kvm_vcpu_wake_up(vcpu);
>  
>  	mmput(mm);
> -	kvm_put_kvm(vcpu->kvm);
>  }
>  
>  void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> @@ -114,7 +113,6 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  #else
>  		if (cancel_work_sync(&work->work)) {
>  			mmput(work->mm);
> -			kvm_put_kvm(vcpu->kvm); /* == work->vcpu->kvm */
>  			kmem_cache_free(async_pf_cache, work);
>  		}
>  #endif
> @@ -126,7 +124,19 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
>  			list_first_entry(&vcpu->async_pf.done,
>  					 typeof(*work), link);
>  		list_del(&work->link);
> +
> +		spin_unlock(&vcpu->async_pf.lock);
> +
> +		/*
> +		 * The async #PF is "done", but KVM must wait for the work item
> +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> +		 * (the module) can be run after the last call to module_put(),
> +		 * i.e. after the last reference to the last vCPU's file is put.
> +		 */
> +		flush_work(&work->work);

I see the flush_work() is inside the check:

  while (!list_empty(&vcpu->async_pf.done))

Is it possible all async_pf are already completed but the work item,
i.e. async_pf_execute, is not completed before this check? That the
work is scheduled out after kvm_arch_async_page_present_queued() and
all APF_READY requests have been handled. In this case the work
synchronization will be skipped...

Thanks,
Yilun

>  		kmem_cache_free(async_pf_cache, work);
> +		spin_lock(&vcpu->async_pf.lock);
>  	}
>  	spin_unlock(&vcpu->async_pf.lock);
>  
> @@ -186,7 +196,6 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  	work->arch = *arch;
>  	work->mm = current->mm;
>  	mmget(work->mm);
> -	kvm_get_kvm(work->vcpu->kvm);
>  
>  	INIT_WORK(&work->work, async_pf_execute);
>  
> -- 
> 2.42.0.655.g421f12c284-goog
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2023-10-24  8:21   ` Xu Yilun
@ 2023-10-24 15:49     ` Sean Christopherson
  2023-10-25  4:25       ` Xu Yilun
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2023-10-24 15:49 UTC (permalink / raw)
  To: Xu Yilun; +Cc: Paolo Bonzini, kvm, linux-kernel, Al Viro, David Matlack

On Tue, Oct 24, 2023, Xu Yilun wrote:
> On Wed, Oct 18, 2023 at 01:46:23PM -0700, Sean Christopherson wrote:
> > @@ -126,7 +124,19 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> >  			list_first_entry(&vcpu->async_pf.done,
> >  					 typeof(*work), link);
> >  		list_del(&work->link);
> > +
> > +		spin_unlock(&vcpu->async_pf.lock);
> > +
> > +		/*
> > +		 * The async #PF is "done", but KVM must wait for the work item
> > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> > +		 * (the module) can be run after the last call to module_put(),
> > +		 * i.e. after the last reference to the last vCPU's file is put.
> > +		 */
> > +		flush_work(&work->work);
> 
> I see the flush_work() is inside the check:
> 
>   while (!list_empty(&vcpu->async_pf.done))
> 
> Is it possible all async_pf are already completed but the work item,
> i.e. async_pf_execute, is not completed before this check? That the
> work is scheduled out after kvm_arch_async_page_present_queued() and
> all APF_READY requests have been handled. In this case the work
> synchronization will be skipped...

Good gravy.  Yes, I assumed KVM wouldn't be so crazy to delete the work before it
completed, but I obviously didn't see this comment in async_pf_execute():

	/*
	 * apf may be freed by kvm_check_async_pf_completion() after
	 * this point
	 */

The most straightforward fix I see is to also flush the work in
kvm_check_async_pf_completion(), and then delete the comment.  The downside is
that there's a small chance a vCPU could be delayed waiting for the work to
complete, but that's a very, very small chance, and likely a very small delay.
kvm_arch_async_page_present_queued() unconditionaly makes a new request, i.e. will
effectively delay entering the guest, so the remaining work is really just:

 	trace_kvm_async_pf_completed(addr, cr2_or_gpa);

	__kvm_vcpu_wake_up(vcpu);

	mmput(mm);

Since mmput() can't drop the last reference to the page tables if the vCPU is
still alive.

I think I'll spin off the async #PF fix to a separate series.  There's are other
tangetially related cleanups that can be done, e.g. there's no reason to pin the
page tables while work is queued, async_pf_execute() can do mmget_not_zero() and
then bail if the process is dying.  Then there's no need to do mmput() when
canceling work.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed
  2023-10-24 15:49     ` Sean Christopherson
@ 2023-10-25  4:25       ` Xu Yilun
  0 siblings, 0 replies; 9+ messages in thread
From: Xu Yilun @ 2023-10-25  4:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, Al Viro, David Matlack

On Tue, Oct 24, 2023 at 08:49:24AM -0700, Sean Christopherson wrote:
> On Tue, Oct 24, 2023, Xu Yilun wrote:
> > On Wed, Oct 18, 2023 at 01:46:23PM -0700, Sean Christopherson wrote:
> > > @@ -126,7 +124,19 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
> > >  			list_first_entry(&vcpu->async_pf.done,
> > >  					 typeof(*work), link);
> > >  		list_del(&work->link);
> > > +
> > > +		spin_unlock(&vcpu->async_pf.lock);
> > > +
> > > +		/*
> > > +		 * The async #PF is "done", but KVM must wait for the work item
> > > +		 * itself, i.e. async_pf_execute(), to run to completion.  If
> > > +		 * KVM is a module, KVM must ensure *no* code owned by the KVM
> > > +		 * (the module) can be run after the last call to module_put(),
> > > +		 * i.e. after the last reference to the last vCPU's file is put.
> > > +		 */
> > > +		flush_work(&work->work);
> > 
> > I see the flush_work() is inside the check:
> > 
> >   while (!list_empty(&vcpu->async_pf.done))
> > 
> > Is it possible all async_pf are already completed but the work item,
> > i.e. async_pf_execute, is not completed before this check? That the
> > work is scheduled out after kvm_arch_async_page_present_queued() and
> > all APF_READY requests have been handled. In this case the work
> > synchronization will be skipped...
> 
> Good gravy.  Yes, I assumed KVM wouldn't be so crazy to delete the work before it
> completed, but I obviously didn't see this comment in async_pf_execute():
> 
> 	/*
> 	 * apf may be freed by kvm_check_async_pf_completion() after
> 	 * this point
> 	 */
> 
> The most straightforward fix I see is to also flush the work in
> kvm_check_async_pf_completion(), and then delete the comment.  The downside is
> that there's a small chance a vCPU could be delayed waiting for the work to
> complete, but that's a very, very small chance, and likely a very small delay.
> kvm_arch_async_page_present_queued() unconditionaly makes a new request, i.e. will
> effectively delay entering the guest, so the remaining work is really just:
> 
>  	trace_kvm_async_pf_completed(addr, cr2_or_gpa);
> 
> 	__kvm_vcpu_wake_up(vcpu);
> 
> 	mmput(mm);
> 
> Since mmput() can't drop the last reference to the page tables if the vCPU is
> still alive.

OK, seems the impact is minor. I'm good to it.

Thanks,
Yilun

> 
> I think I'll spin off the async #PF fix to a separate series.  There's are other
> tangetially related cleanups that can be done, e.g. there's no reason to pin the
> page tables while work is queued, async_pf_execute() can do mmget_not_zero() and
> then bail if the process is dying.  Then there's no need to do mmput() when
> canceling work.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 3/3] Revert "KVM: Prevent module exit until all VMs are freed"
  2023-10-18 20:46 [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s) Sean Christopherson
  2023-10-18 20:46 ` [PATCH 1/3] KVM: Set file_operations.owner appropriately for all such structures Sean Christopherson
  2023-10-18 20:46 ` [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
@ 2023-10-18 20:46 ` Sean Christopherson
  2023-12-01 23:30 ` [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s) Sean Christopherson
  3 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2023-10-18 20:46 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Al Viro, David Matlack

Revert KVM's misguided attempt to "fix" a use-after-module-unload bug that
was actually due to failure to flush a workqueue, not a lack of module
refcounting.  Pinning the KVM module until kvm_vm_destroy() doesn't
prevent use-after-free due to the module being unloaded, as userspace can
invoke delete_module() the instant the last reference to KVM is put, i.e.
can cause all KVM code to be unmapped while KVM is actively executing said
code.

Generally speaking, the many instances of module_put(THIS_MODULE)
notwithstanding, outside of a few special paths, a module can never safely
put the last reference to itself without creating deadlock, i.e. something
external to the module *must* put the last reference.  In other words,
having VMs grab a reference to the KVM module is futile, pointless, and as
evidenced by the now-reverted commit 70375c2d8fa3 ("Revert "KVM: set owner
of cpu and vm file operations""), actively dangerous.

This reverts commit 405294f29faee5de8c10cb9d4a90e229c2835279 and commit
5f6de5cbebee925a612856fce6f9182bb3eee0db.

Fixes: 405294f29fae ("KVM: Unconditionally get a ref to /dev/kvm module when creating a VM")
Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/kvm_main.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1e65a506985f..3b1b9e8dd70c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -115,8 +115,6 @@ EXPORT_SYMBOL_GPL(kvm_debugfs_dir);

 static const struct file_operations stat_fops_per_vm;

-static struct file_operations kvm_chardev_ops;
-
 static long kvm_vcpu_ioctl(struct file *file, unsigned int ioctl,
 			   unsigned long arg);
 #ifdef CONFIG_KVM_COMPAT
@@ -1157,9 +1155,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	if (!kvm)
 		return ERR_PTR(-ENOMEM);

-	/* KVM is pinned via open("/dev/kvm"), the fd passed to this ioctl(). */
-	__module_get(kvm_chardev_ops.owner);
-
 	KVM_MMU_LOCK_INIT(kvm);
 	mmgrab(current->mm);
 	kvm->mm = current->mm;
@@ -1279,7 +1274,6 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 out_err_no_srcu:
 	kvm_arch_free_vm(kvm);
 	mmdrop(current->mm);
-	module_put(kvm_chardev_ops.owner);
 	return ERR_PTR(r);
 }

@@ -1348,7 +1342,6 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	preempt_notifier_dec();
 	hardware_disable_all();
 	mmdrop(mm);
-	module_put(kvm_chardev_ops.owner);
 }

 void kvm_get_kvm(struct kvm *kvm)
-- 
2.42.0.655.g421f12c284-goog

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s)
  2023-10-18 20:46 [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s) Sean Christopherson
                   ` (2 preceding siblings ...)
  2023-10-18 20:46 ` [PATCH 3/3] Revert "KVM: Prevent module exit until all VMs are freed" Sean Christopherson
@ 2023-12-01 23:30 ` Sean Christopherson
  3 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2023-12-01 23:30 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, Al Viro, David Matlack

On Wed, 18 Oct 2023 13:46:21 -0700, Sean Christopherson wrote:
> Clean up a KVM module refcounting mess that Al pointed out in the context
> of the guest_memfd series.  The worst behavior was recently introduced by
> an ill-fated attempt to fix a bug in x86's async #PF code.  Instead of
> fixing the underlying bug of not flushing a workqueue (see patch 2), KVM
> fudged around the bug by gifting every VM a reference to the KVM module.
> 
> That made the reproducer happy (hopefully there was actually a reproducer
> at one point), but it didn't fully fix the use-after-free bug, it just made
> the bug harder to hit.  E.g. as pointed out by Al, if kvm_destroy_vm() is
> preempted after putting the last KVM module reference, KVM can be unloaded
> before kvm_destroy_vm() completes, and scheduling back in the associated
> task will explode (preemption isn't strictly required, it's just the most
> obvious path to failure).
> 
> [...]

Applied 1 and 3 (the .owner fixes) to kvm-x86 fixes.  I'll follow-up with a
separate series to tackle the async #PF mess.

[1/3] KVM: Set file_operations.owner appropriately for all such structures
      https://github.com/kvm-x86/linux/commit/087e15206d6a
[2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed
      (no commit info)
[3/3] Revert "KVM: Prevent module exit until all VMs are freed"
      https://github.com/kvm-x86/linux/commit/ea61294befd3

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-12-01 23:31 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-18 20:46 [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s) Sean Christopherson
2023-10-18 20:46 ` [PATCH 1/3] KVM: Set file_operations.owner appropriately for all such structures Sean Christopherson
2023-10-18 20:46 ` [PATCH 2/3] KVM: Always flush async #PF workqueue when vCPU is being destroyed Sean Christopherson
2023-10-23 18:24   ` David Matlack
2023-10-24  8:21   ` Xu Yilun
2023-10-24 15:49     ` Sean Christopherson
2023-10-25  4:25       ` Xu Yilun
2023-10-18 20:46 ` [PATCH 3/3] Revert "KVM: Prevent module exit until all VMs are freed" Sean Christopherson
2023-12-01 23:30 ` [PATCH 0/3] KVM: Fix KVM-owned file refcounting of KVM module(s) Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox