public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Paul Mackerras <paulus@ozlabs.org>
To: "Cédric Le Goater" <clg@kaod.org>, kvm@vger.kernel.org
Cc: kvm-ppc@vger.kernel.org, David Gibson <david@gibson.dropbear.id.au>
Subject: [PATCH v2] KVM: PPC: Book3S HV: XIVE: Prevent races when releasing device
Date: Mon, 29 Apr 2019 11:24:03 +1000	[thread overview]
Message-ID: <20190429012403.GA11154@blackberry> (raw)
In-Reply-To: <20190426061014.GB12768@blackberry>

Now that we have the possibility of a XIVE or XICS-on-XIVE device being
released while the VM is still running, we need to be careful about
races and potential use-after-free bugs.  Although the kvmppc_xive
struct is not freed, but kept around for re-use, the kvmppc_xive_vcpu
structs are freed, and they are used extensively in both the XIVE native
and XICS-on-XIVE code.

There are various ways in which XIVE code gets invoked:

- VCPU entry and exit, which do push and pull operations on the XIVE hardware
- one_reg get and set functions (vcpu->mutex is held)
- XICS hypercalls (but only inside guest execution, not from
  kvmppc_pseries_do_hcall)
- device creation calls (kvm->lock is held)
- device callbacks - get/set attribute, mmap, pagefault, release/destroy
- set_mapped/clr_mapped calls (kvm->lock is held)
- connect_vcpu calls
- debugfs file read callbacks

Inside a device release function, we know that userspace cannot have an
open file descriptor referring to the device, nor can it have any mmapped
regions from the device.  Therefore the device callbacks are excluded,
as are the connect_vcpu calls (since they need a fd for the device).
Further, since the caller holds the kvm->lock mutex, no other device
creation calls or set/clr_mapped calls can be executing concurrently.

To exclude VCPU execution and XICS hypercalls, we temporarily set
kvm->arch.mmu_ready to 0.  This forces any VCPU task that is trying to
enter the guest to take the kvm->lock mutex, which is held by the caller
of the release function.  Then, sending an IPI to all other CPUs forces
any VCPU currently executing in the guest to exit.

Finally, we take the vcpu->mutex for each VCPU around the process of
cleaning up and freeing its XIVE data structures, in order to exclude
any one_reg get/set calls.

To exclude the debugfs read callbacks, we just need to ensure that
debugfs_remove is called before freeing any data structures.  Once it
returns we know that no CPU can be executing the callbacks (for our
kvmppc_xive instance).

Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
---
v2: move debugfs_remove call to eliminate race

 arch/powerpc/kvm/book3s_xive.c        | 51 ++++++++++++++++++++++++++++-------
 arch/powerpc/kvm/book3s_xive_native.c | 43 ++++++++++++++++++++++++-----
 2 files changed, 78 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 922689b..4280cd8 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -846,7 +846,8 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
 
 	/*
 	 * We can't update the state of a "pushed" VCPU, but that
-	 * shouldn't happen.
+	 * shouldn't happen because the vcpu->mutex makes running a
+	 * vcpu mutually exclusive with doing one_reg get/set on it.
 	 */
 	if (WARN_ON(vcpu->arch.xive_pushed))
 		return -EIO;
@@ -1835,7 +1836,7 @@ void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
 }
 
 /*
- * Called when device fd is closed
+ * Called when device fd is closed.  kvm->lock is held.
  */
 static void kvmppc_xive_release(struct kvm_device *dev)
 {
@@ -1843,21 +1844,46 @@ static void kvmppc_xive_release(struct kvm_device *dev)
 	struct kvm *kvm = xive->kvm;
 	struct kvm_vcpu *vcpu;
 	int i;
+	int was_ready;
 
 	pr_devel("Releasing xive device\n");
 
+	debugfs_remove(xive->dentry);
+
 	/*
-	 * When releasing the KVM device fd, the vCPUs can still be
-	 * running and we should clean up the vCPU interrupt
-	 * presenters first.
+	 * Clearing mmu_ready temporarily while holding kvm->lock
+	 * is a way of ensuring that no vcpus can enter the guest
+	 * until we drop kvm->lock.  Doing kick_all_cpus_sync()
+	 * ensures that any vcpu executing inside the guest has
+	 * exited the guest.  Once kick_all_cpus_sync() has finished,
+	 * we know that no vcpu can be executing the XIVE push or
+	 * pull code, or executing a XICS hcall.
+	 *
+	 * Since this is the device release function, we know that
+	 * userspace does not have any open fd referring to the
+	 * device.  Therefore there can not be any of the device
+	 * attribute set/get functions being executed concurrently,
+	 * and similarly, the connect_vcpu and set/clr_mapped
+	 * functions also cannot be being executed.
 	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
-		kvmppc_xive_cleanup_vcpu(vcpu);
+	was_ready = kvm->arch.mmu_ready;
+	kvm->arch.mmu_ready = 0;
+	kick_all_cpus_sync();
 
-	debugfs_remove(xive->dentry);
+	/*
+	 * We should clean up the vCPU interrupt presenters first.
+	 */
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		/*
+		 * Take vcpu->mutex to ensure that no one_reg get/set ioctl
+		 * (i.e. kvmppc_xive_[gs]et_icp) can be done concurrently.
+		 */
+		mutex_lock(&vcpu->mutex);
+		kvmppc_xive_cleanup_vcpu(vcpu);
+		mutex_unlock(&vcpu->mutex);
+	}
 
-	if (kvm)
-		kvm->arch.xive = NULL;
+	kvm->arch.xive = NULL;
 
 	/* Mask and free interrupts */
 	for (i = 0; i <= xive->max_sbid; i++) {
@@ -1870,6 +1896,8 @@ static void kvmppc_xive_release(struct kvm_device *dev)
 	if (xive->vp_base != XIVE_INVALID_VP)
 		xive_native_free_vp_block(xive->vp_base);
 
+	kvm->arch.mmu_ready = was_ready;
+
 	/*
 	 * A reference of the kvmppc_xive pointer is now kept under
 	 * the xive_devices struct of the machine for reuse. It is
@@ -1906,6 +1934,9 @@ struct kvmppc_xive *kvmppc_xive_get_device(struct kvm *kvm, u32 type)
 	return xive;
 }
 
+/*
+ * Create a XICS device with XIVE backend.  kvm->lock is held.
+ */
 static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 0497272a..5e14df1 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -973,21 +973,47 @@ static void kvmppc_xive_native_release(struct kvm_device *dev)
 	struct kvm *kvm = xive->kvm;
 	struct kvm_vcpu *vcpu;
 	int i;
+	int was_ready;
 
 	debugfs_remove(xive->dentry);
 
 	pr_devel("Releasing xive native device\n");
 
 	/*
-	 * When releasing the KVM device fd, the vCPUs can still be
-	 * running and we should clean up the vCPU interrupt
-	 * presenters first.
+	 * Clearing mmu_ready temporarily while holding kvm->lock
+	 * is a way of ensuring that no vcpus can enter the guest
+	 * until we drop kvm->lock.  Doing kick_all_cpus_sync()
+	 * ensures that any vcpu executing inside the guest has
+	 * exited the guest.  Once kick_all_cpus_sync() has finished,
+	 * we know that no vcpu can be executing the XIVE push or
+	 * pull code or accessing the XIVE MMIO regions.
+	 *
+	 * Since this is the device release function, we know that
+	 * userspace does not have any open fd or mmap referring to
+	 * the device.  Therefore there can not be any of the
+	 * device attribute set/get, mmap, or page fault functions
+	 * being executed concurrently, and similarly, the
+	 * connect_vcpu and set/clr_mapped functions also cannot
+	 * be being executed.
 	 */
-	kvm_for_each_vcpu(i, vcpu, kvm)
+	was_ready = kvm->arch.mmu_ready;
+	kvm->arch.mmu_ready = 0;
+	kick_all_cpus_sync();
+
+	/*
+	 * We should clean up the vCPU interrupt presenters first.
+	 */
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		/*
+		 * Take vcpu->mutex to ensure that no one_reg get/set ioctl
+		 * (i.e. kvmppc_xive_native_[gs]et_vp) can be being done.
+		 */
+		mutex_lock(&vcpu->mutex);
 		kvmppc_xive_native_cleanup_vcpu(vcpu);
+		mutex_unlock(&vcpu->mutex);
+	}
 
-	if (kvm)
-		kvm->arch.xive = NULL;
+	kvm->arch.xive = NULL;
 
 	for (i = 0; i <= xive->max_sbid; i++) {
 		if (xive->src_blocks[i])
@@ -999,6 +1025,8 @@ static void kvmppc_xive_native_release(struct kvm_device *dev)
 	if (xive->vp_base != XIVE_INVALID_VP)
 		xive_native_free_vp_block(xive->vp_base);
 
+	kvm->arch.mmu_ready = was_ready;
+
 	/*
 	 * A reference of the kvmppc_xive pointer is now kept under
 	 * the xive_devices struct of the machine for reuse. It is
@@ -1009,6 +1037,9 @@ static void kvmppc_xive_native_release(struct kvm_device *dev)
 	kfree(dev);
 }
 
+/*
+ * Create a XIVE device.  kvm->lock is held.
+ */
 static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
-- 
2.7.4


  reply	other threads:[~2019-04-29  1:24 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-18 10:39 [PATCH v6 00/17] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 01/17] powerpc/xive: add OPAL extensions for the XIVE native exploitation support Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 02/17] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 03/17] KVM: PPC: Book3S HV: XIVE: introduce a new capability KVM_CAP_PPC_IRQ_XIVE Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 04/17] KVM: PPC: Book3S HV: XIVE: add a control to initialize a source Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 05/17] KVM: PPC: Book3S HV: XIVE: add a control to configure " Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 06/17] KVM: PPC: Book3S HV: XIVE: add controls for the EQ configuration Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 07/17] KVM: PPC: Book3S HV: XIVE: add a global reset control Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 08/17] KVM: PPC: Book3S HV: XIVE: add a control to sync the sources Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 09/17] KVM: PPC: Book3S HV: XIVE: add a control to dirty the XIVE EQ pages Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 10/17] KVM: PPC: Book3S HV: XIVE: add get/set accessors for the VP XIVE state Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 11/17] KVM: introduce a 'mmap' method for KVM devices Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 12/17] KVM: PPC: Book3S HV: XIVE: add a TIMA mapping Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 13/17] KVM: PPC: Book3S HV: XIVE: add a mapping for the source ESB pages Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 14/17] KVM: PPC: Book3S HV: XIVE: add passthrough support Cédric Le Goater
2019-04-25  7:07   ` Paul Mackerras
2019-05-06 15:16     ` Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 15/17] KVM: PPC: Book3S HV: XIVE: activate XIVE exploitation mode Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 16/17] KVM: introduce a 'release' method for KVM devices Cédric Le Goater
2019-04-18 10:39 ` [PATCH v6 17/17] KVM: PPC: Book3S HV: XIVE: replace the 'destroy' method by a 'release' method Cédric Le Goater
2019-04-23  6:04   ` David Gibson
2019-04-26  6:10   ` [PATCH] KVM: PPC: Book3S HV: XIVE: Prevent races when releasing device Paul Mackerras
2019-04-29  1:24     ` Paul Mackerras [this message]
2019-05-06 16:05       ` [PATCH v2] " Cédric Le Goater
2019-04-30 10:11 ` [PATCH v6 00/17] KVM: PPC: Book3S HV: add XIVE native exploitation mode Paul Mackerras
2019-05-06 16:10   ` Cédric Le Goater

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190429012403.GA11154@blackberry \
    --to=paulus@ozlabs.org \
    --cc=clg@kaod.org \
    --cc=david@gibson.dropbear.id.au \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox