From: Sean Christopherson <seanjc@google.com>
To: Michal Luczaj <mhal@rbox.co>
Cc: Will Deacon <will@kernel.org>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Paolo Bonzini <pbonzini@redhat.com>,
Alexander Potapenko <glider@google.com>,
Marc Zyngier <maz@kernel.org>
Subject: Re: [PATCH] KVM: Fix error path in kvm_vm_ioctl_create_vcpu() on xa_store() failure
Date: Mon, 5 Aug 2024 15:56:49 -0700 [thread overview]
Message-ID: <ZrFYsSPaDWUHOl0N@google.com> (raw)
In-Reply-To: <07987fc3-5c47-4e77-956c-dae4bdf4bc2b@rbox.co>
On Sun, Aug 04, 2024, Michal Luczaj wrote:
> On 8/1/24 14:41, Will Deacon wrote:
> > On Wed, Jul 31, 2024 at 09:18:56AM -0700, Sean Christopherson wrote:
> >> [...]
> >> Ya, the basic problem is that we have two ways of publishing the vCPU, fd and
> >> vcpu_array, with no way of setting both atomically. Given that xa_store() should
> >> never fail, I vote we do the simple thing and deliberately leak the memory.
> >
> > I'm inclined to agree. This conversation did momentarily get me worried
> > about the window between the successful create_vcpu_fd() and the
> > xa_store(), but it looks like 'kvm->online_vcpus' protects that.
> >
> > I'll spin a v2 leaking the vCPU, then.
>
> But perhaps you're right. The window you've described may be an issue.
> For example:
>
> static u64 get_time_ref_counter(struct kvm *kvm)
> {
> ...
> vcpu = kvm_get_vcpu(kvm, 0); // may still be NULL
> tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> return mul_u64_u64_shr(tsc, hv->tsc_ref.tsc_scale, 64)
> + hv->tsc_ref.tsc_offset;
> }
>
> u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> {
> return vcpu->arch.l1_tsc_offset +
> kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio);
> }
>
> After stuffing msleep() between fd install and vcpu_array store:
>
> [ 125.296110] BUG: kernel NULL pointer dereference, address: 0000000000000b38
> [ 125.296203] #PF: supervisor read access in kernel mode
> [ 125.296266] #PF: error_code(0x0000) - not-present page
> [ 125.296327] PGD 12539e067 P4D 12539e067 PUD 12539d067 PMD 0
> [ 125.296392] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 125.296454] CPU: 12 UID: 1000 PID: 1179 Comm: a.out Not tainted 6.11.0-rc1nokasan+ #19
> [ 125.296521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [ 125.296585] RIP: 0010:kvm_read_l1_tsc+0x6/0x50 [kvm]
> [ 125.297376] Call Trace:
> [ 125.297430] <TASK>
> [ 125.297919] get_time_ref_counter+0x70/0x90 [kvm]
> [ 125.298039] kvm_hv_get_msr_common+0xc1/0x7d0 [kvm]
> [ 125.298150] __kvm_get_msr+0x72/0xf0 [kvm]
> [ 125.298421] do_get_msr+0x16/0x50 [kvm]
> [ 125.298531] msr_io+0x9d/0x110 [kvm]
> [ 125.298626] kvm_arch_vcpu_ioctl+0xdc5/0x19c0 [kvm]
> [ 125.299345] kvm_vcpu_ioctl+0x6cc/0x920 [kvm]
> [ 125.299540] __x64_sys_ioctl+0x90/0xd0
> [ 125.299582] do_syscall_64+0x93/0x180
> [ 125.300206] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 125.300243] RIP: 0033:0x7f2d64aded2d
>
> So, is get_time_ref_counter() broken (with a trivial fix) or should it be
> considered a regression after commit afb2acb2e3a3
> ("KVM: Fix vcpu_array[0] races")?
The latter, though arguably afb2acb2e3a3 isn't really a regression since it
essentially just reverts back to the pre-Xarray code, i.e. the bug was always
there, it was just temporarily masked by a worst bug.
I don't think we want to go down the path of declaring get_time_ref_counter()
broken, because that is going to result in an impossible programming model.
Ha! We can kill two birds with one stone. If we take vcpu->mutex before installing
the file descriptor, and hold it until online_vcpus is bumped, userspace
Argh, so close, kvm_arch_vcpu_async_ioctl() throws a wrench in that idea. Double
argh, whether or not an ioctl is async is buried in arch code.
I still think it makes sense to grab vcpu->mutex for synchronous ioctls. That
way there's no vibisle change to userspace, and we can lean on that code to reject
the async ioctls, as I can't imagine there's a practical use case for emitting an
an async ioctl without first doing a synchronous ioctl. E.g. in addition to the
below patch, plus changes to add kvm_arch_is_async_vcpu_ioctl():
/*
* Some architectures have vcpu ioctls that are asynchronous to vcpu
* execution; mutex_lock() would break them. Disallow asynchronous
* ioctls until the vCPU is fully online. This can only happen if
* userspace has *never* a done a synchronous ioctl, as acquiring the
* vCPU's mutex ensures the vCPU is online, i.e. isn't a restriction
* for any practical use case.
*/
if (kvm_arch_is_async_vcpu_ioctl(ioctl)) {
if (vcpu->vcpu_idx < atomic_read(&kvm->online_vcpus))
return -EINVAL;
return kvm_vcpu_async_ioctl(filp, ioctl, arg);
}
Alternatively, we could go for the super simple change and cross our fingers that
no "real" VMM emits vCPU ioctls before KVM_CREATE_VCPU returns.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..9ae9022a015f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4450,6 +4450,9 @@ static long kvm_vcpu_ioctl(struct file *filp,
if (unlikely(_IOC_TYPE(ioctl) != KVMIO))
return -EINVAL;
+ if (unlikely(vcpu->vcpu_idx < atomic_read(&kvm->online_vcpus)))
+ return -EINVAL;
+
/*
* Some architectures have vcpu ioctls that are asynchronous to vcpu
* execution; mutex_lock() would break them.
The mutex approach, sans async ioctl support:
---
virt/kvm/kvm_main.c | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d0788d0a72cc..0a9c390b18a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4269,12 +4269,6 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
mutex_lock(&kvm->lock);
-#ifdef CONFIG_LOCKDEP
- /* Ensure that lockdep knows vcpu->mutex is taken *inside* kvm->lock */
- mutex_lock(&vcpu->mutex);
- mutex_unlock(&vcpu->mutex);
-#endif
-
if (kvm_get_vcpu_by_id(kvm, id)) {
r = -EEXIST;
goto unlock_vcpu_destroy;
@@ -4285,15 +4279,29 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
if (r)
goto unlock_vcpu_destroy;
- /* Now it's all set up, let userspace reach it */
+ /*
+ * Now it's all set up, let userspace reach it. Grab the vCPU's mutex
+ * so that userspace can't invoke vCPU ioctl()s until the vCPU is fully
+ * visibile (per online_vcpus), e.g. so that KVM doesn't get tricked
+ * into a NULL-pointer dereference because KVM thinks the _current_
+ * vCPU doesn't exist. As a bonus, taking vcpu->mutex ensures lockdep
+ * knows it's taken *inside* kvm->lock.
+ */
+ mutex_lock(&vcpu->mutex);
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu);
if (r < 0)
goto kvm_put_xa_release;
+ /*
+ * xa_store() should never fail, see xa_reserve() above. Leak the vCPU
+ * if the impossible happens, as userspace already has access to the
+ * vCPU, i.e. freeing the vCPU before userspace puts its file reference
+ * would trigger a use-after-free.
+ */
if (KVM_BUG_ON(xa_store(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, 0), kvm)) {
- r = -EINVAL;
- goto kvm_put_xa_release;
+ mutex_unlock(&vcpu->mutex);
+ return -EINVAL;
}
/*
@@ -4302,6 +4310,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
*/
smp_wmb();
atomic_inc(&kvm->online_vcpus);
+ mutex_unlock(&vcpu->mutex);
mutex_unlock(&kvm->lock);
kvm_arch_vcpu_postcreate(vcpu);
@@ -4309,6 +4318,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, unsigned long id)
return r;
kvm_put_xa_release:
+ mutex_unlock(&vcpu->mutex);
kvm_put_kvm_no_destroy(kvm);
xa_release(&kvm->vcpu_array, vcpu->vcpu_idx);
unlock_vcpu_destroy:
base-commit: 332d2c1d713e232e163386c35a3ba0c1b90df83f
--
next prev parent reply other threads:[~2024-08-05 22:56 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-30 15:56 [PATCH] KVM: Fix error path in kvm_vm_ioctl_create_vcpu() on xa_store() failure Will Deacon
2024-07-30 18:55 ` Michal Luczaj
2024-07-30 23:31 ` Sean Christopherson
2024-07-31 13:31 ` Will Deacon
2024-07-31 15:49 ` Michal Luczaj
2024-07-31 16:18 ` Sean Christopherson
2024-07-31 19:27 ` Michal Luczaj
2024-08-01 12:41 ` Will Deacon
2024-08-04 21:05 ` Michal Luczaj
2024-08-05 22:56 ` Sean Christopherson [this message]
2024-08-05 23:02 ` Paolo Bonzini
2024-08-06 16:59 ` Michal Luczaj
2024-08-07 21:58 ` Sean Christopherson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZrFYsSPaDWUHOl0N@google.com \
--to=seanjc@google.com \
--cc=glider@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maz@kernel.org \
--cc=mhal@rbox.co \
--cc=pbonzini@redhat.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.