Re: [PATCH] KVM: Addressing a possible race in kvm_vcpu_on_spin:

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Breno Leitao <leitao@debian.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	rbc@meta.com, paulmck@kernel.org,
	 "open list:KERNEL VIRTUAL MACHINE (KVM)" <kvm@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] KVM: Addressing a possible race in kvm_vcpu_on_spin:
Date: Thu, 9 May 2024 09:42:48 -0700	[thread overview]
Message-ID: <Zjz9CLAIxRXlWe0F@google.com> (raw)
In-Reply-To: <20240509090146.146153-1-leitao@debian.org>

On Thu, May 09, 2024, Breno Leitao wrote:
> There are two workflow paths that access the same address
> simultaneously, creating a potential data race in kvm_vcpu_on_spin. This
> occurs when one workflow reads kvm->last_boosted_vcpu while another
> parallel path writes to it.
> 
> KCSAN produces the following output when enabled.
> 
> 	BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]
> 
> 	write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
> 	kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
> 	handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
> 	vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
> 	vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
> 	kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
> 	kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
> 	__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
> 	__x64_sys_ioctl (fs/ioctl.c:890)
> 	x64_sys_call (arch/x86/entry/syscall_64.c:33)
> 	do_syscall_64 (arch/x86/entry/common.c:?)
> 	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> 
> 	read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
> 	kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
> 	handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
> 	vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
> 	vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
> 	kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
> 	kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
> 	__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
> 	__x64_sys_ioctl (fs/ioctl.c:890)
> 	x64_sys_call (arch/x86/entry/syscall_64.c:33)
> 	do_syscall_64 (arch/x86/entry/common.c:?)
> 	entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> 
> 	value changed: 0x00000012 -> 0x00000000
> 
> Given that both operations occur simultaneously without any locking
> mechanisms in place, let's ensure atomicity to prevent possible data
> corruption. We'll achieve this by employing READ_ONCE() for the reading
> operation and WRITE_ONCE() for the writing operation.

Please state changelogs as a commands, e.g.

  Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure ...

And I think it's worth calling out that corruption is _extremely_ unlikely to
happen in practice.  It would require the compiler to generate truly awful code,
and it would require a VM with >256 vCPUs.

That said, I do think this should be sent to stable kernels, as it's (very, very)
theoretically possible to generate an out-of-bounds access, and this seems like a
super safe fix.  How about this?

---
KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin() 

Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the
loads and stores are atomic.  In the extremely unlikely scenario the
compiler tears the stores, it's theoretically possible for KVM to attempt
to get a vCPU using an out-of-bounds index, e.g. if the write is split
into multiple 8-bit stores, and is paired with a 32-bit load on a VM with
257 vCPUs:

  CPU0                              CPU1
  last_boosted_vcpu = 0xff;                                       

                                    (last_boosted_vcpu = 0x100)
                                    last_boosted_vcpu[15:8] = 0x01;
  i = (last_boosted_vcpu = 0x1ff)                                               
                                    last_boosted_vcpu[7:0] = 0x00;

  vcpu = kvm->vcpu_array[0x1ff];

As detected by KCSAN:

  BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]

  write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
  kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
  handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
  vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
  vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
  kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
  kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
  __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
  __x64_sys_ioctl (fs/ioctl.c:890)
  x64_sys_call (arch/x86/entry/syscall_64.c:33)
  do_syscall_64 (arch/x86/entry/common.c:?)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

  read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
  kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
  handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
  vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
  vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
  kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
  kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
  __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
  __x64_sys_ioctl (fs/ioctl.c:890)
  x64_sys_call (arch/x86/entry/syscall_64.c:33)
  do_syscall_64 (arch/x86/entry/common.c:?)
  entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

  value changed: 0x00000012 -> 0x00000000

Fixes: 217ece6129f2 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin")
Cc: stable@vger.kernel.org
---

> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  virt/kvm/kvm_main.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff0a20565f90..9768307d5e6c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4066,12 +4066,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  {
>  	struct kvm *kvm = me->kvm;
>  	struct kvm_vcpu *vcpu;
> -	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
> +	int last_boosted_vcpu;
>  	unsigned long i;
>  	int yielded = 0;
>  	int try = 3;
>  	int pass;
>  
> +	last_boosted_vcpu = READ_ONCE(me->kvm->last_boosted_vcpu);

Nit, this could opportunistically use "kvm" without the silly me->kvm.

>  	kvm_vcpu_set_in_spin_loop(me, true);
>  	/*
>  	 * We boost the priority of a VCPU that is runnable but not
> @@ -4109,7 +4110,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>  
>  			yielded = kvm_vcpu_yield_to(vcpu);
>  			if (yielded > 0) {
> -				kvm->last_boosted_vcpu = i;
> +				WRITE_ONCE(kvm->last_boosted_vcpu, i);
>  				break;
>  			} else if (yielded < 0) {
>  				try--;

Side topic #1: am I the only one that finds these loops unnecessarily hard to
read?  Unless I'm misreading the code, it's really just an indirect way of looping
over all vCPUs, starting at last_boosted_vcpu+1 and the wrapping.

IMO, reworking it to be like this is more straightforward:

	int nr_vcpus, start, i, idx, yielded;
	struct kvm *kvm = me->kvm;
	struct kvm_vcpu *vcpu;
	int try = 3;

	nr_vcpus = atomic_read(&kvm->online_vcpus);
	if (nr_vcpus < 2)
		return;

	/* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */
	smp_rmb();

	kvm_vcpu_set_in_spin_loop(me, true);

	start = READ_ONCE(kvm->last_boosted_vcpu) + 1;
	for (i = 0; i < nr_vcpus; i++) {
		idx = (start + i) % nr_vcpus;
		if (idx == me->vcpu_idx)
			continue;

		vcpu = xa_load(&kvm->vcpu_array, idx);
		if (!READ_ONCE(vcpu->ready))
			continue;
		if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
			continue;

		/*
		 * Treat the target vCPU as being in-kernel if it has a pending
		 * interrupt, as the vCPU trying to yield may be spinning
		 * waiting on IPI delivery, i.e. the target vCPU is in-kernel
		 * for the purposes of directed yield.
		 */
		if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
		    !kvm_arch_dy_has_pending_interrupt(vcpu) &&
		    !kvm_arch_vcpu_preempted_in_kernel(vcpu))
			continue;

		if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
			continue;

		yielded = kvm_vcpu_yield_to(vcpu);
		if (yielded > 0) {
			WRITE_ONCE(kvm->last_boosted_vcpu, i);
			break;
		} else if (yielded < 0 && !--try) {
			break;
		}
	}

	kvm_vcpu_set_in_spin_loop(me, false);

	/* Ensure vcpu is not eligible during next spinloop */
	kvm_vcpu_set_dy_eligible(me, false);

Side topic #2, intercepting PAUSE on x86 when there's only a single vCPU in the
VM is silly.  I don't know if it's worth the complexity, but we could defer
enabling PLE exiting until a second vCPU is created, e.g. via a new request.

Hmm, but x86 at least already has KVM_X86_DISABLE_EXITS_PAUSE, so this could be
more easily handled in userspace, e.g. by disabing PAUSE exiting if userspace
knows it's creating a single-vCPU VM.

next prev parent reply	other threads:[~2024-05-09 16:42 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-09  9:01 [PATCH] KVM: Addressing a possible race in kvm_vcpu_on_spin: Breno Leitao
2024-05-09 16:42 ` Sean Christopherson [this message]
2024-05-10  9:11   ` Breno Leitao
2024-05-10 14:39     ` Sean Christopherson
2024-05-10 15:52       ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zjz9CLAIxRXlWe0F@google.com \
    --to=seanjc@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=leitao@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=rbc@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.