[PATCH v2 0/2] Fix for a very old KVM bug in the segment cache

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Fix for a very old KVM bug in the segment cache
@ 2024-07-16  2:20 Maxim Levitsky
  2024-07-16  2:20 ` [PATCH v2 1/2] KVM: nVMX: use vmx_segment_cache_clear Maxim Levitsky
  2024-07-16  2:20 ` [PATCH v2 2/2] KVM: VMX: disable preemption when touching segment fields Maxim Levitsky
  0 siblings, 2 replies; 8+ messages in thread
From: Maxim Levitsky @ 2024-07-16  2:20 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Ingo Molnar, H. Peter Anvin, x86,
	Sean Christopherson, Borislav Petkov, linux-kernel, Dave Hansen,
	Thomas Gleixner, Maxim Levitsky

Hi,

Recently, while trying to understand why the pmu_counters_test
selftest sometimes fails when run nested I stumbled
upon a very interesting and old bug:

It turns out that KVM caches guest segment state,
but this cache doesn't have any protection against concurrent use.

This usually works because the cache is per vcpu, and should
only be accessed by vCPU thread, however there is an exception:

If the full preemption is enabled in the host kernel,
it is possible that vCPU thread will be preempted, for
example during the vmx_vcpu_reset.

vmx_vcpu_reset resets the segment cache bitmask and then initializes
the segments in the vmcs, however if the vcpus is preempted in the
middle of this code, the kvm_arch_vcpu_put is called which
reads SS's AR bytes to determine if the vCPU is in the kernel mode,
which caches the old value.

Later vmx_vcpu_reset will set the SS's AR field to the correct value
in vmcs but the cache still contains an invalid value which
can later for example leak via KVM_GET_SREGS and such.

In particular, kvm selftests will do KVM_GET_SREGS,
and then KVM_SET_SREGS, with a broken SS's AR field passed as is,
which will lead to vm entry failure.

This issue is not a nested issue, and actually I was able
to reproduce it on bare metal, but due to timing it happens
much more often nested. The only requirement for this to happen
is to have full preemption enabled in the kernel which runs the selftest.

pmu_counters_test reproduces this issue well, because it creates
lots of short lived VMs, but the issue as was noted
about is not related to pmu.

To fix this issue, all places in KVM which write to the segment
cache, are now wrapped with vmx_write_segment_cache_start/end
which disables the preemption.

V2: incorporated Paolo's suggestion of having
    vmx_write_segment_cache_start/end functions  (thanks!)

Best regards,
	Maxim Levitsky

Maxim Levitsky (2):
  KVM: nVMX: use vmx_segment_cache_clear
  KVM: VMX: disable preemption when touching segment fields

 arch/x86/kvm/vmx/nested.c |  5 ++++-
 arch/x86/kvm/vmx/vmx.c    | 29 +++++++++++++++++++----------
 arch/x86/kvm/vmx/vmx.h    | 17 +++++++++++++++++
 3 files changed, 40 insertions(+), 11 deletions(-)

-- 
2.26.3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/2] KVM: nVMX: use vmx_segment_cache_clear
  2024-07-16  2:20 [PATCH v2 0/2] Fix for a very old KVM bug in the segment cache Maxim Levitsky
@ 2024-07-16  2:20 ` Maxim Levitsky
  2024-07-16 21:07   ` Sean Christopherson
  2024-07-16  2:20 ` [PATCH v2 2/2] KVM: VMX: disable preemption when touching segment fields Maxim Levitsky
  1 sibling, 1 reply; 8+ messages in thread
From: Maxim Levitsky @ 2024-07-16  2:20 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Ingo Molnar, H. Peter Anvin, x86,
	Sean Christopherson, Borislav Petkov, linux-kernel, Dave Hansen,
	Thomas Gleixner, Maxim Levitsky

In prepare_vmcs02_rare, call vmx_segment_cache_clear, instead
of setting the segment_cache.bitmask directly.

No functional change intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/nested.c | 5 +++--
 arch/x86/kvm/vmx/vmx.c    | 4 ----
 arch/x86/kvm/vmx/vmx.h    | 5 +++++
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 643935a0f70ab..d3ca1a772ae67 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2469,6 +2469,9 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 
 	if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
 			   HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) {
+
+		vmx_segment_cache_clear(vmx);
+
 		vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
 		vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
 		vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
@@ -2505,8 +2508,6 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 		vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
 		vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
 		vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
-
-		vmx->segment_cache.bitmask = 0;
 	}
 
 	if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b3c83c06f8265..fa9f307d9b18b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -524,10 +524,6 @@ static const struct kvm_vmx_segment_field {
 	VMX_SEGMENT_FIELD(LDTR),
 };
 
-static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
-{
-	vmx->segment_cache.bitmask = 0;
-}
 
 static unsigned long host_idt_base;
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 7b64e271a9319..1689f0d59f435 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -755,4 +755,9 @@ static inline bool vmx_can_use_ipiv(struct kvm_vcpu *vcpu)
 	return  lapic_in_kernel(vcpu) && enable_ipiv;
 }
 
+static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
+{
+	vmx->segment_cache.bitmask = 0;
+}
+
 #endif /* __KVM_X86_VMX_H */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] KVM: nVMX: use vmx_segment_cache_clear
  2024-07-16  2:20 ` [PATCH v2 1/2] KVM: nVMX: use vmx_segment_cache_clear Maxim Levitsky
@ 2024-07-16 21:07   ` Sean Christopherson
  2024-07-24 18:18     ` Maxim Levitsky
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2024-07-16 21:07 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Paolo Bonzini, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Dave Hansen, Thomas Gleixner

On Mon, Jul 15, 2024, Maxim Levitsky wrote:
> In prepare_vmcs02_rare, call vmx_segment_cache_clear, instead
> of setting the segment_cache.bitmask directly.
> 
> No functional change intended.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 5 +++--
>  arch/x86/kvm/vmx/vmx.c    | 4 ----
>  arch/x86/kvm/vmx/vmx.h    | 5 +++++
>  3 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 643935a0f70ab..d3ca1a772ae67 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2469,6 +2469,9 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>  
>  	if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
>  			   HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) {
> +
> +		vmx_segment_cache_clear(vmx);
> +
>  		vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
>  		vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
>  		vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
> @@ -2505,8 +2508,6 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>  		vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
>  		vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
>  		vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
> -
> -		vmx->segment_cache.bitmask = 0;

This actually exacerbates the bug that you're trying fix in patch 2.  Clearing
segment_cache.bitmask _after_ writing the relevant state limits the stale data
to only the accessor that's running in IRQ context (kvm_arch_vcpu_put()).

Clearing segment_cache.bitmask _before_ writing the relevant statement means
that kvm_arch_vcpu_put() _and_ all future readers will be exposed to the stale
data, because the stale data cached by kvm_arch_vcpu_put() won't mark it invalid.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/2] KVM: nVMX: use vmx_segment_cache_clear
  2024-07-16 21:07   ` Sean Christopherson
@ 2024-07-24 18:18     ` Maxim Levitsky
  0 siblings, 0 replies; 8+ messages in thread
From: Maxim Levitsky @ 2024-07-24 18:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Dave Hansen, Thomas Gleixner

On Tue, 2024-07-16 at 14:07 -0700, Sean Christopherson wrote:
> On Mon, Jul 15, 2024, Maxim Levitsky wrote:
> > In prepare_vmcs02_rare, call vmx_segment_cache_clear, instead
> > of setting the segment_cache.bitmask directly.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/vmx/nested.c | 5 +++--
> >  arch/x86/kvm/vmx/vmx.c    | 4 ----
> >  arch/x86/kvm/vmx/vmx.h    | 5 +++++
> >  3 files changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 643935a0f70ab..d3ca1a772ae67 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -2469,6 +2469,9 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> >  
> >  	if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
> >  			   HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) {
> > +
> > +		vmx_segment_cache_clear(vmx);
> > +
> >  		vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
> >  		vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
> >  		vmcs_write16(GUEST_SS_SELECTOR, vmcs12->guest_ss_selector);
> > @@ -2505,8 +2508,6 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> >  		vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
> >  		vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
> >  		vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
> > -
> > -		vmx->segment_cache.bitmask = 0;
> 
> This actually exacerbates the bug that you're trying fix in patch 2.  Clearing
> segment_cache.bitmask _after_ writing the relevant state limits the stale data
> to only the accessor that's running in IRQ context (kvm_arch_vcpu_put()).
> 
> Clearing segment_cache.bitmask _before_ writing the relevant statement means
> that kvm_arch_vcpu_put() _and_ all future readers will be exposed to the stale
> data, because the stale data cached by kvm_arch_vcpu_put() won't mark it invalid.
> 
I noticed that after I sent the patch series, this makes sense.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 2/2] KVM: VMX: disable preemption when touching segment fields
  2024-07-16  2:20 [PATCH v2 0/2] Fix for a very old KVM bug in the segment cache Maxim Levitsky
  2024-07-16  2:20 ` [PATCH v2 1/2] KVM: nVMX: use vmx_segment_cache_clear Maxim Levitsky
@ 2024-07-16  2:20 ` Maxim Levitsky
  2024-07-16 22:36   ` Sean Christopherson
  1 sibling, 1 reply; 8+ messages in thread
From: Maxim Levitsky @ 2024-07-16  2:20 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Ingo Molnar, H. Peter Anvin, x86,
	Sean Christopherson, Borislav Petkov, linux-kernel, Dave Hansen,
	Thomas Gleixner, Maxim Levitsky

VMX code uses segment cache to avoid reading guest segment fields.

The cache is reset each time a segment's field (e.g base/access rights/etc)
is written, and then a new value of this field is written.

However if the vCPU is preempted between these two events, and this
segment field is read (e.g kvm reads SS's access rights to check
if the vCPU is in kernel mode), then old field value will get
cached and never updated.

Usually a lock is required to avoid such race but since vCPU segments
are only accessed by its vCPU thread, we can avoid a lock and
only disable preemption, in places where the segment cache
is invalidated and segment fields are updated.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/nested.c |  4 +++-
 arch/x86/kvm/vmx/vmx.c    | 25 +++++++++++++++++++------
 arch/x86/kvm/vmx/vmx.h    | 14 +++++++++++++-
 3 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d3ca1a772ae67..b6597fe5d011d 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2470,7 +2470,7 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 	if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
 			   HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2)) {
 
-		vmx_segment_cache_clear(vmx);
+		vmx_write_segment_cache_start(vmx);
 
 		vmcs_write16(GUEST_ES_SELECTOR, vmcs12->guest_es_selector);
 		vmcs_write16(GUEST_CS_SELECTOR, vmcs12->guest_cs_selector);
@@ -2508,6 +2508,8 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 		vmcs_writel(GUEST_TR_BASE, vmcs12->guest_tr_base);
 		vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
 		vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
+
+		vmx_write_segment_cache_end(vmx);
 	}
 
 	if (!hv_evmcs || !(hv_evmcs->hv_clean_fields &
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fa9f307d9b18b..26a5efd34aef7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2171,12 +2171,14 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 #ifdef CONFIG_X86_64
 	case MSR_FS_BASE:
-		vmx_segment_cache_clear(vmx);
+		vmx_write_segment_cache_start(vmx);
 		vmcs_writel(GUEST_FS_BASE, data);
+		vmx_write_segment_cache_end(vmx);
 		break;
 	case MSR_GS_BASE:
-		vmx_segment_cache_clear(vmx);
+		vmx_write_segment_cache_start(vmx);
 		vmcs_writel(GUEST_GS_BASE, data);
+		vmx_write_segment_cache_end(vmx);
 		break;
 	case MSR_KERNEL_GS_BASE:
 		vmx_write_guest_kernel_gs_base(vmx, data);
@@ -3088,7 +3090,7 @@ static void enter_rmode(struct kvm_vcpu *vcpu)
 
 	vmx->rmode.vm86_active = 1;
 
-	vmx_segment_cache_clear(vmx);
+	vmx_write_segment_cache_start(vmx);
 
 	vmcs_writel(GUEST_TR_BASE, kvm_vmx->tss_addr);
 	vmcs_write32(GUEST_TR_LIMIT, RMODE_TSS_SIZE - 1);
@@ -3109,6 +3111,8 @@ static void enter_rmode(struct kvm_vcpu *vcpu)
 	fix_rmode_seg(VCPU_SREG_DS, &vmx->rmode.segs[VCPU_SREG_DS]);
 	fix_rmode_seg(VCPU_SREG_GS, &vmx->rmode.segs[VCPU_SREG_GS]);
 	fix_rmode_seg(VCPU_SREG_FS, &vmx->rmode.segs[VCPU_SREG_FS]);
+
+	vmx_write_segment_cache_end(vmx);
 }
 
 int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)
@@ -3139,8 +3143,9 @@ int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)
 static void enter_lmode(struct kvm_vcpu *vcpu)
 {
 	u32 guest_tr_ar;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
-	vmx_segment_cache_clear(to_vmx(vcpu));
+	vmx_write_segment_cache_start(vmx);
 
 	guest_tr_ar = vmcs_read32(GUEST_TR_AR_BYTES);
 	if ((guest_tr_ar & VMX_AR_TYPE_MASK) != VMX_AR_TYPE_BUSY_64_TSS) {
@@ -3150,6 +3155,9 @@ static void enter_lmode(struct kvm_vcpu *vcpu)
 			     (guest_tr_ar & ~VMX_AR_TYPE_MASK)
 			     | VMX_AR_TYPE_BUSY_64_TSS);
 	}
+
+	vmx_write_segment_cache_end(vmx);
+
 	vmx_set_efer(vcpu, vcpu->arch.efer | EFER_LMA);
 }
 
@@ -3571,7 +3579,7 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	const struct kvm_vmx_segment_field *sf = &kvm_vmx_segment_fields[seg];
 
-	vmx_segment_cache_clear(vmx);
+	vmx_write_segment_cache_start(vmx);
 
 	if (vmx->rmode.vm86_active && seg != VCPU_SREG_LDTR) {
 		vmx->rmode.segs[seg] = *var;
@@ -3601,6 +3609,8 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 		var->type |= 0x1; /* Accessed */
 
 	vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
+
+	vmx_write_segment_cache_end(vmx);
 }
 
 void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
@@ -4870,7 +4880,8 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx->hv_deadline_tsc = -1;
 	kvm_set_cr8(vcpu, 0);
 
-	vmx_segment_cache_clear(vmx);
+	vmx_write_segment_cache_start(vmx);
+
 	kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);
 
 	seg_setup(VCPU_SREG_CS);
@@ -4899,6 +4910,8 @@ void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmcs_writel(GUEST_IDTR_BASE, 0);
 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
 
+	vmx_write_segment_cache_end(vmx);
+
 	vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
 	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, 0);
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 1689f0d59f435..cba14911032cd 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -755,9 +755,21 @@ static inline bool vmx_can_use_ipiv(struct kvm_vcpu *vcpu)
 	return  lapic_in_kernel(vcpu) && enable_ipiv;
 }
 
-static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
+static inline void vmx_write_segment_cache_start(struct vcpu_vmx *vmx)
+{
+	/* VMX segment cache can be accessed during preemption.
+	 * (e.g to determine the guest's CPL)
+	 *
+	 * To avoid caching a wrong value during such access, disable
+	 * the preemption
+	 */
+	preempt_disable();
+}
+
+static inline void vmx_write_segment_cache_end(struct vcpu_vmx  *vmx)
 {
 	vmx->segment_cache.bitmask = 0;
+	preempt_enable();
 }
 
 #endif /* __KVM_X86_VMX_H */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] KVM: VMX: disable preemption when touching segment fields
  2024-07-16  2:20 ` [PATCH v2 2/2] KVM: VMX: disable preemption when touching segment fields Maxim Levitsky
@ 2024-07-16 22:36   ` Sean Christopherson
  2024-07-25 12:59     ` Maxim Levitsky
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2024-07-16 22:36 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Paolo Bonzini, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Dave Hansen, Thomas Gleixner

On Mon, Jul 15, 2024, Maxim Levitsky wrote:
> VMX code uses segment cache to avoid reading guest segment fields.
> 
> The cache is reset each time a segment's field (e.g base/access rights/etc)
> is written, and then a new value of this field is written.
> 
> However if the vCPU is preempted between these two events, and this
> segment field is read (e.g kvm reads SS's access rights to check
> if the vCPU is in kernel mode), then old field value will get
> cached and never updated.

It'be super helpful to include the gory details about how kvm_arch_vcpu_put()
reads stale data.  Without that information, it's very hard to figure out how
getting preempted is problematic.

  vmx_vcpu_reset resets the segment cache bitmask and then initializes
  the segments in the vmcs, however if the vcpus is preempted in the
  middle of this code, the kvm_arch_vcpu_put is called which
  reads SS's AR bytes to determine if the vCPU is in the kernel mode,
  which caches the old value.

> Usually a lock is required to avoid such race but since vCPU segments
> are only accessed by its vCPU thread, we can avoid a lock and
> only disable preemption, in places where the segment cache
> is invalidated and segment fields are updated.

This doesn't fully fix the problem.  It's not just kvm_sched_out() => kvm_arch_vcpu_put()
that's problematic, it's any path that executes KVM code in interrupt context.
And it's not just limited to segment registers, any register that is conditionally
cached via arch.regs_avail is susceptible to races.

Specifically, kvm_guest_state() and kvm_guest_get_ip() will read SS.AR_bytes and
RIP in NMI and/or IRQ context when handling a PMI.

A few possible ideas.

 1. Force reads from IRQ/NMI context to skip the cache and go to the VMCS.

 2. Same thing as #1, but focus it specifically on kvm_arch_vcpu_in_kernel()
    and kvm_arch_vcpu_get_ip(), and WARN if kvm_register_is_available() or
    vmx_segment_cache_test_set() is invoked from IRQ or NMI context.

 3. Force caching of SS.AR_bytes, CS.AR_bytes, and RIP prior to kvm_after_interrupt(),
    rename preempted_in_kernel to something like "exited_in_kernel" and snapshot
    it before kvm_after_interrupt(), and add the same hardening as #2.

    This is doable because kvm_guest_state() should never read guest state for
    PMIs that occur between VM-Exit and kvm_after_interrupt(), nor should KVM
    write guest state in that window.  And the intent of the "preempted in kernel"
    check is to query vCPU state at the time of exit.

 5. Do a combination of #3 and patch 02 (#3 fixes PMIs, patch 02 fixes preemption).

My vote is probably for #2 or #4.  I definitely think we need WARNs in the caching
code, and in general kvm_arch_vcpu_put() shouldn't be reading cacheable state, i.e.
I am fairly confident we can restrict it to checking CPL.

I don't hate this patch by any means, but I don't love disabling preemption in a
bunch of flows just so that the preempted_in_kernel logic works.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] KVM: VMX: disable preemption when touching segment fields
  2024-07-16 22:36   ` Sean Christopherson
@ 2024-07-25 12:59     ` Maxim Levitsky
  2024-07-25 17:37       ` Maxim Levitsky
  0 siblings, 1 reply; 8+ messages in thread
From: Maxim Levitsky @ 2024-07-25 12:59 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Dave Hansen, Thomas Gleixner

On Tue, 2024-07-16 at 15:36 -0700, Sean Christopherson wrote:
> On Mon, Jul 15, 2024, Maxim Levitsky wrote:
> > VMX code uses segment cache to avoid reading guest segment fields.
> > 
> > The cache is reset each time a segment's field (e.g base/access rights/etc)
> > is written, and then a new value of this field is written.
> > 
> > However if the vCPU is preempted between these two events, and this
> > segment field is read (e.g kvm reads SS's access rights to check
> > if the vCPU is in kernel mode), then old field value will get
> > cached and never updated.
> 
> It'be super helpful to include the gory details about how kvm_arch_vcpu_put()
> reads stale data.  Without that information, it's very hard to figure out how
> getting preempted is problematic.

I will do this in next version of this patch.

> 
>   vmx_vcpu_reset resets the segment cache bitmask and then initializes
>   the segments in the vmcs, however if the vcpus is preempted in the
>   middle of this code, the kvm_arch_vcpu_put is called which
>   reads SS's AR bytes to determine if the vCPU is in the kernel mode,
>   which caches the old value.
> 
> > Usually a lock is required to avoid such race but since vCPU segments
> > are only accessed by its vCPU thread, we can avoid a lock and
> > only disable preemption, in places where the segment cache
> > is invalidated and segment fields are updated.
> 
> This doesn't fully fix the problem.  It's not just kvm_sched_out() => kvm_arch_vcpu_put()
> that's problematic, it's any path that executes KVM code in interrupt context.
> And it's not just limited to segment registers, any register that is conditionally
> cached via arch.regs_avail is susceptible to races.
> 
> Specifically, kvm_guest_state() and kvm_guest_get_ip() will read SS.AR_bytes and
> RIP in NMI and/or IRQ context when handling a PMI.

> 
> A few possible ideas.
> 
>  1. Force reads from IRQ/NMI context to skip the cache and go to the VMCS.

This IMHO is the best solution. For segment cache its easy to do, the code
will be contained in vmx_read_guest_seg_* functions.

For other VMX registers, this can be lot of work due to the way the code is scattered
around. Still probably double.


> 
>  2. Same thing as #1, but focus it specifically on kvm_arch_vcpu_in_kernel()
>     and kvm_arch_vcpu_get_ip(), and WARN if kvm_register_is_available() or
>     vmx_segment_cache_test_set() is invoked from IRQ or NMI context.

I agree on this, this is actually one of the suggestions I had originally.
( I didn't notice the kvm_arch_vcpu_get_ip though )

I think I will implement this suggestion.

> 
>  3. Force caching of SS.AR_bytes, CS.AR_bytes, and RIP prior to kvm_after_interrupt(),
>     rename preempted_in_kernel to something like "exited_in_kernel" and snapshot
>     it before kvm_after_interrupt(), and add the same hardening as #2.
> 
>     This is doable because kvm_guest_state() should never read guest state for
>     PMIs that occur between VM-Exit and kvm_after_interrupt(), nor should KVM
>     write guest state in that window.  And the intent of the "preempted in kernel"
>     check is to query vCPU state at the time of exit.
> 


>  5. Do a combination of #3 and patch 02 (#3 fixes PMIs, patch 02 fixes preemption).


> 
> My vote is probably for #2 or #4.
#4 causes a NULL pointer deference here :) 

>   I definitely think we need WARNs in the caching
> code, and in general kvm_arch_vcpu_put() shouldn't be reading cacheable state, i.e.
> I am fairly confident we can restrict it to checking CPL.
> 
> I don't hate this patch by any means, but I don't love disabling preemption in a
> bunch of flows just so that the preempted_in_kernel logic works.
> 


Thanks for the suggestions!

Best regards,	
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 2/2] KVM: VMX: disable preemption when touching segment fields
  2024-07-25 12:59     ` Maxim Levitsky
@ 2024-07-25 17:37       ` Maxim Levitsky
  0 siblings, 0 replies; 8+ messages in thread
From: Maxim Levitsky @ 2024-07-25 17:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Paolo Bonzini, Ingo Molnar, H. Peter Anvin, x86,
	Borislav Petkov, linux-kernel, Dave Hansen, Thomas Gleixner

On Thu, 2024-07-25 at 08:59 -0400, Maxim Levitsky wrote:
> On Tue, 2024-07-16 at 15:36 -0700, Sean Christopherson wrote:
> > On Mon, Jul 15, 2024, Maxim Levitsky wrote:
> > > VMX code uses segment cache to avoid reading guest segment fields.
> > > 
> > > The cache is reset each time a segment's field (e.g base/access rights/etc)
> > > is written, and then a new value of this field is written.
> > > 
> > > However if the vCPU is preempted between these two events, and this
> > > segment field is read (e.g kvm reads SS's access rights to check
> > > if the vCPU is in kernel mode), then old field value will get
> > > cached and never updated.
> > 
> > It'be super helpful to include the gory details about how kvm_arch_vcpu_put()
> > reads stale data.  Without that information, it's very hard to figure out how
> > getting preempted is problematic.
> 
> I will do this in next version of this patch.
> 
> >   vmx_vcpu_reset resets the segment cache bitmask and then initializes
> >   the segments in the vmcs, however if the vcpus is preempted in the
> >   middle of this code, the kvm_arch_vcpu_put is called which
> >   reads SS's AR bytes to determine if the vCPU is in the kernel mode,
> >   which caches the old value.
> > 
> > > Usually a lock is required to avoid such race but since vCPU segments
> > > are only accessed by its vCPU thread, we can avoid a lock and
> > > only disable preemption, in places where the segment cache
> > > is invalidated and segment fields are updated.
> > 
> > This doesn't fully fix the problem.  It's not just kvm_sched_out() => kvm_arch_vcpu_put()
> > that's problematic, it's any path that executes KVM code in interrupt context.
> > And it's not just limited to segment registers, any register that is conditionally
> > cached via arch.regs_avail is susceptible to races.
> > 
> > Specifically, kvm_guest_state() and kvm_guest_get_ip() will read SS.AR_bytes and
> > RIP in NMI and/or IRQ context when handling a PMI.
> > A few possible ideas.
> > 
> >  1. Force reads from IRQ/NMI context to skip the cache and go to the VMCS.
> 
> This IMHO is the best solution. For segment cache its easy to do, the code
> will be contained in vmx_read_guest_seg_* functions.
> 
> For other VMX registers, this can be lot of work due to the way the code is scattered
> around. Still probably double.
> 
> 
> >  2. Same thing as #1, but focus it specifically on kvm_arch_vcpu_in_kernel()
> >     and kvm_arch_vcpu_get_ip(), and WARN if kvm_register_is_available() or
> >     vmx_segment_cache_test_set() is invoked from IRQ or NMI context.
> 
> I agree on this, this is actually one of the suggestions I had originally.
> ( I didn't notice the kvm_arch_vcpu_get_ip though )
> 
> I think I will implement this suggestion.
> 
> >  3. Force caching of SS.AR_bytes, CS.AR_bytes, and RIP prior to kvm_after_interrupt(),
> >     rename preempted_in_kernel to something like "exited_in_kernel" and snapshot
> >     it before kvm_after_interrupt(), and add the same hardening as #2.
> > 
> >     This is doable because kvm_guest_state() should never read guest state for
> >     PMIs that occur between VM-Exit and kvm_after_interrupt(), nor should KVM
> >     write guest state in that window.  And the intent of the "preempted in kernel"
> >     check is to query vCPU state at the time of exit.
> > 
> >  5. Do a combination of #3 and patch 02 (#3 fixes PMIs, patch 02 fixes preemption).
> > My vote is probably for #2 or #4.
> #4 causes a NULL pointer deference here :) 
> 
> >   I definitely think we need WARNs in the caching
> > code, and in general kvm_arch_vcpu_put() shouldn't be reading cacheable state, i.e.
> > I am fairly confident we can restrict it to checking CPL.
> > 
> > I don't hate this patch by any means, but I don't love disabling preemption in a
> > bunch of flows just so that the preempted_in_kernel logic works.
> > 
> 
> Thanks for the suggestions!

Hi,

I decided to keep it simple. I'll send a patch which moves call to the vmx_segment_cache_clear
to be after we done with the segment initialization in vmx_vcpu_reset, and later I
write a refactoring/hardening to make sure that we don't read the cache
from the interrupt context.

Best regards,
	Maxim Levitsky



> 
> Best regards,	
> 	Maxim Levitsky
> 
> 
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-07-25 17:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-16  2:20 [PATCH v2 0/2] Fix for a very old KVM bug in the segment cache Maxim Levitsky
2024-07-16  2:20 ` [PATCH v2 1/2] KVM: nVMX: use vmx_segment_cache_clear Maxim Levitsky
2024-07-16 21:07   ` Sean Christopherson
2024-07-24 18:18     ` Maxim Levitsky
2024-07-16  2:20 ` [PATCH v2 2/2] KVM: VMX: disable preemption when touching segment fields Maxim Levitsky
2024-07-16 22:36   ` Sean Christopherson
2024-07-25 12:59     ` Maxim Levitsky
2024-07-25 17:37       ` Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox