public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
@ 2024-09-20  7:59 Markku Ahvenjärvi
  2024-09-20  7:59 ` [PATCH 1/1] " Markku Ahvenjärvi
  0 siblings, 1 reply; 14+ messages in thread
From: Markku Ahvenjärvi @ 2024-09-20  7:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
  Cc: mankku, janne.karhunen, kvm, linux-kernel

Hello,

We experience hanging of the guest when running a hypervisor under KVM on
VMX. The L1 guest hypervisor in this particular case is pKVM for Intel
Architecture [1]. The hang occurs when a second nested guest is launched (the
first being de-privileged host). We observed that external interrupt
vmexit would not be passed to L1, instead L0 would attempt to resume L2.

We isolated the problem to VPPR not being updated on nested vmlaunch/vmresume,
and that causes vmx_has_apicv_interrupt() in nested_vmx_enter_non_root_mode()
to miss interrupts. Updating VPPR in vmx_has_apicv_interrupt() ensures VPPR
to be up-to-date.

We don't fully understand why VPPR problem appears with pKVM-IA as L1, but not
with normal KVM as L1. On pKVM-IA some of the host functionality is moved from
vmx root to non-root, but I would appreciate if someone could clarify why
normal KVM as L1 is seemingly unaffected.

Thanks,
Markku

[1]: https://lore.kernel.org/kvm/20230312180048.1778187-1-jason.cj.chen@intel.com

Markku Ahvenjärvi (1):
  KVM: nVMX: update VPPR on vmlaunch/vmresume

 arch/x86/kvm/lapic.c      | 9 +++++----
 arch/x86/kvm/lapic.h      | 1 +
 arch/x86/kvm/vmx/nested.c | 5 +++--
 3 files changed, 9 insertions(+), 6 deletions(-)

-- 
2.44.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-09-20  7:59 [PATCH 0/1] KVM: nVMX: update VPPR on vmlaunch/vmresume Markku Ahvenjärvi
@ 2024-09-20  7:59 ` Markku Ahvenjärvi
  2024-09-20  8:18   ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Markku Ahvenjärvi @ 2024-09-20  7:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin
  Cc: mankku, janne.karhunen, kvm, linux-kernel

Running certain hypervisors under KVM on VMX suffered L1 hangs after
launching a nested guest. The external interrupts were not processed on
vmlaunch/vmresume due to stale VPPR, and L2 guest would resume without
allowing L1 hypervisor to process the events.

The patch ensures VPPR to be updated when checking for pending
interrupts.

Signed-off-by: Markku Ahvenjärvi <mankku@gmail.com>
Signed-off-by: Janne Karhunen <janne.karhunen@gmail.com>
---
 arch/x86/kvm/lapic.c      | 9 +++++----
 arch/x86/kvm/lapic.h      | 1 +
 arch/x86/kvm/vmx/nested.c | 5 +++--
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5bb481aefcbc..7747c7d672ea 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -952,7 +952,7 @@ static int apic_has_interrupt_for_ppr(struct kvm_lapic *apic, u32 ppr)
 	return highest_irr;
 }
 
-static bool __apic_update_ppr(struct kvm_lapic *apic, u32 *new_ppr)
+bool __kvm_apic_update_ppr(struct kvm_lapic *apic, u32 *new_ppr)
 {
 	u32 tpr, isrv, ppr, old_ppr;
 	int isr;
@@ -973,12 +973,13 @@ static bool __apic_update_ppr(struct kvm_lapic *apic, u32 *new_ppr)
 
 	return ppr < old_ppr;
 }
+EXPORT_SYMBOL_GPL(__kvm_apic_update_ppr);
 
 static void apic_update_ppr(struct kvm_lapic *apic)
 {
 	u32 ppr;
 
-	if (__apic_update_ppr(apic, &ppr) &&
+	if (__kvm_apic_update_ppr(apic, &ppr) &&
 	    apic_has_interrupt_for_ppr(apic, ppr) != -1)
 		kvm_make_request(KVM_REQ_EVENT, apic->vcpu);
 }
@@ -2895,7 +2896,7 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
 	if (!kvm_apic_present(vcpu))
 		return -1;
 
-	__apic_update_ppr(apic, &ppr);
+	__kvm_apic_update_ppr(apic, &ppr);
 	return apic_has_interrupt_for_ppr(apic, ppr);
 }
 EXPORT_SYMBOL_GPL(kvm_apic_has_interrupt);
@@ -2954,7 +2955,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu)
 		 * triggered KVM_REQ_EVENT already.
 		 */
 		apic_set_isr(vector, apic);
-		__apic_update_ppr(apic, &ppr);
+		__kvm_apic_update_ppr(apic, &ppr);
 	}
 
 	return vector;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 7ef8ae73e82d..1d0bc13a6794 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -106,6 +106,7 @@ int kvm_apic_compare_prio(struct kvm_vcpu *vcpu1, struct kvm_vcpu *vcpu2);
 void kvm_apic_clear_irr(struct kvm_vcpu *vcpu, int vec);
 bool __kvm_apic_update_irr(u32 *pir, void *regs, int *max_irr);
 bool kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir, int *max_irr);
+bool __kvm_apic_update_ppr(struct kvm_lapic *apic, u32 *new_ppr);
 void kvm_apic_update_ppr(struct kvm_vcpu *vcpu);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
 		     struct dest_map *dest_map);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 2392a7ef254d..dacc92b150dd 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3431,10 +3431,11 @@ static int nested_vmx_check_permission(struct kvm_vcpu *vcpu)
 
 static u8 vmx_has_apicv_interrupt(struct kvm_vcpu *vcpu)
 {
+	u32 vppr;
 	u8 rvi = vmx_get_rvi();
-	u8 vppr = kvm_lapic_get_reg(vcpu->arch.apic, APIC_PROCPRI);
+	__kvm_apic_update_ppr(vcpu->arch.apic, &vppr);
 
-	return ((rvi & 0xf0) > (vppr & 0xf0));
+	return ((rvi & 0xf0) > (u8) (vppr & 0xf0));
 }
 
 static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
-- 
2.44.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-09-20  7:59 ` [PATCH 1/1] " Markku Ahvenjärvi
@ 2024-09-20  8:18   ` Sean Christopherson
  2024-09-20 12:40     ` Markku Ahvenjärvi
  2024-10-02 12:42     ` Markku Ahvenjärvi
  0 siblings, 2 replies; 14+ messages in thread
From: Sean Christopherson @ 2024-09-20  8:18 UTC (permalink / raw)
  To: Markku Ahvenjärvi
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, janne.karhunen, kvm,
	linux-kernel

On Fri, Sep 20, 2024, Markku Ahvenjärvi wrote:
> Running certain hypervisors under KVM on VMX suffered L1 hangs after
> launching a nested guest. The external interrupts were not processed on
> vmlaunch/vmresume due to stale VPPR, and L2 guest would resume without
> allowing L1 hypervisor to process the events.
> 
> The patch ensures VPPR to be updated when checking for pending
> interrupts.

This is architecturally incorrect, PPR isn't refreshed at VM-Enter.

Aha!  I wonder if the missing PPR update is due to the nested VM-Enter path
directly clearing IRR when processing a posted interrupt.

On top of https://github.com/kvm-x86/linux/tree/next, does this fix things?

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a8e7bc04d9bf..a8255c6f0d51 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3731,7 +3731,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
            kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
                vmx->nested.pi_pending = true;
                kvm_make_request(KVM_REQ_EVENT, vcpu);
-               kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
+               kvm_apic_ack_interrupt(vcpu, vmx->nested.posted_intr_nv);
        }
 
        /* Hide L1D cache contents from the nested guest.  */

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-09-20  8:18   ` Sean Christopherson
@ 2024-09-20 12:40     ` Markku Ahvenjärvi
  2024-10-02 12:42     ` Markku Ahvenjärvi
  1 sibling, 0 replies; 14+ messages in thread
From: Markku Ahvenjärvi @ 2024-09-20 12:40 UTC (permalink / raw)
  To: seanjc
  Cc: bp, dave.hansen, hpa, janne.karhunen, kvm, linux-kernel, mankku,
	mingo, pbonzini, tglx, x86

> This is architecturally incorrect, PPR isn't refreshed at VM-Enter.

Ah, I see. Thanks for pointing out.

> Aha!  I wonder if the missing PPR update is due to the nested VM-Enter path
> directly clearing IRR when processing a posted interrupt.
> 
> On top of https://github.com/kvm-x86/linux/tree/next, does this fix things?
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index a8e7bc04d9bf..a8255c6f0d51 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3731,7 +3731,7 @@ static int nested_vmx_run(struct kvm_vcpu *vcpu, bool launch)
>             kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
>                 vmx->nested.pi_pending = true;
>                 kvm_make_request(KVM_REQ_EVENT, vcpu);
> -               kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
> +               kvm_apic_ack_interrupt(vcpu, vmx->nested.posted_intr_nv);
>         }
> 
>         /* Hide L1D cache contents from the nested guest.  */

No, unfortunately.

The vmexits L1 is not receiving originate from deprivileged host (like timer
ticks), running in vmx non-root mode.

Thanks,
Markku

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-09-20  8:18   ` Sean Christopherson
  2024-09-20 12:40     ` Markku Ahvenjärvi
@ 2024-10-02 12:42     ` Markku Ahvenjärvi
  2024-10-02 15:52       ` Sean Christopherson
  1 sibling, 1 reply; 14+ messages in thread
From: Markku Ahvenjärvi @ 2024-10-02 12:42 UTC (permalink / raw)
  To: seanjc
  Cc: bp, dave.hansen, hpa, janne.karhunen, kvm, linux-kernel, mankku,
	mingo, pbonzini, tglx, x86

Hi Sean,

> On Fri, Sep 20, 2024, Markku Ahvenjärvi wrote:
> > Running certain hypervisors under KVM on VMX suffered L1 hangs after
> > launching a nested guest. The external interrupts were not processed on
> > vmlaunch/vmresume due to stale VPPR, and L2 guest would resume without
> > allowing L1 hypervisor to process the events.
> > 
> > The patch ensures VPPR to be updated when checking for pending
> > interrupts.
>
> This is architecturally incorrect, PPR isn't refreshed at VM-Enter.

I looked into this and found the following from Intel manual:

"30.1.3 PPR Virtualization

The processor performs PPR virtualization in response to the following
operations: (1) VM entry; (2) TPR virtualization; and (3) EOI virtualization.

..."

The section "27.3.2.5 Updating Non-Register State" further explains the VM
enter:

"If the “virtual-interrupt delivery” VM-execution control is 1, VM entry loads
the values of RVI and SVI from the guest interrupt-status field in the VMCS
(see Section 25.4.2). After doing so, the logical processor first causes PPR
virtualization (Section 30.1.3) and then evaluates pending virtual interrupts
(Section 30.2.1). If a virtual interrupt is recognized, it may be delivered in
VMX non-root operation immediately after VM entry (including any specified
event injection) completes; ..."

According to that, PPR is supposed to be refreshed at VM-Enter, or am I
missing something here?

Kind regards,
Markku

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-10-02 12:42     ` Markku Ahvenjärvi
@ 2024-10-02 15:52       ` Sean Christopherson
  2024-10-02 16:49         ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2024-10-02 15:52 UTC (permalink / raw)
  To: Markku Ahvenjärvi
  Cc: bp, dave.hansen, hpa, janne.karhunen, kvm, linux-kernel, mingo,
	pbonzini, tglx, x86

On Wed, Oct 02, 2024, Markku Ahvenjärvi wrote:
> Hi Sean,
> 
> > On Fri, Sep 20, 2024, Markku Ahvenjärvi wrote:
> > > Running certain hypervisors under KVM on VMX suffered L1 hangs after
> > > launching a nested guest. The external interrupts were not processed on
> > > vmlaunch/vmresume due to stale VPPR, and L2 guest would resume without
> > > allowing L1 hypervisor to process the events.
> > > 
> > > The patch ensures VPPR to be updated when checking for pending
> > > interrupts.
> >
> > This is architecturally incorrect, PPR isn't refreshed at VM-Enter.
> 
> I looked into this and found the following from Intel manual:
> 
> "30.1.3 PPR Virtualization
> 
> The processor performs PPR virtualization in response to the following
> operations: (1) VM entry; (2) TPR virtualization; and (3) EOI virtualization.
> 
> ..."
> 
> The section "27.3.2.5 Updating Non-Register State" further explains the VM
> enter:
> 
> "If the “virtual-interrupt delivery” VM-execution control is 1, VM entry loads
> the values of RVI and SVI from the guest interrupt-status field in the VMCS
> (see Section 25.4.2). After doing so, the logical processor first causes PPR
> virtualization (Section 30.1.3) and then evaluates pending virtual interrupts
> (Section 30.2.1). If a virtual interrupt is recognized, it may be delivered in
> VMX non-root operation immediately after VM entry (including any specified
> event injection) completes; ..."
> 
> According to that, PPR is supposed to be refreshed at VM-Enter, or am I
> missing something here?

Huh, I missed that.  It makes sense I guess; VM-Enter processes pending virtual
interrupts, so it stands that VM-Enter would refresh PPR as well.

Ugh, and looking again, KVM refreshes PPR every time it checks for a pending
interrupt, including the VM-Enter case (via kvm_apic_has_interrupt()) when nested
posted interrupts are in use:

	/* Emulate processing of posted interrupts on VM-Enter. */
	if (nested_cpu_has_posted_intr(vmcs12) &&
	    kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
		vmx->nested.pi_pending = true;
		kvm_make_request(KVM_REQ_EVENT, vcpu);
		kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
	}

I'm still curious as to what's different about your setup, but certainly not
curious enough to hold up a fix.

Anyways, back to the code, I think we can and should shoot for a more complete
cleanup (on top of a minimal fix).  As Chao suggested[*], the above nested posted
interrupt code shouldn't exist, as KVM should handle nested posted interrupts as
part of vmx_check_nested_events(), which honors event priority.  And I see a way,
albeit a bit of an ugly way, to avoid regressing performance when there's pending
nested posted interrupt at VM-Enter.

The other aspect of this code is that I don't think we need to limit the check
to APICv, i.e. KVM can simply check kvm_apic_has_interrupt() after VM-Enter
succeeds (the funky pre-check is necessary to read RVI from vmcs01, with the
event request deferred until KVM knows VM-Enter will be successful).

Arguably, that's probably more correct, as PPR virtualization should only occur
if VM-Enter is successful (or at least guest past the VM-Fail checks).

So, for an immediate fix, I _think_ we can do:

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a8e7bc04d9bf..784b61c9810b 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3593,7 +3593,8 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
         * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit
         * unconditionally.
         */
-       if (unlikely(evaluate_pending_interrupts))
+       if (unlikely(evaluate_pending_interrupts) ||
+           kvm_apic_has_interrupt(vcpu))
                kvm_make_request(KVM_REQ_EVENT, vcpu);
 
        /*

and then eventually make nested_vmx_enter_non_root_mode() look like the below.

Can you verify that the above fixes your setup?  If it does, I'll put together a
small series with that change and the cleanups I have in mind.

Thanks much!

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a8e7bc04d9bf..77f0695784d8 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3483,7 +3483,6 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
        struct vcpu_vmx *vmx = to_vmx(vcpu);
        struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
        enum vm_entry_failure_code entry_failure_code;
-       bool evaluate_pending_interrupts;
        union vmx_exit_reason exit_reason = {
                .basic = EXIT_REASON_INVALID_STATE,
                .failed_vmentry = 1,
@@ -3502,13 +3501,6 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 
        kvm_service_local_tlb_flush_requests(vcpu);
 
-       evaluate_pending_interrupts = exec_controls_get(vmx) &
-               (CPU_BASED_INTR_WINDOW_EXITING | CPU_BASED_NMI_WINDOW_EXITING);
-       if (likely(!evaluate_pending_interrupts) && kvm_vcpu_apicv_active(vcpu))
-               evaluate_pending_interrupts |= vmx_has_apicv_interrupt(vcpu);
-       if (!evaluate_pending_interrupts)
-               evaluate_pending_interrupts |= kvm_apic_has_pending_init_or_sipi(vcpu);
-
        if (!vmx->nested.nested_run_pending ||
            !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
                vmx->nested.pre_vmenter_debugctl = vmcs_read64(GUEST_IA32_DEBUGCTL);
@@ -3591,9 +3583,13 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
         * Re-evaluate pending events if L1 had a pending IRQ/NMI/INIT/SIPI
         * when it executed VMLAUNCH/VMRESUME, as entering non-root mode can
         * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit
-        * unconditionally.
+        * unconditionally.  Take care to pull data from vmcs01 as appropriate,
+        * e.g. when checking for interrupt windows, as vmcs02 is now loaded.
         */
-       if (unlikely(evaluate_pending_interrupts))
+       if ((__exec_controls_get(&vmx->vmcs01) & (CPU_BASED_INTR_WINDOW_EXITING |
+                                                 CPU_BASED_NMI_WINDOW_EXITING)) ||
+           kvm_apic_has_pending_init_or_sipi(vcpu) ||
+           kvm_apic_has_interrupt(vcpu))
                kvm_make_request(KVM_REQ_EVENT, vcpu);
 
        /*


[*] https://lore.kernel.org/all/Zp%2FC5IlwfzC5DCsl@chao-email

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-10-02 15:52       ` Sean Christopherson
@ 2024-10-02 16:49         ` Sean Christopherson
  2024-10-02 17:20           ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2024-10-02 16:49 UTC (permalink / raw)
  To: Markku Ahvenjärvi
  Cc: bp, dave.hansen, hpa, janne.karhunen, kvm, linux-kernel, mingo,
	pbonzini, tglx, x86

On Wed, Oct 02, 2024, Sean Christopherson wrote:
> On Wed, Oct 02, 2024, Markku Ahvenjärvi wrote:
> > Hi Sean,
> > 
> > > On Fri, Sep 20, 2024, Markku Ahvenjärvi wrote:
> > > > Running certain hypervisors under KVM on VMX suffered L1 hangs after
> > > > launching a nested guest. The external interrupts were not processed on
> > > > vmlaunch/vmresume due to stale VPPR, and L2 guest would resume without
> > > > allowing L1 hypervisor to process the events.
> > > > 
> > > > The patch ensures VPPR to be updated when checking for pending
> > > > interrupts.
> > >
> > > This is architecturally incorrect, PPR isn't refreshed at VM-Enter.
> > 
> > I looked into this and found the following from Intel manual:
> > 
> > "30.1.3 PPR Virtualization
> > 
> > The processor performs PPR virtualization in response to the following
> > operations: (1) VM entry; (2) TPR virtualization; and (3) EOI virtualization.
> > 
> > ..."
> > 
> > The section "27.3.2.5 Updating Non-Register State" further explains the VM
> > enter:
> > 
> > "If the “virtual-interrupt delivery” VM-execution control is 1, VM entry loads
> > the values of RVI and SVI from the guest interrupt-status field in the VMCS
> > (see Section 25.4.2). After doing so, the logical processor first causes PPR
> > virtualization (Section 30.1.3) and then evaluates pending virtual interrupts
> > (Section 30.2.1). If a virtual interrupt is recognized, it may be delivered in
> > VMX non-root operation immediately after VM entry (including any specified
> > event injection) completes; ..."
> > 
> > According to that, PPR is supposed to be refreshed at VM-Enter, or am I
> > missing something here?
> 
> Huh, I missed that.  It makes sense I guess; VM-Enter processes pending virtual
> interrupts, so it stands that VM-Enter would refresh PPR as well.
> 
> Ugh, and looking again, KVM refreshes PPR every time it checks for a pending
> interrupt, including the VM-Enter case (via kvm_apic_has_interrupt()) when nested
> posted interrupts are in use:
> 
> 	/* Emulate processing of posted interrupts on VM-Enter. */
> 	if (nested_cpu_has_posted_intr(vmcs12) &&
> 	    kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
> 		vmx->nested.pi_pending = true;
> 		kvm_make_request(KVM_REQ_EVENT, vcpu);
> 		kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
> 	}
> 
> I'm still curious as to what's different about your setup, but certainly not
> curious enough to hold up a fix.

Actually, none of the above is even relevant.  PPR virtualization in the nested
VM-Enter case would be for _L2's_ vPRR, not L1's.  And that virtualization is
performed by hardware (vmcs02 has the correct RVI, SVI, and vAPIC information
for L2).

Which means my initial instinct that KVM is missing a PPR update somewhere is
likely correct.

That said, I'm inclined to go with the below fix anyways, because KVM updates
PPR _constantly_, e.g. every time kvm_vcpu_has_events() is invoked with IRQs
enabled.  Which means that trying to avoid a PPR update on VM-Enter just to be
pedantically accurate is ridiculous.

> So, for an immediate fix, I _think_ we can do:
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index a8e7bc04d9bf..784b61c9810b 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3593,7 +3593,8 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>          * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit
>          * unconditionally.
>          */
> -       if (unlikely(evaluate_pending_interrupts))
> +       if (unlikely(evaluate_pending_interrupts) ||
> +           kvm_apic_has_interrupt(vcpu))
>                 kvm_make_request(KVM_REQ_EVENT, vcpu);
>  
>         /*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-10-02 16:49         ` Sean Christopherson
@ 2024-10-02 17:20           ` Sean Christopherson
  2024-10-03 11:29             ` Markku Ahvenjärvi
  2024-10-10 11:00             ` Chao Gao
  0 siblings, 2 replies; 14+ messages in thread
From: Sean Christopherson @ 2024-10-02 17:20 UTC (permalink / raw)
  To: Markku Ahvenjärvi
  Cc: bp, dave.hansen, hpa, janne.karhunen, kvm, linux-kernel, mingo,
	pbonzini, tglx, x86

On Wed, Oct 02, 2024, Sean Christopherson wrote:
> On Wed, Oct 02, 2024, Sean Christopherson wrote:
> > On Wed, Oct 02, 2024, Markku Ahvenjärvi wrote:
> > > Hi Sean,
> > > 
> > > > On Fri, Sep 20, 2024, Markku Ahvenjärvi wrote:
> > > > > Running certain hypervisors under KVM on VMX suffered L1 hangs after
> > > > > launching a nested guest. The external interrupts were not processed on
> > > > > vmlaunch/vmresume due to stale VPPR, and L2 guest would resume without
> > > > > allowing L1 hypervisor to process the events.
> > > > > 
> > > > > The patch ensures VPPR to be updated when checking for pending
> > > > > interrupts.
> > > >
> > > > This is architecturally incorrect, PPR isn't refreshed at VM-Enter.
> > > 
> > > I looked into this and found the following from Intel manual:
> > > 
> > > "30.1.3 PPR Virtualization
> > > 
> > > The processor performs PPR virtualization in response to the following
> > > operations: (1) VM entry; (2) TPR virtualization; and (3) EOI virtualization.
> > > 
> > > ..."
> > > 
> > > The section "27.3.2.5 Updating Non-Register State" further explains the VM
> > > enter:
> > > 
> > > "If the “virtual-interrupt delivery” VM-execution control is 1, VM entry loads
> > > the values of RVI and SVI from the guest interrupt-status field in the VMCS
> > > (see Section 25.4.2). After doing so, the logical processor first causes PPR
> > > virtualization (Section 30.1.3) and then evaluates pending virtual interrupts
> > > (Section 30.2.1). If a virtual interrupt is recognized, it may be delivered in
> > > VMX non-root operation immediately after VM entry (including any specified
> > > event injection) completes; ..."
> > > 
> > > According to that, PPR is supposed to be refreshed at VM-Enter, or am I
> > > missing something here?
> > 
> > Huh, I missed that.  It makes sense I guess; VM-Enter processes pending virtual
> > interrupts, so it stands that VM-Enter would refresh PPR as well.
> > 
> > Ugh, and looking again, KVM refreshes PPR every time it checks for a pending
> > interrupt, including the VM-Enter case (via kvm_apic_has_interrupt()) when nested
> > posted interrupts are in use:
> > 
> > 	/* Emulate processing of posted interrupts on VM-Enter. */
> > 	if (nested_cpu_has_posted_intr(vmcs12) &&
> > 	    kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
> > 		vmx->nested.pi_pending = true;
> > 		kvm_make_request(KVM_REQ_EVENT, vcpu);
> > 		kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
> > 	}
> > 
> > I'm still curious as to what's different about your setup, but certainly not
> > curious enough to hold up a fix.
> 
> Actually, none of the above is even relevant.  PPR virtualization in the nested
> VM-Enter case would be for _L2's_ vPRR, not L1's.  And that virtualization is
> performed by hardware (vmcs02 has the correct RVI, SVI, and vAPIC information
> for L2).
> 
> Which means my initial instinct that KVM is missing a PPR update somewhere is
> likely correct.

Talking to myself :-)

Assuming it actually fixes your issue, this is what I'm planning on posting.  I
suspect KVM botches something when the deprivileged host is active, but given
that the below will allow for additional cleanups, and practically speaking doesn't
have any downsides, I don't see any reason to withhold the hack-a-fix.  Though
hopefully we'll someday figure out exactly what's broken.

---
From: Sean Christopherson <seanjc@google.com>
Date: Wed, 2 Oct 2024 08:53:23 -0700
Subject: [PATCH] KVM: nVMX: Explicitly update vPPR on successful nested
 VM-Enter
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Request pending event evaluation after successful nested VM-Enter if L1
has a pending IRQ, which in turn refreshes vPPR based on vTPR and vISR.
This fixes an issue where KVM will fail to deliver a pending IRQ to L1
when running an atypical hypervisor in L1, e.g. the pKVM port to VMX.

Odds are very good that refreshing L1's vPPR is papering over a missed
PPR update somewhere, e.g. the underlying bug likely related to the fact
that pKVM passes through its APIC to the depriveleged host (which is an
L2 guest from KVM's perspective).

However, KVM updates PPR _constantly_, even when PPR technically shouldn't
be refreshed, e.g. kvm_vcpu_has_events() re-evaluates PPR if IRQs are
unblocked, by way of the same kvm_apic_has_interrupt() check.  Ditto for
nested VM-Enter itself, when nested posted interrupts are enabled.  Thus,
trying to avoid a PPR update on VM-Enter just to be pedantically accurate
is ridiculous, given the behavior elsewhere in KVM.

Unconditionally checking for interrupts will also allow for additional
cleanups, e.g. the manual RVI check earlier in VM-Enter emulation by
by vmx_has_apicv_interrupt() can be dropped, and the aforementioned nested
posted interrupt logic can be massaged to better honor priority between
concurrent events.

Link: https://lore.kernel.org/kvm/20230312180048.1778187-1-jason.cj.chen@intel.com
Reported-by: Markku Ahvenjärvi <mankku@gmail.com>
Closes: https://lore.kernel.org/all/20240920080012.74405-1-mankku@gmail.com
Suggested-by: Markku Ahvenjärvi <mankku@gmail.com>
Cc: Janne Karhunen <janne.karhunen@gmail.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a8e7bc04d9bf..784b61c9810b 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3593,7 +3593,8 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	 * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit
 	 * unconditionally.
 	 */
-	if (unlikely(evaluate_pending_interrupts))
+	if (unlikely(evaluate_pending_interrupts) ||
+	    kvm_apic_has_interrupt(vcpu))
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	/*

base-commit: e32cde8d2bd7d251a8f9b434143977ddf13dcec6
-- 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-10-02 17:20           ` Sean Christopherson
@ 2024-10-03 11:29             ` Markku Ahvenjärvi
  2024-10-10 11:00             ` Chao Gao
  1 sibling, 0 replies; 14+ messages in thread
From: Markku Ahvenjärvi @ 2024-10-03 11:29 UTC (permalink / raw)
  To: seanjc
  Cc: bp, dave.hansen, hpa, janne.karhunen, kvm, linux-kernel, mankku,
	mingo, pbonzini, tglx, x86

Hi Sean,

> Talking to myself :-)
A good monologue. :)

As you said, PPR is updated constantly with kvm_vcpu_has_events() anyways.

Actually implicit updates, especially in kvm_apic_has_interrupt(), makes it a
bit difficult to follow when these updates take place. It is quite easy to
accidentally make an update in architecturally incorrect place. Could be
useful to make it explicit.

> Assuming it actually fixes your issue, this is what I'm planning on posting.  I
> suspect KVM botches something when the deprivileged host is active, but given
> that the below will allow for additional cleanups, and practically speaking doesn't
> have any downsides, I don't see any reason to withhold the hack-a-fix.  Though
> hopefully we'll someday figure out exactly what's broken.

It fixes the issue. Thanks a lot Sean!

Kind regards,
Markku

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-10-02 17:20           ` Sean Christopherson
  2024-10-03 11:29             ` Markku Ahvenjärvi
@ 2024-10-10 11:00             ` Chao Gao
  2024-10-14 10:57               ` Markku Ahvenjärvi
  2024-10-16 18:54               ` Sean Christopherson
  1 sibling, 2 replies; 14+ messages in thread
From: Chao Gao @ 2024-10-10 11:00 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Markku Ahvenjärvi, bp, dave.hansen, hpa, janne.karhunen, kvm,
	linux-kernel, mingo, pbonzini, tglx, x86

On Wed, Oct 02, 2024 at 10:20:11AM -0700, Sean Christopherson wrote:
>On Wed, Oct 02, 2024, Sean Christopherson wrote:
>> On Wed, Oct 02, 2024, Sean Christopherson wrote:
>> > On Wed, Oct 02, 2024, Markku Ahvenjärvi wrote:
>> > > Hi Sean,
>> > > 
>> > > > On Fri, Sep 20, 2024, Markku Ahvenjärvi wrote:
>> > > > > Running certain hypervisors under KVM on VMX suffered L1 hangs after
>> > > > > launching a nested guest. The external interrupts were not processed on
>> > > > > vmlaunch/vmresume due to stale VPPR, and L2 guest would resume without
>> > > > > allowing L1 hypervisor to process the events.
>> > > > > 
>> > > > > The patch ensures VPPR to be updated when checking for pending
>> > > > > interrupts.
>> > > >
>> > > > This is architecturally incorrect, PPR isn't refreshed at VM-Enter.
>> > > 
>> > > I looked into this and found the following from Intel manual:
>> > > 
>> > > "30.1.3 PPR Virtualization
>> > > 
>> > > The processor performs PPR virtualization in response to the following
>> > > operations: (1) VM entry; (2) TPR virtualization; and (3) EOI virtualization.
>> > > 
>> > > ..."
>> > > 
>> > > The section "27.3.2.5 Updating Non-Register State" further explains the VM
>> > > enter:
>> > > 
>> > > "If the “virtual-interrupt delivery” VM-execution control is 1, VM entry loads
>> > > the values of RVI and SVI from the guest interrupt-status field in the VMCS
>> > > (see Section 25.4.2). After doing so, the logical processor first causes PPR
>> > > virtualization (Section 30.1.3) and then evaluates pending virtual interrupts
>> > > (Section 30.2.1). If a virtual interrupt is recognized, it may be delivered in
>> > > VMX non-root operation immediately after VM entry (including any specified
>> > > event injection) completes; ..."
>> > > 
>> > > According to that, PPR is supposed to be refreshed at VM-Enter, or am I
>> > > missing something here?
>> > 
>> > Huh, I missed that.  It makes sense I guess; VM-Enter processes pending virtual
>> > interrupts, so it stands that VM-Enter would refresh PPR as well.
>> > 
>> > Ugh, and looking again, KVM refreshes PPR every time it checks for a pending
>> > interrupt, including the VM-Enter case (via kvm_apic_has_interrupt()) when nested
>> > posted interrupts are in use:
>> > 
>> > 	/* Emulate processing of posted interrupts on VM-Enter. */
>> > 	if (nested_cpu_has_posted_intr(vmcs12) &&
>> > 	    kvm_apic_has_interrupt(vcpu) == vmx->nested.posted_intr_nv) {
>> > 		vmx->nested.pi_pending = true;
>> > 		kvm_make_request(KVM_REQ_EVENT, vcpu);
>> > 		kvm_apic_clear_irr(vcpu, vmx->nested.posted_intr_nv);
>> > 	}
>> > 
>> > I'm still curious as to what's different about your setup, but certainly not
>> > curious enough to hold up a fix.
>> 
>> Actually, none of the above is even relevant.  PPR virtualization in the nested
>> VM-Enter case would be for _L2's_ vPRR, not L1's.  And that virtualization is
>> performed by hardware (vmcs02 has the correct RVI, SVI, and vAPIC information
>> for L2).
>> 
>> Which means my initial instinct that KVM is missing a PPR update somewhere is
>> likely correct.
>
>Talking to myself :-)
>
>Assuming it actually fixes your issue, this is what I'm planning on posting.  I
>suspect KVM botches something when the deprivileged host is active, but given
>that the below will allow for additional cleanups, and practically speaking doesn't
>have any downsides, I don't see any reason to withhold the hack-a-fix.  Though
>hopefully we'll someday figure out exactly what's broken.

The issue is that KVM does not properly update vmcs01's SVI. In this case, L1
does not intercept EOI MSR writes from the deprivileged host (L2), so KVM
emulates EOI writes by clearing the highest bit in vISR and updating vPPR.
However, SVI in vmcs01 is not updated, causing it to retain the interrupt vector
that was just EOI'd. On the next VM-entry to L1, the CPU performs PPR
virtualization, setting vPPR to SVI & 0xf0, which results in an incorrect vPPR

Can you try this fix?

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4a93ac1b9be9..3d24194a648d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -122,6 +122,8 @@
 #define KVM_REQ_HV_TLB_FLUSH \
 	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
+#define KVM_REQ_UPDATE_HWAPIC_ISR \
+	KVM_ARCH_REQ_FLAGS(35, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 
 #define CR0_RESERVED_BITS                                               \
 	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
@@ -764,6 +766,7 @@ struct kvm_vcpu_arch {
 	u64 apic_base;
 	struct kvm_lapic *apic;    /* kernel irqchip context */
 	bool load_eoi_exitmap_pending;
+	bool update_hwapic_isr;
 	DECLARE_BITMAP(ioapic_handled_vectors, 256);
 	unsigned long apic_attention;
 	int32_t apic_arb_prio;
diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index b1eb46e26b2e..a8dad16161e4 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -220,6 +220,11 @@ static inline void leave_guest_mode(struct kvm_vcpu *vcpu)
 		kvm_make_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu);
 	}
 
+	if (vcpu->arch.update_hwapic_isr) {
+		vcpu->arch.update_hwapic_isr = false;
+		kvm_make_request(KVM_REQ_UPDATE_HWAPIC_ISR, vcpu);
+	}
+
 	vcpu->stat.guest_mode = 0;
 }
 
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5bb481aefcbc..d6a03c30f085 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -800,6 +800,9 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
 	if (!__apic_test_and_clear_vector(vec, apic->regs + APIC_ISR))
 		return;
 
+	if (is_guest_mode(apic->vcpu))
+		apic->vcpu->arch.update_hwapic_isr = true;
+
 	/*
 	 * We do get here for APIC virtualization enabled if the guest
 	 * uses the Hyper-V APIC enlightenment.  In this case we may need
@@ -3068,6 +3071,14 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s)
 	return 0;
 }
 
+void kvm_vcpu_update_hwapic_isr(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	if (apic->apicv_active)
+		kvm_x86_call(hwapic_isr_update)(apic_find_highest_isr(apic));
+}
+
 void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 {
 	struct hrtimer *timer;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 7ef8ae73e82d..ffa0c0e8bda9 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -266,6 +266,7 @@ void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
 bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu);
 void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu);
 bool kvm_can_use_hv_timer(struct kvm_vcpu *vcpu);
+void kvm_vcpu_update_hwapic_isr(struct kvm_vcpu *vcpu);
 
 static inline enum lapic_mode kvm_apic_mode(u64 apic_base)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 34b52b49f5e6..d90add3fbe99 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10968,6 +10968,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 #endif
 		if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
 			kvm_vcpu_update_apicv(vcpu);
+		if (kvm_check_request(KVM_REQ_UPDATE_HWAPIC_ISR, vcpu))
+			kvm_vcpu_update_hwapic_isr(vcpu);
 		if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
 			kvm_check_async_pf_completion(vcpu);
 		if (kvm_check_request(KVM_REQ_MSR_FILTER_CHANGED, vcpu))

>
>---
>From: Sean Christopherson <seanjc@google.com>
>Date: Wed, 2 Oct 2024 08:53:23 -0700
>Subject: [PATCH] KVM: nVMX: Explicitly update vPPR on successful nested
> VM-Enter
>MIME-Version: 1.0
>Content-Type: text/plain; charset=UTF-8
>Content-Transfer-Encoding: 8bit
>
>Request pending event evaluation after successful nested VM-Enter if L1
>has a pending IRQ, which in turn refreshes vPPR based on vTPR and vISR.
>This fixes an issue where KVM will fail to deliver a pending IRQ to L1
>when running an atypical hypervisor in L1, e.g. the pKVM port to VMX.
>
>Odds are very good that refreshing L1's vPPR is papering over a missed
>PPR update somewhere, e.g. the underlying bug likely related to the fact
>that pKVM passes through its APIC to the depriveleged host (which is an
>L2 guest from KVM's perspective).
>
>However, KVM updates PPR _constantly_, even when PPR technically shouldn't
>be refreshed, e.g. kvm_vcpu_has_events() re-evaluates PPR if IRQs are
>unblocked, by way of the same kvm_apic_has_interrupt() check.  Ditto for
>nested VM-Enter itself, when nested posted interrupts are enabled.  Thus,
>trying to avoid a PPR update on VM-Enter just to be pedantically accurate
>is ridiculous, given the behavior elsewhere in KVM.
>
>Unconditionally checking for interrupts will also allow for additional
>cleanups, e.g. the manual RVI check earlier in VM-Enter emulation by
>by vmx_has_apicv_interrupt() can be dropped, and the aforementioned nested
>posted interrupt logic can be massaged to better honor priority between
>concurrent events.
>
>Link: https://lore.kernel.org/kvm/20230312180048.1778187-1-jason.cj.chen@intel.com
>Reported-by: Markku Ahvenjärvi <mankku@gmail.com>
>Closes: https://lore.kernel.org/all/20240920080012.74405-1-mankku@gmail.com
>Suggested-by: Markku Ahvenjärvi <mankku@gmail.com>
>Cc: Janne Karhunen <janne.karhunen@gmail.com>
>Signed-off-by: Sean Christopherson <seanjc@google.com>
>---
> arch/x86/kvm/vmx/nested.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>index a8e7bc04d9bf..784b61c9810b 100644
>--- a/arch/x86/kvm/vmx/nested.c
>+++ b/arch/x86/kvm/vmx/nested.c
>@@ -3593,7 +3593,8 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> 	 * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit
> 	 * unconditionally.
> 	 */
>-	if (unlikely(evaluate_pending_interrupts))
>+	if (unlikely(evaluate_pending_interrupts) ||
>+	    kvm_apic_has_interrupt(vcpu))
> 		kvm_make_request(KVM_REQ_EVENT, vcpu);
> 
> 	/*
>
>base-commit: e32cde8d2bd7d251a8f9b434143977ddf13dcec6
>-- 
>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-10-10 11:00             ` Chao Gao
@ 2024-10-14 10:57               ` Markku Ahvenjärvi
  2024-10-16 18:54               ` Sean Christopherson
  1 sibling, 0 replies; 14+ messages in thread
From: Markku Ahvenjärvi @ 2024-10-14 10:57 UTC (permalink / raw)
  To: chao.gao
  Cc: bp, dave.hansen, hpa, janne.karhunen, kvm, linux-kernel, mankku,
	mingo, pbonzini, seanjc, tglx, x86

Hi Chao,

> The issue is that KVM does not properly update vmcs01's SVI. In this case, L1
> does not intercept EOI MSR writes from the deprivileged host (L2), so KVM
> emulates EOI writes by clearing the highest bit in vISR and updating vPPR.
> However, SVI in vmcs01 is not updated, causing it to retain the interrupt vector
> that was just EOI'd. On the next VM-entry to L1, the CPU performs PPR
> virtualization, setting vPPR to SVI & 0xf0, which results in an incorrect vPPR
> 
> Can you try this fix?

I tried, it also fixes the issue.

Thank you Chao, I really appreciate the explanation, it makes sense now.

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4a93ac1b9be9..3d24194a648d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -122,6 +122,8 @@
>  #define KVM_REQ_HV_TLB_FLUSH \
>         KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE   KVM_ARCH_REQ(34)
> +#define KVM_REQ_UPDATE_HWAPIC_ISR \
> +       KVM_ARCH_REQ_FLAGS(35, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
> 
>  #define CR0_RESERVED_BITS                                               \
>         (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -764,6 +766,7 @@ struct kvm_vcpu_arch {
>         u64 apic_base;
>         struct kvm_lapic *apic;    /* kernel irqchip context */
>         bool load_eoi_exitmap_pending;
> +       bool update_hwapic_isr;
>         DECLARE_BITMAP(ioapic_handled_vectors, 256);
>         unsigned long apic_attention;
>         int32_t apic_arb_prio;
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index b1eb46e26b2e..a8dad16161e4 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -220,6 +220,11 @@ static inline void leave_guest_mode(struct kvm_vcpu *vcpu)
>                 kvm_make_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu);
>         }
> 
> +       if (vcpu->arch.update_hwapic_isr) {
> +               vcpu->arch.update_hwapic_isr = false;
> +               kvm_make_request(KVM_REQ_UPDATE_HWAPIC_ISR, vcpu);
> +       }
> +
>         vcpu->stat.guest_mode = 0;
>  }
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 5bb481aefcbc..d6a03c30f085 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -800,6 +800,9 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
>         if (!__apic_test_and_clear_vector(vec, apic->regs + APIC_ISR))
>                 return;
> 
> +       if (is_guest_mode(apic->vcpu))
> +               apic->vcpu->arch.update_hwapic_isr = true;
> +
>         /*
>          * We do get here for APIC virtualization enabled if the guest
>          * uses the Hyper-V APIC enlightenment.  In this case we may need
> @@ -3068,6 +3071,14 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s)
>         return 0;
>  }
> 
> +void kvm_vcpu_update_hwapic_isr(struct kvm_vcpu *vcpu)
> +{
> +       struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +       if (apic->apicv_active)
> +               kvm_x86_call(hwapic_isr_update)(apic_find_highest_isr(apic));
> +}
> +
>  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
>  {
>         struct hrtimer *timer;
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 7ef8ae73e82d..ffa0c0e8bda9 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -266,6 +266,7 @@ void kvm_lapic_expired_hv_timer(struct kvm_vcpu *vcpu);
>  bool kvm_lapic_hv_timer_in_use(struct kvm_vcpu *vcpu);
>  void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu);
>  bool kvm_can_use_hv_timer(struct kvm_vcpu *vcpu);
> +void kvm_vcpu_update_hwapic_isr(struct kvm_vcpu *vcpu);
> 
>  static inline enum lapic_mode kvm_apic_mode(u64 apic_base)
>  {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 34b52b49f5e6..d90add3fbe99 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10968,6 +10968,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  #endif
>                 if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
>                         kvm_vcpu_update_apicv(vcpu);
> +               if (kvm_check_request(KVM_REQ_UPDATE_HWAPIC_ISR, vcpu))
> +                       kvm_vcpu_update_hwapic_isr(vcpu);
>                 if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
>                         kvm_check_async_pf_completion(vcpu);
>                 if (kvm_check_request(KVM_REQ_MSR_FILTER_CHANGED, vcpu))

Kind regards,
Markku

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-10-10 11:00             ` Chao Gao
  2024-10-14 10:57               ` Markku Ahvenjärvi
@ 2024-10-16 18:54               ` Sean Christopherson
  2024-10-17 13:27                 ` Chao Gao
  1 sibling, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2024-10-16 18:54 UTC (permalink / raw)
  To: Chao Gao
  Cc: Markku Ahvenjärvi, bp, dave.hansen, hpa, janne.karhunen, kvm,
	linux-kernel, mingo, pbonzini, tglx, x86

On Thu, Oct 10, 2024, Chao Gao wrote:
> The issue is that KVM does not properly update vmcs01's SVI. In this case, L1
> does not intercept EOI MSR writes from the deprivileged host (L2), so KVM

Oof.  It's not simply that L1 doesn't intercept EOI, it's also that L1 doesn't
have APICv enabled.

Note, the only reason there aren't a slew of other issues in this setup is that
KVM takes all APICv-related controls from vmcs12.  I.e. if L1 is sharing its APIC
with L2, then KVM will effectively disable APICv when running L2.  Which is
rather unfortunate for the pKVM-on-KVM case (lost performance), but functionally
it's ok.

E.g. if KVM enabled APICv in an APIC-passthrough scenario, then KVM would need
to load the correct EOI bitmaps when running L2, and keep them up-to-date for
both L1 and L2.

> emulates EOI writes by clearing the highest bit in vISR and updating vPPR.
> However, SVI in vmcs01 is not updated, causing it to retain the interrupt vector
> that was just EOI'd. On the next VM-entry to L1, the CPU performs PPR
> virtualization, setting vPPR to SVI & 0xf0, which results in an incorrect vPPR
> 
> Can you try this fix?
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4a93ac1b9be9..3d24194a648d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -122,6 +122,8 @@
>  #define KVM_REQ_HV_TLB_FLUSH \
>  	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
> +#define KVM_REQ_UPDATE_HWAPIC_ISR \
> +	KVM_ARCH_REQ_FLAGS(35, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  
>  #define CR0_RESERVED_BITS                                               \
>  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -764,6 +766,7 @@ struct kvm_vcpu_arch {
>  	u64 apic_base;
>  	struct kvm_lapic *apic;    /* kernel irqchip context */
>  	bool load_eoi_exitmap_pending;
> +	bool update_hwapic_isr;

Obviously not your fault, but I'm really beginning to hate all of the flags we're
accumulating, e.g. in addition to load_eoi_exitmap_pending and potentially
update_hwapic_isr, VMX also has:

	bool change_vmcs01_virtual_apic_mode;
	bool reload_vmcs01_apic_access_page;
	bool update_vmcs01_cpu_dirty_logging;
	bool update_vmcs01_apicv_status;

Doesn't (and shouldn't) need to be handled as part of this, but I wonder if we
can clean things up a bit by impementing something along the lines of deferred
requests, e.g. nested_vmx_defer_request(NESTED_VMX_REQ_XXX, vcpu).  We'd still
need to add each request, but it would cut down on some of the boilerplate, e.g.

	if (vmx->nested.change_vmcs01_virtual_apic_mode) {
		vmx->nested.change_vmcs01_virtual_apic_mode = false;
		vmx_set_virtual_apic_mode(vcpu);
	}

would become

	if (nested_vmx_check_request(NESTED_VMX_REQ_VAPIC_MODE, vcpu))
		vmx_set_virtual_apic_mode(vcpu);

An alternative idea would be to use KVM_REQ_XXX directly, and shuffle them from
vmx->nested to vcpu, but that would pollute KVM_REQ_XXX for the cases where KVM
doesn't need a "normal" request.

Another idea would be to temporarily switch to vmcs01 as needed, which would
probably be ok for most cases since they are uncommon events, but for EOIs in
this, situation the overhead would be non-trivial and completely unnecessary.

>  	DECLARE_BITMAP(ioapic_handled_vectors, 256);
>  	unsigned long apic_attention;
>  	int32_t apic_arb_prio;
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index b1eb46e26b2e..a8dad16161e4 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -220,6 +220,11 @@ static inline void leave_guest_mode(struct kvm_vcpu *vcpu)
>  		kvm_make_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu);
>  	}
>  
> +	if (vcpu->arch.update_hwapic_isr) {
> +		vcpu->arch.update_hwapic_isr = false;
> +		kvm_make_request(KVM_REQ_UPDATE_HWAPIC_ISR, vcpu);

I don't think we need a new request for this, KVM can refresh SVI directly in
nested_vmx_vmexit(), e.g. along with change_vmcs01_virtual_apic_mode and friends.

> +	}
> +
>  	vcpu->stat.guest_mode = 0;
>  }
>  
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 5bb481aefcbc..d6a03c30f085 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -800,6 +800,9 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
>  	if (!__apic_test_and_clear_vector(vec, apic->regs + APIC_ISR))
>  		return;
>  
> +	if (is_guest_mode(apic->vcpu))

As above, I think this needs to be

	if (is_guest_mode(apic->vcpu) && !nested_cpu_has_vid(get_vmcs12(vcpu)))

because if virtual interrupt delivery is enabled, then EOIs are virtualized.
Which means that this needs to be handled in vmx_hwapic_isr_update().

Hmm, actually, there's more to it, I think.  If virtual interrupt deliver is
disabled for L2, then writing vmcs02 is pointless because GUEST_INTR_STATUS is
unused by the CPU.

Argh, but hwapic_isr_update() doesn't take @vcpu.  That's easy enough to fix,
just annoying.

Chao, can you provide your SoB for your code?  If you've no objections, I'll
massage it to avoid using a request and write a changelog, and then post it as
part of a small series.

Thanks again!

> +		apic->vcpu->arch.update_hwapic_isr = true;
> +
>  	/*
>  	 * We do get here for APIC virtualization enabled if the guest
>  	 * uses the Hyper-V APIC enlightenment.  In this case we may need

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-10-16 18:54               ` Sean Christopherson
@ 2024-10-17 13:27                 ` Chao Gao
  2024-10-17 16:05                   ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Chao Gao @ 2024-10-17 13:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Markku Ahvenjärvi, bp, dave.hansen, hpa, janne.karhunen, kvm,
	linux-kernel, mingo, pbonzini, tglx, x86

>
>I don't think we need a new request for this, KVM can refresh SVI directly in
>nested_vmx_vmexit(), e.g. along with change_vmcs01_virtual_apic_mode and friends.

Agree.

>
>> +	}
>> +
>>  	vcpu->stat.guest_mode = 0;
>>  }
>>  
>> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> index 5bb481aefcbc..d6a03c30f085 100644
>> --- a/arch/x86/kvm/lapic.c
>> +++ b/arch/x86/kvm/lapic.c
>> @@ -800,6 +800,9 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
>>  	if (!__apic_test_and_clear_vector(vec, apic->regs + APIC_ISR))
>>  		return;
>>  
>> +	if (is_guest_mode(apic->vcpu))
>
>As above, I think this needs to be
>
>	if (is_guest_mode(apic->vcpu) && !nested_cpu_has_vid(get_vmcs12(vcpu)))
>
>because if virtual interrupt delivery is enabled, then EOIs are virtualized.
>Which means that this needs to be handled in vmx_hwapic_isr_update().

I'm not sure if nested_cpu_has_vid() is necessary. My understanding is that
when a bit in the vCPU's vISR is cleared, the vCPU's SVI (i.e., SVI in vmcs01)
may be stale and so needs an update if vmcs01 isn't the active VMCS (i.e., the
vCPU is in guest mode).

If L1 enables VID and EOIs from L2 are virtualized by KVM (L0), KVM shouldn't
call this function in the first place. Because KVM should update the
'virt-APIC' page in VMCS12, rather than updating the vISR of the vCPU.

>
>Hmm, actually, there's more to it, I think.  If virtual interrupt deliver is
>disabled for L2, then writing vmcs02 is pointless because GUEST_INTR_STATUS is
>unused by the CPU.
>
>Argh, but hwapic_isr_update() doesn't take @vcpu.  That's easy enough to fix,
>just annoying.
>
>Chao, can you provide your SoB for your code?  If you've no objections, I'll
>massage it to avoid using a request and write a changelog, and then post it as
>part of a small series.

Sure.

Signed-off-by: Chao Gao <chao.gao@intel.com>

Thanks for your help.

>
>Thanks again!
>
>> +		apic->vcpu->arch.update_hwapic_isr = true;
>> +
>>  	/*
>>  	 * We do get here for APIC virtualization enabled if the guest
>>  	 * uses the Hyper-V APIC enlightenment.  In this case we may need

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume
  2024-10-17 13:27                 ` Chao Gao
@ 2024-10-17 16:05                   ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2024-10-17 16:05 UTC (permalink / raw)
  To: Chao Gao
  Cc: Markku Ahvenjärvi, bp, dave.hansen, hpa, janne.karhunen, kvm,
	linux-kernel, mingo, pbonzini, tglx, x86

On Thu, Oct 17, 2024, Chao Gao wrote:
> >> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> >> index 5bb481aefcbc..d6a03c30f085 100644
> >> --- a/arch/x86/kvm/lapic.c
> >> +++ b/arch/x86/kvm/lapic.c
> >> @@ -800,6 +800,9 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
> >>  	if (!__apic_test_and_clear_vector(vec, apic->regs + APIC_ISR))
> >>  		return;
> >>  
> >> +	if (is_guest_mode(apic->vcpu))
> >
> >As above, I think this needs to be
> >
> >	if (is_guest_mode(apic->vcpu) && !nested_cpu_has_vid(get_vmcs12(vcpu)))
> >
> >because if virtual interrupt delivery is enabled, then EOIs are virtualized.
> >Which means that this needs to be handled in vmx_hwapic_isr_update().
> 
> I'm not sure if nested_cpu_has_vid() is necessary. My understanding is that
> when a bit in the vCPU's vISR is cleared, the vCPU's SVI (i.e., SVI in vmcs01)
> may be stale and so needs an update if vmcs01 isn't the active VMCS (i.e., the
> vCPU is in guest mode).
> 
> If L1 enables VID and EOIs from L2 are virtualized by KVM (L0), KVM shouldn't
> call this function in the first place. Because KVM should update the
> 'virt-APIC' page in VMCS12, rather than updating the vISR of the vCPU.

Ah, right.  And KVM handles that in nested_vmx_l1_wants_exit(), by forwarding all
APICv exits to L1:

	case EXIT_REASON_APIC_ACCESS:
	case EXIT_REASON_APIC_WRITE:
	case EXIT_REASON_EOI_INDUCED:
		/*
		 * The controls for "virtualize APIC accesses," "APIC-
		 * register virtualization," and "virtual-interrupt
		 * delivery" only come from vmcs12.
		 */
		return true;

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-10-17 16:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-20  7:59 [PATCH 0/1] KVM: nVMX: update VPPR on vmlaunch/vmresume Markku Ahvenjärvi
2024-09-20  7:59 ` [PATCH 1/1] " Markku Ahvenjärvi
2024-09-20  8:18   ` Sean Christopherson
2024-09-20 12:40     ` Markku Ahvenjärvi
2024-10-02 12:42     ` Markku Ahvenjärvi
2024-10-02 15:52       ` Sean Christopherson
2024-10-02 16:49         ` Sean Christopherson
2024-10-02 17:20           ` Sean Christopherson
2024-10-03 11:29             ` Markku Ahvenjärvi
2024-10-10 11:00             ` Chao Gao
2024-10-14 10:57               ` Markku Ahvenjärvi
2024-10-16 18:54               ` Sean Christopherson
2024-10-17 13:27                 ` Chao Gao
2024-10-17 16:05                   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox