* [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on @ 2025-08-19 13:32 Maciej S. Szmigiero 2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Maciej S. Szmigiero @ 2025-08-19 13:32 UTC (permalink / raw) To: Paolo Bonzini, Sean Christopherson Cc: Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR sync in sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would *not* get copied into the V_TPR field of VMCB. AVIC does sync between these two fields, however it does so only on explicit guest writes to one of these fields, not on a bare VMRUN. This is especially true when it is the userspace setting LAPIC state via KVM_SET_LAPIC ioctl() since userspace does not have access to the guest VMCB. Practice shows that it is the V_TPR that is actually used by the AVIC to decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), so any leftover value in V_TPR will cause serious interrupt delivery issues in the guest when AVIC is enabled. Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and similar code paths when AVIC is enabled. Add also a relevant set of tests to xapic_state_test so hopefully we'll be protected against getting such regressions in the future. Yes, this breaks real guests when AVIC is enabled. Specifically, the one OS that sometimes needs different handling and its name begins with letter 'W'. KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs KVM: selftests: Test TPR / CR8 sync and interrupt masking arch/x86/kvm/svm/avic.c | 23 ++ .../testing/selftests/kvm/include/x86/apic.h | 5 + .../selftests/kvm/x86/xapic_state_test.c | 265 +++++++++++++++++- 3 files changed, 290 insertions(+), 3 deletions(-) ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs 2025-08-19 13:32 [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Maciej S. Szmigiero @ 2025-08-19 13:32 ` Maciej S. Szmigiero 2025-08-21 20:38 ` Sean Christopherson 2025-08-19 13:32 ` [PATCH 2/2] KVM: selftests: Test TPR / CR8 sync and interrupt masking Maciej S. Szmigiero 2025-08-21 8:18 ` [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Naveen N Rao 2 siblings, 1 reply; 12+ messages in thread From: Maciej S. Szmigiero @ 2025-08-19 13:32 UTC (permalink / raw) To: Paolo Bonzini, Sean Christopherson Cc: Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would not get copied into the V_TPR field of VMCB. AVIC does sync between these two fields, however it does so only on explicit guest writes to one of these fields, not on a bare VMRUN. This is especially true when it is the userspace setting LAPIC state via KVM_SET_LAPIC ioctl() since userspace does not have access to the guest VMCB. Practice shows that it is the V_TPR that is actually used by the AVIC to decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), so any leftover value in V_TPR will cause serious interrupt delivery issues in the guest when AVIC is enabled. Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and similar code paths when AVIC is enabled. Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> --- arch/x86/kvm/svm/avic.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index a34c5c3b164e..877bc3db2c6e 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -725,8 +725,31 @@ int avic_init_vcpu(struct vcpu_svm *svm) void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu) { + struct vcpu_svm *svm = to_svm(vcpu); + u64 cr8; + avic_handle_dfr_update(vcpu); avic_handle_ldr_update(vcpu); + + /* Running nested should have inhibited AVIC. */ + if (WARN_ON_ONCE(nested_svm_virtualize_tpr(vcpu))) + return; + + /* + * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB. + * + * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() + * is inhibited so any set TPR LAPIC state would not get reflected + * in V_TPR. + * + * Practice shows that it is the V_TPR that is actually used by the + * AVIC to decide whether to issue pending interrupts to the CPU, not + * TPR in TASKPRI. + */ + cr8 = kvm_get_cr8(vcpu); + svm->vmcb->control.int_ctl &= ~V_TPR_MASK; + svm->vmcb->control.int_ctl |= cr8 & V_TPR_MASK; + WARN_ON_ONCE(!vmcb_is_dirty(svm->vmcb, VMCB_INTR)); } static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs 2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero @ 2025-08-21 20:38 ` Sean Christopherson 2025-08-22 9:04 ` Naveen N Rao 2025-08-22 23:20 ` Maciej S. Szmigiero 0 siblings, 2 replies; 12+ messages in thread From: Sean Christopherson @ 2025-08-21 20:38 UTC (permalink / raw) To: Maciej S. Szmigiero Cc: Paolo Bonzini, Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() is > inhibited so any changed TPR in the LAPIC state would not get copied into > the V_TPR field of VMCB. > > AVIC does sync between these two fields, however it does so only on > explicit guest writes to one of these fields, not on a bare VMRUN. > > This is especially true when it is the userspace setting LAPIC state via > KVM_SET_LAPIC ioctl() since userspace does not have access to the guest > VMCB. > > Practice shows that it is the V_TPR that is actually used by the AVIC to > decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), > so any leftover value in V_TPR will cause serious interrupt delivery issues > in the guest when AVIC is enabled. > > Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in > avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and > similar code paths when AVIC is enabled. > > Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> > --- > arch/x86/kvm/svm/avic.c | 23 +++++++++++++++++++++++ > 1 file changed, 23 insertions(+) > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c > index a34c5c3b164e..877bc3db2c6e 100644 > --- a/arch/x86/kvm/svm/avic.c > +++ b/arch/x86/kvm/svm/avic.c > @@ -725,8 +725,31 @@ int avic_init_vcpu(struct vcpu_svm *svm) > > void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu) > { > + struct vcpu_svm *svm = to_svm(vcpu); > + u64 cr8; > + > avic_handle_dfr_update(vcpu); > avic_handle_ldr_update(vcpu); > + > + /* Running nested should have inhibited AVIC. */ > + if (WARN_ON_ONCE(nested_svm_virtualize_tpr(vcpu))) > + return; > + > + /* > + * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB. > + * > + * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() > + * is inhibited so any set TPR LAPIC state would not get reflected > + * in V_TPR. Hmm, I think that code is straight up wrong. There's no justification, just a claim: commit 3bbf3565f48ce3999b5a12cde946f81bd4475312 Author: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com> AuthorDate: Wed May 4 14:09:51 2016 -0500 Commit: Paolo Bonzini <pbonzini@redhat.com> CommitDate: Wed May 18 18:04:31 2016 +0200 svm: Do not intercept CR8 when enable AVIC When enable AVIC: * Do not intercept CR8 since this should be handled by AVIC HW. * Also, we don't need to sync cr8/V_TPR and APIC backing page. <====== Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> That claim assumes APIC[TPR] will _never_ be modified by anything other than hardware. That's obviously false for state restore from userspace, and it's also technically false at steady state, e.g. if KVM managed to trigger emulation of a store to the APIC page, then KVM would bypass the automatic harware sync. There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC except for the initial ioctl), but could again set APIC[TPR] without updating V_TPR. So, rather than manually do the update during state restore, my vote is to restore the sync logic. And if we want to optimize that code (seems unnecessary), then we should hook all TPR writes. diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index d9931c6c4bc6..1bfebe40854f 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu) struct vcpu_svm *svm = to_svm(vcpu); u64 cr8; - if (nested_svm_virtualize_tpr(vcpu) || - kvm_vcpu_apicv_active(vcpu)) + if (nested_svm_virtualize_tpr(vcpu)) return; cr8 = kvm_get_cr8(vcpu); > + * > + * Practice shows that it is the V_TPR that is actually used by the > + * AVIC to decide whether to issue pending interrupts to the CPU, not > + * TPR in TASKPRI. FWIW, the APM kinda sorta alludes to this: Enabling AVIC also affects CR8 behavior independent of V_INTR_MASKING enable (bit 24): writes to CR8 affect the V_TPR and update the backing page and reads from CR8 return V_TPR. > + */ > + cr8 = kvm_get_cr8(vcpu); > + svm->vmcb->control.int_ctl &= ~V_TPR_MASK; > + svm->vmcb->control.int_ctl |= cr8 & V_TPR_MASK; > + WARN_ON_ONCE(!vmcb_is_dirty(svm->vmcb, VMCB_INTR)); > } > > static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd) ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs 2025-08-21 20:38 ` Sean Christopherson @ 2025-08-22 9:04 ` Naveen N Rao 2025-08-22 20:54 ` Sean Christopherson 2025-08-22 23:20 ` Maciej S. Szmigiero 1 sibling, 1 reply; 12+ messages in thread From: Naveen N Rao @ 2025-08-22 9:04 UTC (permalink / raw) To: Sean Christopherson Cc: Maciej S. Szmigiero, Paolo Bonzini, Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel On Thu, Aug 21, 2025 at 01:38:17PM -0700, Sean Christopherson wrote: > On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote: > > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > > > When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() is > > inhibited so any changed TPR in the LAPIC state would not get copied into > > the V_TPR field of VMCB. > > > > AVIC does sync between these two fields, however it does so only on > > explicit guest writes to one of these fields, not on a bare VMRUN. > > > > This is especially true when it is the userspace setting LAPIC state via > > KVM_SET_LAPIC ioctl() since userspace does not have access to the guest > > VMCB. > > > > Practice shows that it is the V_TPR that is actually used by the AVIC to > > decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), > > so any leftover value in V_TPR will cause serious interrupt delivery issues > > in the guest when AVIC is enabled. > > > > Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in > > avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and > > similar code paths when AVIC is enabled. > > > > Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") > > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> > > --- > > arch/x86/kvm/svm/avic.c | 23 +++++++++++++++++++++++ > > 1 file changed, 23 insertions(+) > > > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c > > index a34c5c3b164e..877bc3db2c6e 100644 > > --- a/arch/x86/kvm/svm/avic.c > > +++ b/arch/x86/kvm/svm/avic.c > > @@ -725,8 +725,31 @@ int avic_init_vcpu(struct vcpu_svm *svm) > > > > void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu) > > { > > + struct vcpu_svm *svm = to_svm(vcpu); > > + u64 cr8; > > + > > avic_handle_dfr_update(vcpu); > > avic_handle_ldr_update(vcpu); > > + > > + /* Running nested should have inhibited AVIC. */ > > + if (WARN_ON_ONCE(nested_svm_virtualize_tpr(vcpu))) > > + return; > > > > + > > + /* > > + * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB. > > + * > > + * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() > > + * is inhibited so any set TPR LAPIC state would not get reflected > > + * in V_TPR. > > Hmm, I think that code is straight up wrong. There's no justification, just a > claim: > > commit 3bbf3565f48ce3999b5a12cde946f81bd4475312 > Author: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com> > AuthorDate: Wed May 4 14:09:51 2016 -0500 > Commit: Paolo Bonzini <pbonzini@redhat.com> > CommitDate: Wed May 18 18:04:31 2016 +0200 > > svm: Do not intercept CR8 when enable AVIC > > When enable AVIC: > * Do not intercept CR8 since this should be handled by AVIC HW. > * Also, we don't need to sync cr8/V_TPR and APIC backing page. <====== > > Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> > [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo] > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > > That claim assumes APIC[TPR] will _never_ be modified by anything other than > hardware. It also isn't clear to me why only sync_lapic_to_cr8() was gated when AVIC was enabled, while sync_cr8_to_lapic() continues to copy V_TRP to the backing page. If AVIC is enabled, then the AVIC hardware updates both the backing page and V_TPR on a guest write to TPR. > That's obviously false for state restore from userspace, and it's also > technically false at steady state, e.g. if KVM managed to trigger emulation of a > store to the APIC page, then KVM would bypass the automatic harware sync. Do you mean emulation due to AVIC being inhibited? I initially thought this could be a problem, but in this scenario, AVIC would be disabled on the next VMRUN, so we will end up sync'ing TPR from the lapic to V_TPR. > > There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to > be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC > except for the initial ioctl), but could again set APIC[TPR] without updating > V_TPR. > > So, rather than manually do the update during state restore, my vote > is to restore the sync logic. And if we want to optimize that code > (seems unnecessary), then we should hook all TPR writes. I guess you mean apic_set_tpr()? We will need to hook into that in addition to updating avic_apicv_post_state_restore() since KVM_SET_LAPIC just does a memcpy of the register state. > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index d9931c6c4bc6..1bfebe40854f 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu) > struct vcpu_svm *svm = to_svm(vcpu); > u64 cr8; > > - if (nested_svm_virtualize_tpr(vcpu) || > - kvm_vcpu_apicv_active(vcpu)) > + if (nested_svm_virtualize_tpr(vcpu)) > return; > > cr8 = kvm_get_cr8(vcpu); I agree that this is a simpler fix, so would be good to do for backport ease. The code in sync_lapic_to_cr8 ends up being a function call to kvm_get_cr8() and ~6 instructions, which isn't that much. But if we can gate sync'ing V_TPR to the backing page in sync_cr8_to_lapic() as well, then it might be good to do so. - Naveen ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs 2025-08-22 9:04 ` Naveen N Rao @ 2025-08-22 20:54 ` Sean Christopherson 0 siblings, 0 replies; 12+ messages in thread From: Sean Christopherson @ 2025-08-22 20:54 UTC (permalink / raw) To: Naveen N Rao Cc: Maciej S. Szmigiero, Paolo Bonzini, Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel On Fri, Aug 22, 2025, Naveen N Rao wrote: > On Thu, Aug 21, 2025 at 01:38:17PM -0700, Sean Christopherson wrote: > > On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote: > > > + /* > > > + * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB. > > > + * > > > + * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() > > > + * is inhibited so any set TPR LAPIC state would not get reflected > > > + * in V_TPR. > > > > Hmm, I think that code is straight up wrong. There's no justification, just a > > claim: > > > > commit 3bbf3565f48ce3999b5a12cde946f81bd4475312 > > Author: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com> > > AuthorDate: Wed May 4 14:09:51 2016 -0500 > > Commit: Paolo Bonzini <pbonzini@redhat.com> > > CommitDate: Wed May 18 18:04:31 2016 +0200 > > > > svm: Do not intercept CR8 when enable AVIC > > > > When enable AVIC: > > * Do not intercept CR8 since this should be handled by AVIC HW. > > * Also, we don't need to sync cr8/V_TPR and APIC backing page. <====== > > > > Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> > > [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo] > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > > > > That claim assumes APIC[TPR] will _never_ be modified by anything other than > > hardware. > > It also isn't clear to me why only sync_lapic_to_cr8() was gated when > AVIC was enabled, while sync_cr8_to_lapic() continues to copy V_TRP to > the backing page. If AVIC is enabled, then the AVIC hardware updates > both the backing page and V_TPR on a guest write to TPR. > > > That's obviously false for state restore from userspace, and it's also > > technically false at steady state, e.g. if KVM managed to trigger emulation of a > > store to the APIC page, then KVM would bypass the automatic harware sync. > > Do you mean emulation due to AVIC being inhibited? I initially thought > this could be a problem, but in this scenario, AVIC would be disabled on > the next VMRUN, so we will end up sync'ing TPR from the lapic to V_TPR. No, if emulation is triggered when AVIC isn't inhibited. E.g. a contrived but entirely possible situation would be if MOVBE isn't supported in hardware, KVM is emulating MOVBE for emulation, and the guest sets the TPR via MOVBE. The MOVBE #UDs, KVM emulates in response to the #UD, and Bob's your uncle. There are other scenarios where KVM would emulate, though they're even more contrived. > > There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to > > be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC > > except for the initial ioctl), but could again set APIC[TPR] without updating > > V_TPR. > > > > So, rather than manually do the update during state restore, my vote > > is to restore the sync logic. And if we want to optimize that code > > (seems unnecessary), then we should hook all TPR writes. > > I guess you mean apic_set_tpr()? Yep. > We will need to hook into that in addition to updating > avic_apicv_post_state_restore() since KVM_SET_LAPIC just does a memcpy of the > register state. Yeah, or explicitly call the hook, e.g. like kvm_apic_set_state() does for hwapic_isr_update(). But I don't think we should add a hook unless someone proves that unconditionally synchronizing before VMRUN affects performance. > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > > index d9931c6c4bc6..1bfebe40854f 100644 > > --- a/arch/x86/kvm/svm/svm.c > > +++ b/arch/x86/kvm/svm/svm.c > > @@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu) > > struct vcpu_svm *svm = to_svm(vcpu); > > u64 cr8; > > > > - if (nested_svm_virtualize_tpr(vcpu) || > > - kvm_vcpu_apicv_active(vcpu)) > > + if (nested_svm_virtualize_tpr(vcpu)) > > return; > > > > cr8 = kvm_get_cr8(vcpu); > > I agree that this is a simpler fix, so would be good to do for backport > ease. > > The code in sync_lapic_to_cr8 ends up being a function call to > kvm_get_cr8() and ~6 instructions, which isn't that much. But if we can > gate sync'ing V_TPR to the backing page in sync_cr8_to_lapic() as well, > then it might be good to do so. > > > - Naveen > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs 2025-08-21 20:38 ` Sean Christopherson 2025-08-22 9:04 ` Naveen N Rao @ 2025-08-22 23:20 ` Maciej S. Szmigiero 2025-08-22 23:41 ` Sean Christopherson 1 sibling, 1 reply; 12+ messages in thread From: Maciej S. Szmigiero @ 2025-08-22 23:20 UTC (permalink / raw) To: Sean Christopherson Cc: Paolo Bonzini, Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel, Naveen N Rao, Radim Krčmář On 21.08.2025 22:38, Sean Christopherson wrote: > On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote: >> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >> >> When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() is >> inhibited so any changed TPR in the LAPIC state would not get copied into >> the V_TPR field of VMCB. >> >> AVIC does sync between these two fields, however it does so only on >> explicit guest writes to one of these fields, not on a bare VMRUN. >> >> This is especially true when it is the userspace setting LAPIC state via >> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest >> VMCB. >> >> Practice shows that it is the V_TPR that is actually used by the AVIC to >> decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), >> so any leftover value in V_TPR will cause serious interrupt delivery issues >> in the guest when AVIC is enabled. >> >> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in >> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and >> similar code paths when AVIC is enabled. >> >> Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC") >> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> >> --- >> arch/x86/kvm/svm/avic.c | 23 +++++++++++++++++++++++ >> 1 file changed, 23 insertions(+) >> >> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c >> index a34c5c3b164e..877bc3db2c6e 100644 >> --- a/arch/x86/kvm/svm/avic.c >> +++ b/arch/x86/kvm/svm/avic.c >> @@ -725,8 +725,31 @@ int avic_init_vcpu(struct vcpu_svm *svm) >> >> void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu) >> { >> + struct vcpu_svm *svm = to_svm(vcpu); >> + u64 cr8; >> + >> avic_handle_dfr_update(vcpu); >> avic_handle_ldr_update(vcpu); >> + >> + /* Running nested should have inhibited AVIC. */ >> + if (WARN_ON_ONCE(nested_svm_virtualize_tpr(vcpu))) >> + return; > > >> + >> + /* >> + * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB. >> + * >> + * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() >> + * is inhibited so any set TPR LAPIC state would not get reflected >> + * in V_TPR. > > Hmm, I think that code is straight up wrong. There's no justification, just a > claim: > > commit 3bbf3565f48ce3999b5a12cde946f81bd4475312 > Author: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com> > AuthorDate: Wed May 4 14:09:51 2016 -0500 > Commit: Paolo Bonzini <pbonzini@redhat.com> > CommitDate: Wed May 18 18:04:31 2016 +0200 > > svm: Do not intercept CR8 when enable AVIC > > When enable AVIC: > * Do not intercept CR8 since this should be handled by AVIC HW. > * Also, we don't need to sync cr8/V_TPR and APIC backing page. <====== > > Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> > [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo] > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > > That claim assumes APIC[TPR] will _never_ be modified by anything other than > hardware. That's obviously false for state restore from userspace, and it's also > technically false at steady state, e.g. if KVM managed to trigger emulation of a > store to the APIC page, then KVM would bypass the automatic harware sync. > > There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to > be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC > except for the initial ioctl), but could again set APIC[TPR] without updating > V_TPR. > > So, rather than manually do the update during state restore, my vote is to restore > the sync logic. And if we want to optimize that code (seems unnecessary), then > we should hook all TPR writes. > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index d9931c6c4bc6..1bfebe40854f 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu) > struct vcpu_svm *svm = to_svm(vcpu); > u64 cr8; > > - if (nested_svm_virtualize_tpr(vcpu) || > - kvm_vcpu_apicv_active(vcpu)) > + if (nested_svm_virtualize_tpr(vcpu)) > return; > > cr8 = kvm_get_cr8(vcpu); > > So you want to just do an unconditional LAPIC -> V_TPR sync at each VMRUN and not try to patch every code flow where these possibly could get de-synced to do such sync only on demand, correct? By the way, the original Suravee's submission for the aforementioned patch did *not* inhibit that sync when AVIC is on [1]. Something similar to this sync inhibit only showed in v4 [2], probably upon Radim's comment on v3 [3] that: > I think we can exit early with svm_vcpu_avic_enabled(). But the initial sync inhibit condition in v4 was essentially nested_svm_virtualize_tpr() && svm_vcpu_avic_enabled(), which suggests there was some confusion what was exactly meant by the reviewer comment. The final sync inhibit condition only showed in v5 [4]. No further discussion happened on that point. Thanks, Maciej [1]: https://lore.kernel.org/kvm/1455285574-27892-9-git-send-email-suravee.suthikulpanit@amd.com/ [2]: https://lore.kernel.org/kvm/1460017232-17429-11-git-send-email-Suravee.Suthikulpanit@amd.com/ [3]: https://lore.kernel.org/kvm/20160318211048.GB26119@potion.brq.redhat.com/ [4]: https://lore.kernel.org/kvm/1462388992-25242-13-git-send-email-Suravee.Suthikulpanit@amd.com/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs 2025-08-22 23:20 ` Maciej S. Szmigiero @ 2025-08-22 23:41 ` Sean Christopherson 0 siblings, 0 replies; 12+ messages in thread From: Sean Christopherson @ 2025-08-22 23:41 UTC (permalink / raw) To: Maciej S. Szmigiero Cc: Paolo Bonzini, Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel, Naveen N Rao, Radim Krčmář On Sat, Aug 23, 2025, Maciej S. Szmigiero wrote: > On 21.08.2025 22:38, Sean Christopherson wrote: > > On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote: > > > + /* > > > + * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB. > > > + * > > > + * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() > > > + * is inhibited so any set TPR LAPIC state would not get reflected > > > + * in V_TPR. > > > > Hmm, I think that code is straight up wrong. There's no justification, just a > > claim: > > > > commit 3bbf3565f48ce3999b5a12cde946f81bd4475312 > > Author: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com> > > AuthorDate: Wed May 4 14:09:51 2016 -0500 > > Commit: Paolo Bonzini <pbonzini@redhat.com> > > CommitDate: Wed May 18 18:04:31 2016 +0200 > > > > svm: Do not intercept CR8 when enable AVIC > > When enable AVIC: > > * Do not intercept CR8 since this should be handled by AVIC HW. > > * Also, we don't need to sync cr8/V_TPR and APIC backing page. <====== > > Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> > > [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo] > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> > > > > That claim assumes APIC[TPR] will _never_ be modified by anything other than > > hardware. That's obviously false for state restore from userspace, and it's also > > technically false at steady state, e.g. if KVM managed to trigger emulation of a > > store to the APIC page, then KVM would bypass the automatic harware sync. > > > > There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to > > be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC > > except for the initial ioctl), but could again set APIC[TPR] without updating > > V_TPR. > > > > So, rather than manually do the update during state restore, my vote is to restore > > the sync logic. And if we want to optimize that code (seems unnecessary), then > > we should hook all TPR writes. > > > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > > index d9931c6c4bc6..1bfebe40854f 100644 > > --- a/arch/x86/kvm/svm/svm.c > > +++ b/arch/x86/kvm/svm/svm.c > > @@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu) > > struct vcpu_svm *svm = to_svm(vcpu); > > u64 cr8; > > - if (nested_svm_virtualize_tpr(vcpu) || > > - kvm_vcpu_apicv_active(vcpu)) > > + if (nested_svm_virtualize_tpr(vcpu)) > > return; > > cr8 = kvm_get_cr8(vcpu); > > > > > > So you want to just do an unconditional LAPIC -> V_TPR sync at each VMRUN > and not try to patch every code flow where these possibly could get de-synced > to do such sync only on demand, correct? Yep. For a fix, I definitely want to go with the bare minimum. If we want to optimize the sync, that can be done on top, and it can be done irrespective of AVIC. E.g. for guests that don't modify TPR, the sync is almost pure overhead too. > By the way, the original Suravee's submission for the aforementioned patch > did *not* inhibit that sync when AVIC is on [1]. > > Something similar to this sync inhibit only showed in v4 [2], > probably upon Radim's comment on v3 [3] that: > > I think we can exit early with svm_vcpu_avic_enabled(). > > But the initial sync inhibit condition in v4 was essentially > nested_svm_virtualize_tpr() && svm_vcpu_avic_enabled(), > which suggests there was some confusion what was exactly meant > by the reviewer comment. Hmm, I suspect that was just a goof. My guess is that Radim and Suravee simply forgot to consider TPR writes that aren't handled by hardware. > The final sync inhibit condition only showed in v5 [4]. > No further discussion happened on that point. > > Thanks, > Maciej > > [1]: https://lore.kernel.org/kvm/1455285574-27892-9-git-send-email-suravee.suthikulpanit@amd.com/ > [2]: https://lore.kernel.org/kvm/1460017232-17429-11-git-send-email-Suravee.Suthikulpanit@amd.com/ > [3]: https://lore.kernel.org/kvm/20160318211048.GB26119@potion.brq.redhat.com/ > [4]: https://lore.kernel.org/kvm/1462388992-25242-13-git-send-email-Suravee.Suthikulpanit@amd.com/ > ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/2] KVM: selftests: Test TPR / CR8 sync and interrupt masking 2025-08-19 13:32 [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Maciej S. Szmigiero 2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero @ 2025-08-19 13:32 ` Maciej S. Szmigiero 2025-08-21 8:18 ` [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Naveen N Rao 2 siblings, 0 replies; 12+ messages in thread From: Maciej S. Szmigiero @ 2025-08-19 13:32 UTC (permalink / raw) To: Paolo Bonzini, Sean Christopherson Cc: Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> Add a few extra TPR / CR8 tests to x86's xapic_state_test to see if: * TPR is 0 on reset, * TPR, PPR and CR8 are equal inside the guest, * TPR and CR8 read equal by the host after a VMExit * TPR borderline values set by the host correctly mask interrupts in the guest. These hopefully will catch the most obvious cases of improper TPR sync or interrupt masking. Do these tests both in x2APIC and xAPIC modes. The x2APIC mode uses SELF_IPI register to trigger interrupts to give it a bit of exercise too. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> --- .../testing/selftests/kvm/include/x86/apic.h | 5 + .../selftests/kvm/x86/xapic_state_test.c | 265 +++++++++++++++++- 2 files changed, 267 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/kvm/include/x86/apic.h b/tools/testing/selftests/kvm/include/x86/apic.h index 80fe9f69b38d..67285e533e14 100644 --- a/tools/testing/selftests/kvm/include/x86/apic.h +++ b/tools/testing/selftests/kvm/include/x86/apic.h @@ -27,7 +27,11 @@ #define APIC_LVR 0x30 #define GET_APIC_ID_FIELD(x) (((x) >> 24) & 0xFF) #define APIC_TASKPRI 0x80 +#define APIC_TASKPRI_TP_SHIFT 4 +#define APIC_TASKPRI_TP_MASK GENMASK(7, 4) #define APIC_PROCPRI 0xA0 +#define APIC_PROCPRI_PP_SHIFT 4 +#define APIC_PROCPRI_PP_MASK GENMASK(7, 4) #define APIC_EOI 0xB0 #define APIC_SPIV 0xF0 #define APIC_SPIV_FOCUS_DISABLED (1 << 9) @@ -67,6 +71,7 @@ #define APIC_TMICT 0x380 #define APIC_TMCCT 0x390 #define APIC_TDCR 0x3E0 +#define APIC_SELF_IPI 0x3F0 void apic_disable(void); void xapic_enable(void); diff --git a/tools/testing/selftests/kvm/x86/xapic_state_test.c b/tools/testing/selftests/kvm/x86/xapic_state_test.c index fdebff1165c7..968e5e539a1a 100644 --- a/tools/testing/selftests/kvm/x86/xapic_state_test.c +++ b/tools/testing/selftests/kvm/x86/xapic_state_test.c @@ -1,9 +1,11 @@ // SPDX-License-Identifier: GPL-2.0-only #include <fcntl.h> +#include <stdatomic.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/ioctl.h> +#include <unistd.h> #include "apic.h" #include "kvm_util.h" @@ -16,6 +18,245 @@ struct xapic_vcpu { bool has_xavic_errata; }; +#define IRQ_VECTOR 0x20 + +/* See also the comment at similar assertion in memslot_perf_test.c */ +static_assert(ATOMIC_INT_LOCK_FREE == 2, "atomic int is not lockless"); + +static atomic_uint tpr_guest_irq_sync_val; + +static void tpr_guest_irq_sync_flag_reset(void) +{ + atomic_store_explicit(&tpr_guest_irq_sync_val, 0, + memory_order_release); +} + +static unsigned int tpr_guest_irq_sync_val_get(void) +{ + return atomic_load_explicit(&tpr_guest_irq_sync_val, + memory_order_acquire); +} + +static void tpr_guest_irq_sync_val_inc(void) +{ + atomic_fetch_add_explicit(&tpr_guest_irq_sync_val, 1, + memory_order_acq_rel); +} + +static void tpr_guest_irq_handler_xapic(struct ex_regs *regs) +{ + tpr_guest_irq_sync_val_inc(); + + xapic_write_reg(APIC_EOI, 0); +} + +static void tpr_guest_irq_handler_x2apic(struct ex_regs *regs) +{ + tpr_guest_irq_sync_val_inc(); + + x2apic_write_reg(APIC_EOI, 0); +} + +static void tpr_guest_irq_queue(bool x2apic) +{ + if (x2apic) { + x2apic_write_reg(APIC_SELF_IPI, IRQ_VECTOR); + } else { + uint32_t icr, icr2; + + icr = APIC_DEST_SELF | APIC_DEST_PHYSICAL | APIC_DM_FIXED | + IRQ_VECTOR; + icr2 = 0; + + xapic_write_reg(APIC_ICR2, icr2); + xapic_write_reg(APIC_ICR, icr); + } +} + +static uint8_t tpr_guest_tpr_get(bool x2apic) +{ + uint32_t taskpri; + + if (x2apic) + taskpri = x2apic_read_reg(APIC_TASKPRI); + else + taskpri = xapic_read_reg(APIC_TASKPRI); + + return (taskpri & APIC_TASKPRI_TP_MASK) >> APIC_TASKPRI_TP_SHIFT; +} + +static uint8_t tpr_guest_ppr_get(bool x2apic) +{ + uint32_t procpri; + + if (x2apic) + procpri = x2apic_read_reg(APIC_PROCPRI); + else + procpri = xapic_read_reg(APIC_PROCPRI); + + return (procpri & APIC_PROCPRI_PP_MASK) >> APIC_PROCPRI_PP_SHIFT; +} + +static uint8_t tpr_guest_cr8_get(void) +{ + uint64_t cr8; + + asm volatile ("mov %%cr8, %[cr8]\n\t" : [cr8] "=r"(cr8)); + + return cr8 & GENMASK(3, 0); +} + +static void tpr_guest_check_tpr_ppr_cr8_equal(bool x2apic) +{ + uint8_t tpr; + + tpr = tpr_guest_tpr_get(x2apic); + + GUEST_ASSERT_EQ(tpr_guest_ppr_get(x2apic), tpr); + GUEST_ASSERT_EQ(tpr_guest_cr8_get(), tpr); +} + +static void tpr_guest_code(uint64_t x2apic) +{ + cli(); + + if (x2apic) + x2apic_enable(); + else + xapic_enable(); + + tpr_guest_check_tpr_ppr_cr8_equal(x2apic); + + tpr_guest_irq_queue(x2apic); + + /* TPR = 0 but IRQ masked by IF=0, should not fire */ + udelay(1000); + GUEST_ASSERT_EQ(tpr_guest_irq_sync_val_get(), 0); + + sti(); + + /* IF=1 now, IRQ should fire */ + while (tpr_guest_irq_sync_val_get() == 0) + cpu_relax(); + GUEST_ASSERT_EQ(tpr_guest_irq_sync_val_get(), 1); + + GUEST_SYNC(0); + tpr_guest_check_tpr_ppr_cr8_equal(x2apic); + + tpr_guest_irq_queue(x2apic); + + /* IRQ masked by barely high enough TPR now, should not fire */ + udelay(1000); + GUEST_ASSERT_EQ(tpr_guest_irq_sync_val_get(), 1); + + GUEST_SYNC(1); + tpr_guest_check_tpr_ppr_cr8_equal(x2apic); + + /* TPR barely low enough now to unmask IRQ, should fire */ + while (tpr_guest_irq_sync_val_get() == 1) + cpu_relax(); + GUEST_ASSERT_EQ(tpr_guest_irq_sync_val_get(), 2); + + GUEST_DONE(); +} + +static uint8_t lapic_tpr_get(struct kvm_lapic_state *xapic) +{ + return (*((u32 *)&xapic->regs[APIC_TASKPRI]) & APIC_TASKPRI_TP_MASK) >> + APIC_TASKPRI_TP_SHIFT; +} + +static void lapic_tpr_set(struct kvm_lapic_state *xapic, uint8_t val) +{ + *((u32 *)&xapic->regs[APIC_TASKPRI]) &= ~APIC_TASKPRI_TP_MASK; + *((u32 *)&xapic->regs[APIC_TASKPRI]) |= val << APIC_TASKPRI_TP_SHIFT; +} + +static uint8_t sregs_tpr(struct kvm_sregs *sregs) +{ + return sregs->cr8 & GENMASK(3, 0); +} + +static void test_tpr_check_tpr_zero(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic_state xapic; + + vcpu_ioctl(vcpu, KVM_GET_LAPIC, &xapic); + + TEST_ASSERT_EQ(lapic_tpr_get(&xapic), 0); +} + +static void test_tpr_check_tpr_cr8_equal(struct kvm_vcpu *vcpu) +{ + struct kvm_sregs sregs; + struct kvm_lapic_state xapic; + + vcpu_sregs_get(vcpu, &sregs); + vcpu_ioctl(vcpu, KVM_GET_LAPIC, &xapic); + + TEST_ASSERT_EQ(sregs_tpr(&sregs), lapic_tpr_get(&xapic)); +} + +static void test_tpr_mask_irq(struct kvm_vcpu *vcpu, bool mask) +{ + struct kvm_lapic_state xapic; + uint8_t tpr; + + static_assert(IRQ_VECTOR >= 16, "invalid IRQ vector number"); + tpr = IRQ_VECTOR / 16; + if (!mask) + tpr--; + + vcpu_ioctl(vcpu, KVM_GET_LAPIC, &xapic); + lapic_tpr_set(&xapic, tpr); + vcpu_ioctl(vcpu, KVM_SET_LAPIC, &xapic); +} + +static void test_tpr(struct kvm_vcpu *vcpu, bool x2apic) +{ + bool run_guest = true; + + vcpu_args_set(vcpu, 1, (uint64_t)x2apic); + + /* According to the SDM/APM the TPR value at reset is 0 */ + test_tpr_check_tpr_zero(vcpu); + test_tpr_check_tpr_cr8_equal(vcpu); + + tpr_guest_irq_sync_flag_reset(); + + while (run_guest) { + struct ucall uc; + + alarm(2); + vcpu_run(vcpu); + alarm(0); + + switch (get_ucall(vcpu, &uc)) { + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + break; + case UCALL_DONE: + test_tpr_check_tpr_cr8_equal(vcpu); + + run_guest = false; + break; + case UCALL_SYNC: + test_tpr_check_tpr_cr8_equal(vcpu); + + if (uc.args[1] == 0) + test_tpr_mask_irq(vcpu, true); + else if (uc.args[1] == 1) + test_tpr_mask_irq(vcpu, false); + else + TEST_FAIL("Unknown SYNC %lu", uc.args[1]); + break; + default: + TEST_FAIL("Unknown ucall result 0x%lx", uc.cmd); + break; + } + } +} + static void xapic_guest_code(void) { cli(); @@ -195,6 +436,12 @@ static void test_apic_id(void) kvm_vm_free(vm); } +static void clear_x2apic_cap_map_apic(struct kvm_vm *vm, struct kvm_vcpu *vcpu) +{ + vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_X2APIC); + virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA); +} + static void test_x2apic_id(void) { struct kvm_lapic_state lapic = {}; @@ -230,10 +477,17 @@ int main(int argc, char *argv[]) }; struct kvm_vm *vm; + /* x2APIC tests */ + vm = vm_create_with_one_vcpu(&x.vcpu, x2apic_guest_code); test_icr(&x); kvm_vm_free(vm); + vm = vm_create_with_one_vcpu(&x.vcpu, tpr_guest_code); + vm_install_exception_handler(vm, IRQ_VECTOR, tpr_guest_irq_handler_x2apic); + test_tpr(x.vcpu, true); + kvm_vm_free(vm); + /* * Use a second VM for the xAPIC test so that x2APIC can be hidden from * the guest in order to test AVIC. KVM disallows changing CPUID after @@ -251,12 +505,17 @@ int main(int argc, char *argv[]) x.has_xavic_errata = host_cpu_is_amd && get_kvm_amd_param_bool("avic"); - vcpu_clear_cpuid_feature(x.vcpu, X86_FEATURE_X2APIC); - - virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA); + clear_x2apic_cap_map_apic(vm, x.vcpu); test_icr(&x); kvm_vm_free(vm); + /* Also do a TPR non-x2APIC test */ + vm = vm_create_with_one_vcpu(&x.vcpu, tpr_guest_code); + clear_x2apic_cap_map_apic(vm, x.vcpu); + vm_install_exception_handler(vm, IRQ_VECTOR, tpr_guest_irq_handler_xapic); + test_tpr(x.vcpu, false); + kvm_vm_free(vm); + test_apic_id(); test_x2apic_id(); } ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on 2025-08-19 13:32 [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Maciej S. Szmigiero 2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero 2025-08-19 13:32 ` [PATCH 2/2] KVM: selftests: Test TPR / CR8 sync and interrupt masking Maciej S. Szmigiero @ 2025-08-21 8:18 ` Naveen N Rao 2025-08-21 11:42 ` Maciej S. Szmigiero 2 siblings, 1 reply; 12+ messages in thread From: Naveen N Rao @ 2025-08-21 8:18 UTC (permalink / raw) To: Maciej S. Szmigiero Cc: Paolo Bonzini, Sean Christopherson, Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel On Tue, Aug 19, 2025 at 03:32:13PM +0200, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> > > When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR sync in > sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would > *not* get copied into the V_TPR field of VMCB. > > AVIC does sync between these two fields, however it does so only on > explicit guest writes to one of these fields, not on a bare VMRUN. > > This is especially true when it is the userspace setting LAPIC state via > KVM_SET_LAPIC ioctl() since userspace does not have access to the guest > VMCB. Dumb question: why is the VMM updating TPR? Is this related to live migration or such? I think I do see the problem described here, but when AVIC is temporarily inhibited. So, trying to understand if there are other flows involving the VMM where TPR could be updated outside of the guest. > > Practice shows that it is the V_TPR that is actually used by the AVIC to > decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), > so any leftover value in V_TPR will cause serious interrupt delivery issues > in the guest when AVIC is enabled. > > Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in > avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and > similar code paths when AVIC is enabled. > > Add also a relevant set of tests to xapic_state_test so hopefully > we'll be protected against getting such regressions in the future. Do the new tests reproduce this issue? > > > Yes, this breaks real guests when AVIC is enabled. > Specifically, the one OS that sometimes needs different handling and its > name begins with letter 'W'. Indeed, Linux does not use TPR AFAIK. - Naveen ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on 2025-08-21 8:18 ` [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Naveen N Rao @ 2025-08-21 11:42 ` Maciej S. Szmigiero 2025-08-21 14:59 ` Alejandro Jimenez 0 siblings, 1 reply; 12+ messages in thread From: Maciej S. Szmigiero @ 2025-08-21 11:42 UTC (permalink / raw) To: Naveen N Rao Cc: Paolo Bonzini, Sean Christopherson, Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel On 21.08.2025 10:18, Naveen N Rao wrote: > On Tue, Aug 19, 2025 at 03:32:13PM +0200, Maciej S. Szmigiero wrote: >> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >> >> When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR sync in >> sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would >> *not* get copied into the V_TPR field of VMCB. >> >> AVIC does sync between these two fields, however it does so only on >> explicit guest writes to one of these fields, not on a bare VMRUN. >> >> This is especially true when it is the userspace setting LAPIC state via >> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest >> VMCB. > > Dumb question: why is the VMM updating TPR? Is this related to live > migration or such? In this case, VMM is resetting LAPIC state on machine reset. > I think I do see the problem described here, but when AVIC is > temporarily inhibited. So, trying to understand if there are other flows > involving the VMM where TPR could be updated outside of the guest. > >> >> Practice shows that it is the V_TPR that is actually used by the AVIC to >> decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI), >> so any leftover value in V_TPR will cause serious interrupt delivery issues >> in the guest when AVIC is enabled. >> >> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in >> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and >> similar code paths when AVIC is enabled. >> >> Add also a relevant set of tests to xapic_state_test so hopefully >> we'll be protected against getting such regressions in the future. > > Do the new tests reproduce this issue? Yes, and also check quite a bit more of TPR-related functionality. >> >> >> Yes, this breaks real guests when AVIC is enabled. >> Specifically, the one OS that sometimes needs different handling and its >> name begins with letter 'W'. > > Indeed, Linux does not use TPR AFAIK. > > > - Naveen > Thanks, Maciej ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on 2025-08-21 11:42 ` Maciej S. Szmigiero @ 2025-08-21 14:59 ` Alejandro Jimenez 2025-08-21 21:11 ` Sean Christopherson 0 siblings, 1 reply; 12+ messages in thread From: Alejandro Jimenez @ 2025-08-21 14:59 UTC (permalink / raw) To: Maciej S. Szmigiero, Naveen N Rao Cc: Paolo Bonzini, Sean Christopherson, Maxim Levitsky, Suravee Suthikulpanit, kvm, linux-kernel On 8/21/25 7:42 AM, Maciej S. Szmigiero wrote: > On 21.08.2025 10:18, Naveen N Rao wrote: >> On Tue, Aug 19, 2025 at 03:32:13PM +0200, Maciej S. Szmigiero wrote: >>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com> >>> >>> When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR >>> sync in >>> sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC >>> state would >>> *not* get copied into the V_TPR field of VMCB. >>> >>> AVIC does sync between these two fields, however it does so only on >>> explicit guest writes to one of these fields, not on a bare VMRUN. >>> >>> This is especially true when it is the userspace setting LAPIC state via >>> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest >>> VMCB. >> >> Dumb question: why is the VMM updating TPR? Is this related to live >> migration or such? > > In this case, VMM is resetting LAPIC state on machine reset. > >> I think I do see the problem described here, but when AVIC is >> temporarily inhibited. So, trying to understand if there are other flows >> involving the VMM where TPR could be updated outside of the guest. >> >>> >>> Practice shows that it is the V_TPR that is actually used by the AVIC to >>> decide whether to issue pending interrupts to the CPU (not TPR in >>> TASKPRI), >>> so any leftover value in V_TPR will cause serious interrupt delivery >>> issues >>> in the guest when AVIC is enabled. >>> >>> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in >>> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC >>> and >>> similar code paths when AVIC is enabled. >>> >>> Add also a relevant set of tests to xapic_state_test so hopefully >>> we'll be protected against getting such regressions in the future. >> >> Do the new tests reproduce this issue? > > Yes, and also check quite a bit more of TPR-related functionality. > >>> >>> >>> Yes, this breaks real guests when AVIC is enabled. >>> Specifically, the one OS that sometimes needs different handling and its >>> name begins with letter 'W'. >> >> Indeed, Linux does not use TPR AFAIK. >> >> I believe it does, during the local APIC initialization. When Maciej determined the root cause of this issue, I was wondering why we have not seen it earlier in Linux. I found that Linux takes a defensive approach and drains all pending interrupts during lapic initialization. Essentially, for each CPU, Linux will: - temporarily disable the Local APIC (via Spurious Int Vector Reg) - set the TPR to accept all "regular" interrupts i.e. tpr=0x10 - drain all pending interrupts in ISR and/or IRR - attempt the above draining step a max of 512 times - then re-enable APIC and continue initialization The relevant code is in setup_local_APIC() https://elixir.bootlin.com/linux/v6.16/source/arch/x86/kernel/apic/apic.c#L1533-L1545 So without Maciej's proposed change, other OSs that are not as resilient could also be affected by this issue. Alejandro >> - Naveen >> > > Thanks, > Maciej > > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on 2025-08-21 14:59 ` Alejandro Jimenez @ 2025-08-21 21:11 ` Sean Christopherson 0 siblings, 0 replies; 12+ messages in thread From: Sean Christopherson @ 2025-08-21 21:11 UTC (permalink / raw) To: Alejandro Jimenez Cc: Maciej S. Szmigiero, Naveen N Rao, Paolo Bonzini, Maxim Levitsky, Suravee Suthikulpanit, kvm, linux-kernel On Thu, Aug 21, 2025, Alejandro Jimenez wrote: > On 8/21/25 7:42 AM, Maciej S. Szmigiero wrote: > > On 21.08.2025 10:18, Naveen N Rao wrote: > > > > Yes, this breaks real guests when AVIC is enabled. > > > > Specifically, the one OS that sometimes needs different handling and its > > > > name begins with letter 'W'. > > > > > > Indeed, Linux does not use TPR AFAIK. > > I believe it does, Heh, yes, Linux technically "uses" the TPR in that it does a one-time write to it. But what Naveen really meant is that Linux doesn't actively use TPR to manage what IRQs are masked/allowed, whereas Windows heavily uses TPR to do exactly that. Specifically, what matters is that Linux doesn't use TPR to _mask_ IRQs, and so clobbering it to '0' on migration is largely benign. > during the local APIC initialization. When Maciej > determined the root cause of this issue, I was wondering why we have not > seen it earlier in Linux. I found that Linux takes a defensive approach and > drains all pending interrupts during lapic initialization. Essentially, for > each CPU, Linux will: > - temporarily disable the Local APIC (via Spurious Int Vector Reg) > - set the TPR to accept all "regular" interrupts i.e. tpr=0x10 > - drain all pending interrupts in ISR and/or IRR > - attempt the above draining step a max of 512 times > - then re-enable APIC and continue initialization > > The relevant code is in setup_local_APIC() > https://elixir.bootlin.com/linux/v6.16/source/arch/x86/kernel/apic/apic.c#L1533-L1545 > > So without Maciej's proposed change, other OSs that are not as resilient > could also be affected by this issue. > > Alejandro > > > > - Naveen > > > > > > > Thanks, > > Maciej > > > > > ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-08-22 23:41 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-08-19 13:32 [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Maciej S. Szmigiero 2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero 2025-08-21 20:38 ` Sean Christopherson 2025-08-22 9:04 ` Naveen N Rao 2025-08-22 20:54 ` Sean Christopherson 2025-08-22 23:20 ` Maciej S. Szmigiero 2025-08-22 23:41 ` Sean Christopherson 2025-08-19 13:32 ` [PATCH 2/2] KVM: selftests: Test TPR / CR8 sync and interrupt masking Maciej S. Szmigiero 2025-08-21 8:18 ` [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Naveen N Rao 2025-08-21 11:42 ` Maciej S. Szmigiero 2025-08-21 14:59 ` Alejandro Jimenez 2025-08-21 21:11 ` Sean Christopherson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).