linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on
@ 2025-08-19 13:32 Maciej S. Szmigiero
  2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Maciej S. Szmigiero @ 2025-08-19 13:32 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm,
	linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR sync in
sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would
*not* get copied into the V_TPR field of VMCB.

AVIC does sync between these two fields, however it does so only on
explicit guest writes to one of these fields, not on a bare VMRUN.

This is especially true when it is the userspace setting LAPIC state via
KVM_SET_LAPIC ioctl() since userspace does not have access to the guest
VMCB.

Practice shows that it is the V_TPR that is actually used by the AVIC to
decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI),
so any leftover value in V_TPR will cause serious interrupt delivery issues
in the guest when AVIC is enabled.

Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in
avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and
similar code paths when AVIC is enabled.

Add also a relevant set of tests to xapic_state_test so hopefully
we'll be protected against getting such regressions in the future.


Yes, this breaks real guests when AVIC is enabled.
Specifically, the one OS that sometimes needs different handling and its
name begins with letter 'W'.


  KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs
  KVM: selftests: Test TPR / CR8 sync and interrupt masking

 arch/x86/kvm/svm/avic.c                       |  23 ++
 .../testing/selftests/kvm/include/x86/apic.h  |   5 +
 .../selftests/kvm/x86/xapic_state_test.c      | 265 +++++++++++++++++-
 3 files changed, 290 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs
  2025-08-19 13:32 [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Maciej S. Szmigiero
@ 2025-08-19 13:32 ` Maciej S. Szmigiero
  2025-08-21 20:38   ` Sean Christopherson
  2025-08-19 13:32 ` [PATCH 2/2] KVM: selftests: Test TPR / CR8 sync and interrupt masking Maciej S. Szmigiero
  2025-08-21  8:18 ` [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Naveen N Rao
  2 siblings, 1 reply; 12+ messages in thread
From: Maciej S. Szmigiero @ 2025-08-19 13:32 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm,
	linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() is
inhibited so any changed TPR in the LAPIC state would not get copied into
the V_TPR field of VMCB.

AVIC does sync between these two fields, however it does so only on
explicit guest writes to one of these fields, not on a bare VMRUN.

This is especially true when it is the userspace setting LAPIC state via
KVM_SET_LAPIC ioctl() since userspace does not have access to the guest
VMCB.

Practice shows that it is the V_TPR that is actually used by the AVIC to
decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI),
so any leftover value in V_TPR will cause serious interrupt delivery issues
in the guest when AVIC is enabled.

Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in
avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and
similar code paths when AVIC is enabled.

Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 arch/x86/kvm/svm/avic.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index a34c5c3b164e..877bc3db2c6e 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -725,8 +725,31 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 
 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_svm *svm = to_svm(vcpu);
+	u64 cr8;
+
 	avic_handle_dfr_update(vcpu);
 	avic_handle_ldr_update(vcpu);
+
+	/* Running nested should have inhibited AVIC. */
+	if (WARN_ON_ONCE(nested_svm_virtualize_tpr(vcpu)))
+		return;
+
+	/*
+	 * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB.
+	 *
+	 * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8()
+	 * is inhibited so any set TPR LAPIC state would not get reflected
+	 * in V_TPR.
+	 *
+	 * Practice shows that it is the V_TPR that is actually used by the
+	 * AVIC to decide whether to issue pending interrupts to the CPU, not
+	 * TPR in TASKPRI.
+	 */
+	cr8 = kvm_get_cr8(vcpu);
+	svm->vmcb->control.int_ctl &= ~V_TPR_MASK;
+	svm->vmcb->control.int_ctl |= cr8 & V_TPR_MASK;
+	WARN_ON_ONCE(!vmcb_is_dirty(svm->vmcb, VMCB_INTR));
 }
 
 static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] KVM: selftests: Test TPR / CR8 sync and interrupt masking
  2025-08-19 13:32 [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Maciej S. Szmigiero
  2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero
@ 2025-08-19 13:32 ` Maciej S. Szmigiero
  2025-08-21  8:18 ` [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Naveen N Rao
  2 siblings, 0 replies; 12+ messages in thread
From: Maciej S. Szmigiero @ 2025-08-19 13:32 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Maxim Levitsky, Suravee Suthikulpanit, Alejandro Jimenez, kvm,
	linux-kernel

From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>

Add a few extra TPR / CR8 tests to x86's xapic_state_test to see if:
* TPR is 0 on reset,
* TPR, PPR and CR8 are equal inside the guest,
* TPR and CR8 read equal by the host after a VMExit
* TPR borderline values set by the host correctly mask interrupts in the
guest.

These hopefully will catch the most obvious cases of improper TPR sync or
interrupt masking.

Do these tests both in x2APIC and xAPIC modes.
The x2APIC mode uses SELF_IPI register to trigger interrupts to give it a
bit of exercise too.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
---
 .../testing/selftests/kvm/include/x86/apic.h  |   5 +
 .../selftests/kvm/x86/xapic_state_test.c      | 265 +++++++++++++++++-
 2 files changed, 267 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/x86/apic.h b/tools/testing/selftests/kvm/include/x86/apic.h
index 80fe9f69b38d..67285e533e14 100644
--- a/tools/testing/selftests/kvm/include/x86/apic.h
+++ b/tools/testing/selftests/kvm/include/x86/apic.h
@@ -27,7 +27,11 @@
 #define	APIC_LVR	0x30
 #define		GET_APIC_ID_FIELD(x)	(((x) >> 24) & 0xFF)
 #define	APIC_TASKPRI	0x80
+#define		APIC_TASKPRI_TP_SHIFT	4
+#define		APIC_TASKPRI_TP_MASK	GENMASK(7, 4)
 #define	APIC_PROCPRI	0xA0
+#define		APIC_PROCPRI_PP_SHIFT	4
+#define		APIC_PROCPRI_PP_MASK	GENMASK(7, 4)
 #define	APIC_EOI	0xB0
 #define	APIC_SPIV	0xF0
 #define		APIC_SPIV_FOCUS_DISABLED	(1 << 9)
@@ -67,6 +71,7 @@
 #define	APIC_TMICT	0x380
 #define	APIC_TMCCT	0x390
 #define	APIC_TDCR	0x3E0
+#define	APIC_SELF_IPI	0x3F0
 
 void apic_disable(void);
 void xapic_enable(void);
diff --git a/tools/testing/selftests/kvm/x86/xapic_state_test.c b/tools/testing/selftests/kvm/x86/xapic_state_test.c
index fdebff1165c7..968e5e539a1a 100644
--- a/tools/testing/selftests/kvm/x86/xapic_state_test.c
+++ b/tools/testing/selftests/kvm/x86/xapic_state_test.c
@@ -1,9 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include <fcntl.h>
+#include <stdatomic.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/ioctl.h>
+#include <unistd.h>
 
 #include "apic.h"
 #include "kvm_util.h"
@@ -16,6 +18,245 @@ struct xapic_vcpu {
 	bool has_xavic_errata;
 };
 
+#define IRQ_VECTOR 0x20
+
+/* See also the comment at similar assertion in memslot_perf_test.c */
+static_assert(ATOMIC_INT_LOCK_FREE == 2, "atomic int is not lockless");
+
+static atomic_uint tpr_guest_irq_sync_val;
+
+static void tpr_guest_irq_sync_flag_reset(void)
+{
+	atomic_store_explicit(&tpr_guest_irq_sync_val, 0,
+			      memory_order_release);
+}
+
+static unsigned int tpr_guest_irq_sync_val_get(void)
+{
+	return atomic_load_explicit(&tpr_guest_irq_sync_val,
+				    memory_order_acquire);
+}
+
+static void tpr_guest_irq_sync_val_inc(void)
+{
+	atomic_fetch_add_explicit(&tpr_guest_irq_sync_val, 1,
+				  memory_order_acq_rel);
+}
+
+static void tpr_guest_irq_handler_xapic(struct ex_regs *regs)
+{
+	tpr_guest_irq_sync_val_inc();
+
+	xapic_write_reg(APIC_EOI, 0);
+}
+
+static void tpr_guest_irq_handler_x2apic(struct ex_regs *regs)
+{
+	tpr_guest_irq_sync_val_inc();
+
+	x2apic_write_reg(APIC_EOI, 0);
+}
+
+static void tpr_guest_irq_queue(bool x2apic)
+{
+	if (x2apic) {
+		x2apic_write_reg(APIC_SELF_IPI, IRQ_VECTOR);
+	} else {
+		uint32_t icr, icr2;
+
+		icr = APIC_DEST_SELF | APIC_DEST_PHYSICAL | APIC_DM_FIXED |
+			IRQ_VECTOR;
+		icr2 = 0;
+
+		xapic_write_reg(APIC_ICR2, icr2);
+		xapic_write_reg(APIC_ICR, icr);
+	}
+}
+
+static uint8_t tpr_guest_tpr_get(bool x2apic)
+{
+	uint32_t taskpri;
+
+	if (x2apic)
+		taskpri = x2apic_read_reg(APIC_TASKPRI);
+	else
+		taskpri = xapic_read_reg(APIC_TASKPRI);
+
+	return (taskpri & APIC_TASKPRI_TP_MASK) >> APIC_TASKPRI_TP_SHIFT;
+}
+
+static uint8_t tpr_guest_ppr_get(bool x2apic)
+{
+	uint32_t procpri;
+
+	if (x2apic)
+		procpri = x2apic_read_reg(APIC_PROCPRI);
+	else
+		procpri = xapic_read_reg(APIC_PROCPRI);
+
+	return (procpri & APIC_PROCPRI_PP_MASK) >> APIC_PROCPRI_PP_SHIFT;
+}
+
+static uint8_t tpr_guest_cr8_get(void)
+{
+	uint64_t cr8;
+
+	asm volatile ("mov %%cr8, %[cr8]\n\t" : [cr8] "=r"(cr8));
+
+	return cr8 & GENMASK(3, 0);
+}
+
+static void tpr_guest_check_tpr_ppr_cr8_equal(bool x2apic)
+{
+	uint8_t tpr;
+
+	tpr = tpr_guest_tpr_get(x2apic);
+
+	GUEST_ASSERT_EQ(tpr_guest_ppr_get(x2apic), tpr);
+	GUEST_ASSERT_EQ(tpr_guest_cr8_get(), tpr);
+}
+
+static void tpr_guest_code(uint64_t x2apic)
+{
+	cli();
+
+	if (x2apic)
+		x2apic_enable();
+	else
+		xapic_enable();
+
+	tpr_guest_check_tpr_ppr_cr8_equal(x2apic);
+
+	tpr_guest_irq_queue(x2apic);
+
+	/* TPR = 0 but IRQ masked by IF=0, should not fire */
+	udelay(1000);
+	GUEST_ASSERT_EQ(tpr_guest_irq_sync_val_get(), 0);
+
+	sti();
+
+	/* IF=1 now, IRQ should fire */
+	while (tpr_guest_irq_sync_val_get() == 0)
+		cpu_relax();
+	GUEST_ASSERT_EQ(tpr_guest_irq_sync_val_get(), 1);
+
+	GUEST_SYNC(0);
+	tpr_guest_check_tpr_ppr_cr8_equal(x2apic);
+
+	tpr_guest_irq_queue(x2apic);
+
+	/* IRQ masked by barely high enough TPR now, should not fire */
+	udelay(1000);
+	GUEST_ASSERT_EQ(tpr_guest_irq_sync_val_get(), 1);
+
+	GUEST_SYNC(1);
+	tpr_guest_check_tpr_ppr_cr8_equal(x2apic);
+
+	/* TPR barely low enough now to unmask IRQ, should fire */
+	while (tpr_guest_irq_sync_val_get() == 1)
+		cpu_relax();
+	GUEST_ASSERT_EQ(tpr_guest_irq_sync_val_get(), 2);
+
+	GUEST_DONE();
+}
+
+static uint8_t lapic_tpr_get(struct kvm_lapic_state *xapic)
+{
+	return (*((u32 *)&xapic->regs[APIC_TASKPRI]) & APIC_TASKPRI_TP_MASK) >>
+		APIC_TASKPRI_TP_SHIFT;
+}
+
+static void lapic_tpr_set(struct kvm_lapic_state *xapic, uint8_t val)
+{
+	*((u32 *)&xapic->regs[APIC_TASKPRI]) &= ~APIC_TASKPRI_TP_MASK;
+	*((u32 *)&xapic->regs[APIC_TASKPRI]) |= val << APIC_TASKPRI_TP_SHIFT;
+}
+
+static uint8_t sregs_tpr(struct kvm_sregs *sregs)
+{
+	return sregs->cr8 & GENMASK(3, 0);
+}
+
+static void test_tpr_check_tpr_zero(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic_state xapic;
+
+	vcpu_ioctl(vcpu, KVM_GET_LAPIC, &xapic);
+
+	TEST_ASSERT_EQ(lapic_tpr_get(&xapic), 0);
+}
+
+static void test_tpr_check_tpr_cr8_equal(struct kvm_vcpu *vcpu)
+{
+	struct kvm_sregs sregs;
+	struct kvm_lapic_state xapic;
+
+	vcpu_sregs_get(vcpu, &sregs);
+	vcpu_ioctl(vcpu, KVM_GET_LAPIC, &xapic);
+
+	TEST_ASSERT_EQ(sregs_tpr(&sregs), lapic_tpr_get(&xapic));
+}
+
+static void test_tpr_mask_irq(struct kvm_vcpu *vcpu, bool mask)
+{
+	struct kvm_lapic_state xapic;
+	uint8_t tpr;
+
+	static_assert(IRQ_VECTOR >= 16, "invalid IRQ vector number");
+	tpr = IRQ_VECTOR / 16;
+	if (!mask)
+		tpr--;
+
+	vcpu_ioctl(vcpu, KVM_GET_LAPIC, &xapic);
+	lapic_tpr_set(&xapic, tpr);
+	vcpu_ioctl(vcpu, KVM_SET_LAPIC, &xapic);
+}
+
+static void test_tpr(struct kvm_vcpu *vcpu, bool x2apic)
+{
+	bool run_guest = true;
+
+	vcpu_args_set(vcpu, 1, (uint64_t)x2apic);
+
+	/* According to the SDM/APM the TPR value at reset is 0 */
+	test_tpr_check_tpr_zero(vcpu);
+	test_tpr_check_tpr_cr8_equal(vcpu);
+
+	tpr_guest_irq_sync_flag_reset();
+
+	while (run_guest) {
+		struct ucall uc;
+
+		alarm(2);
+		vcpu_run(vcpu);
+		alarm(0);
+
+		switch (get_ucall(vcpu, &uc)) {
+		case UCALL_ABORT:
+			REPORT_GUEST_ASSERT(uc);
+			break;
+		case UCALL_DONE:
+			test_tpr_check_tpr_cr8_equal(vcpu);
+
+			run_guest = false;
+			break;
+		case UCALL_SYNC:
+			test_tpr_check_tpr_cr8_equal(vcpu);
+
+			if (uc.args[1] == 0)
+				test_tpr_mask_irq(vcpu, true);
+			else if (uc.args[1] == 1)
+				test_tpr_mask_irq(vcpu, false);
+			else
+				TEST_FAIL("Unknown SYNC %lu", uc.args[1]);
+			break;
+		default:
+			TEST_FAIL("Unknown ucall result 0x%lx", uc.cmd);
+			break;
+		}
+	}
+}
+
 static void xapic_guest_code(void)
 {
 	cli();
@@ -195,6 +436,12 @@ static void test_apic_id(void)
 	kvm_vm_free(vm);
 }
 
+static void clear_x2apic_cap_map_apic(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+	vcpu_clear_cpuid_feature(vcpu, X86_FEATURE_X2APIC);
+	virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA);
+}
+
 static void test_x2apic_id(void)
 {
 	struct kvm_lapic_state lapic = {};
@@ -230,10 +477,17 @@ int main(int argc, char *argv[])
 	};
 	struct kvm_vm *vm;
 
+	/* x2APIC tests */
+
 	vm = vm_create_with_one_vcpu(&x.vcpu, x2apic_guest_code);
 	test_icr(&x);
 	kvm_vm_free(vm);
 
+	vm = vm_create_with_one_vcpu(&x.vcpu, tpr_guest_code);
+	vm_install_exception_handler(vm, IRQ_VECTOR, tpr_guest_irq_handler_x2apic);
+	test_tpr(x.vcpu, true);
+	kvm_vm_free(vm);
+
 	/*
 	 * Use a second VM for the xAPIC test so that x2APIC can be hidden from
 	 * the guest in order to test AVIC.  KVM disallows changing CPUID after
@@ -251,12 +505,17 @@ int main(int argc, char *argv[])
 	x.has_xavic_errata = host_cpu_is_amd &&
 			     get_kvm_amd_param_bool("avic");
 
-	vcpu_clear_cpuid_feature(x.vcpu, X86_FEATURE_X2APIC);
-
-	virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA);
+	clear_x2apic_cap_map_apic(vm, x.vcpu);
 	test_icr(&x);
 	kvm_vm_free(vm);
 
+	/* Also do a TPR non-x2APIC test */
+	vm = vm_create_with_one_vcpu(&x.vcpu, tpr_guest_code);
+	clear_x2apic_cap_map_apic(vm, x.vcpu);
+	vm_install_exception_handler(vm, IRQ_VECTOR, tpr_guest_irq_handler_xapic);
+	test_tpr(x.vcpu, false);
+	kvm_vm_free(vm);
+
 	test_apic_id();
 	test_x2apic_id();
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on
  2025-08-19 13:32 [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Maciej S. Szmigiero
  2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero
  2025-08-19 13:32 ` [PATCH 2/2] KVM: selftests: Test TPR / CR8 sync and interrupt masking Maciej S. Szmigiero
@ 2025-08-21  8:18 ` Naveen N Rao
  2025-08-21 11:42   ` Maciej S. Szmigiero
  2 siblings, 1 reply; 12+ messages in thread
From: Naveen N Rao @ 2025-08-21  8:18 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Sean Christopherson, Maxim Levitsky,
	Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel

On Tue, Aug 19, 2025 at 03:32:13PM +0200, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR sync in
> sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would
> *not* get copied into the V_TPR field of VMCB.
> 
> AVIC does sync between these two fields, however it does so only on
> explicit guest writes to one of these fields, not on a bare VMRUN.
> 
> This is especially true when it is the userspace setting LAPIC state via
> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest
> VMCB.

Dumb question: why is the VMM updating TPR? Is this related to live 
migration or such?

I think I do see the problem described here, but when AVIC is 
temporarily inhibited. So, trying to understand if there are other flows 
involving the VMM where TPR could be updated outside of the guest.

> 
> Practice shows that it is the V_TPR that is actually used by the AVIC to
> decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI),
> so any leftover value in V_TPR will cause serious interrupt delivery issues
> in the guest when AVIC is enabled.
> 
> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in
> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and
> similar code paths when AVIC is enabled.
> 
> Add also a relevant set of tests to xapic_state_test so hopefully
> we'll be protected against getting such regressions in the future.

Do the new tests reproduce this issue?

> 
> 
> Yes, this breaks real guests when AVIC is enabled.
> Specifically, the one OS that sometimes needs different handling and its
> name begins with letter 'W'.

Indeed, Linux does not use TPR AFAIK.


- Naveen


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on
  2025-08-21  8:18 ` [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Naveen N Rao
@ 2025-08-21 11:42   ` Maciej S. Szmigiero
  2025-08-21 14:59     ` Alejandro Jimenez
  0 siblings, 1 reply; 12+ messages in thread
From: Maciej S. Szmigiero @ 2025-08-21 11:42 UTC (permalink / raw)
  To: Naveen N Rao
  Cc: Paolo Bonzini, Sean Christopherson, Maxim Levitsky,
	Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel

On 21.08.2025 10:18, Naveen N Rao wrote:
> On Tue, Aug 19, 2025 at 03:32:13PM +0200, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR sync in
>> sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC state would
>> *not* get copied into the V_TPR field of VMCB.
>>
>> AVIC does sync between these two fields, however it does so only on
>> explicit guest writes to one of these fields, not on a bare VMRUN.
>>
>> This is especially true when it is the userspace setting LAPIC state via
>> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest
>> VMCB.
> 
> Dumb question: why is the VMM updating TPR? Is this related to live
> migration or such?

In this case, VMM is resetting LAPIC state on machine reset.

> I think I do see the problem described here, but when AVIC is
> temporarily inhibited. So, trying to understand if there are other flows
> involving the VMM where TPR could be updated outside of the guest.
> 
>>
>> Practice shows that it is the V_TPR that is actually used by the AVIC to
>> decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI),
>> so any leftover value in V_TPR will cause serious interrupt delivery issues
>> in the guest when AVIC is enabled.
>>
>> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in
>> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and
>> similar code paths when AVIC is enabled.
>>
>> Add also a relevant set of tests to xapic_state_test so hopefully
>> we'll be protected against getting such regressions in the future.
> 
> Do the new tests reproduce this issue?

Yes, and also check quite a bit more of TPR-related functionality.

>>
>>
>> Yes, this breaks real guests when AVIC is enabled.
>> Specifically, the one OS that sometimes needs different handling and its
>> name begins with letter 'W'.
> 
> Indeed, Linux does not use TPR AFAIK.
> 
> 
> - Naveen
> 

Thanks,
Maciej


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on
  2025-08-21 11:42   ` Maciej S. Szmigiero
@ 2025-08-21 14:59     ` Alejandro Jimenez
  2025-08-21 21:11       ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Alejandro Jimenez @ 2025-08-21 14:59 UTC (permalink / raw)
  To: Maciej S. Szmigiero, Naveen N Rao
  Cc: Paolo Bonzini, Sean Christopherson, Maxim Levitsky,
	Suravee Suthikulpanit, kvm, linux-kernel



On 8/21/25 7:42 AM, Maciej S. Szmigiero wrote:
> On 21.08.2025 10:18, Naveen N Rao wrote:
>> On Tue, Aug 19, 2025 at 03:32:13PM +0200, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>>
>>> When AVIC is enabled the normal pre-VMRUN LAPIC TPR to VMCB::V_TPR 
>>> sync in
>>> sync_lapic_to_cr8() is inhibited so any changed TPR in the LAPIC 
>>> state would
>>> *not* get copied into the V_TPR field of VMCB.
>>>
>>> AVIC does sync between these two fields, however it does so only on
>>> explicit guest writes to one of these fields, not on a bare VMRUN.
>>>
>>> This is especially true when it is the userspace setting LAPIC state via
>>> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest
>>> VMCB.
>>
>> Dumb question: why is the VMM updating TPR? Is this related to live
>> migration or such?
> 
> In this case, VMM is resetting LAPIC state on machine reset.
> 
>> I think I do see the problem described here, but when AVIC is
>> temporarily inhibited. So, trying to understand if there are other flows
>> involving the VMM where TPR could be updated outside of the guest.
>>
>>>
>>> Practice shows that it is the V_TPR that is actually used by the AVIC to
>>> decide whether to issue pending interrupts to the CPU (not TPR in 
>>> TASKPRI),
>>> so any leftover value in V_TPR will cause serious interrupt delivery 
>>> issues
>>> in the guest when AVIC is enabled.
>>>
>>> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in
>>> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC 
>>> and
>>> similar code paths when AVIC is enabled.
>>>
>>> Add also a relevant set of tests to xapic_state_test so hopefully
>>> we'll be protected against getting such regressions in the future.
>>
>> Do the new tests reproduce this issue?
> 
> Yes, and also check quite a bit more of TPR-related functionality.
> 
>>>
>>>
>>> Yes, this breaks real guests when AVIC is enabled.
>>> Specifically, the one OS that sometimes needs different handling and its
>>> name begins with letter 'W'.
>>
>> Indeed, Linux does not use TPR AFAIK.
>>
>>

I believe it does, during the local APIC initialization. When Maciej 
determined the root cause of this issue, I was wondering why we have not 
seen it earlier in Linux. I found that Linux takes a defensive approach 
and drains all pending interrupts during lapic initialization. 
Essentially, for each CPU, Linux will:
- temporarily disable the Local APIC (via Spurious Int Vector Reg)
- set the TPR to accept all "regular" interrupts i.e. tpr=0x10
- drain all pending interrupts in ISR and/or IRR
- attempt the above draining step a max of 512 times
- then re-enable APIC and continue initialization

The relevant code is in setup_local_APIC()
https://elixir.bootlin.com/linux/v6.16/source/arch/x86/kernel/apic/apic.c#L1533-L1545

So without Maciej's proposed change, other OSs that are not as resilient 
could also be affected by this issue.

Alejandro

>> - Naveen
>>
> 
> Thanks,
> Maciej
> 
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs
  2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero
@ 2025-08-21 20:38   ` Sean Christopherson
  2025-08-22  9:04     ` Naveen N Rao
  2025-08-22 23:20     ` Maciej S. Szmigiero
  0 siblings, 2 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-08-21 20:38 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Maxim Levitsky, Suravee Suthikulpanit,
	Alejandro Jimenez, kvm, linux-kernel

On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> 
> When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() is
> inhibited so any changed TPR in the LAPIC state would not get copied into
> the V_TPR field of VMCB.
> 
> AVIC does sync between these two fields, however it does so only on
> explicit guest writes to one of these fields, not on a bare VMRUN.
> 
> This is especially true when it is the userspace setting LAPIC state via
> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest
> VMCB.
> 
> Practice shows that it is the V_TPR that is actually used by the AVIC to
> decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI),
> so any leftover value in V_TPR will cause serious interrupt delivery issues
> in the guest when AVIC is enabled.
> 
> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in
> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and
> similar code paths when AVIC is enabled.
> 
> Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC")
> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> ---
>  arch/x86/kvm/svm/avic.c | 23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index a34c5c3b164e..877bc3db2c6e 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -725,8 +725,31 @@ int avic_init_vcpu(struct vcpu_svm *svm)
>  
>  void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
>  {
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	u64 cr8;
> +
>  	avic_handle_dfr_update(vcpu);
>  	avic_handle_ldr_update(vcpu);
> +
> +	/* Running nested should have inhibited AVIC. */
> +	if (WARN_ON_ONCE(nested_svm_virtualize_tpr(vcpu)))
> +		return;


> +
> +	/*
> +	 * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB.
> +	 *
> +	 * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8()
> +	 * is inhibited so any set TPR LAPIC state would not get reflected
> +	 * in V_TPR.

Hmm, I think that code is straight up wrong.  There's no justification, just a
claim:

  commit 3bbf3565f48ce3999b5a12cde946f81bd4475312
  Author:     Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
  AuthorDate: Wed May 4 14:09:51 2016 -0500
  Commit:     Paolo Bonzini <pbonzini@redhat.com>
  CommitDate: Wed May 18 18:04:31 2016 +0200

    svm: Do not intercept CR8 when enable AVIC
    
    When enable AVIC:
        * Do not intercept CR8 since this should be handled by AVIC HW.
        * Also, we don't need to sync cr8/V_TPR and APIC backing page.   <======
    
    Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
    [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo]
    Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

That claim assumes APIC[TPR] will _never_ be modified by anything other than
hardware.  That's obviously false for state restore from userspace, and it's also
technically false at steady state, e.g. if KVM managed to trigger emulation of a
store to the APIC page, then KVM would bypass the automatic harware sync.

There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to
be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC
except for the initial ioctl), but could again set APIC[TPR] without updating
V_TPR.

So, rather than manually do the update during state restore, my vote is to restore
the sync logic.  And if we want to optimize that code (seems unnecessary), then
we should hook all TPR writes.

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d9931c6c4bc6..1bfebe40854f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu)
        struct vcpu_svm *svm = to_svm(vcpu);
        u64 cr8;
 
-       if (nested_svm_virtualize_tpr(vcpu) ||
-           kvm_vcpu_apicv_active(vcpu))
+       if (nested_svm_virtualize_tpr(vcpu))
                return;
 
        cr8 = kvm_get_cr8(vcpu);


> +	 *
> +	 * Practice shows that it is the V_TPR that is actually used by the
> +	 * AVIC to decide whether to issue pending interrupts to the CPU, not
> +	 * TPR in TASKPRI.

FWIW, the APM kinda sorta alludes to this:

  Enabling AVIC also affects CR8 behavior independent of V_INTR_MASKING enable
  (bit 24): writes to CR8 affect the V_TPR and update the backing page and reads
  from CR8 return V_TPR.


> +	 */
> +	cr8 = kvm_get_cr8(vcpu);
> +	svm->vmcb->control.int_ctl &= ~V_TPR_MASK;
> +	svm->vmcb->control.int_ctl |= cr8 & V_TPR_MASK;
> +	WARN_ON_ONCE(!vmcb_is_dirty(svm->vmcb, VMCB_INTR));


>  }
>  
>  static void svm_ir_list_del(struct kvm_kernel_irqfd *irqfd)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on
  2025-08-21 14:59     ` Alejandro Jimenez
@ 2025-08-21 21:11       ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-08-21 21:11 UTC (permalink / raw)
  To: Alejandro Jimenez
  Cc: Maciej S. Szmigiero, Naveen N Rao, Paolo Bonzini, Maxim Levitsky,
	Suravee Suthikulpanit, kvm, linux-kernel

On Thu, Aug 21, 2025, Alejandro Jimenez wrote:
> On 8/21/25 7:42 AM, Maciej S. Szmigiero wrote:
> > On 21.08.2025 10:18, Naveen N Rao wrote:
> > > > Yes, this breaks real guests when AVIC is enabled.
> > > > Specifically, the one OS that sometimes needs different handling and its
> > > > name begins with letter 'W'.
> > > 
> > > Indeed, Linux does not use TPR AFAIK.
> 
> I believe it does, 

Heh, yes, Linux technically "uses" the TPR in that it does a one-time write to
it.  But what Naveen really meant is that Linux doesn't actively use TPR to
manage what IRQs are masked/allowed, whereas Windows heavily uses TPR to do
exactly that.  Specifically, what matters is that Linux doesn't use TPR to _mask_
IRQs, and so clobbering it to '0' on migration is largely benign.

> during the local APIC initialization. When Maciej
> determined the root cause of this issue, I was wondering why we have not
> seen it earlier in Linux. I found that Linux takes a defensive approach and
> drains all pending interrupts during lapic initialization. Essentially, for
> each CPU, Linux will:
> - temporarily disable the Local APIC (via Spurious Int Vector Reg)
> - set the TPR to accept all "regular" interrupts i.e. tpr=0x10
> - drain all pending interrupts in ISR and/or IRR
> - attempt the above draining step a max of 512 times
> - then re-enable APIC and continue initialization
> 
> The relevant code is in setup_local_APIC()
> https://elixir.bootlin.com/linux/v6.16/source/arch/x86/kernel/apic/apic.c#L1533-L1545
> 
> So without Maciej's proposed change, other OSs that are not as resilient
> could also be affected by this issue.
> 
> Alejandro
> 
> > > - Naveen
> > > 
> > 
> > Thanks,
> > Maciej
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs
  2025-08-21 20:38   ` Sean Christopherson
@ 2025-08-22  9:04     ` Naveen N Rao
  2025-08-22 20:54       ` Sean Christopherson
  2025-08-22 23:20     ` Maciej S. Szmigiero
  1 sibling, 1 reply; 12+ messages in thread
From: Naveen N Rao @ 2025-08-22  9:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maciej S. Szmigiero, Paolo Bonzini, Maxim Levitsky,
	Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel

On Thu, Aug 21, 2025 at 01:38:17PM -0700, Sean Christopherson wrote:
> On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote:
> > From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
> > 
> > When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() is
> > inhibited so any changed TPR in the LAPIC state would not get copied into
> > the V_TPR field of VMCB.
> > 
> > AVIC does sync between these two fields, however it does so only on
> > explicit guest writes to one of these fields, not on a bare VMRUN.
> > 
> > This is especially true when it is the userspace setting LAPIC state via
> > KVM_SET_LAPIC ioctl() since userspace does not have access to the guest
> > VMCB.
> > 
> > Practice shows that it is the V_TPR that is actually used by the AVIC to
> > decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI),
> > so any leftover value in V_TPR will cause serious interrupt delivery issues
> > in the guest when AVIC is enabled.
> > 
> > Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in
> > avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and
> > similar code paths when AVIC is enabled.
> > 
> > Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC")
> > Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
> > ---
> >  arch/x86/kvm/svm/avic.c | 23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> > index a34c5c3b164e..877bc3db2c6e 100644
> > --- a/arch/x86/kvm/svm/avic.c
> > +++ b/arch/x86/kvm/svm/avic.c
> > @@ -725,8 +725,31 @@ int avic_init_vcpu(struct vcpu_svm *svm)
> >  
> >  void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
> >  {
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> > +	u64 cr8;
> > +
> >  	avic_handle_dfr_update(vcpu);
> >  	avic_handle_ldr_update(vcpu);
> > +
> > +	/* Running nested should have inhibited AVIC. */
> > +	if (WARN_ON_ONCE(nested_svm_virtualize_tpr(vcpu)))
> > +		return;
> 
> 
> > +
> > +	/*
> > +	 * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB.
> > +	 *
> > +	 * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8()
> > +	 * is inhibited so any set TPR LAPIC state would not get reflected
> > +	 * in V_TPR.
> 
> Hmm, I think that code is straight up wrong.  There's no justification, just a
> claim:
> 
>   commit 3bbf3565f48ce3999b5a12cde946f81bd4475312
>   Author:     Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
>   AuthorDate: Wed May 4 14:09:51 2016 -0500
>   Commit:     Paolo Bonzini <pbonzini@redhat.com>
>   CommitDate: Wed May 18 18:04:31 2016 +0200
> 
>     svm: Do not intercept CR8 when enable AVIC
>     
>     When enable AVIC:
>         * Do not intercept CR8 since this should be handled by AVIC HW.
>         * Also, we don't need to sync cr8/V_TPR and APIC backing page.   <======
>     
>     Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
>     [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo]
>     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> That claim assumes APIC[TPR] will _never_ be modified by anything other than
> hardware. 

It also isn't clear to me why only sync_lapic_to_cr8() was gated when 
AVIC was enabled, while sync_cr8_to_lapic() continues to copy V_TRP to 
the backing page. If AVIC is enabled, then the AVIC hardware updates 
both the backing page and V_TPR on a guest write to TPR.

> That's obviously false for state restore from userspace, and it's also
> technically false at steady state, e.g. if KVM managed to trigger emulation of a
> store to the APIC page, then KVM would bypass the automatic harware sync.

Do you mean emulation due to AVIC being inhibited? I initially thought 
this could be a problem, but in this scenario, AVIC would be disabled on 
the next VMRUN, so we will end up sync'ing TPR from the lapic to V_TPR.

> 
> There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to
> be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC
> except for the initial ioctl), but could again set APIC[TPR] without updating
> V_TPR.
>
> So, rather than manually do the update during state restore, my vote 
> is to restore the sync logic.  And if we want to optimize that code 
> (seems unnecessary), then we should hook all TPR writes.

I guess you mean apic_set_tpr()? We will need to hook into that in 
addition to updating avic_apicv_post_state_restore() since KVM_SET_LAPIC 
just does a memcpy of the register state.

> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index d9931c6c4bc6..1bfebe40854f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu)
>         struct vcpu_svm *svm = to_svm(vcpu);
>         u64 cr8;
>  
> -       if (nested_svm_virtualize_tpr(vcpu) ||
> -           kvm_vcpu_apicv_active(vcpu))
> +       if (nested_svm_virtualize_tpr(vcpu))
>                 return;
>  
>         cr8 = kvm_get_cr8(vcpu);

I agree that this is a simpler fix, so would be good to do for backport 
ease.

The code in sync_lapic_to_cr8 ends up being a function call to 
kvm_get_cr8() and ~6 instructions, which isn't that much. But if we can 
gate sync'ing V_TPR to the backing page in sync_cr8_to_lapic() as well, 
then it might be good to do so.


- Naveen


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs
  2025-08-22  9:04     ` Naveen N Rao
@ 2025-08-22 20:54       ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-08-22 20:54 UTC (permalink / raw)
  To: Naveen N Rao
  Cc: Maciej S. Szmigiero, Paolo Bonzini, Maxim Levitsky,
	Suravee Suthikulpanit, Alejandro Jimenez, kvm, linux-kernel

On Fri, Aug 22, 2025, Naveen N Rao wrote:
> On Thu, Aug 21, 2025 at 01:38:17PM -0700, Sean Christopherson wrote:
> > On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote:
> > > +	/*
> > > +	 * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB.
> > > +	 *
> > > +	 * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8()
> > > +	 * is inhibited so any set TPR LAPIC state would not get reflected
> > > +	 * in V_TPR.
> > 
> > Hmm, I think that code is straight up wrong.  There's no justification, just a
> > claim:
> > 
> >   commit 3bbf3565f48ce3999b5a12cde946f81bd4475312
> >   Author:     Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
> >   AuthorDate: Wed May 4 14:09:51 2016 -0500
> >   Commit:     Paolo Bonzini <pbonzini@redhat.com>
> >   CommitDate: Wed May 18 18:04:31 2016 +0200
> > 
> >     svm: Do not intercept CR8 when enable AVIC
> >     
> >     When enable AVIC:
> >         * Do not intercept CR8 since this should be handled by AVIC HW.
> >         * Also, we don't need to sync cr8/V_TPR and APIC backing page.   <======
> >     
> >     Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> >     [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo]
> >     Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > 
> > That claim assumes APIC[TPR] will _never_ be modified by anything other than
> > hardware. 
> 
> It also isn't clear to me why only sync_lapic_to_cr8() was gated when 
> AVIC was enabled, while sync_cr8_to_lapic() continues to copy V_TRP to 
> the backing page. If AVIC is enabled, then the AVIC hardware updates 
> both the backing page and V_TPR on a guest write to TPR.
> 
> > That's obviously false for state restore from userspace, and it's also
> > technically false at steady state, e.g. if KVM managed to trigger emulation of a
> > store to the APIC page, then KVM would bypass the automatic harware sync.
> 
> Do you mean emulation due to AVIC being inhibited? I initially thought 
> this could be a problem, but in this scenario, AVIC would be disabled on 
> the next VMRUN, so we will end up sync'ing TPR from the lapic to V_TPR.

No, if emulation is triggered when AVIC isn't inhibited.  E.g. a contrived but
entirely possible situation would be if MOVBE isn't supported in hardware, KVM
is emulating MOVBE for emulation, and the guest sets the TPR via MOVBE.  The MOVBE
#UDs, KVM emulates in response to the #UD, and Bob's your uncle.

There are other scenarios where KVM would emulate, though they're even more
contrived.

> > There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to
> > be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC
> > except for the initial ioctl), but could again set APIC[TPR] without updating
> > V_TPR.
> >
> > So, rather than manually do the update during state restore, my vote 
> > is to restore the sync logic.  And if we want to optimize that code 
> > (seems unnecessary), then we should hook all TPR writes.
> 
> I guess you mean apic_set_tpr()? 

Yep.

> We will need to hook into that in addition to updating
> avic_apicv_post_state_restore() since KVM_SET_LAPIC just does a memcpy of the
> register state.

Yeah, or explicitly call the hook, e.g. like kvm_apic_set_state() does for
hwapic_isr_update().  But I don't think we should add a hook unless someone
proves that unconditionally synchronizing before VMRUN affects performance.

> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index d9931c6c4bc6..1bfebe40854f 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu)
> >         struct vcpu_svm *svm = to_svm(vcpu);
> >         u64 cr8;
> >  
> > -       if (nested_svm_virtualize_tpr(vcpu) ||
> > -           kvm_vcpu_apicv_active(vcpu))
> > +       if (nested_svm_virtualize_tpr(vcpu))
> >                 return;
> >  
> >         cr8 = kvm_get_cr8(vcpu);
> 
> I agree that this is a simpler fix, so would be good to do for backport 
> ease.
> 
> The code in sync_lapic_to_cr8 ends up being a function call to 
> kvm_get_cr8() and ~6 instructions, which isn't that much. But if we can 
> gate sync'ing V_TPR to the backing page in sync_cr8_to_lapic() as well, 
> then it might be good to do so.
> 
> 
> - Naveen
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs
  2025-08-21 20:38   ` Sean Christopherson
  2025-08-22  9:04     ` Naveen N Rao
@ 2025-08-22 23:20     ` Maciej S. Szmigiero
  2025-08-22 23:41       ` Sean Christopherson
  1 sibling, 1 reply; 12+ messages in thread
From: Maciej S. Szmigiero @ 2025-08-22 23:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Maxim Levitsky, Suravee Suthikulpanit,
	Alejandro Jimenez, kvm, linux-kernel, Naveen N Rao,
	Radim Krčmář

On 21.08.2025 22:38, Sean Christopherson wrote:
> On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote:
>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>>
>> When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8() is
>> inhibited so any changed TPR in the LAPIC state would not get copied into
>> the V_TPR field of VMCB.
>>
>> AVIC does sync between these two fields, however it does so only on
>> explicit guest writes to one of these fields, not on a bare VMRUN.
>>
>> This is especially true when it is the userspace setting LAPIC state via
>> KVM_SET_LAPIC ioctl() since userspace does not have access to the guest
>> VMCB.
>>
>> Practice shows that it is the V_TPR that is actually used by the AVIC to
>> decide whether to issue pending interrupts to the CPU (not TPR in TASKPRI),
>> so any leftover value in V_TPR will cause serious interrupt delivery issues
>> in the guest when AVIC is enabled.
>>
>> Fix this issue by explicitly copying LAPIC TPR to VMCB::V_TPR in
>> avic_apicv_post_state_restore(), which gets called from KVM_SET_LAPIC and
>> similar code paths when AVIC is enabled.
>>
>> Fixes: 3bbf3565f48c ("svm: Do not intercept CR8 when enable AVIC")
>> Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
>> ---
>>   arch/x86/kvm/svm/avic.c | 23 +++++++++++++++++++++++
>>   1 file changed, 23 insertions(+)
>>
>> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
>> index a34c5c3b164e..877bc3db2c6e 100644
>> --- a/arch/x86/kvm/svm/avic.c
>> +++ b/arch/x86/kvm/svm/avic.c
>> @@ -725,8 +725,31 @@ int avic_init_vcpu(struct vcpu_svm *svm)
>>   
>>   void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
>>   {
>> +	struct vcpu_svm *svm = to_svm(vcpu);
>> +	u64 cr8;
>> +
>>   	avic_handle_dfr_update(vcpu);
>>   	avic_handle_ldr_update(vcpu);
>> +
>> +	/* Running nested should have inhibited AVIC. */
>> +	if (WARN_ON_ONCE(nested_svm_virtualize_tpr(vcpu)))
>> +		return;
> 
> 
>> +
>> +	/*
>> +	 * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB.
>> +	 *
>> +	 * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8()
>> +	 * is inhibited so any set TPR LAPIC state would not get reflected
>> +	 * in V_TPR.
> 
> Hmm, I think that code is straight up wrong.  There's no justification, just a
> claim:
> 
>    commit 3bbf3565f48ce3999b5a12cde946f81bd4475312
>    Author:     Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
>    AuthorDate: Wed May 4 14:09:51 2016 -0500
>    Commit:     Paolo Bonzini <pbonzini@redhat.com>
>    CommitDate: Wed May 18 18:04:31 2016 +0200
> 
>      svm: Do not intercept CR8 when enable AVIC
>      
>      When enable AVIC:
>          * Do not intercept CR8 since this should be handled by AVIC HW.
>          * Also, we don't need to sync cr8/V_TPR and APIC backing page.   <======
>      
>      Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
>      [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo]
>      Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> That claim assumes APIC[TPR] will _never_ be modified by anything other than
> hardware.  That's obviously false for state restore from userspace, and it's also
> technically false at steady state, e.g. if KVM managed to trigger emulation of a
> store to the APIC page, then KVM would bypass the automatic harware sync.
> 
> There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to
> be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC
> except for the initial ioctl), but could again set APIC[TPR] without updating
> V_TPR.
> 
> So, rather than manually do the update during state restore, my vote is to restore
> the sync logic.  And if we want to optimize that code (seems unnecessary), then
> we should hook all TPR writes.
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index d9931c6c4bc6..1bfebe40854f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu)
>          struct vcpu_svm *svm = to_svm(vcpu);
>          u64 cr8;
>   
> -       if (nested_svm_virtualize_tpr(vcpu) ||
> -           kvm_vcpu_apicv_active(vcpu))
> +       if (nested_svm_virtualize_tpr(vcpu))
>                  return;
>   
>          cr8 = kvm_get_cr8(vcpu);
> 
> 

So you want to just do an unconditional LAPIC -> V_TPR sync at each VMRUN
and not try to patch every code flow where these possibly could get de-synced
to do such sync only on demand, correct?

By the way, the original Suravee's submission for the aforementioned patch
did *not* inhibit that sync when AVIC is on [1].

Something similar to this sync inhibit only showed in v4 [2],
probably upon Radim's comment on v3 [3] that:
> I think we can exit early with svm_vcpu_avic_enabled().

But the initial sync inhibit condition in v4 was essentially
nested_svm_virtualize_tpr() && svm_vcpu_avic_enabled(),
which suggests there was some confusion what was exactly meant
by the reviewer comment.

The final sync inhibit condition only showed in v5 [4].
No further discussion happened on that point.

Thanks,
Maciej

[1]: https://lore.kernel.org/kvm/1455285574-27892-9-git-send-email-suravee.suthikulpanit@amd.com/
[2]: https://lore.kernel.org/kvm/1460017232-17429-11-git-send-email-Suravee.Suthikulpanit@amd.com/
[3]: https://lore.kernel.org/kvm/20160318211048.GB26119@potion.brq.redhat.com/
[4]: https://lore.kernel.org/kvm/1462388992-25242-13-git-send-email-Suravee.Suthikulpanit@amd.com/


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs
  2025-08-22 23:20     ` Maciej S. Szmigiero
@ 2025-08-22 23:41       ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2025-08-22 23:41 UTC (permalink / raw)
  To: Maciej S. Szmigiero
  Cc: Paolo Bonzini, Maxim Levitsky, Suravee Suthikulpanit,
	Alejandro Jimenez, kvm, linux-kernel, Naveen N Rao,
	Radim Krčmář

On Sat, Aug 23, 2025, Maciej S. Szmigiero wrote:
> On 21.08.2025 22:38, Sean Christopherson wrote:
> > On Tue, Aug 19, 2025, Maciej S. Szmigiero wrote:
> > > +	/*
> > > +	 * Sync TPR from LAPIC TASKPRI into V_TPR field of the VMCB.
> > > +	 *
> > > +	 * When AVIC is enabled the normal pre-VMRUN sync in sync_lapic_to_cr8()
> > > +	 * is inhibited so any set TPR LAPIC state would not get reflected
> > > +	 * in V_TPR.
> > 
> > Hmm, I think that code is straight up wrong.  There's no justification, just a
> > claim:
> > 
> >    commit 3bbf3565f48ce3999b5a12cde946f81bd4475312
> >    Author:     Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
> >    AuthorDate: Wed May 4 14:09:51 2016 -0500
> >    Commit:     Paolo Bonzini <pbonzini@redhat.com>
> >    CommitDate: Wed May 18 18:04:31 2016 +0200
> > 
> >      svm: Do not intercept CR8 when enable AVIC
> >      When enable AVIC:
> >          * Do not intercept CR8 since this should be handled by AVIC HW.
> >          * Also, we don't need to sync cr8/V_TPR and APIC backing page.   <======
> >      Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
> >      [Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo]
> >      Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > 
> > That claim assumes APIC[TPR] will _never_ be modified by anything other than
> > hardware.  That's obviously false for state restore from userspace, and it's also
> > technically false at steady state, e.g. if KVM managed to trigger emulation of a
> > store to the APIC page, then KVM would bypass the automatic harware sync.
> > 
> > There's also the comically ancient KVM_SET_VAPIC_ADDR, which AFAICT appears to
> > be largely dead code with respect to vTPR (nothing sets KVM_APIC_CHECK_VAPIC
> > except for the initial ioctl), but could again set APIC[TPR] without updating
> > V_TPR.
> > 
> > So, rather than manually do the update during state restore, my vote is to restore
> > the sync logic.  And if we want to optimize that code (seems unnecessary), then
> > we should hook all TPR writes.
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index d9931c6c4bc6..1bfebe40854f 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -4046,8 +4046,7 @@ static inline void sync_lapic_to_cr8(struct kvm_vcpu *vcpu)
> >          struct vcpu_svm *svm = to_svm(vcpu);
> >          u64 cr8;
> > -       if (nested_svm_virtualize_tpr(vcpu) ||
> > -           kvm_vcpu_apicv_active(vcpu))
> > +       if (nested_svm_virtualize_tpr(vcpu))
> >                  return;
> >          cr8 = kvm_get_cr8(vcpu);
> > 
> > 
> 
> So you want to just do an unconditional LAPIC -> V_TPR sync at each VMRUN
> and not try to patch every code flow where these possibly could get de-synced
> to do such sync only on demand, correct?

Yep.  For a fix, I definitely want to go with the bare minimum.  If we want to
optimize the sync, that can be done on top, and it can be done irrespective of
AVIC.  E.g. for guests that don't modify TPR, the sync is almost pure overhead
too.

> By the way, the original Suravee's submission for the aforementioned patch
> did *not* inhibit that sync when AVIC is on [1].
> 
> Something similar to this sync inhibit only showed in v4 [2],
> probably upon Radim's comment on v3 [3] that:
> > I think we can exit early with svm_vcpu_avic_enabled().
> 
> But the initial sync inhibit condition in v4 was essentially
> nested_svm_virtualize_tpr() && svm_vcpu_avic_enabled(),
> which suggests there was some confusion what was exactly meant
> by the reviewer comment.

Hmm, I suspect that was just a goof.  My guess is that Radim and Suravee simply
forgot to consider TPR writes that aren't handled by hardware.

> The final sync inhibit condition only showed in v5 [4].
> No further discussion happened on that point.
> 
> Thanks,
> Maciej
> 
> [1]: https://lore.kernel.org/kvm/1455285574-27892-9-git-send-email-suravee.suthikulpanit@amd.com/
> [2]: https://lore.kernel.org/kvm/1460017232-17429-11-git-send-email-Suravee.Suthikulpanit@amd.com/
> [3]: https://lore.kernel.org/kvm/20160318211048.GB26119@potion.brq.redhat.com/
> [4]: https://lore.kernel.org/kvm/1462388992-25242-13-git-send-email-Suravee.Suthikulpanit@amd.com/
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-08-22 23:41 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-19 13:32 [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Maciej S. Szmigiero
2025-08-19 13:32 ` [PATCH 1/2] KVM: SVM: Sync TPR from LAPIC into VMCB::V_TPR when setting LAPIC regs Maciej S. Szmigiero
2025-08-21 20:38   ` Sean Christopherson
2025-08-22  9:04     ` Naveen N Rao
2025-08-22 20:54       ` Sean Christopherson
2025-08-22 23:20     ` Maciej S. Szmigiero
2025-08-22 23:41       ` Sean Christopherson
2025-08-19 13:32 ` [PATCH 2/2] KVM: selftests: Test TPR / CR8 sync and interrupt masking Maciej S. Szmigiero
2025-08-21  8:18 ` [PATCH 0/2] KVM: SVM: Fix missing LAPIC TPR sync into VMCB::V_TPR with AVIC on Naveen N Rao
2025-08-21 11:42   ` Maciej S. Szmigiero
2025-08-21 14:59     ` Alejandro Jimenez
2025-08-21 21:11       ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).