public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* Bug with nested PAUSE intercept on SVM
@ 2026-04-07 18:11 Kaplan, David
  2026-04-07 18:24 ` Sean Christopherson
  0 siblings, 1 reply; 3+ messages in thread
From: Kaplan, David @ 2026-04-07 18:11 UTC (permalink / raw)
  To: kvm list
  Cc: LKML, Andrew Cooper, Lendacky, Thomas, Sean Christopherson,
	Paolo Bonzini

Hi,

On AMD SVM when the L1 guest is trying to intercept every PAUSE instruction in an L2 guest, the PAUSE intercept sometimes fails to fire.  I have a theory on the source of the bug and also included a short reproducer below.

In this scenario, L1 has created a guest with the pause count and threshold set to 0, and the PAUSE intercept bit set.  I *think* the bug is that if the vCPU gets scheduled out on L0 while we're in the L2 guest, then upon resuming the vCPU KVM calls shrink_ple_window() which doesn't appear to take into account the fact that svm->vmcb might be for the L2 guest and not the L1.  As a result, it looks like it sets the pause count to the default (3000) causing many PAUSE instructions in L2 to not be intercepted.

The code below (compiled as an out-of-tree module) can reproduce this.  The code runs a trivial VM which does INC EAX/PAUSE in a loop and compares the final guest value of EAX vs how many times the host saw a PAUSE intercept.  When run on bare-metal (meaning I loaded this module on a bare-metal kernel) the values always match.  But when running inside a VM, I see mismatches like so:

[  854.354997] pause_svm_test: completed 100000 PAUSE exits
[  854.355002] pause_svm_test: guest EAX=112000 (expected 100000)
[  854.355003] pause_svm_test: EAX mismatch, possible PAUSE intercept bug
[  854.369108] pause_svm_test: module unloaded
[  855.189293] pause_svm_test: completed 100000 PAUSE exits
[  855.189299] pause_svm_test: guest EAX=109000 (expected 100000)
[  855.189300] pause_svm_test: EAX mismatch, possible PAUSE intercept bug
[  855.203689] pause_svm_test: module unloaded
[  856.042187] pause_svm_test: completed 100000 PAUSE exits
[  856.042193] pause_svm_test: guest EAX=106000 (expected 100000)
[  856.042194] pause_svm_test: EAX mismatch, possible PAUSE intercept bug

The fact that the deltas are always a multiple of 3000 (KVM_SVM_DEFAULT_PLE_WINDOW) seems suspicious, but consistent with the above theory.

Reproduction code below (written with AI help).  Compile against desired kernel version and load it to run the test.

Tested on AMD Zen5 CPU running Fedora 6.19.10-200.fc43.x86_64 with no special kvm module parameters.

Thanks --David Kaplan

---
// SPDX-License-Identifier: GPL-2.0
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/preempt.h>
#include <linux/uaccess.h>
#include <linux/bitops.h>
#include <asm/msr.h>
#include <asm/msr-index.h>
#include <asm/processor.h>
#include <asm/processor-flags.h>
#include <asm/svm.h>
#include <uapi/asm/svm.h>

#define TEST_MAX_PAUSES 100000

#define NPT_PRESENT 0x1ULL
#define NPT_RW      0x2ULL
#define NPT_USER    0x4ULL

static __always_inline void stgi(void)
{
	asm volatile("stgi" ::: "memory");
}

static inline void svm_vmrun(u64 vmcb_pa)
{
	asm volatile("vmrun %0" : : "a"(vmcb_pa) : "memory");
	stgi();
}

static void vmcb_set_seg(struct vmcb_seg *seg, u16 selector, u32 attrib)
{
	seg->selector = selector;
	seg->attrib = attrib;
	seg->limit = 0xFFFFF;
	seg->base = 0;
}

static int __init pause_svm_test_init(void)
{
	struct vmcb *vmcb = NULL;
	void *guest_page = NULL;
	u64 vmcb_pa = 0;
	u64 npt_cr3 = 0;
	u64 *pml4 = NULL;
	u64 *pml5 = NULL;
	u64 *pdpt = NULL;
	u64 *pd = NULL;
	u64 *pt = NULL;
	u32 exit_code;
	u32 pause_count = 0;

	if (!boot_cpu_has(X86_FEATURE_SVM)) {
		pr_err("pause_svm_test: CPU does not support SVM\n");
		return -ENODEV;
	}

	guest_page = (void *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
	if (!guest_page)
		return -ENOMEM;

	vmcb = (struct vmcb *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
	if (!vmcb) {
		free_page((unsigned long)guest_page);
		return -ENOMEM;
	}

	pml4 = (u64 *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
	if (boot_cpu_has(X86_FEATURE_LA57))
		pml5 = (u64 *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
	pdpt = (u64 *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
	pd = (u64 *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
	pt = (u64 *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
	if (!pml4 || (boot_cpu_has(X86_FEATURE_LA57) && !pml5) || !pdpt || !pd || !pt) {
		free_page((unsigned long)pt);
		free_page((unsigned long)pd);
		free_page((unsigned long)pdpt);
		free_page((unsigned long)pml5);
		free_page((unsigned long)pml4);
		free_page((unsigned long)vmcb);
		free_page((unsigned long)guest_page);
		return -ENOMEM;
	}

	/* Guest code: inc %eax; pause; jmp 1b */
	((u8 *)guest_page)[0] = 0x40; /* inc %eax */
	((u8 *)guest_page)[1] = 0xF3; /* pause */
	((u8 *)guest_page)[2] = 0x90;
	((u8 *)guest_page)[3] = 0xEB; /* jmp short -5 */
	((u8 *)guest_page)[4] = 0xFB;

	/* Build a minimal NPT mapping for GPA 0 -> guest_page */
	pt[0] = (u64)__pa(guest_page) | NPT_PRESENT | NPT_RW | NPT_USER;
	pd[0] = (u64)__pa(pt) | NPT_PRESENT | NPT_RW | NPT_USER;
	pdpt[0] = (u64)__pa(pd) | NPT_PRESENT | NPT_RW | NPT_USER;
	pml4[0] = (u64)__pa(pdpt) | NPT_PRESENT | NPT_RW | NPT_USER;
	if (pml5)
		pml5[0] = (u64)__pa(pml4) | NPT_PRESENT | NPT_RW | NPT_USER;
	
	npt_cr3 = (u64)__pa(pml5 ? pml5 : pml4);
	vmcb_pa = (u64)__pa(vmcb);

	set_bit(INTERCEPT_INTR, (unsigned long *)vmcb->control.intercepts);
	set_bit(INTERCEPT_NMI, (unsigned long *)vmcb->control.intercepts);
	set_bit(INTERCEPT_SMI, (unsigned long *)vmcb->control.intercepts);
	set_bit(INTERCEPT_SHUTDOWN, (unsigned long *)vmcb->control.intercepts);
	set_bit(INTERCEPT_VMRUN, (unsigned long *)vmcb->control.intercepts);
	set_bit(INTERCEPT_PAUSE, (unsigned long *)vmcb->control.intercepts);

	vmcb->control.asid = 1;
	vmcb->control.tlb_ctl = TLB_CONTROL_FLUSH_ALL_ASID;
	vmcb->control.nested_ctl = SVM_NESTED_CTL_NP_ENABLE;
	vmcb->control.nested_cr3 = npt_cr3;
	vmcb->control.int_ctl = V_INTR_MASKING_MASK;

	/* 32-bit protected mode, no paging */
	vmcb->save.cr0 = X86_CR0_PE | X86_CR0_NE | X86_CR0_ET | X86_CR0_MP;
	vmcb->save.efer = EFER_SVME;
	vmcb->save.dr7 = 0x400;
	vmcb->save.rflags = 0x2;

	/* Flat 32-bit segment */
	vmcb_set_seg(&vmcb->save.cs, 0x8, 0xC9B);

	while (pause_count < TEST_MAX_PAUSES) {
		svm_vmrun(vmcb_pa);
		exit_code = vmcb->control.exit_code;

		switch (exit_code) {
		case SVM_EXIT_INTR:
		case SVM_EXIT_NMI:
		case SVM_EXIT_SMI:
			break;
		case SVM_EXIT_PAUSE:
			pause_count++;
			vmcb->save.rip += 2;
			break;
		default:
			pr_err("pause_svm_test: unexpected exit 0x%x info1=0x%llx info2=0x%llx",
			       exit_code, vmcb->control.exit_info_1,
			       vmcb->control.exit_info_2);
			goto out_restore;
		}
	}

	pr_info("pause_svm_test: completed %u PAUSE exits\n", pause_count);
	pr_info("pause_svm_test: guest EAX=%llu (expected %u)\n",
		vmcb->save.rax, pause_count);
	if ((u32)vmcb->save.rax != pause_count)
		pr_err("pause_svm_test: EAX mismatch, possible PAUSE intercept bug\n");

out_restore:
	free_page((unsigned long)pt);
	free_page((unsigned long)pd);
	free_page((unsigned long)pdpt);
	free_page((unsigned long)pml5);
	free_page((unsigned long)pml4);
	free_page((unsigned long)vmcb);
	free_page((unsigned long)guest_page);
	return 0;
}

static void __exit pause_svm_test_exit(void)
{
	pr_info("pause_svm_test: module unloaded\n");
}

module_init(pause_svm_test_init);
module_exit(pause_svm_test_exit);

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Codex");
MODULE_DESCRIPTION("Out-of-tree SVM PAUSE intercept test");

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Bug with nested PAUSE intercept on SVM
  2026-04-07 18:11 Bug with nested PAUSE intercept on SVM Kaplan, David
@ 2026-04-07 18:24 ` Sean Christopherson
  2026-04-07 18:30   ` Kaplan, David
  0 siblings, 1 reply; 3+ messages in thread
From: Sean Christopherson @ 2026-04-07 18:24 UTC (permalink / raw)
  To: David Kaplan
  Cc: kvm list, LKML, Andrew Cooper, Thomas Lendacky, Paolo Bonzini

On Tue, Apr 07, 2026, David Kaplan wrote:
> Hi,
> 
> On AMD SVM when the L1 guest is trying to intercept every PAUSE instruction
> in an L2 guest, the PAUSE intercept sometimes fails to fire.  I have a theory
> on the source of the bug and also included a short reproducer below.
> 
> In this scenario, L1 has created a guest with the pause count and threshold
> set to 0, and the PAUSE intercept bit set.  I *think* the bug is that if the
> vCPU gets scheduled out on L0 while we're in the L2 guest, then upon resuming
> the vCPU KVM calls shrink_ple_window() which doesn't appear to take into
> account the fact that svm->vmcb might be for the L2 guest and not the L1.  As
> a result, it looks like it sets the pause count to the default (3000) causing
> many PAUSE instructions in L2 to not be intercepted.

It's probably even simpler than that: KVM is completely broken.

https://lore.kernel.org/all/20250131010601.469904-1-seanjc@google.com

Paolo, can I finally apply that patch?  I brought it up in PUCK a while back,
and IIRC you were resistant to dropping "support" for cpu_pm=on setups.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: Bug with nested PAUSE intercept on SVM
  2026-04-07 18:24 ` Sean Christopherson
@ 2026-04-07 18:30   ` Kaplan, David
  0 siblings, 0 replies; 3+ messages in thread
From: Kaplan, David @ 2026-04-07 18:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm list, LKML, Andrew Cooper, Lendacky, Thomas, Paolo Bonzini

[AMD Official Use Only - AMD Internal Distribution Only]

> -----Original Message-----
> From: Sean Christopherson <seanjc@google.com>
> Sent: Tuesday, April 7, 2026 1:25 PM
> To: Kaplan, David <David.Kaplan@amd.com>
> Cc: kvm list <kvm@vger.kernel.org>; LKML <linux-kernel@vger.kernel.org>;
> Andrew Cooper <andrew.cooper3@citrix.com>; Lendacky, Thomas
> <Thomas.Lendacky@amd.com>; Paolo Bonzini <pbonzini@redhat.com>
> Subject: Re: Bug with nested PAUSE intercept on SVM
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> On Tue, Apr 07, 2026, David Kaplan wrote:
> > Hi,
> >
> > On AMD SVM when the L1 guest is trying to intercept every PAUSE
> instruction
> > in an L2 guest, the PAUSE intercept sometimes fails to fire.  I have a theory
> > on the source of the bug and also included a short reproducer below.
> >
> > In this scenario, L1 has created a guest with the pause count and threshold
> > set to 0, and the PAUSE intercept bit set.  I *think* the bug is that if the
> > vCPU gets scheduled out on L0 while we're in the L2 guest, then upon
> resuming
> > the vCPU KVM calls shrink_ple_window() which doesn't appear to take into
> > account the fact that svm->vmcb might be for the L2 guest and not the L1.
> As
> > a result, it looks like it sets the pause count to the default (3000) causing
> > many PAUSE instructions in L2 to not be intercepted.
>
> It's probably even simpler than that: KVM is completely broken.
>
> https://lore.kernel.org/all/20250131010601.469904-1-seanjc@google.com
>
> Paolo, can I finally apply that patch?  I brought it up in PUCK a while back,
> and IIRC you were resistant to dropping "support" for cpu_pm=on setups.

Interesting.  But does that patch solve my problem?  It looks like it would still call shrink_ple_window even if L2 was scheduled out and change the page_filter_count in vmcb02, if I'm reading it correctly.

--David Kaplan

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-04-07 18:30 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-07 18:11 Bug with nested PAUSE intercept on SVM Kaplan, David
2026-04-07 18:24 ` Sean Christopherson
2026-04-07 18:30   ` Kaplan, David

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox