[PATCH 0/1] KVM: x86/mmu: don't kill the VM on access to a disabled passthrough BAR

Kernel KVM virtualization development
 help / color / mirror / Atom feed

* [PATCH 0/1] KVM: x86/mmu: don't kill the VM on access to a disabled passthrough BAR
@ 2026-06-21 13:37 mike.malyshev
  2026-06-21 13:37 ` [PATCH 1/1] KVM: x86/mmu: Emulate, don't kill the VM, " mike.malyshev
  0 siblings, 1 reply; 5+ messages in thread
From: mike.malyshev @ 2026-06-21 13:37 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm
  Cc: linux-kernel, tglx, mingo, bp, dave.hansen, x86, H. Peter Anvin,
	Mikhail Malyshev

From: Mikhail Malyshev <mike.malyshev@gmail.com>

A guest with an assigned PCI device can crash its own VM by toggling
PCI_COMMAND.MEM on that device while another vCPU accesses the device's
BAR. KVM_RUN returns -EFAULT, which userspace (QEMU) treats as fatal.

This is a guest-triggerable, host-side VM kill, so I think it is worth
addressing in KVM rather than papering over it in userspace.

The window
==========

A passed-through BAR is mapped into the guest via a VM_IO/VM_PFNMAP VMA
whose fault handler (vfio_pci_mmap_fault()) refuses to install a PTE while
the device's memory space is disabled. When the guest clears
PCI_COMMAND.MEM:

  - the kernel vfio config-write path zaps the BAR's userspace mapping;
  - userspace's memory listener later removes the corresponding KVM
    memslot.

A vCPU that faults on the BAR in the window after the mapping is zapped but
before the memslot is removed lands in the page fault path with a *valid*
memslot but a backing whose fault handler declines. hva_to_pfn_remapped()
returns an error, the gfn resolves to KVM_PFN_ERR_FAULT, and
kvm_handle_error_pfn() returns -EFAULT.

On bare metal the same access is an Unsupported Request (reads all ones,
writes dropped), not a fatal error. This series makes KVM emulate the
access as MMIO in that case, matching hardware, while leaving genuine
faults (e.g. a vanished anonymous backing, vma == NULL) returning -EFAULT
as before -- consistent with what tools/testing/selftests/kvm/
mmu_stress_test.c already asserts.

How it was found / confirmed
============================

The crash was originally hit in production on edge devices that pass an
Intel iGPU (Raptor Lake-P) through to a guest; the guest's display driver
clears PCI_COMMAND.MEM on one vCPU while another vCPU is mid-MMIO to BAR0.

To study it deterministically I built a reduced, hardware-light reproducer
(no specific guest OS required, the race is host-side):

  - a Linux guest with any assigned PCI device whose BAR0 is mmap'd from
    userspace (/sys/.../resource0) and hammered with a tight MMIO write
    loop on one vCPU;
  - a second thread that toggles PCI_COMMAND.MEM 1->0->1 on that device
    via the VFIO config region.

Without this patch the VM dies within ~1 s. eBPF on the fault path showed
the -EFAULT originating in the faultin path (kvm_mmu_faultin_pfn ->
kvm_handle_error_pfn) with the memslot valid (flags=0, not
KVM_MEMSLOT_INVALID), no mmu_notifier invalidation in progress, and the
pfn equal to the GUP-error value -- i.e. the VM_PFNMAP fault handler
declining, exactly the case this patch targets. (An earlier attempt to
treat it as a stale-mapping race and retry on mmu_invalidate_retry_gfn()
did not help, because by the time of the fault the invalidation has
already completed and the seq is stable; that confirmed the failure is a
steady-state "device decoding disabled" condition, not a transient
invalidation, and led to the MMIO approach here.)

With the patch the same reproducer survived 200k toggle cycles, and a
fleet of 17 devices ran 48h with no recurrence.

Open questions for reviewers
============================

 - hva_to_pfn_remapped() can in principle return an error for reasons
   other than "fault handler declined" (e.g. an OOM from
   fixup_user_fault()). Treating all of them as MMIO is what this patch
   does for simplicity; I can instead plumb the specific condition through
   if you'd prefer to narrow it.

 - I could not find a clean way to add a selftest: a faithful regression
   test needs a VM_PFNMAP backing whose fault handler can be toggled,
   which from pure userspace means /dev/mem or a real assigned device
   (neither CI-portable), or a dedicated test module (outside selftests).
   Guidance on the preferred shape would be welcome.

Mikhail Malyshev (1):
  KVM: x86/mmu: Emulate, don't kill the VM, on access to a disabled
    passthrough BAR

 arch/x86/kvm/mmu/mmu.c   | 16 +++++++++++++++-
 include/linux/kvm_host.h |  8 ++++++++
 virt/kvm/kvm_main.c      |  9 ++++++++-
 3 files changed, 31 insertions(+), 2 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/1] KVM: x86/mmu: Emulate, don't kill the VM, on access to a disabled passthrough BAR
  2026-06-21 13:37 [PATCH 0/1] KVM: x86/mmu: don't kill the VM on access to a disabled passthrough BAR mike.malyshev
@ 2026-06-21 13:37 ` mike.malyshev
  2026-06-22 23:23   ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: mike.malyshev @ 2026-06-21 13:37 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm
  Cc: linux-kernel, tglx, mingo, bp, dave.hansen, x86, H. Peter Anvin,
	Mikhail Malyshev

From: Mikhail Malyshev <mike.malyshev@gmail.com>

A passed-through PCI device's BAR is mapped into the guest through a
VM_IO/VM_PFNMAP VMA whose fault handler (e.g. vfio_pci_mmap_fault())
declines to install a PTE while the device's memory space is disabled,
such as immediately after the guest clears PCI_COMMAND.MEM. If another
vCPU accesses that BAR during the window, hva_to_pfn_remapped() fails and
the gfn resolves to an error pfn even though the memslot is still valid.
kvm_handle_error_pfn() then returns -EFAULT and KVM_RUN exits to userspace,
which typically treats this as fatal and kills the VM.

This is guest-triggerable: a guest that toggles PCI_COMMAND.MEM on an
assigned device while another vCPU touches the BAR can take down its own
VM (observed in production with an assigned Intel iGPU; the guest's display
driver clears PCI_COMMAND.MEM on one vCPU while another is mid-MMIO to the
BAR).

On bare metal an access to a BAR whose memory decoding is disabled simply
completes as an Unsupported Request: reads return all ones, writes are
dropped. KVM can present the same behaviour by treating the access as MMIO
and emulating it, which is exactly what the noslot path already does for a
gfn that has no memslot.

Distinguish the VM_IO/VM_PFNMAP fault-handler failure from other error
pfns with a new KVM_PFN_ERR_PFNMAP value (in range of KVM_PFN_ERR_MASK, so
existing is_error_pfn() checks are unaffected) and route it to
kvm_handle_noslot_fault() in the x86 TDP fault path. Genuine,
non-pfnmap faults, e.g. a vanished anonymous backing, still take the fatal
-EFAULT path, consistent with what mmu_stress_test already expects. The
MMIO mapping self-heals when the device's memory space is re-enabled and
the memslot is updated, bumping the MMIO generation.

Signed-off-by: Mikhail Malyshev <mike.malyshev@gmail.com>
---
 arch/x86/kvm/mmu/mmu.c   | 16 +++++++++++++++-
 include/linux/kvm_host.h |  8 ++++++++
 virt/kvm/kvm_main.c      |  9 ++++++++-
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 91843e9224d04..115e2c4db5fa0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4759,8 +4759,22 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 	if (ret != RET_PF_CONTINUE)
 		return ret;
 
-	if (unlikely(is_error_pfn(fault->pfn)))
+	if (unlikely(is_error_pfn(fault->pfn))) {
+		/*
+		 * A passed-through PCI BAR is backed by a VM_IO/VM_PFNMAP
+		 * mapping whose fault handler refuses to install a PTE while the
+		 * device's memory space is disabled (e.g. the guest cleared
+		 * PCI_COMMAND.MEM). The fault then fails even though the memslot
+		 * is still valid. Treat such an access as MMIO and emulate it so
+		 * the guest observes Unsupported Request semantics, matching
+		 * bare metal, instead of killing the VM with -EFAULT. Genuine,
+		 * non-pfnmap errors still take the fatal path.
+		 */
+		if (fault->pfn == KVM_PFN_ERR_PFNMAP)
+			return kvm_handle_noslot_fault(vcpu, fault, access);
+
 		return kvm_handle_error_pfn(vcpu, fault);
+	}
 
 	if (WARN_ON_ONCE(!fault->slot || is_noslot_pfn(fault->pfn)))
 		return kvm_handle_noslot_fault(vcpu, fault, access);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4c14aee1fb063..dc5973e400721 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -99,6 +99,14 @@
 #define KVM_PFN_ERR_RO_FAULT	(KVM_PFN_ERR_MASK + 2)
 #define KVM_PFN_ERR_SIGPENDING	(KVM_PFN_ERR_MASK + 3)
 #define KVM_PFN_ERR_NEEDS_IO	(KVM_PFN_ERR_MASK + 4)
+/*
+ * Faulting in a VM_IO/VM_PFNMAP mapping failed because the owner's fault
+ * handler declined to install a PTE, e.g. a passed-through PCI BAR whose
+ * device memory is currently disabled (the guest cleared PCI_COMMAND.MEM).
+ * The memslot is valid; the access should be treated as MMIO rather than a
+ * fatal -EFAULT.
+ */
+#define KVM_PFN_ERR_PFNMAP	(KVM_PFN_ERR_MASK + 5)
 
 /*
  * error pfns indicate that the gfn is in slot but faild to
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 881f92d7a469e..f232fc2f42380 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3015,7 +3015,14 @@ kvm_pfn_t hva_to_pfn(struct kvm_follow_pfn *kfp)
 		if (r == -EAGAIN)
 			goto retry;
 		if (r < 0)
-			pfn = KVM_PFN_ERR_FAULT;
+			/*
+			 * The owner's fault handler declined to install a PTE
+			 * (e.g. a passed-through PCI BAR with device memory
+			 * disabled). Flag it distinctly so the arch fault
+			 * handler can treat the access as MMIO instead of a
+			 * fatal -EFAULT.
+			 */
+			pfn = KVM_PFN_ERR_PFNMAP;
 	} else {
 		if ((kfp->flags & FOLL_NOWAIT) &&
 		    vma_is_valid(vma, kfp->flags & FOLL_WRITE))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] KVM: x86/mmu: Emulate, don't kill the VM, on access to a disabled passthrough BAR
  2026-06-21 13:37 ` [PATCH 1/1] KVM: x86/mmu: Emulate, don't kill the VM, " mike.malyshev
@ 2026-06-22 23:23   ` Sean Christopherson
  2026-06-23 11:46     ` Mikhail Malyshev
  0 siblings, 1 reply; 5+ messages in thread
From: Sean Christopherson @ 2026-06-22 23:23 UTC (permalink / raw)
  To: mike.malyshev
  Cc: Paolo Bonzini, kvm, linux-kernel, tglx, mingo, bp, dave.hansen,
	x86, H. Peter Anvin

On Sun, Jun 21, 2026, mike.malyshev@gmail.com wrote:
> ---
>  arch/x86/kvm/mmu/mmu.c   | 16 +++++++++++++++-
>  include/linux/kvm_host.h |  8 ++++++++
>  virt/kvm/kvm_main.c      |  9 ++++++++-
>  3 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 91843e9224d04..115e2c4db5fa0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4759,8 +4759,22 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
>  	if (ret != RET_PF_CONTINUE)
>  		return ret;
>  
> -	if (unlikely(is_error_pfn(fault->pfn)))
> +	if (unlikely(is_error_pfn(fault->pfn))) {
> +		/*
> +		 * A passed-through PCI BAR is backed by a VM_IO/VM_PFNMAP
> +		 * mapping whose fault handler refuses to install a PTE while the
> +		 * device's memory space is disabled (e.g. the guest cleared
> +		 * PCI_COMMAND.MEM). The fault then fails even though the memslot
> +		 * is still valid. Treat such an access as MMIO and emulate it so
> +		 * the guest observes Unsupported Request semantics, matching
> +		 * bare metal, instead of killing the VM with -EFAULT. Genuine,
> +		 * non-pfnmap errors still take the fatal path.
> +		 */
> +		if (fault->pfn == KVM_PFN_ERR_PFNMAP)
> +			return kvm_handle_noslot_fault(vcpu, fault, access);

I really don't like this.  It's an ABI change that affects years of precedent,
and the ABI it creates is very haphazard.  E.g. why should PFNMAP memory get
emulate MMIO semantics for this case?  And a PROT_NONE VMA really shouldn't get
MMIO semantics either.

I would rather have KVM exit to userspace with KVM_EXIT_MEMORY_FAULT and fill
run->memory_fault, i.e.

diff --git arch/x86/kvm/mmu/mmu.c arch/x86/kvm/mmu/mmu.c
index 26ed97efda91..aff6b82b0fdd 100644
--- arch/x86/kvm/mmu/mmu.c
+++ arch/x86/kvm/mmu/mmu.c
@@ -3534,6 +3534,7 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
                return RET_PF_RETRY;
        }
 
+       kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
        return -EFAULT;
 }
 
Huh, knew I had a feeling of deja vu.  I even proposed this as a fix a while
back.  I don't know why it didn't go anywhere.  Maybe simply because no one cared
at the time?

https://lore.kernel.org/all/Zr-8M9rYplgN6IS3@google.com

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] KVM: x86/mmu: Emulate, don't kill the VM, on access to a disabled passthrough BAR
  2026-06-22 23:23   ` Sean Christopherson
@ 2026-06-23 11:46     ` Mikhail Malyshev
  2026-06-25 16:21       ` Sean Christopherson
  0 siblings, 1 reply; 5+ messages in thread
From: Mikhail Malyshev @ 2026-06-23 11:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, tglx, mingo, bp, dave.hansen,
	x86, H. Peter Anvin

On Mon, Jun 22, 2026, Sean Christopherson wrote:
> I really don't like this.  It's an ABI change that affects years of precedent,
> and the ABI it creates is very haphazard.  E.g. why should PFNMAP memory get
> emulate MMIO semantics for this case?  And a PROT_NONE VMA really shouldn't get
> MMIO semantics either.
>
> I would rather have KVM exit to userspace with KVM_EXIT_MEMORY_FAULT and fill
> run->memory_fault, i.e.

Agreed, thanks - that's the right layer, and the PROT_NONE case alone is reason
enough not to overload the pfn classification the way I did.

For context on v1's shape: localizing the fix in KVM was a deliberate scoping
choice - it was the smallest change that stopped the crash without updating
another major component (the VMM) at the same time. I agree that optimizing for
a single-component change put the policy in the wrong layer; KVM reporting
KVM_EXIT_MEMORY_FAULT and the VMM applying device semantics is the right split.

Plan: implement your suggestion (kvm_mmu_prepare_memory_fault_exit() in
kvm_handle_error_pfn()) and pair it with a QEMU change that recognizes a
memory-fault on a vfio-pci BAR whose memory space is currently disabled and
emulates it as an Unsupported Request (reads all ones, writes dropped) rather
than aborting. Note QEMU's current KVM_EXIT_MEMORY_FAULT handling only covers
guest_memfd shared/private conversion, so this is new vfio-pci handling on the
VMM side - I'll test the kernel + QEMU pair against the reproducer before
sending v2.

FWIW this is the concrete user the 2024 KVM_CAP_MEMORY_FAULT_INFO work
(Zr-8M9rYplgN6IS3) was missing: an assigned Intel iGPU whose guest driver clears
PCI_COMMAND.MEM on one vCPU while another is mid-MMIO to the BAR, which kills the
VM. Reproducible on demand and seen in the field.

Two questions before I respin:

- Prefer I revive Anish's x86 patch from that series (rebased), or a fresh
  minimal change against kvm-x86/next? The tree has moved a lot since 2024
  (kvm_follow_pfn, etc.), so I was leaning fresh with a Link: to the original
  and Suggested-by: you.
- You'd asked for KVM_BUG_ON() + -EIO conversions on that x86 patch - still want
  those folded in, and should this stay x86-only or also cover arm64?

Thanks,
Mike

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] KVM: x86/mmu: Emulate, don't kill the VM, on access to a disabled passthrough BAR
  2026-06-23 11:46     ` Mikhail Malyshev
@ 2026-06-25 16:21       ` Sean Christopherson
  0 siblings, 0 replies; 5+ messages in thread
From: Sean Christopherson @ 2026-06-25 16:21 UTC (permalink / raw)
  To: Mikhail Malyshev
  Cc: Paolo Bonzini, kvm, linux-kernel, tglx, mingo, bp, dave.hansen,
	x86, H. Peter Anvin

On Tue, Jun 23, 2026, Mikhail Malyshev wrote:
> On Mon, Jun 22, 2026, Sean Christopherson wrote:
> > I really don't like this.  It's an ABI change that affects years of precedent,
> > and the ABI it creates is very haphazard.  E.g. why should PFNMAP memory get
> > emulate MMIO semantics for this case?  And a PROT_NONE VMA really shouldn't get
> > MMIO semantics either.
> >
> > I would rather have KVM exit to userspace with KVM_EXIT_MEMORY_FAULT and fill
> > run->memory_fault, i.e.
> 
> Agreed, thanks - that's the right layer, and the PROT_NONE case alone is reason
> enough not to overload the pfn classification the way I did.
> 
> For context on v1's shape: localizing the fix in KVM was a deliberate scoping
> choice - it was the smallest change that stopped the crash without updating
> another major component (the VMM) at the same time. I agree that optimizing for
> a single-component change put the policy in the wrong layer; KVM reporting
> KVM_EXIT_MEMORY_FAULT and the VMM applying device semantics is the right split.
> 
> Plan: implement your suggestion (kvm_mmu_prepare_memory_fault_exit() in
> kvm_handle_error_pfn()) and pair it with a QEMU change that recognizes a
> memory-fault on a vfio-pci BAR whose memory space is currently disabled and
> emulates it as an Unsupported Request (reads all ones, writes dropped) rather
> than aborting. Note QEMU's current KVM_EXIT_MEMORY_FAULT handling only covers
> guest_memfd shared/private conversion, so this is new vfio-pci handling on the
> VMM side - I'll test the kernel + QEMU pair against the reproducer before
> sending v2.
> 
> FWIW this is the concrete user the 2024 KVM_CAP_MEMORY_FAULT_INFO work
> (Zr-8M9rYplgN6IS3) was missing: an assigned Intel iGPU whose guest driver clears
> PCI_COMMAND.MEM on one vCPU while another is mid-MMIO to the BAR, which kills the
> VM. Reproducible on demand and seen in the field.
> 
> Two questions before I respin:
> 
> - Prefer I revive Anish's x86 patch from that series (rebased), or a fresh
>   minimal change against kvm-x86/next? The tree has moved a lot since 2024
>   (kvm_follow_pfn, etc.), so I was leaning fresh with a Link: to the original
>   and Suggested-by: you.

Oh, I forgot that Anish had posted the x86 MMU change as an isolated patch.

Given that patch 2 still applies cleanly, I would say take Anish's patch, rewrite
the changelog, and then give yourself a Co-developed-by.  I.e. give attribution
to both Anish and you.

> - You'd asked for KVM_BUG_ON() + -EIO conversions on that x86 patch - still want
>   those folded in,

Yes, please tack on a separate patch to convert the WARN_ON_ONCE()-protected
EFAULTS in x86's fault path to KVM_BUG_ON() + -EIO.

> and should this stay x86-only or also cover arm64?

Definitely keep your patches x86-only.  For all intents and purposes this is a
bug fix, let someone else deal with the pain of enabling KVM_CAP_MEMORY_FAULT_INFO
on arm64 :-)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-25 16:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-21 13:37 [PATCH 0/1] KVM: x86/mmu: don't kill the VM on access to a disabled passthrough BAR mike.malyshev
2026-06-21 13:37 ` [PATCH 1/1] KVM: x86/mmu: Emulate, don't kill the VM, " mike.malyshev
2026-06-22 23:23   ` Sean Christopherson
2026-06-23 11:46     ` Mikhail Malyshev
2026-06-25 16:21       ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox