Kernel KVM virtualization development
 help / color / mirror / Atom feed
* [PATCH 0/1] KVM: x86/mmu: don't kill the VM on access to a disabled passthrough BAR
@ 2026-06-21 13:37 mike.malyshev
  2026-06-21 13:37 ` [PATCH 1/1] KVM: x86/mmu: Emulate, don't kill the VM, " mike.malyshev
  0 siblings, 1 reply; 5+ messages in thread
From: mike.malyshev @ 2026-06-21 13:37 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, kvm
  Cc: linux-kernel, tglx, mingo, bp, dave.hansen, x86, H. Peter Anvin,
	Mikhail Malyshev

From: Mikhail Malyshev <mike.malyshev@gmail.com>

A guest with an assigned PCI device can crash its own VM by toggling
PCI_COMMAND.MEM on that device while another vCPU accesses the device's
BAR. KVM_RUN returns -EFAULT, which userspace (QEMU) treats as fatal.

This is a guest-triggerable, host-side VM kill, so I think it is worth
addressing in KVM rather than papering over it in userspace.

The window
==========

A passed-through BAR is mapped into the guest via a VM_IO/VM_PFNMAP VMA
whose fault handler (vfio_pci_mmap_fault()) refuses to install a PTE while
the device's memory space is disabled. When the guest clears
PCI_COMMAND.MEM:

  - the kernel vfio config-write path zaps the BAR's userspace mapping;
  - userspace's memory listener later removes the corresponding KVM
    memslot.

A vCPU that faults on the BAR in the window after the mapping is zapped but
before the memslot is removed lands in the page fault path with a *valid*
memslot but a backing whose fault handler declines. hva_to_pfn_remapped()
returns an error, the gfn resolves to KVM_PFN_ERR_FAULT, and
kvm_handle_error_pfn() returns -EFAULT.

On bare metal the same access is an Unsupported Request (reads all ones,
writes dropped), not a fatal error. This series makes KVM emulate the
access as MMIO in that case, matching hardware, while leaving genuine
faults (e.g. a vanished anonymous backing, vma == NULL) returning -EFAULT
as before -- consistent with what tools/testing/selftests/kvm/
mmu_stress_test.c already asserts.

How it was found / confirmed
============================

The crash was originally hit in production on edge devices that pass an
Intel iGPU (Raptor Lake-P) through to a guest; the guest's display driver
clears PCI_COMMAND.MEM on one vCPU while another vCPU is mid-MMIO to BAR0.

To study it deterministically I built a reduced, hardware-light reproducer
(no specific guest OS required, the race is host-side):

  - a Linux guest with any assigned PCI device whose BAR0 is mmap'd from
    userspace (/sys/.../resource0) and hammered with a tight MMIO write
    loop on one vCPU;
  - a second thread that toggles PCI_COMMAND.MEM 1->0->1 on that device
    via the VFIO config region.

Without this patch the VM dies within ~1 s. eBPF on the fault path showed
the -EFAULT originating in the faultin path (kvm_mmu_faultin_pfn ->
kvm_handle_error_pfn) with the memslot valid (flags=0, not
KVM_MEMSLOT_INVALID), no mmu_notifier invalidation in progress, and the
pfn equal to the GUP-error value -- i.e. the VM_PFNMAP fault handler
declining, exactly the case this patch targets. (An earlier attempt to
treat it as a stale-mapping race and retry on mmu_invalidate_retry_gfn()
did not help, because by the time of the fault the invalidation has
already completed and the seq is stable; that confirmed the failure is a
steady-state "device decoding disabled" condition, not a transient
invalidation, and led to the MMIO approach here.)

With the patch the same reproducer survived 200k toggle cycles, and a
fleet of 17 devices ran 48h with no recurrence.

Open questions for reviewers
============================

 - hva_to_pfn_remapped() can in principle return an error for reasons
   other than "fault handler declined" (e.g. an OOM from
   fixup_user_fault()). Treating all of them as MMIO is what this patch
   does for simplicity; I can instead plumb the specific condition through
   if you'd prefer to narrow it.

 - I could not find a clean way to add a selftest: a faithful regression
   test needs a VM_PFNMAP backing whose fault handler can be toggled,
   which from pure userspace means /dev/mem or a real assigned device
   (neither CI-portable), or a dedicated test module (outside selftests).
   Guidance on the preferred shape would be welcome.

Mikhail Malyshev (1):
  KVM: x86/mmu: Emulate, don't kill the VM, on access to a disabled
    passthrough BAR

 arch/x86/kvm/mmu/mmu.c   | 16 +++++++++++++++-
 include/linux/kvm_host.h |  8 ++++++++
 virt/kvm/kvm_main.c      |  9 ++++++++-
 3 files changed, 31 insertions(+), 2 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-25 16:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-21 13:37 [PATCH 0/1] KVM: x86/mmu: don't kill the VM on access to a disabled passthrough BAR mike.malyshev
2026-06-21 13:37 ` [PATCH 1/1] KVM: x86/mmu: Emulate, don't kill the VM, " mike.malyshev
2026-06-22 23:23   ` Sean Christopherson
2026-06-23 11:46     ` Mikhail Malyshev
2026-06-25 16:21       ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox