[PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults

Linux Documentation
 help / color / mirror / Atom feed

From: Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
To: Paolo Bonzini <pbonzini@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	 Shuah Khan <skhan@linuxfoundation.org>,
	 Tianrui Zhao <zhaotianrui@loongson.cn>,
	Bibo Mao <maobibo@loongson.cn>,
	 Huacai Chen <chenhuacai@kernel.org>,
	WANG Xuerui <kernel@xen0n.name>,
	 Sean Christopherson <seanjc@google.com>,
	Thomas Gleixner <tglx@kernel.org>,
	 Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Fuad Tabba <tabba@google.com>,
	 vannapurve@google.com, x86@kernel.org,
	"H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	 linux-kernel@vger.kernel.org, loongarch@lists.linux.dev,
	 Ackerley Tng <ackerleytng@google.com>
Subject: [PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults
Date: Wed, 27 May 2026 08:33:21 -0700	[thread overview]
Message-ID: <20260527-kvm-locking-docs-v1-9-4fe8b602ff47@google.com> (raw)
In-Reply-To: <20260527-kvm-locking-docs-v1-0-4fe8b602ff47@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Document how synchronization is used while managing guest faults centrally
so code comments can point users at a central place.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 Documentation/virt/kvm/locking.rst | 108 +++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index f12664443e913..0663ccfe0633d 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -339,3 +339,111 @@ time it will be set using the Dirty tracking mechanism described above.
     cpu_hotplug_lock is held, e.g. from cpufreq_boost_trigger_state(), and many
     operations need to take cpu_hotplug_lock when loading a vendor module, e.g.
     updating static calls.
+
+4. Synchronization while managing guest faults
+----------------------------------------------
+
+This section explains the intersection of these synchronization mechanisms:
+
+- ``kvm->srcu`` (for memslots)
+- ``kvm->mmu_invalidate_*`` (pending invalidations)
+- ``kvm->mn_*`` (synchronization for ``kvm->mmu_invalidate_*``)
+
+4.1 Overview
+^^^^^^^^^^^^
+
+KVM resolves guest page faults by translating the Guest Frame Number (GFN) into
+a Page Frame Number (PFN) via memslots and then populating its shadow page
+tables with the resulting mapping.
+
+While handling the guest page fault, KVM must ensure a consistent view of the
+active memslots container, so KVM takes ``srcu_read_lock(&kvm->srcu);``.
+
+Guest page fault handling can race with some request from host userspace to
+invalidate shadow page tables. These requests originate from a few places, such
+as
+
+1. MMU Notifiers: KVM registers callbacks with the kernel’s memory management
+   subsystem to know when there are changes to mappings in the host userspace
+   page tables.
+2. Memslot Updates: The host userspace VMM, such as QEMU may use the
+   ``KVM_SET_USER_MEMORY_REGION`` ioctl to add, delete, or move a memslot. KVM
+   must zap the affected shadow page tables to ensure the guest doesn't access
+   stale mappings.
+3. Memory Attribute Changes: The ``KVM_SET_MEMORY_ATTRIBUTES`` ioctl allows
+   userspace to change attributes for a range of guest memory (e.g., setting a
+   range as "private" for Confidential Computing). This also requires
+   invalidating existing shadow mappings.
+
+When such a race occurs, KVM optimistically allows the faulting logic to
+proceed, but just before committing the fault, KVM will check for a pending
+invalidation, and retry the fault process if there is a pending invalidation
+affecting the GFN where the fault occurred.
+
+4.2 Tracking pending invalidations with ``kvm->mmu_invalidate*`` fields
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A "pending invalidation" is determined using a combination of
+
+- ``kvm->mmu_invalidate_in_progress``
+- ``kvm->mmu_invalidate_range_start`` and ``kvm->mmu_invalidate_range_end``
+- ``kvm->mmu_invalidate_seq``
+
+``is_page_fault_stale()`` shows how the above fields are used to determine if
+the page fault is stale and requires a retry.
+
+To protect the above combination of fields, a lock is used, which is the
+``kvm->mmu_lock``.
+
+4.2.1 Derived information vs pending invalidations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally, the result of any information derived from GFN aka page
+attribute/page metadata lookups may race with invalidations. Here are some
+examples of lookups:
+
+- ``host_pfn_mapping_level()`` uses memslot information to find the mapping
+  level of pages in host userspace page tables. If there's an invalidation, the
+  pages that were mapped would no longer be mapped and hence the mapping level
+  result would be stale.
+
+There are several ways to ensure valid results:
+
+- Check ``mmu_invalidate_retry_gfn()`` after grabbing the result, before
+  consuming it. In this case, ``mmu_lock`` doesn't need to be held during the
+  lookup, but it does need to be held while checking the MMU notifier. KVM's
+  guest page fault handling uses this option.
+- Hold ``mmu_lock`` AND ensure there is no in-progress MMU notifier invalidation
+  event for the hva. This can be done by explicit checking the MMU notifier or
+  by ensuring that KVM already has a valid mapping that covers the
+  hva. ``kvm_mmu_recover_huge_pages()`` uses this option.
+
+4.3 Further optimization: ignoring invalidations if there is no matching memslot
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Invalidation is only really required when the invalidated memory range overlaps
+with some memslot. Without a matching memslot, the invalidation request could
+actually just be ignored. Hence, KVM only updates the ``kvm->mmu_invalidate_*``
+fields and takes ``kvm->mmu_lock`` if it finds a matching memslot.
+
+This creates another problem: if memslots are updated while there is an ongoing
+invalidation, then the updates to the fields and the lock would be imbalanced.
+
+4.4 Synchronization for invalidation lock/fields: ``kvm->mn_*``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To make sure the updates to the invalidation lock/fields are balanced, KVM has a
+further layer of synchronization. ``kvm_swap_active_memslots()`` enforces that
+changes to memslots are only committed once all pending invalidations are
+complete.
+
+In other words, ``kvm->mn_*`` ensures the following does not happen:
+
+1. Some memslot existed, causing a pending invalidation request to be recorded
+   in the ``kvm->mmu_invalidate_*`` fields
+2. Memslot got removed, so the invalidation request was never removed from the
+   ``kvm->mmu_invalidate_*`` fields.
+
+In addition, ``kvm_swap_active_memslots()`` also enforces that changes to
+memslots are complete before doing ``synchronize_srcu(&kvm->srcu)`` to make sure
+running readers of the old memslots container are done before freeing it.

-- 
2.54.0.823.g6e5bcc1fc9-goog

next prev parent reply	other threads:[~2026-05-27 15:33 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 15:33 [PATCH RFC 00/12] Document synchronization used in managing guest faults Ackerley Tng via B4 Relay
2026-05-27 15:33 ` [PATCH RFC 01/12] Documentation: KVM: Elaborate comment on kvm_usage_lock Ackerley Tng via B4 Relay
2026-06-25 18:12   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 02/12] Documentation: KVM: Consolidate notes about cpu_read_lock() and kvm_lock Ackerley Tng via B4 Relay
2026-06-25 18:12   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 03/12] Documentation: KVM: Consolidate notes about kvm->slots_lock and irq_lock Ackerley Tng via B4 Relay
2026-06-25 18:12   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 04/12] Documentation: KVM: Turn - into bullet point Ackerley Tng via B4 Relay
2026-05-27 15:33 ` [PATCH RFC 05/12] Documentation: KVM: Explain what rule the exception section is meant for Ackerley Tng via B4 Relay
2026-05-27 15:33 ` [PATCH RFC 06/12] Documentation: KVM: Have actual headings for exceptions Ackerley Tng via B4 Relay
2026-05-27 15:33 ` [PATCH RFC 07/12] Documentation: KVM: Drop mention of kvm->lock in SRCU documentation Ackerley Tng via B4 Relay
2026-06-25 18:35   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 08/12] Documentation: KVM: Add example for kvm->srcu in relation to mutex/lock Ackerley Tng via B4 Relay
2026-06-25 18:17   ` Sean Christopherson
2026-05-27 15:33 ` Ackerley Tng via B4 Relay [this message]
2026-05-27 15:33 ` [PATCH RFC 10/12] KVM: guest_memfd: Clarify comment about gmem.file vs kvm->srcu Ackerley Tng via B4 Relay
2026-06-25 18:19   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 11/12] KVM: mmu: Point users of host_pfn_mapping_level() to docs Ackerley Tng via B4 Relay
2026-06-25 18:29   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 12/12] Documentation: KVM: Focus acquisition order section on preventing deadlocks Ackerley Tng via B4 Relay
2026-06-25 18:25   ` Sean Christopherson
2026-06-25 18:37 ` [PATCH RFC 00/12] Document synchronization used in managing guest faults Sean Christopherson

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:f12664443e91 dfblob:0663ccfe0633 )
 OR (
bs:"[PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260527-kvm-locking-docs-v1-9-4fe8b602ff47@google.com \
    --to=devnull+ackerleytng.google.com@kernel.org \
    --cc=ackerleytng@google.com \
    --cc=bp@alien8.de \
    --cc=chenhuacai@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=kernel@xen0n.name \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=loongarch@lists.linux.dev \
    --cc=maobibo@loongson.cn \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=skhan@linuxfoundation.org \
    --cc=tabba@google.com \
    --cc=tglx@kernel.org \
    --cc=vannapurve@google.com \
    --cc=x86@kernel.org \
    --cc=zhaotianrui@loongson.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox