[PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Ackerley Tng <ackerleytng@google.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	 Shuah Khan <skhan@linuxfoundation.org>,
	 Tianrui Zhao <zhaotianrui@loongson.cn>,
	Bibo Mao <maobibo@loongson.cn>,
	 Huacai Chen <chenhuacai@kernel.org>,
	WANG Xuerui <kernel@xen0n.name>,
	 Sean Christopherson <seanjc@google.com>,
	Thomas Gleixner <tglx@kernel.org>,
	 Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Fuad Tabba <tabba@google.com>,
	 vannapurve@google.com, x86@kernel.org,
	"H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	 linux-kernel@vger.kernel.org, loongarch@lists.linux.dev,
	 Ackerley Tng <ackerleytng@google.com>
Subject: [PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults
Date: Wed, 27 May 2026 08:33:21 -0700	[thread overview]
Message-ID: <20260527-kvm-locking-docs-v1-9-4fe8b602ff47@google.com> (raw)
In-Reply-To: <20260527-kvm-locking-docs-v1-0-4fe8b602ff47@google.com>

Document how synchronization is used while managing guest faults centrally
so code comments can point users at a central place.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 Documentation/virt/kvm/locking.rst | 108 +++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index f12664443e913..0663ccfe0633d 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -339,3 +339,111 @@ time it will be set using the Dirty tracking mechanism described above.
     cpu_hotplug_lock is held, e.g. from cpufreq_boost_trigger_state(), and many
     operations need to take cpu_hotplug_lock when loading a vendor module, e.g.
     updating static calls.
+
+4. Synchronization while managing guest faults
+----------------------------------------------
+
+This section explains the intersection of these synchronization mechanisms:
+
+- ``kvm->srcu`` (for memslots)
+- ``kvm->mmu_invalidate_*`` (pending invalidations)
+- ``kvm->mn_*`` (synchronization for ``kvm->mmu_invalidate_*``)
+
+4.1 Overview
+^^^^^^^^^^^^
+
+KVM resolves guest page faults by translating the Guest Frame Number (GFN) into
+a Page Frame Number (PFN) via memslots and then populating its shadow page
+tables with the resulting mapping.
+
+While handling the guest page fault, KVM must ensure a consistent view of the
+active memslots container, so KVM takes ``srcu_read_lock(&kvm->srcu);``.
+
+Guest page fault handling can race with some request from host userspace to
+invalidate shadow page tables. These requests originate from a few places, such
+as
+
+1. MMU Notifiers: KVM registers callbacks with the kernel’s memory management
+   subsystem to know when there are changes to mappings in the host userspace
+   page tables.
+2. Memslot Updates: The host userspace VMM, such as QEMU may use the
+   ``KVM_SET_USER_MEMORY_REGION`` ioctl to add, delete, or move a memslot. KVM
+   must zap the affected shadow page tables to ensure the guest doesn't access
+   stale mappings.
+3. Memory Attribute Changes: The ``KVM_SET_MEMORY_ATTRIBUTES`` ioctl allows
+   userspace to change attributes for a range of guest memory (e.g., setting a
+   range as "private" for Confidential Computing). This also requires
+   invalidating existing shadow mappings.
+
+When such a race occurs, KVM optimistically allows the faulting logic to
+proceed, but just before committing the fault, KVM will check for a pending
+invalidation, and retry the fault process if there is a pending invalidation
+affecting the GFN where the fault occurred.
+
+4.2 Tracking pending invalidations with ``kvm->mmu_invalidate*`` fields
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A "pending invalidation" is determined using a combination of
+
+- ``kvm->mmu_invalidate_in_progress``
+- ``kvm->mmu_invalidate_range_start`` and ``kvm->mmu_invalidate_range_end``
+- ``kvm->mmu_invalidate_seq``
+
+``is_page_fault_stale()`` shows how the above fields are used to determine if
+the page fault is stale and requires a retry.
+
+To protect the above combination of fields, a lock is used, which is the
+``kvm->mmu_lock``.
+
+4.2.1 Derived information vs pending invalidations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally, the result of any information derived from GFN aka page
+attribute/page metadata lookups may race with invalidations. Here are some
+examples of lookups:
+
+- ``host_pfn_mapping_level()`` uses memslot information to find the mapping
+  level of pages in host userspace page tables. If there's an invalidation, the
+  pages that were mapped would no longer be mapped and hence the mapping level
+  result would be stale.
+
+There are several ways to ensure valid results:
+
+- Check ``mmu_invalidate_retry_gfn()`` after grabbing the result, before
+  consuming it. In this case, ``mmu_lock`` doesn't need to be held during the
+  lookup, but it does need to be held while checking the MMU notifier. KVM's
+  guest page fault handling uses this option.
+- Hold ``mmu_lock`` AND ensure there is no in-progress MMU notifier invalidation
+  event for the hva. This can be done by explicit checking the MMU notifier or
+  by ensuring that KVM already has a valid mapping that covers the
+  hva. ``kvm_mmu_recover_huge_pages()`` uses this option.
+
+4.3 Further optimization: ignoring invalidations if there is no matching memslot
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Invalidation is only really required when the invalidated memory range overlaps
+with some memslot. Without a matching memslot, the invalidation request could
+actually just be ignored. Hence, KVM only updates the ``kvm->mmu_invalidate_*``
+fields and takes ``kvm->mmu_lock`` if it finds a matching memslot.
+
+This creates another problem: if memslots are updated while there is an ongoing
+invalidation, then the updates to the fields and the lock would be imbalanced.
+
+4.4 Synchronization for invalidation lock/fields: ``kvm->mn_*``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To make sure the updates to the invalidation lock/fields are balanced, KVM has a
+further layer of synchronization. ``kvm_swap_active_memslots()`` enforces that
+changes to memslots are only committed once all pending invalidations are
+complete.
+
+In other words, ``kvm->mn_*`` ensures the following does not happen:
+
+1. Some memslot existed, causing a pending invalidation request to be recorded
+   in the ``kvm->mmu_invalidate_*`` fields
+2. Memslot got removed, so the invalidation request was never removed from the
+   ``kvm->mmu_invalidate_*`` fields.
+
+In addition, ``kvm_swap_active_memslots()`` also enforces that changes to
+memslots are complete before doing ``synchronize_srcu(&kvm->srcu)`` to make sure
+running readers of the old memslots container are done before freeing it.

-- 
2.54.0.823.g6e5bcc1fc9-goog

WARNING: multiple messages have this Message-ID (diff)

From: Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
To: Paolo Bonzini <pbonzini@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	 Shuah Khan <skhan@linuxfoundation.org>,
	 Tianrui Zhao <zhaotianrui@loongson.cn>,
	Bibo Mao <maobibo@loongson.cn>,
	 Huacai Chen <chenhuacai@kernel.org>,
	WANG Xuerui <kernel@xen0n.name>,
	 Sean Christopherson <seanjc@google.com>,
	Thomas Gleixner <tglx@kernel.org>,
	 Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	 Dave Hansen <dave.hansen@linux.intel.com>,
	Fuad Tabba <tabba@google.com>,
	 vannapurve@google.com, x86@kernel.org,
	"H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	 linux-kernel@vger.kernel.org, loongarch@lists.linux.dev,
	 Ackerley Tng <ackerleytng@google.com>
Subject: [PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults
Date: Wed, 27 May 2026 08:33:21 -0700	[thread overview]
Message-ID: <20260527-kvm-locking-docs-v1-9-4fe8b602ff47@google.com> (raw)
In-Reply-To: <20260527-kvm-locking-docs-v1-0-4fe8b602ff47@google.com>

From: Ackerley Tng <ackerleytng@google.com>

Document how synchronization is used while managing guest faults centrally
so code comments can point users at a central place.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 Documentation/virt/kvm/locking.rst | 108 +++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index f12664443e913..0663ccfe0633d 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -339,3 +339,111 @@ time it will be set using the Dirty tracking mechanism described above.
     cpu_hotplug_lock is held, e.g. from cpufreq_boost_trigger_state(), and many
     operations need to take cpu_hotplug_lock when loading a vendor module, e.g.
     updating static calls.
+
+4. Synchronization while managing guest faults
+----------------------------------------------
+
+This section explains the intersection of these synchronization mechanisms:
+
+- ``kvm->srcu`` (for memslots)
+- ``kvm->mmu_invalidate_*`` (pending invalidations)
+- ``kvm->mn_*`` (synchronization for ``kvm->mmu_invalidate_*``)
+
+4.1 Overview
+^^^^^^^^^^^^
+
+KVM resolves guest page faults by translating the Guest Frame Number (GFN) into
+a Page Frame Number (PFN) via memslots and then populating its shadow page
+tables with the resulting mapping.
+
+While handling the guest page fault, KVM must ensure a consistent view of the
+active memslots container, so KVM takes ``srcu_read_lock(&kvm->srcu);``.
+
+Guest page fault handling can race with some request from host userspace to
+invalidate shadow page tables. These requests originate from a few places, such
+as
+
+1. MMU Notifiers: KVM registers callbacks with the kernel’s memory management
+   subsystem to know when there are changes to mappings in the host userspace
+   page tables.
+2. Memslot Updates: The host userspace VMM, such as QEMU may use the
+   ``KVM_SET_USER_MEMORY_REGION`` ioctl to add, delete, or move a memslot. KVM
+   must zap the affected shadow page tables to ensure the guest doesn't access
+   stale mappings.
+3. Memory Attribute Changes: The ``KVM_SET_MEMORY_ATTRIBUTES`` ioctl allows
+   userspace to change attributes for a range of guest memory (e.g., setting a
+   range as "private" for Confidential Computing). This also requires
+   invalidating existing shadow mappings.
+
+When such a race occurs, KVM optimistically allows the faulting logic to
+proceed, but just before committing the fault, KVM will check for a pending
+invalidation, and retry the fault process if there is a pending invalidation
+affecting the GFN where the fault occurred.
+
+4.2 Tracking pending invalidations with ``kvm->mmu_invalidate*`` fields
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A "pending invalidation" is determined using a combination of
+
+- ``kvm->mmu_invalidate_in_progress``
+- ``kvm->mmu_invalidate_range_start`` and ``kvm->mmu_invalidate_range_end``
+- ``kvm->mmu_invalidate_seq``
+
+``is_page_fault_stale()`` shows how the above fields are used to determine if
+the page fault is stale and requires a retry.
+
+To protect the above combination of fields, a lock is used, which is the
+``kvm->mmu_lock``.
+
+4.2.1 Derived information vs pending invalidations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally, the result of any information derived from GFN aka page
+attribute/page metadata lookups may race with invalidations. Here are some
+examples of lookups:
+
+- ``host_pfn_mapping_level()`` uses memslot information to find the mapping
+  level of pages in host userspace page tables. If there's an invalidation, the
+  pages that were mapped would no longer be mapped and hence the mapping level
+  result would be stale.
+
+There are several ways to ensure valid results:
+
+- Check ``mmu_invalidate_retry_gfn()`` after grabbing the result, before
+  consuming it. In this case, ``mmu_lock`` doesn't need to be held during the
+  lookup, but it does need to be held while checking the MMU notifier. KVM's
+  guest page fault handling uses this option.
+- Hold ``mmu_lock`` AND ensure there is no in-progress MMU notifier invalidation
+  event for the hva. This can be done by explicit checking the MMU notifier or
+  by ensuring that KVM already has a valid mapping that covers the
+  hva. ``kvm_mmu_recover_huge_pages()`` uses this option.
+
+4.3 Further optimization: ignoring invalidations if there is no matching memslot
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Invalidation is only really required when the invalidated memory range overlaps
+with some memslot. Without a matching memslot, the invalidation request could
+actually just be ignored. Hence, KVM only updates the ``kvm->mmu_invalidate_*``
+fields and takes ``kvm->mmu_lock`` if it finds a matching memslot.
+
+This creates another problem: if memslots are updated while there is an ongoing
+invalidation, then the updates to the fields and the lock would be imbalanced.
+
+4.4 Synchronization for invalidation lock/fields: ``kvm->mn_*``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To make sure the updates to the invalidation lock/fields are balanced, KVM has a
+further layer of synchronization. ``kvm_swap_active_memslots()`` enforces that
+changes to memslots are only committed once all pending invalidations are
+complete.
+
+In other words, ``kvm->mn_*`` ensures the following does not happen:
+
+1. Some memslot existed, causing a pending invalidation request to be recorded
+   in the ``kvm->mmu_invalidate_*`` fields
+2. Memslot got removed, so the invalidation request was never removed from the
+   ``kvm->mmu_invalidate_*`` fields.
+
+In addition, ``kvm_swap_active_memslots()`` also enforces that changes to
+memslots are complete before doing ``synchronize_srcu(&kvm->srcu)`` to make sure
+running readers of the old memslots container are done before freeing it.

-- 
2.54.0.823.g6e5bcc1fc9-goog

next prev parent reply	other threads:[~2026-05-27 15:33 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-27 15:33 [PATCH RFC 00/12] Document synchronization used in managing guest faults Ackerley Tng
2026-05-27 15:33 ` Ackerley Tng via B4 Relay
2026-05-27 15:33 ` [PATCH RFC 01/12] Documentation: KVM: Elaborate comment on kvm_usage_lock Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-06-25 18:12   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 02/12] Documentation: KVM: Consolidate notes about cpu_read_lock() and kvm_lock Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-06-25 18:12   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 03/12] Documentation: KVM: Consolidate notes about kvm->slots_lock and irq_lock Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-06-25 18:12   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 04/12] Documentation: KVM: Turn - into bullet point Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-05-27 15:33 ` [PATCH RFC 05/12] Documentation: KVM: Explain what rule the exception section is meant for Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-05-27 15:33 ` [PATCH RFC 06/12] Documentation: KVM: Have actual headings for exceptions Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-05-27 15:40   ` sashiko-bot
2026-05-27 15:33 ` [PATCH RFC 07/12] Documentation: KVM: Drop mention of kvm->lock in SRCU documentation Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-06-25 18:35   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 08/12] Documentation: KVM: Add example for kvm->srcu in relation to mutex/lock Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-06-25 18:17   ` Sean Christopherson
2026-05-27 15:33 ` Ackerley Tng [this message]
2026-05-27 15:33   ` [PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults Ackerley Tng via B4 Relay
2026-05-27 15:33 ` [PATCH RFC 10/12] KVM: guest_memfd: Clarify comment about gmem.file vs kvm->srcu Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-06-25 18:19   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 11/12] KVM: mmu: Point users of host_pfn_mapping_level() to docs Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-06-25 18:29   ` Sean Christopherson
2026-05-27 15:33 ` [PATCH RFC 12/12] Documentation: KVM: Focus acquisition order section on preventing deadlocks Ackerley Tng
2026-05-27 15:33   ` Ackerley Tng via B4 Relay
2026-06-25 18:25   ` Sean Christopherson
2026-06-25 18:37 ` [PATCH RFC 00/12] Document synchronization used in managing guest faults Sean Christopherson

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:f12664443e91 dfblob:0663ccfe0633 dfblob:f12664443e91
dfblob:0663ccfe0633 )
 OR (
bs:"[PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260527-kvm-locking-docs-v1-9-4fe8b602ff47@google.com \
    --to=ackerleytng@google.com \
    --cc=bp@alien8.de \
    --cc=chenhuacai@kernel.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=kernel@xen0n.name \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=loongarch@lists.linux.dev \
    --cc=maobibo@loongson.cn \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=skhan@linuxfoundation.org \
    --cc=tabba@google.com \
    --cc=tglx@kernel.org \
    --cc=vannapurve@google.com \
    --cc=x86@kernel.org \
    --cc=zhaotianrui@loongson.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.