From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0F4123EF0C2; Wed, 27 May 2026 15:33:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779896004; cv=none; b=Jo5MHjVgzz2uoxAis2FZBtgVLYz/04852tKP4Ypw7YhDT5ZZY/joWBDyhiOetjNmpK9zRE1w4/6cVh8bmnHwxz5tl8bDaxG2fkWplSL2MVIWThVE8Bsy9UMzqMdmQ4UOaGpRtuWMsv0BKvwQgEMQsoNkVLK6PywbR2eUh96WrRM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779896004; c=relaxed/simple; bh=++rXmaxiJyBzVjijE4IdxKxLGi/JxM1MIFEf3U+1KhI=; h=From:Date:Subject:MIME-Version:Content-Type:Message-Id:References: In-Reply-To:To:Cc; b=ExB/ObrtI9409YtjszN3UFymlFi9d8rs+00MO1kCZ1baAzr8WXvp8hlfnnhWaExYxbQ3tG2f8VCCo1SmBFq42uhXT6GZu4+j2DDau1mXhV7sGnvGUgljR0CUGZtS3jjWTtoY8z8lCaJsn96MoELxNbsx/OUJd4Zr1D1tKeeo2Lw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ZRrigFsw; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ZRrigFsw" Received: by smtp.kernel.org (Postfix) with ESMTPS id 094B1C2BD00; Wed, 27 May 2026 15:33:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1779896003; bh=++rXmaxiJyBzVjijE4IdxKxLGi/JxM1MIFEf3U+1KhI=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=ZRrigFswXsoi16AGLKt1zdl3PkZOJsWCyB9bS40NaeY7aQLXdzjm1FiUJ0lVTQfCe xZtV4V5jFIoVGzoYLCPhl8d/kGgUcOxEBP5zyo0OgPP8THNlquPK9RYNXH2JZthQld pZl2jYyMc5uPYa59hTAVGMIMfR7T62lFgwZETEa0MeLGtv+P7EKrby7fiMEHe6IyTV e0BYW+0wM77ZwDd7duHDmM3QR/59k370ZqOojZeDTV7KIMOdBLEekVzKCU9Hxtug/q oRof45g/X/bBsyEXlSqGYS0dWzTLUfWnT+Fk1R273ORWFLDxMA1gFX378eddMfnUHb zGzhSzOZLsCgQ== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 01016CD6E45; Wed, 27 May 2026 15:33:23 +0000 (UTC) From: Ackerley Tng via B4 Relay Date: Wed, 27 May 2026 08:33:21 -0700 Subject: [PATCH RFC 09/12] Documentation: KVM: Document synchronization for managing guest faults Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Message-Id: <20260527-kvm-locking-docs-v1-9-4fe8b602ff47@google.com> References: <20260527-kvm-locking-docs-v1-0-4fe8b602ff47@google.com> In-Reply-To: <20260527-kvm-locking-docs-v1-0-4fe8b602ff47@google.com> To: Paolo Bonzini , Jonathan Corbet , Shuah Khan , Tianrui Zhao , Bibo Mao , Huacai Chen , WANG Xuerui , Sean Christopherson , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Fuad Tabba , vannapurve@google.com, x86@kernel.org, "H. Peter Anvin" Cc: kvm@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, loongarch@lists.linux.dev, Ackerley Tng X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1779896001; l=6122; i=ackerleytng@google.com; s=20260225; h=from:subject:message-id; bh=qtMOxyeP5ch21YYzClGkI6coJJOBcrx7UDfHl0oxiyE=; b=ZLCGPfSxzOWhzxer3lPOoW8Xf+IoErNOQmSifHUvrPQo82wBll8m7eGZttpVwVQqj7yjyti43 78hYb4H5CmGDNYj5SKX4oT3LEEtKjKf4ovfwqD7TgaDcawPtUY/OgDc X-Developer-Key: i=ackerleytng@google.com; a=ed25519; pk=sAZDYXdm6Iz8FHitpHeFlCMXwabodTm7p8/3/8xUxuU= X-Endpoint-Received: by B4 Relay for ackerleytng@google.com/20260225 with auth_id=649 X-Original-From: Ackerley Tng Reply-To: ackerleytng@google.com From: Ackerley Tng Document how synchronization is used while managing guest faults centrally so code comments can point users at a central place. Signed-off-by: Ackerley Tng --- Documentation/virt/kvm/locking.rst | 108 +++++++++++++++++++++++++++++++++++++ 1 file changed, 108 insertions(+) diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst index f12664443e913..0663ccfe0633d 100644 --- a/Documentation/virt/kvm/locking.rst +++ b/Documentation/virt/kvm/locking.rst @@ -339,3 +339,111 @@ time it will be set using the Dirty tracking mechanism described above. cpu_hotplug_lock is held, e.g. from cpufreq_boost_trigger_state(), and many operations need to take cpu_hotplug_lock when loading a vendor module, e.g. updating static calls. + +4. Synchronization while managing guest faults +---------------------------------------------- + +This section explains the intersection of these synchronization mechanisms: + +- ``kvm->srcu`` (for memslots) +- ``kvm->mmu_invalidate_*`` (pending invalidations) +- ``kvm->mn_*`` (synchronization for ``kvm->mmu_invalidate_*``) + +4.1 Overview +^^^^^^^^^^^^ + +KVM resolves guest page faults by translating the Guest Frame Number (GFN) into +a Page Frame Number (PFN) via memslots and then populating its shadow page +tables with the resulting mapping. + +While handling the guest page fault, KVM must ensure a consistent view of the +active memslots container, so KVM takes ``srcu_read_lock(&kvm->srcu);``. + +Guest page fault handling can race with some request from host userspace to +invalidate shadow page tables. These requests originate from a few places, such +as + +1. MMU Notifiers: KVM registers callbacks with the kernel’s memory management + subsystem to know when there are changes to mappings in the host userspace + page tables. +2. Memslot Updates: The host userspace VMM, such as QEMU may use the + ``KVM_SET_USER_MEMORY_REGION`` ioctl to add, delete, or move a memslot. KVM + must zap the affected shadow page tables to ensure the guest doesn't access + stale mappings. +3. Memory Attribute Changes: The ``KVM_SET_MEMORY_ATTRIBUTES`` ioctl allows + userspace to change attributes for a range of guest memory (e.g., setting a + range as "private" for Confidential Computing). This also requires + invalidating existing shadow mappings. + +When such a race occurs, KVM optimistically allows the faulting logic to +proceed, but just before committing the fault, KVM will check for a pending +invalidation, and retry the fault process if there is a pending invalidation +affecting the GFN where the fault occurred. + +4.2 Tracking pending invalidations with ``kvm->mmu_invalidate*`` fields +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A "pending invalidation" is determined using a combination of + +- ``kvm->mmu_invalidate_in_progress`` +- ``kvm->mmu_invalidate_range_start`` and ``kvm->mmu_invalidate_range_end`` +- ``kvm->mmu_invalidate_seq`` + +``is_page_fault_stale()`` shows how the above fields are used to determine if +the page fault is stale and requires a retry. + +To protect the above combination of fields, a lock is used, which is the +``kvm->mmu_lock``. + +4.2.1 Derived information vs pending invalidations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Generally, the result of any information derived from GFN aka page +attribute/page metadata lookups may race with invalidations. Here are some +examples of lookups: + +- ``host_pfn_mapping_level()`` uses memslot information to find the mapping + level of pages in host userspace page tables. If there's an invalidation, the + pages that were mapped would no longer be mapped and hence the mapping level + result would be stale. + +There are several ways to ensure valid results: + +- Check ``mmu_invalidate_retry_gfn()`` after grabbing the result, before + consuming it. In this case, ``mmu_lock`` doesn't need to be held during the + lookup, but it does need to be held while checking the MMU notifier. KVM's + guest page fault handling uses this option. +- Hold ``mmu_lock`` AND ensure there is no in-progress MMU notifier invalidation + event for the hva. This can be done by explicit checking the MMU notifier or + by ensuring that KVM already has a valid mapping that covers the + hva. ``kvm_mmu_recover_huge_pages()`` uses this option. + +4.3 Further optimization: ignoring invalidations if there is no matching memslot +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Invalidation is only really required when the invalidated memory range overlaps +with some memslot. Without a matching memslot, the invalidation request could +actually just be ignored. Hence, KVM only updates the ``kvm->mmu_invalidate_*`` +fields and takes ``kvm->mmu_lock`` if it finds a matching memslot. + +This creates another problem: if memslots are updated while there is an ongoing +invalidation, then the updates to the fields and the lock would be imbalanced. + +4.4 Synchronization for invalidation lock/fields: ``kvm->mn_*`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To make sure the updates to the invalidation lock/fields are balanced, KVM has a +further layer of synchronization. ``kvm_swap_active_memslots()`` enforces that +changes to memslots are only committed once all pending invalidations are +complete. + +In other words, ``kvm->mn_*`` ensures the following does not happen: + +1. Some memslot existed, causing a pending invalidation request to be recorded + in the ``kvm->mmu_invalidate_*`` fields +2. Memslot got removed, so the invalidation request was never removed from the + ``kvm->mmu_invalidate_*`` fields. + +In addition, ``kvm_swap_active_memslots()`` also enforces that changes to +memslots are complete before doing ``synchronize_srcu(&kvm->srcu)`` to make sure +running readers of the old memslots container are done before freeing it. -- 2.54.0.823.g6e5bcc1fc9-goog