[RFC PATCH 0/2] KVM: pfncache: Support guest

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map
@ 2025-12-03 14:41 Takahiro Itazuri
  2025-12-03 14:41 ` [RFC PATCH 1/2] KVM: pfncache: Use kvm_gmem_get_pfn() for guest_memfd-backed memslots Takahiro Itazuri
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Takahiro Itazuri @ 2025-12-03 14:41 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Fuad Tabba,
	Brendan Jackman, David Hildenbrand, David Woodhouse, Paul Durrant,
	Nikita Kalyazin, Patrick Roy, Takahiro Itazuri

[ based on kvm/next with [1] ]

Recent work on guest_memfd [1] is introducing support for removing guest
memory from the kernel direct map (Note that this work has not yet been
merged, which is why this patch series is labelled RFC). The feature is
useful for non-CoCo VMs to prevent the host kernel from accidentally or
speculatively accessing guest memory as a general safety improvement.
Pages for guest_memfd created with GUEST_MEMFD_FLAG_NO_DIRECT_MAP have
their direct-map PTEs explicitly disabled, and thus cannot rely on the
direct map.

This breaks the features that use gfn_to_pfn_cache, including kvm-clock.
gfn_to_pfn_cache caches the pfn and kernel host virtual address (khva)
for a given gfn so that KVM can repeatedly access the corresponding
guest page.  The cached khva may later be dereferenced from atomic
contexts in some cases.  Such contexts cannot tolerate sleep or page
faults, and therefore cannot use the userspace mapping (uhva), as those
mappings may fault at any time.  As a result, gfn_to_pfn_cache requires
a stable, fault-free kernel virtual address for the backing pages,
independent of the userspace mapping.

This small patch series enables gfn_to_pfn_cache to work correctly when
a memslot is backed by guest_memfd with GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
The first patch teaches gfn_to_pfn_cache to obtain pfn for guest_memfd-
backed memslots via kvm_gmem_get_pfn() instead of GUP (hva_to_pfn()).
The second patch makes gfn_to_pfn_cache use vmap()/vunmap() to create a
fault-free kernel address for such pages.  We believe that establishing
such mapping for paravirtual guest/host communication is acceptable as
such pages do not contain sensitive data.

Another considered idea was to use memremap() instead of vmap(), since
gpc_map() already falls back to memremap() if pfn_valid() is false.
However, vmap() was chosen for the following reason.  memremap() with
MEMREMAP_WB first attempts to use the direct map via try_ram_remap(),
and then falls back to arch_memremap_wb(), which explicitly refuses to
map system RAM.  It would be possible to relax this restriction, but the
side effects are unclear because memremap() is widely used throughout
the kernel.  Changing memremap() to support system RAM without the
direct map solely for gfn_to_pfn_cache feels disproportionate.  If
additional users appear that need to map system RAM without the direct
map, revisiting and generalizing memremap() might make sense.  For now,
vmap()/vunmap() provides a contained and predictable solution.

A possible approach in the future is to use the "ephmap" (or proclocal)
proposed in [2], but it is not yet clear when that work will be merged.
In contrast, the changes in this patch series are small and self-
contained, yet immediately allow gfn_to_pfn_cache (including kvm-clock)
to operate correctly with direct map-removed guest_memfd.  Once ephmap
eventually is merged, gfn_to_pfn_cache can be updated to make use of it
as appropriate.

[1]: https://lore.kernel.org/all/20250924151101.2225820-1-patrick.roy@campus.lmu.de/
[2]: https://lore.kernel.org/all/20250812173109.295750-1-jackmanb@google.com/

Takahiro Itazuri (2):
  KVM: pfncache: Use kvm_gmem_get_pfn() for guest_memfd-backed memslots
  KVM: pfncache: Use vmap() for guest_memfd pages without direct map

 include/linux/kvm_host.h |  7 ++++++
 virt/kvm/pfncache.c      | 52 +++++++++++++++++++++++++++++-----------
 2 files changed, 45 insertions(+), 14 deletions(-)

--
2.50.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 1/2] KVM: pfncache: Use kvm_gmem_get_pfn() for guest_memfd-backed memslots
  2025-12-03 14:41 [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map Takahiro Itazuri
@ 2025-12-03 14:41 ` Takahiro Itazuri
  2026-01-19 12:34   ` David Hildenbrand (Red Hat)
  2025-12-03 14:41 ` [RFC PATCH 2/2] KVM: pfncache: Use vmap() for guest_memfd pages without direct map Takahiro Itazuri
  2025-12-03 16:01 ` [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd " Brendan Jackman
  2 siblings, 1 reply; 9+ messages in thread
From: Takahiro Itazuri @ 2025-12-03 14:41 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Fuad Tabba,
	Brendan Jackman, David Hildenbrand, David Woodhouse, Paul Durrant,
	Nikita Kalyazin, Patrick Roy, Takahiro Itazuri

gfn_to_pfn_cache currently relies on hva_to_pfn(), which resolves PFNs
through GUP.  GUP assumes that the page has a valid direct-map PTE,
which is not true for guest_memfd created with
GUEST_MEMFD_FLAG_NO_DIRECT_MAP, because their direct-map PTEs are
explicitly invalidated via set_direct_map_valid_noflush().

Introduce a helper function, gpc_to_pfn(), that routes PFN lookup to
kvm_gmem_get_pfn() for guest_memfd-backed memslots (regardless of
whether GUEST_MEMFD_FLAG_NO_DIRECT_MAP is set), and otherwise falls
back to the existing hva_to_pfn() path. Rename hva_to_pfn_retry() to
gpc_to_pfn_retry() accordingly.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
 virt/kvm/pfncache.c | 34 +++++++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 11 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 728d2c1b488a..bf8d6090e283 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -152,22 +152,34 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
 	return kvm->mmu_invalidate_seq != mmu_seq;
 }
 
-static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
+static kvm_pfn_t gpc_to_pfn(struct gfn_to_pfn_cache *gpc, struct page **page)
 {
-	/* Note, the new page offset may be different than the old! */
-	void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
-	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
-	void *new_khva = NULL;
-	unsigned long mmu_seq;
-	struct page *page;
+	if (kvm_slot_has_gmem(gpc->memslot)) {
+		kvm_pfn_t pfn;
+
+		kvm_gmem_get_pfn(gpc->kvm, gpc->memslot, gpa_to_gfn(gpc->gpa),
+				 &pfn, page, NULL);
+		return pfn;
+	}
 
 	struct kvm_follow_pfn kfp = {
 		.slot = gpc->memslot,
 		.gfn = gpa_to_gfn(gpc->gpa),
 		.flags = FOLL_WRITE,
 		.hva = gpc->uhva,
-		.refcounted_page = &page,
+		.refcounted_page = page,
 	};
+	return hva_to_pfn(&kfp);
+}
+
+static kvm_pfn_t gpc_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
+{
+	/* Note, the new page offset may be different than the old! */
+	void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
+	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
+	void *new_khva = NULL;
+	unsigned long mmu_seq;
+	struct page *page;
 
 	lockdep_assert_held(&gpc->refresh_lock);
 
@@ -206,7 +218,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 			cond_resched();
 		}
 
-		new_pfn = hva_to_pfn(&kfp);
+		new_pfn = gpc_to_pfn(gpc, &page);
 		if (is_error_noslot_pfn(new_pfn))
 			goto out_error;
 
@@ -319,7 +331,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 		}
 	}
 
-	/* Note: the offset must be correct before calling hva_to_pfn_retry() */
+	/* Note: the offset must be correct before calling gpc_to_pfn_retry() */
 	gpc->uhva += page_offset;
 
 	/*
@@ -327,7 +339,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	 * drop the lock and do the HVA to PFN lookup again.
 	 */
 	if (!gpc->valid || hva_change) {
-		ret = hva_to_pfn_retry(gpc);
+		ret = gpc_to_pfn_retry(gpc);
 	} else {
 		/*
 		 * If the HVA→PFN mapping was already valid, don't unmap it.
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 1/2] KVM: pfncache: Use kvm_gmem_get_pfn() for guest_memfd-backed memslots
  2025-12-03 14:41 ` [RFC PATCH 1/2] KVM: pfncache: Use kvm_gmem_get_pfn() for guest_memfd-backed memslots Takahiro Itazuri
@ 2026-01-19 12:34   ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-19 12:34 UTC (permalink / raw)
  To: Takahiro Itazuri, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Fuad Tabba,
	Brendan Jackman, David Woodhouse, Paul Durrant, Nikita Kalyazin,
	Patrick Roy, Takahiro Itazuri

On 12/3/25 15:41, Takahiro Itazuri wrote:
> gfn_to_pfn_cache currently relies on hva_to_pfn(), which resolves PFNs
> through GUP.  GUP assumes that the page has a valid direct-map PTE,
> which is not true for guest_memfd created with
> GUEST_MEMFD_FLAG_NO_DIRECT_MAP, because their direct-map PTEs are
> explicitly invalidated via set_direct_map_valid_noflush().
> 
> Introduce a helper function, gpc_to_pfn(), that routes PFN lookup to
> kvm_gmem_get_pfn() for guest_memfd-backed memslots (regardless of
> whether GUEST_MEMFD_FLAG_NO_DIRECT_MAP is set), and otherwise falls
> back to the existing hva_to_pfn() path. Rename hva_to_pfn_retry() to
> gpc_to_pfn_retry() accordingly.

Let's look into some details:

The pfncache looks up a page from the page tables through GUP.

To make sure that the looked up PFN can be safely used, it must we very 
careful: after it looked up the page through hva_to_pfn(), it marks the 
entry as "valid" and drops the folio reference obtained through 
hva_to_pfn().

At this point, nothing stops the page from getting unmapped from the 
page tables to be freed etc.

Of course, that sounds very dangerous.

That's why the pfncache uses the (KVM) mmu_notifier framework to get 
notified when the page was just unmapped from the KVM mmu while it 
prepared the cache entry (see mmu_notifier_retry_cache()).

But it also has to deal with the page getting removed (+possibly freed) 
from the KVM MMU later, after we already have a valid entry in the cache.

For this reason, gfn_to_pfn_cache_invalidate_start() is used to 
invalidate any entries as they get unmapped from page tables.

Now the big question: how is this supposed to work with gmem? I would 
have expected that we would need similar invalidations etc. from gmem code?

Imagine ftruncate() targets the gmem folio we just looked up, would be 
we get an appropriate invalidate notification?

> 
> Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
> ---
>   virt/kvm/pfncache.c | 34 +++++++++++++++++++++++-----------
>   1 file changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
> index 728d2c1b488a..bf8d6090e283 100644
> --- a/virt/kvm/pfncache.c
> +++ b/virt/kvm/pfncache.c
> @@ -152,22 +152,34 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
>   	return kvm->mmu_invalidate_seq != mmu_seq;
>   }
>   
> -static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> +static kvm_pfn_t gpc_to_pfn(struct gfn_to_pfn_cache *gpc, struct page **page)
>   {
> -	/* Note, the new page offset may be different than the old! */
> -	void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
> -	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
> -	void *new_khva = NULL;
> -	unsigned long mmu_seq;
> -	struct page *page;
> +	if (kvm_slot_has_gmem(gpc->memslot)) {
> +		kvm_pfn_t pfn;
> +
> +		kvm_gmem_get_pfn(gpc->kvm, gpc->memslot, gpa_to_gfn(gpc->gpa),
> +				 &pfn, page, NULL);
> +		return pfn;
> +	}
>   
>   	struct kvm_follow_pfn kfp = {
>   		.slot = gpc->memslot,
>   		.gfn = gpa_to_gfn(gpc->gpa),
>   		.flags = FOLL_WRITE,
>   		.hva = gpc->uhva,
> -		.refcounted_page = &page,
> +		.refcounted_page = page,
>   	};
> +	return hva_to_pfn(&kfp);
> +}
> +
> +static kvm_pfn_t gpc_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
> +{
> +	/* Note, the new page offset may be different than the old! */
> +	void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
> +	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
> +	void *new_khva = NULL;
> +	unsigned long mmu_seq;
> +	struct page *page;
>   
>   	lockdep_assert_held(&gpc->refresh_lock);
>   
> @@ -206,7 +218,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
>   			cond_resched();
>   		}
>   
> -		new_pfn = hva_to_pfn(&kfp);
> +		new_pfn = gpc_to_pfn(gpc, &page);
>   		if (is_error_noslot_pfn(new_pfn))
>   			goto out_error;
>   
> @@ -319,7 +331,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
>   		}
>   	}
>   
> -	/* Note: the offset must be correct before calling hva_to_pfn_retry() */
> +	/* Note: the offset must be correct before calling gpc_to_pfn_retry() */
>   	gpc->uhva += page_offset;
>   
>   	/*
> @@ -327,7 +339,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
>   	 * drop the lock and do the HVA to PFN lookup again.
>   	 */
>   	if (!gpc->valid || hva_change) {
> -		ret = hva_to_pfn_retry(gpc);
> +		ret = gpc_to_pfn_retry(gpc);
>   	} else {
>   		/*
>   		 * If the HVA→PFN mapping was already valid, don't unmap it.


-- 
Cheers

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH 2/2] KVM: pfncache: Use vmap() for guest_memfd pages without direct map
  2025-12-03 14:41 [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map Takahiro Itazuri
  2025-12-03 14:41 ` [RFC PATCH 1/2] KVM: pfncache: Use kvm_gmem_get_pfn() for guest_memfd-backed memslots Takahiro Itazuri
@ 2025-12-03 14:41 ` Takahiro Itazuri
  2025-12-03 16:01 ` [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd " Brendan Jackman
  2 siblings, 0 replies; 9+ messages in thread
From: Takahiro Itazuri @ 2025-12-03 14:41 UTC (permalink / raw)
  To: kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Fuad Tabba,
	Brendan Jackman, David Hildenbrand, David Woodhouse, Paul Durrant,
	Nikita Kalyazin, Patrick Roy, Takahiro Itazuri

gfn_to_pfn_cache currently maps RAM PFNs with kmap(), which relies on
the direct map.  guest_memfd created with GUEST_MEMFD_FLAG_NO_DIRECT_MAP
disable their direct-map PTEs via set_direct_map_valid_noflush(), so the
linear address returned by kmap()/page_address() will fault if
dereferenced.

In some cases, gfn_to_pfn_cache dereferences the cached kernel host
virtual address (khva) from atomic contexts where page faults cannot be
tolerated.  Therefore khva must always refer to a fault-free kernel
mapping.  Since mapping and unmapping happen exclusively in the refresh
path, which may sleep, using vmap()/vunmap() for these pages is safe and
sufficient.

Introduce kvm_slot_no_direct_map() to detect guest_memfd slots without
the direct map, and make gpc_map()/gpc_unmap() use vmap()/vunmap() for
such pages.

This allows the features based on gfn_to_pfn_cache (e.g. kvm-clock) to
work correctly with guest_memfd regardless of whether its direct-map
PTEs are valid.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
---
 include/linux/kvm_host.h |  7 +++++++
 virt/kvm/pfncache.c      | 26 ++++++++++++++++++++------
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 70e6a5210ceb..793d98f97928 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -15,6 +15,7 @@
 #include <linux/minmax.h>
 #include <linux/mm.h>
 #include <linux/mmu_notifier.h>
+#include <linux/pagemap.h>
 #include <linux/preempt.h>
 #include <linux/msi.h>
 #include <linux/slab.h>
@@ -628,6 +629,12 @@ static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *sl
 	return slot->flags & KVM_MEM_LOG_DIRTY_PAGES;
 }
 
+static inline bool kvm_slot_no_direct_map(const struct kvm_memory_slot *slot)
+{
+	return slot && kvm_slot_has_gmem(slot) &&
+	       mapping_no_direct_map(slot->gmem.file->f_mapping);
+}
+
 static inline unsigned long kvm_dirty_bitmap_bytes(struct kvm_memory_slot *memslot)
 {
 	return ALIGN(memslot->npages, BITS_PER_LONG) / 8;
diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index bf8d6090e283..87167d7f3feb 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -96,10 +96,16 @@ bool kvm_gpc_check(struct gfn_to_pfn_cache *gpc, unsigned long len)
 	return true;
 }
 
-static void *gpc_map(kvm_pfn_t pfn)
+static void *gpc_map(struct gfn_to_pfn_cache *gpc, kvm_pfn_t pfn)
 {
-	if (pfn_valid(pfn))
-		return kmap(pfn_to_page(pfn));
+	if (pfn_valid(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+
+		if (kvm_slot_no_direct_map(gpc->memslot))
+			return vmap(&page, 1, VM_MAP, PAGE_KERNEL);
+
+		return kmap(page);
+	}
 
 #ifdef CONFIG_HAS_IOMEM
 	return memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
@@ -115,6 +121,11 @@ static void gpc_unmap(kvm_pfn_t pfn, void *khva)
 		return;
 
 	if (pfn_valid(pfn)) {
+		if (is_vmalloc_addr(khva)) {
+			vunmap(khva);
+			return;
+		}
+
 		kunmap(pfn_to_page(pfn));
 		return;
 	}
@@ -224,13 +235,16 @@ static kvm_pfn_t gpc_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
 
 		/*
 		 * Obtain a new kernel mapping if KVM itself will access the
-		 * pfn.  Note, kmap() and memremap() can both sleep, so this
-		 * too must be done outside of gpc->lock!
+		 * pfn.  Note, kmap(), vmap() and memremap() can sleep, so this
+		 * too must be done outside of gpc->lock! Note that even though
+		 * the rwlock is dropped, it's still fine to read gpc->pfn and
+		 * other fields because gpc->fresh_lock mutex prevents those
+		 * from being changed.
 		 */
 		if (new_pfn == gpc->pfn)
 			new_khva = old_khva;
 		else
-			new_khva = gpc_map(new_pfn);
+			new_khva = gpc_map(gpc, new_pfn);
 
 		if (!new_khva) {
 			kvm_release_page_unused(page);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map
  2025-12-03 14:41 [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map Takahiro Itazuri
  2025-12-03 14:41 ` [RFC PATCH 1/2] KVM: pfncache: Use kvm_gmem_get_pfn() for guest_memfd-backed memslots Takahiro Itazuri
  2025-12-03 14:41 ` [RFC PATCH 2/2] KVM: pfncache: Use vmap() for guest_memfd pages without direct map Takahiro Itazuri
@ 2025-12-03 16:01 ` Brendan Jackman
  2025-12-03 16:35   ` David Woodhouse
  2 siblings, 1 reply; 9+ messages in thread
From: Brendan Jackman @ 2025-12-03 16:01 UTC (permalink / raw)
  To: Takahiro Itazuri, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Fuad Tabba,
	Brendan Jackman, David Hildenbrand, David Woodhouse, Paul Durrant,
	Nikita Kalyazin, Patrick Roy, Takahiro Itazuri

On Wed Dec 3, 2025 at 2:41 PM UTC, Takahiro Itazuri wrote:
> [ based on kvm/next with [1] ]
>
> Recent work on guest_memfd [1] is introducing support for removing guest
> memory from the kernel direct map (Note that this work has not yet been
> merged, which is why this patch series is labelled RFC). The feature is
> useful for non-CoCo VMs to prevent the host kernel from accidentally or
> speculatively accessing guest memory as a general safety improvement.
> Pages for guest_memfd created with GUEST_MEMFD_FLAG_NO_DIRECT_MAP have
> their direct-map PTEs explicitly disabled, and thus cannot rely on the
> direct map.
>
> This breaks the features that use gfn_to_pfn_cache, including kvm-clock.
> gfn_to_pfn_cache caches the pfn and kernel host virtual address (khva)
> for a given gfn so that KVM can repeatedly access the corresponding
> guest page.  The cached khva may later be dereferenced from atomic
> contexts in some cases.  Such contexts cannot tolerate sleep or page
> faults, and therefore cannot use the userspace mapping (uhva), as those
> mappings may fault at any time.  As a result, gfn_to_pfn_cache requires
> a stable, fault-free kernel virtual address for the backing pages,
> independent of the userspace mapping.
>
> This small patch series enables gfn_to_pfn_cache to work correctly when
> a memslot is backed by guest_memfd with GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
> The first patch teaches gfn_to_pfn_cache to obtain pfn for guest_memfd-
> backed memslots via kvm_gmem_get_pfn() instead of GUP (hva_to_pfn()).
> The second patch makes gfn_to_pfn_cache use vmap()/vunmap() to create a
> fault-free kernel address for such pages.  We believe that establishing
> such mapping for paravirtual guest/host communication is acceptable as
> such pages do not contain sensitive data.
>
> Another considered idea was to use memremap() instead of vmap(), since
> gpc_map() already falls back to memremap() if pfn_valid() is false.
> However, vmap() was chosen for the following reason.  memremap() with
> MEMREMAP_WB first attempts to use the direct map via try_ram_remap(),
> and then falls back to arch_memremap_wb(), which explicitly refuses to
> map system RAM.  It would be possible to relax this restriction, but the
> side effects are unclear because memremap() is widely used throughout
> the kernel.  Changing memremap() to support system RAM without the
> direct map solely for gfn_to_pfn_cache feels disproportionate.  If
> additional users appear that need to map system RAM without the direct
> map, revisiting and generalizing memremap() might make sense.  For now,
> vmap()/vunmap() provides a contained and predictable solution.
>
> A possible approach in the future is to use the "ephmap" (or proclocal)
> proposed in [2], but it is not yet clear when that work will be merged.

(Nobody knows how to pronounce "ephmap" aloud and when you do know how
to say it, it sounds like you are sayhing "fmap" which is very
confusing. So next time I post it I plan to call it "mermap" instead:
EPHemeral -> epheMERal).

Apologies for my ignorance of the context here, I may be missing
insights that are obvious, but with that caveat...

The point of the mermap (formerly "ephmap") is to be able to efficiently
map on demand then immediately unmap without the cost of a TLB
shootdown. Is there any reason we'd need to do that here? If we can get
away with a stable vmapping then that seems superior to the mermap
anyway.

Putting it in an mm-local region would be nice (you say there shouldn't
be sensitive data in there, but I guess there's still some potential for
risk? Bounding that to the VMM process seems like a good idea to me)
but that seems nonblocking, could easily be added later. Also note it
doesn't depend on mermap, we could just have an mm-local region of the
vmalloc area. Mermap requires mm-local but not the other-way around.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map
  2025-12-03 16:01 ` [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd " Brendan Jackman
@ 2025-12-03 16:35   ` David Woodhouse
  2025-12-03 17:06     ` Brendan Jackman
  0 siblings, 1 reply; 9+ messages in thread
From: David Woodhouse @ 2025-12-03 16:35 UTC (permalink / raw)
  To: Brendan Jackman, Takahiro Itazuri, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Fuad Tabba,
	David Hildenbrand, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

[-- Attachment #1: Type: text/plain, Size: 4508 bytes --]

On Wed, 2025-12-03 at 16:01 +0000, Brendan Jackman wrote:
> On Wed Dec 3, 2025 at 2:41 PM UTC, Takahiro Itazuri wrote:
> > [ based on kvm/next with [1] ]
> > 
> > Recent work on guest_memfd [1] is introducing support for removing guest
> > memory from the kernel direct map (Note that this work has not yet been
> > merged, which is why this patch series is labelled RFC). The feature is
> > useful for non-CoCo VMs to prevent the host kernel from accidentally or
> > speculatively accessing guest memory as a general safety improvement.
> > Pages for guest_memfd created with GUEST_MEMFD_FLAG_NO_DIRECT_MAP have
> > their direct-map PTEs explicitly disabled, and thus cannot rely on the
> > direct map.
> > 
> > This breaks the features that use gfn_to_pfn_cache, including kvm-clock.
> > gfn_to_pfn_cache caches the pfn and kernel host virtual address (khva)
> > for a given gfn so that KVM can repeatedly access the corresponding
> > guest page.  The cached khva may later be dereferenced from atomic
> > contexts in some cases.  Such contexts cannot tolerate sleep or page
> > faults, and therefore cannot use the userspace mapping (uhva), as those
> > mappings may fault at any time.  As a result, gfn_to_pfn_cache requires
> > a stable, fault-free kernel virtual address for the backing pages,
> > independent of the userspace mapping.
> > 
> > This small patch series enables gfn_to_pfn_cache to work correctly when
> > a memslot is backed by guest_memfd with GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
> > The first patch teaches gfn_to_pfn_cache to obtain pfn for guest_memfd-
> > backed memslots via kvm_gmem_get_pfn() instead of GUP (hva_to_pfn()).
> > The second patch makes gfn_to_pfn_cache use vmap()/vunmap() to create a
> > fault-free kernel address for such pages.  We believe that establishing
> > such mapping for paravirtual guest/host communication is acceptable as
> > such pages do not contain sensitive data.
> > 
> > Another considered idea was to use memremap() instead of vmap(), since
> > gpc_map() already falls back to memremap() if pfn_valid() is false.
> > However, vmap() was chosen for the following reason.  memremap() with
> > MEMREMAP_WB first attempts to use the direct map via try_ram_remap(),
> > and then falls back to arch_memremap_wb(), which explicitly refuses to
> > map system RAM.  It would be possible to relax this restriction, but the
> > side effects are unclear because memremap() is widely used throughout
> > the kernel.  Changing memremap() to support system RAM without the
> > direct map solely for gfn_to_pfn_cache feels disproportionate.  If
> > additional users appear that need to map system RAM without the direct
> > map, revisiting and generalizing memremap() might make sense.  For now,
> > vmap()/vunmap() provides a contained and predictable solution.
> > 
> > A possible approach in the future is to use the "ephmap" (or proclocal)
> > proposed in [2], but it is not yet clear when that work will be merged.
> 
> (Nobody knows how to pronounce "ephmap" aloud and when you do know how
> to say it, it sounds like you are sayhing "fmap" which is very
> confusing. So next time I post it I plan to call it "mermap" instead:
> EPHemeral -> epheMERal).
> 
> Apologies for my ignorance of the context here, I may be missing
> insights that are obvious, but with that caveat...
> 
> The point of the mermap (formerly "ephmap") is to be able to efficiently
> map on demand then immediately unmap without the cost of a TLB
> shootdown. Is there any reason we'd need to do that here? If we can get
> away with a stable vmapping then that seems superior to the mermap
> anyway.
> 
> Putting it in an mm-local region would be nice (you say there shouldn't
> be sensitive data in there, but I guess there's still some potential for
> risk? Bounding that to the VMM process seems like a good idea to me)
> but that seems nonblocking, could easily be added later. Also note it
> doesn't depend on mermap, we could just have an mm-local region of the
> vmalloc area. Mermap requires mm-local but not the other-way around.

Right. It's really the mm-local part which we might want to support in
the gfn_to_pfn_cache, not ephmap/mermap per se.

As things stand, we're taking guest pages which were taken out of the
global directmap for a *reason*... and mapping them right back in
globally. Making the new mapping of those pages mm-local where possible
is going to be very desirable.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map
  2025-12-03 16:35   ` David Woodhouse
@ 2025-12-03 17:06     ` Brendan Jackman
  2025-12-04 22:31       ` David Woodhouse
  0 siblings, 1 reply; 9+ messages in thread
From: Brendan Jackman @ 2025-12-03 17:06 UTC (permalink / raw)
  To: David Woodhouse, Brendan Jackman, Takahiro Itazuri, kvm,
	Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Fuad Tabba,
	David Hildenbrand, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

On Wed Dec 3, 2025 at 4:35 PM UTC, David Woodhouse wrote:
> On Wed, 2025-12-03 at 16:01 +0000, Brendan Jackman wrote:
>> On Wed Dec 3, 2025 at 2:41 PM UTC, Takahiro Itazuri wrote:
>> > [ based on kvm/next with [1] ]
>> > 
>> > Recent work on guest_memfd [1] is introducing support for removing guest
>> > memory from the kernel direct map (Note that this work has not yet been
>> > merged, which is why this patch series is labelled RFC). The feature is
>> > useful for non-CoCo VMs to prevent the host kernel from accidentally or
>> > speculatively accessing guest memory as a general safety improvement.
>> > Pages for guest_memfd created with GUEST_MEMFD_FLAG_NO_DIRECT_MAP have
>> > their direct-map PTEs explicitly disabled, and thus cannot rely on the
>> > direct map.
>> > 
>> > This breaks the features that use gfn_to_pfn_cache, including kvm-clock.
>> > gfn_to_pfn_cache caches the pfn and kernel host virtual address (khva)
>> > for a given gfn so that KVM can repeatedly access the corresponding
>> > guest page.  The cached khva may later be dereferenced from atomic
>> > contexts in some cases.  Such contexts cannot tolerate sleep or page
>> > faults, and therefore cannot use the userspace mapping (uhva), as those
>> > mappings may fault at any time.  As a result, gfn_to_pfn_cache requires
>> > a stable, fault-free kernel virtual address for the backing pages,
>> > independent of the userspace mapping.
>> > 
>> > This small patch series enables gfn_to_pfn_cache to work correctly when
>> > a memslot is backed by guest_memfd with GUEST_MEMFD_FLAG_NO_DIRECT_MAP.
>> > The first patch teaches gfn_to_pfn_cache to obtain pfn for guest_memfd-
>> > backed memslots via kvm_gmem_get_pfn() instead of GUP (hva_to_pfn()).
>> > The second patch makes gfn_to_pfn_cache use vmap()/vunmap() to create a
>> > fault-free kernel address for such pages.  We believe that establishing
>> > such mapping for paravirtual guest/host communication is acceptable as
>> > such pages do not contain sensitive data.
>> > 
>> > Another considered idea was to use memremap() instead of vmap(), since
>> > gpc_map() already falls back to memremap() if pfn_valid() is false.
>> > However, vmap() was chosen for the following reason.  memremap() with
>> > MEMREMAP_WB first attempts to use the direct map via try_ram_remap(),
>> > and then falls back to arch_memremap_wb(), which explicitly refuses to
>> > map system RAM.  It would be possible to relax this restriction, but the
>> > side effects are unclear because memremap() is widely used throughout
>> > the kernel.  Changing memremap() to support system RAM without the
>> > direct map solely for gfn_to_pfn_cache feels disproportionate.  If
>> > additional users appear that need to map system RAM without the direct
>> > map, revisiting and generalizing memremap() might make sense.  For now,
>> > vmap()/vunmap() provides a contained and predictable solution.
>> > 
>> > A possible approach in the future is to use the "ephmap" (or proclocal)
>> > proposed in [2], but it is not yet clear when that work will be merged.
>> 
>> (Nobody knows how to pronounce "ephmap" aloud and when you do know how
>> to say it, it sounds like you are sayhing "fmap" which is very
>> confusing. So next time I post it I plan to call it "mermap" instead:
>> EPHemeral -> epheMERal).
>> 
>> Apologies for my ignorance of the context here, I may be missing
>> insights that are obvious, but with that caveat...
>> 
>> The point of the mermap (formerly "ephmap") is to be able to efficiently
>> map on demand then immediately unmap without the cost of a TLB
>> shootdown. Is there any reason we'd need to do that here? If we can get
>> away with a stable vmapping then that seems superior to the mermap
>> anyway.
>> 
>> Putting it in an mm-local region would be nice (you say there shouldn't
>> be sensitive data in there, but I guess there's still some potential for
>> risk? Bounding that to the VMM process seems like a good idea to me)
>> but that seems nonblocking, could easily be added later. Also note it
>> doesn't depend on mermap, we could just have an mm-local region of the
>> vmalloc area. Mermap requires mm-local but not the other-way around.
>
> Right. It's really the mm-local part which we might want to support in
> the gfn_to_pfn_cache, not ephmap/mermap per se.
>
> As things stand, we're taking guest pages which were taken out of the
> global directmap for a *reason*... and mapping them right back in
> globally. Making the new mapping of those pages mm-local where possible
> is going to be very desirable.

Makes sense. I didn't properly explore if there are any challenges with
making vmalloc aware of it, but assuming there are no issues there I
don't think setting up an mm-local region is very challinging [1]. I
have the impression the main reason there isn't already an mm-local
region is just that the right usecase hasn't come along yet? So maybe
that could just be included in this series (assuming the mermap doesn't
get merged first).

Aside from vmalloc integration the topic I just ignored when prototyping
[0] it was that it obviously has some per-arch element. So I guess for
users of it we do need to look at whether we are OK to gate the
depending feature on arch support.

[0] https://github.com/torvalds/linux/commit/4290b4ffb35bc73ce0ac9ae590f3e9d4d27b6397
[1] https://xcancel.com/pinboard/status/761656824202276864

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map
  2025-12-03 17:06     ` Brendan Jackman
@ 2025-12-04 22:31       ` David Woodhouse
  2025-12-05  7:15         ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 9+ messages in thread
From: David Woodhouse @ 2025-12-04 22:31 UTC (permalink / raw)
  To: Brendan Jackman, Takahiro Itazuri, kvm, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Fuad Tabba,
	David Hildenbrand, Paul Durrant, Nikita Kalyazin, Patrick Roy,
	Takahiro Itazuri

[-- Attachment #1: Type: text/plain, Size: 906 bytes --]

On Wed, 2025-12-03 at 17:06 +0000, Brendan Jackman wrote:
> Makes sense. I didn't properly explore if there are any challenges with
> making vmalloc aware of it, but assuming there are no issues there I
> don't think setting up an mm-local region is very challinging [1]. I
> have the impression the main reason there isn't already an mm-local
> region is just that the right usecase hasn't come along yet?

I'm fairly sure we have a *usecase* for mm-local.

And since researchers dusted off our XSA-289 advisory from 2019,
rediscovered it and called it 'L1TF reloaded' and then expressed
surprise that environments which have been using mm-local ever since
those days don't actually leak secrets from one guest to another... I'd
kind of hope that everyone else has come round to our way of thinking
that we have a usecase for mm-local too? :)

Kind of a 'hey-ho, screw Skylake' moment...

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map
  2025-12-04 22:31       ` David Woodhouse
@ 2025-12-05  7:15         ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-05  7:15 UTC (permalink / raw)
  To: David Woodhouse, Brendan Jackman, Takahiro Itazuri, kvm,
	Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Fuad Tabba, Paul Durrant,
	Nikita Kalyazin, Patrick Roy, Takahiro Itazuri

On 12/4/25 23:31, David Woodhouse wrote:
> On Wed, 2025-12-03 at 17:06 +0000, Brendan Jackman wrote:
>> Makes sense. I didn't properly explore if there are any challenges with
>> making vmalloc aware of it, but assuming there are no issues there I
>> don't think setting up an mm-local region is very challinging [1]. I
>> have the impression the main reason there isn't already an mm-local
>> region is just that the right usecase hasn't come along yet?
> 
> I'm fairly sure we have a *usecase* for mm-local.

Haha, I just skimmed over this patch and wondered "is mm-local a new mm 
branch we want to have" :)

> 
> And since researchers dusted off our XSA-289 advisory from 2019,
> rediscovered it and called it 'L1TF reloaded' and then expressed
> surprise that environments which have been using mm-local ever since
> those days don't actually leak secrets from one guest to another... I'd
> kind of hope that everyone else has come round to our way of thinking
> that we have a usecase for mm-local too? :)

Yeah, I would assume that we have such use cases indeed.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-19 12:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-03 14:41 [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd without direct map Takahiro Itazuri
2025-12-03 14:41 ` [RFC PATCH 1/2] KVM: pfncache: Use kvm_gmem_get_pfn() for guest_memfd-backed memslots Takahiro Itazuri
2026-01-19 12:34   ` David Hildenbrand (Red Hat)
2025-12-03 14:41 ` [RFC PATCH 2/2] KVM: pfncache: Use vmap() for guest_memfd pages without direct map Takahiro Itazuri
2025-12-03 16:01 ` [RFC PATCH 0/2] KVM: pfncache: Support guest_memfd " Brendan Jackman
2025-12-03 16:35   ` David Woodhouse
2025-12-03 17:06     ` Brendan Jackman
2025-12-04 22:31       ` David Woodhouse
2025-12-05  7:15         ` David Hildenbrand (Red Hat)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox