[PATCH v7 00/12] Direct Map Removal Support for guest

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
@ 2025-09-24 15:10 Patrick Roy
  2025-09-24 15:10 ` [PATCH v7 01/12] arch: export set_direct_map_valid_noflush to KVM module Patrick Roy
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Patrick Roy @ 2025-09-24 15:10 UTC (permalink / raw)
  Cc: Patrick Roy, pbonzini, corbet, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, willy, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, song,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jgg, jhubbard, peterx,
	jannh, pfalcato, shuah, seanjc, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-fsdevel, linux-mm, bpf,
	linux-kselftest, xmarcalx, kalyazin, jackabt, derekmn, tabba,
	ackerleytng

From: Patrick Roy <roypat@amazon.co.uk>

[ based on kvm/next ]

Unmapping virtual machine guest memory from the host kernel's direct map is a
successful mitigation against Spectre-style transient execution issues: If the
kernel page tables do not contain entries pointing to guest memory, then any
attempted speculative read through the direct map will necessarily be blocked
by the MMU before any observable microarchitectural side-effects happen. This
means that Spectre-gadgets and similar cannot be used to target virtual machine
memory. Roughly 60% of speculative execution issues fall into this category [1,
Table 1].

This patch series extends guest_memfd with the ability to remove its memory
from the host kernel's direct map, to be able to attain the above protection
for KVM guests running inside guest_memfd.

Additionally, a Firecracker branch with support for these VMs can be found on
GitHub [2].

For more details, please refer to the v5 cover letter [v5]. No
substantial changes in design have taken place since.

=== Changes Since v6 ===

- Drop patch for passing struct address_space to ->free_folio(), due to
  possible races with freeing of the address_space. (Hugh)
- Stop using PG_uptodate / gmem preparedness tracking to keep track of
  direct map state.  Instead, use the lowest bit of folio->private. (Mike, David)
- Do direct map removal when establishing mapping of gmem folio instead
  of at allocation time, due to impossibility of handling direct map
  removal errors in kvm_gmem_populate(). (Patrick)
- Do TLB flushes after direct map removal, and provide a module
  parameter to opt out from them, and a new patch to export
  flush_tlb_kernel_range() to KVM. (Will)

[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[2]: https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
[RFCv1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFCv2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/
[RFCv3]: https://lore.kernel.org/kvm/20241030134912.515725-1-roypat@amazon.co.uk/
[v4]: https://lore.kernel.org/kvm/20250221160728.1584559-1-roypat@amazon.co.uk/
[v5]: https://lore.kernel.org/kvm/20250828093902.2719-1-roypat@amazon.co.uk/
[v6]: https://lore.kernel.org/kvm/20250912091708.17502-1-roypat@amazon.co.uk/


Patrick Roy (12):
  arch: export set_direct_map_valid_noflush to KVM module
  x86/tlb: export flush_tlb_kernel_range to KVM module
  mm: introduce AS_NO_DIRECT_MAP
  KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate
  KVM: guest_memfd: Add flag to remove from direct map
  KVM: guest_memfd: add module param for disabling TLB flushing
  KVM: selftests: load elf via bounce buffer
  KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd
    != -1
  KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
  KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing
    selftests
  KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
  KVM: selftests: Test guest execution from direct map removed gmem

 Documentation/virt/kvm/api.rst                |  5 ++
 arch/arm64/include/asm/kvm_host.h             | 12 ++++
 arch/arm64/mm/pageattr.c                      |  1 +
 arch/loongarch/mm/pageattr.c                  |  1 +
 arch/riscv/mm/pageattr.c                      |  1 +
 arch/s390/mm/pageattr.c                       |  1 +
 arch/x86/include/asm/tlbflush.h               |  3 +-
 arch/x86/mm/pat/set_memory.c                  |  1 +
 arch/x86/mm/tlb.c                             |  1 +
 include/linux/kvm_host.h                      |  9 +++
 include/linux/pagemap.h                       | 16 +++++
 include/linux/secretmem.h                     | 18 -----
 include/uapi/linux/kvm.h                      |  2 +
 lib/buildid.c                                 |  4 +-
 mm/gup.c                                      | 19 ++----
 mm/mlock.c                                    |  2 +-
 mm/secretmem.c                                |  8 +--
 .../testing/selftests/kvm/guest_memfd_test.c  |  2 +
 .../testing/selftests/kvm/include/kvm_util.h  | 37 ++++++++---
 .../testing/selftests/kvm/include/test_util.h |  8 +++
 tools/testing/selftests/kvm/lib/elf.c         |  8 +--
 tools/testing/selftests/kvm/lib/io.c          | 23 +++++++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 61 +++++++++--------
 tools/testing/selftests/kvm/lib/test_util.c   |  8 +++
 tools/testing/selftests/kvm/lib/x86/sev.c     |  1 +
 .../selftests/kvm/pre_fault_memory_test.c     |  1 +
 .../selftests/kvm/set_memory_region_test.c    | 50 ++++++++++++--
 .../kvm/x86/private_mem_conversions_test.c    |  7 +-
 virt/kvm/guest_memfd.c                        | 66 +++++++++++++++++--
 virt/kvm/kvm_main.c                           |  8 +++
 30 files changed, 290 insertions(+), 94 deletions(-)


base-commit: a6ad54137af92535cfe32e19e5f3bc1bb7dbd383
-- 
2.51.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v7 01/12] arch: export set_direct_map_valid_noflush to KVM module
  2025-09-24 15:10 [PATCH v7 00/12] Direct Map Removal Support for guest_memfd Patrick Roy
@ 2025-09-24 15:10 ` Patrick Roy
  2025-09-24 15:10 ` [PATCH v7 02/12] x86/tlb: export flush_tlb_kernel_range " Patrick Roy
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Patrick Roy @ 2025-09-24 15:10 UTC (permalink / raw)
  Cc: Patrick Roy, pbonzini, corbet, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, willy, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, song,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jgg, jhubbard, peterx,
	jannh, pfalcato, shuah, seanjc, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-fsdevel, linux-mm, bpf,
	linux-kselftest, xmarcalx, kalyazin, jackabt, derekmn, tabba,
	ackerleytng, loongarch, linux-riscv, linux-s390

From: Patrick Roy <roypat@amazon.co.uk>

Use the new per-module export functionality to allow KVM (and only KVM)
access to set_direct_map_valid_noflush(). This allows guest_memfd to
remove its memory from the direct map, even if KVM is built as a module.

Direct map removal gives guest_memfd the same protection that
memfd_secret enjoys, such as hardening against Spectre-like attacks
through in-kernel gadgets.

Cc: linux-arm-kernel@lists.infradead.org
Cc: loongarch@lists.linux.dev
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Reviewed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 arch/arm64/mm/pageattr.c     | 1 +
 arch/loongarch/mm/pageattr.c | 1 +
 arch/riscv/mm/pageattr.c     | 1 +
 arch/s390/mm/pageattr.c      | 1 +
 arch/x86/mm/pat/set_memory.c | 1 +
 5 files changed, 5 insertions(+)

diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 04d4a8f676db..4f3cddfab9b0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -291,6 +291,7 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
 
 	return set_memory_valid(addr, nr, valid);
 }
+EXPORT_SYMBOL_FOR_MODULES(set_direct_map_valid_noflush, "kvm");
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 /*
diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
index f5e910b68229..458f5ae6a89b 100644
--- a/arch/loongarch/mm/pageattr.c
+++ b/arch/loongarch/mm/pageattr.c
@@ -236,3 +236,4 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
 
 	return __set_memory(addr, 1, set, clear);
 }
+EXPORT_SYMBOL_FOR_MODULES(set_direct_map_valid_noflush, "kvm");
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 3f76db3d2769..6db31040cd66 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -400,6 +400,7 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
 
 	return __set_memory((unsigned long)page_address(page), nr, set, clear);
 }
+EXPORT_SYMBOL_FOR_MODULES(set_direct_map_valid_noflush, "kvm");
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 static int debug_pagealloc_set_page(pte_t *pte, unsigned long addr, void *data)
diff --git a/arch/s390/mm/pageattr.c b/arch/s390/mm/pageattr.c
index 348e759840e7..8ffd9ef09bc6 100644
--- a/arch/s390/mm/pageattr.c
+++ b/arch/s390/mm/pageattr.c
@@ -413,6 +413,7 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
 
 	return __set_memory((unsigned long)page_to_virt(page), nr, flags);
 }
+EXPORT_SYMBOL_FOR_MODULES(set_direct_map_valid_noflush, "kvm");
 
 bool kernel_page_present(struct page *page)
 {
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 8834c76f91c9..87e9c7d2dcdc 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2661,6 +2661,7 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
 
 	return __set_pages_np(page, nr);
 }
+EXPORT_SYMBOL_FOR_MODULES(set_direct_map_valid_noflush, "kvm");
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 02/12] x86/tlb: export flush_tlb_kernel_range to KVM module
  2025-09-24 15:10 [PATCH v7 00/12] Direct Map Removal Support for guest_memfd Patrick Roy
  2025-09-24 15:10 ` [PATCH v7 01/12] arch: export set_direct_map_valid_noflush to KVM module Patrick Roy
@ 2025-09-24 15:10 ` Patrick Roy
  2025-09-24 15:10 ` [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP Patrick Roy
  2025-09-24 15:29 ` [PATCH v7 00/12] Direct Map Removal Support for guest_memfd Roy, Patrick
  3 siblings, 0 replies; 34+ messages in thread
From: Patrick Roy @ 2025-09-24 15:10 UTC (permalink / raw)
  Cc: Patrick Roy, pbonzini, corbet, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, willy, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, song,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jgg, jhubbard, peterx,
	jannh, pfalcato, shuah, seanjc, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-fsdevel, linux-mm, bpf,
	linux-kselftest, xmarcalx, kalyazin, jackabt, derekmn, tabba,
	ackerleytng

From: Patrick Roy <roypat@amazon.co.uk>

After direct map removal, a TLB flush can be done to ensure that the
just-unmapped memory cannot be accessed through stale TLB entries. This
is particularly useful on modern hardware, where one can not rely on
timely TLB-eviction to ensure these entries go away.

This export is only needed on x86, as arm64 (the only other architecture
supporting guest_memfd currently) does not allow building KVM as a
module.

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 arch/x86/include/asm/tlbflush.h | 3 ++-
 arch/x86/mm/tlb.c               | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 00daedfefc1b..6f57f7eb621b 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -317,7 +317,6 @@ extern void flush_tlb_all(void);
 extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
-extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
@@ -483,6 +482,8 @@ static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask)
 #endif
 #endif /* !MODULE */
 
+extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+
 static inline void __native_tlb_flush_global(unsigned long cr4)
 {
 	native_write_cr4(cr4 ^ X86_CR4_PGE);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 39f80111e6f1..dee5018bceeb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1541,6 +1541,7 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 
 	put_flush_tlb_info();
 }
+EXPORT_SYMBOL_FOR_MODULES(flush_tlb_kernel_range, "kvm");
 
 /*
  * This can be used from process context to figure out what the value of
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP
  2025-09-24 15:10 [PATCH v7 00/12] Direct Map Removal Support for guest_memfd Patrick Roy
  2025-09-24 15:10 ` [PATCH v7 01/12] arch: export set_direct_map_valid_noflush to KVM module Patrick Roy
  2025-09-24 15:10 ` [PATCH v7 02/12] x86/tlb: export flush_tlb_kernel_range " Patrick Roy
@ 2025-09-24 15:10 ` Patrick Roy
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
  2025-09-25 10:25   ` [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP David Hildenbrand
  2025-09-24 15:29 ` [PATCH v7 00/12] Direct Map Removal Support for guest_memfd Roy, Patrick
  3 siblings, 2 replies; 34+ messages in thread
From: Patrick Roy @ 2025-09-24 15:10 UTC (permalink / raw)
  Cc: Patrick Roy, pbonzini, corbet, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, willy, akpm, david,
	lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, song,
	jolsa, ast, daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jgg, jhubbard, peterx,
	jannh, pfalcato, shuah, seanjc, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-fsdevel, linux-mm, bpf,
	linux-kselftest, xmarcalx, kalyazin, jackabt, derekmn, tabba,
	ackerleytng

From: Patrick Roy <roypat@amazon.co.uk>

Add AS_NO_DIRECT_MAP for mappings where direct map entries of folios are
set to not present . Currently, mappings that match this description are
secretmem mappings (memfd_secret()). Later, some guest_memfd
configurations will also fall into this category.

Reject this new type of mappings in all locations that currently reject
secretmem mappings, on the assumption that if secretmem mappings are
rejected somewhere, it is precisely because of an inability to deal with
folios without direct map entries, and then make memfd_secret() use
AS_NO_DIRECT_MAP on its address_space to drop its special
vma_is_secretmem()/secretmem_mapping() checks.

This drops a optimization in gup_fast_folio_allowed() where
secretmem_mapping() was only called if CONFIG_SECRETMEM=y. secretmem is
enabled by default since commit b758fe6df50d ("mm/secretmem: make it on
by default"), so the secretmem check did not actually end up elided in
most cases anymore anyway.

Use a new flag instead of overloading AS_INACCESSIBLE (which is already
set by guest_memfd) because not all guest_memfd mappings will end up
being direct map removed (e.g. in pKVM setups, parts of guest_memfd that
can be mapped to userspace should also be GUP-able, and generally not
have restrictions on who can access it).

Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/pagemap.h   | 16 ++++++++++++++++
 include/linux/secretmem.h | 18 ------------------
 lib/buildid.c             |  4 ++--
 mm/gup.c                  | 19 +++++--------------
 mm/mlock.c                |  2 +-
 mm/secretmem.c            |  8 ++------
 6 files changed, 26 insertions(+), 41 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 12a12dae727d..1f5739f6a9f5 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -211,6 +211,7 @@ enum mapping_flags {
 				   folio contents */
 	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
 	AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9,
+	AS_NO_DIRECT_MAP = 10,	/* Folios in the mapping are not in the direct map */
 	/* Bits 16-25 are used for FOLIO_ORDER */
 	AS_FOLIO_ORDER_BITS = 5,
 	AS_FOLIO_ORDER_MIN = 16,
@@ -346,6 +347,21 @@ static inline bool mapping_writeback_may_deadlock_on_reclaim(struct address_spac
 	return test_bit(AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM, &mapping->flags);
 }
 
+static inline void mapping_set_no_direct_map(struct address_space *mapping)
+{
+	set_bit(AS_NO_DIRECT_MAP, &mapping->flags);
+}
+
+static inline bool mapping_no_direct_map(const struct address_space *mapping)
+{
+	return test_bit(AS_NO_DIRECT_MAP, &mapping->flags);
+}
+
+static inline bool vma_has_no_direct_map(const struct vm_area_struct *vma)
+{
+	return vma->vm_file && mapping_no_direct_map(vma->vm_file->f_mapping);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return mapping->gfp_mask;
diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index e918f96881f5..0ae1fb057b3d 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -4,28 +4,10 @@
 
 #ifdef CONFIG_SECRETMEM
 
-extern const struct address_space_operations secretmem_aops;
-
-static inline bool secretmem_mapping(struct address_space *mapping)
-{
-	return mapping->a_ops == &secretmem_aops;
-}
-
-bool vma_is_secretmem(struct vm_area_struct *vma);
 bool secretmem_active(void);
 
 #else
 
-static inline bool vma_is_secretmem(struct vm_area_struct *vma)
-{
-	return false;
-}
-
-static inline bool secretmem_mapping(struct address_space *mapping)
-{
-	return false;
-}
-
 static inline bool secretmem_active(void)
 {
 	return false;
diff --git a/lib/buildid.c b/lib/buildid.c
index c4b0f376fb34..89e567954284 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -65,8 +65,8 @@ static int freader_get_folio(struct freader *r, loff_t file_off)
 
 	freader_put_folio(r);
 
-	/* reject secretmem folios created with memfd_secret() */
-	if (secretmem_mapping(r->file->f_mapping))
+	/* reject folios without direct map entries (e.g. from memfd_secret() or guest_memfd()) */
+	if (mapping_no_direct_map(r->file->f_mapping))
 		return -EFAULT;
 
 	r->folio = filemap_get_folio(r->file->f_mapping, file_off >> PAGE_SHIFT);
diff --git a/mm/gup.c b/mm/gup.c
index adffe663594d..75a0cffdf37d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -11,7 +11,6 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
-#include <linux/secretmem.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -1234,7 +1233,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_SPLIT_PMD) && is_vm_hugetlb_page(vma))
 		return -EOPNOTSUPP;
 
-	if (vma_is_secretmem(vma))
+	if (vma_has_no_direct_map(vma))
 		return -EFAULT;
 
 	if (write) {
@@ -2736,7 +2735,7 @@ EXPORT_SYMBOL(get_user_pages_unlocked);
  * This call assumes the caller has pinned the folio, that the lowest page table
  * level still points to this folio, and that interrupts have been disabled.
  *
- * GUP-fast must reject all secretmem folios.
+ * GUP-fast must reject all folios without direct map entries (such as secretmem).
  *
  * Writing to pinned file-backed dirty tracked folios is inherently problematic
  * (see comment describing the writable_file_mapping_allowed() function). We
@@ -2751,7 +2750,6 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
 {
 	bool reject_file_backed = false;
 	struct address_space *mapping;
-	bool check_secretmem = false;
 	unsigned long mapping_flags;
 
 	/*
@@ -2763,18 +2761,10 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
 		reject_file_backed = true;
 
 	/* We hold a folio reference, so we can safely access folio fields. */
-
-	/* secretmem folios are always order-0 folios. */
-	if (IS_ENABLED(CONFIG_SECRETMEM) && !folio_test_large(folio))
-		check_secretmem = true;
-
-	if (!reject_file_backed && !check_secretmem)
-		return true;
-
 	if (WARN_ON_ONCE(folio_test_slab(folio)))
 		return false;
 
-	/* hugetlb neither requires dirty-tracking nor can be secretmem. */
+	/* hugetlb neither requires dirty-tracking nor can be without direct map. */
 	if (folio_test_hugetlb(folio))
 		return true;
 
@@ -2812,8 +2802,9 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
 	 * At this point, we know the mapping is non-null and points to an
 	 * address_space object.
 	 */
-	if (check_secretmem && secretmem_mapping(mapping))
+	if (mapping_no_direct_map(mapping))
 		return false;
+
 	/* The only remaining allowed file system is shmem. */
 	return !reject_file_backed || shmem_mapping(mapping);
 }
diff --git a/mm/mlock.c b/mm/mlock.c
index a1d93ad33c6d..36f5e70faeb0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -474,7 +474,7 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 
 	if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma) || vma_is_secretmem(vma) || (oldflags & VM_DROPPABLE))
+	    vma_is_dax(vma) || vma_has_no_direct_map(vma) || (oldflags & VM_DROPPABLE))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 60137305bc20..f4d767c3fe2e 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -134,11 +134,6 @@ static int secretmem_mmap_prepare(struct vm_area_desc *desc)
 	return 0;
 }
 
-bool vma_is_secretmem(struct vm_area_struct *vma)
-{
-	return vma->vm_ops == &secretmem_vm_ops;
-}
-
 static const struct file_operations secretmem_fops = {
 	.release	= secretmem_release,
 	.mmap_prepare	= secretmem_mmap_prepare,
@@ -156,7 +151,7 @@ static void secretmem_free_folio(struct folio *folio)
 	folio_zero_segment(folio, 0, folio_size(folio));
 }
 
-const struct address_space_operations secretmem_aops = {
+static const struct address_space_operations secretmem_aops = {
 	.dirty_folio	= noop_dirty_folio,
 	.free_folio	= secretmem_free_folio,
 	.migrate_folio	= secretmem_migrate_folio,
@@ -205,6 +200,7 @@ static struct file *secretmem_file_create(unsigned long flags)
 
 	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
 	mapping_set_unevictable(inode->i_mapping);
+	mapping_set_no_direct_map(inode->i_mapping);
 
 	inode->i_op = &secretmem_iops;
 	inode->i_mapping->a_ops = &secretmem_aops;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate
  2025-09-24 15:10 ` [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP Patrick Roy
@ 2025-09-24 15:22   ` Roy, Patrick
  2025-09-24 15:22     ` [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map Roy, Patrick
                       ` (8 more replies)
  2025-09-25 10:25   ` [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP David Hildenbrand
  1 sibling, 9 replies; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:22 UTC (permalink / raw)
  Cc: Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

Add a no-op stub for kvm_arch_gmem_invalidate if
CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE=n. This allows defining
kvm_gmem_free_folio without ifdef-ery, which allows more cleanly using
guest_memfd's free_folio callback for non-arch-invalidation related
code.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/kvm_host.h | 2 ++
 virt/kvm/guest_memfd.c   | 4 ----
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8b47891adca1..1d0585616aa3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2573,6 +2573,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
 void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
+#else
+static inline void kvm_arch_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) { }
 #endif
 
 #ifdef CONFIG_KVM_GENERIC_PRE_FAULT_MEMORY
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 08a6bc7d25b6..55b8d739779f 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -429,7 +429,6 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 	return MF_DELAYED;
 }
 
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
 static void kvm_gmem_free_folio(struct folio *folio)
 {
 	struct page *page = folio_page(folio, 0);
@@ -438,15 +437,12 @@ static void kvm_gmem_free_folio(struct folio *folio)
 
 	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
 }
-#endif
 
 static const struct address_space_operations kvm_gmem_aops = {
 	.dirty_folio = noop_dirty_folio,
 	.migrate_folio	= kvm_gmem_migrate_folio,
 	.error_remove_folio = kvm_gmem_error_folio,
-#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
 	.free_folio = kvm_gmem_free_folio,
-#endif
 };
 
 static int kvm_gmem_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
@ 2025-09-24 15:22     ` Roy, Patrick
  2025-09-25 11:00       ` David Hildenbrand
  2025-09-26 14:49       ` Patrick Roy
  2025-09-24 15:22     ` [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing Roy, Patrick
                       ` (7 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:22 UTC (permalink / raw)
  Cc: Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
ioctl. When set, guest_memfd folios will be removed from the direct map
after preparation, with direct map entries only restored when the folios
are freed.

To ensure these folios do not end up in places where the kernel cannot
deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.

Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
guest_memfd itself being supported, but also on whether linux supports
manipulatomg the direct map at page granularity at all (possible most of
the time, outliers being arm64 where its impossible if the direct map
has been setup using hugepages, as arm64 cannot break these apart due to
break-before-make semantics, and powerpc, which does not select
ARCH_HAS_SET_DIRECT_MAP, though also doesn't support guest_memfd
anyway).

Note that this flag causes removal of direct map entries for all
guest_memfd folios independent of whether they are "shared" or "private"
(although current guest_memfd only supports either all folios in the
"shared" state, or all folios in the "private" state if
GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
entries of also the shared parts of guest_memfd are a special type of
non-CoCo VM where, host userspace is trusted to have access to all of
guest memory, but where Spectre-style transient execution attacks
through the host kernel's direct map should still be mitigated.  In this
setup, KVM retains access to guest memory via userspace mappings of
guest_memfd, which are reflected back into KVM's memslots via
userspace_addr. This is needed for things like MMIO emulation on x86_64
to work.

Direct map entries are zapped right before guest or userspace mappings
of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
a gmem folio can be allocated without being mapped anywhere is
kvm_gmem_populate(), where handling potential failures of direct map
removal is not possible (by the time direct map removal is attempted,
the folio is already marked as prepared, meaning attempting to re-try
kvm_gmem_populate() would just result in -EEXIST without fixing up the
direct map state). These folios are then removed form the direct map
upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 Documentation/virt/kvm/api.rst    |  5 +++
 arch/arm64/include/asm/kvm_host.h | 12 ++++++
 include/linux/kvm_host.h          |  6 +++
 include/uapi/linux/kvm.h          |  2 +
 virt/kvm/guest_memfd.c            | 61 ++++++++++++++++++++++++++++++-
 virt/kvm/kvm_main.c               |  5 +++
 6 files changed, 90 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index c17a87a0a5ac..b52c14d58798 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
 supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
 enables mmap() and faulting of guest_memfd memory to host userspace.
 
+When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field
+supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the guest_memfd
+instance behave similarly to memfd_secret, and unmaps the memory backing it from
+the kernel's address space after allocation.
+
 When the KVM MMU performs a PFN lookup to service a guest fault and the backing
 guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
 consumed from guest_memfd, regardless of whether it is a shared or a private
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 2f2394cce24e..0bfd8e5fd9de 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -19,6 +19,7 @@
 #include <linux/maple_tree.h>
 #include <linux/percpu.h>
 #include <linux/psci.h>
+#include <linux/set_memory.h>
 #include <asm/arch_gicv3.h>
 #include <asm/barrier.h>
 #include <asm/cpufeature.h>
@@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
 void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
 void check_feature_map(void);
 
+#ifdef CONFIG_KVM_GUEST_MEMFD
+static inline bool kvm_arch_gmem_supports_no_direct_map(void)
+{
+	/*
+	 * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
+	 * as it calls dcache_clean_inval_poc().
+	 */
+	return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
+}
+#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
+#endif /* CONFIG_KVM_GUEST_MEMFD */
 
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1d0585616aa3..73a15cade54a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -731,6 +731,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 bool kvm_arch_supports_gmem_mmap(struct kvm *kvm);
 #endif
 
+#ifdef CONFIG_KVM_GUEST_MEMFD
+#ifndef kvm_arch_gmem_supports_no_direct_map
+#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map
+#endif
+#endif /* CONFIG_KVM_GUEST_MEMFD */
+
 #ifndef kvm_arch_has_readonly_mem
 static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
 {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6efa98a57ec1..33c8e8946019 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -963,6 +963,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_RISCV_MP_STATE_RESET 242
 #define KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED 243
 #define KVM_CAP_GUEST_MEMFD_MMAP 244
+#define KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP 245
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
@@ -1600,6 +1601,7 @@ struct kvm_memory_attributes {
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 #define GUEST_MEMFD_FLAG_MMAP	(1ULL << 0)
+#define GUEST_MEMFD_FLAG_NO_DIRECT_MAP (1ULL << 1)
 
 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 55b8d739779f..b7129c4868c5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,9 @@
 #include <linux/kvm_host.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
+#include <linux/set_memory.h>
+
+#include <asm/tlbflush.h>
 
 #include "kvm_mm.h"
 
@@ -42,6 +45,44 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
 	return 0;
 }
 
+#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
+
+static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
+{
+	return ((u64) folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
+}
+
+static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
+{
+	if (kvm_gmem_folio_no_direct_map(folio))
+		return 0;
+
+	int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
+					 false);
+
+	if (!r) {
+		unsigned long addr = (unsigned long) folio_address(folio);
+		folio->private = (void *) ((u64) folio->private & KVM_GMEM_FOLIO_NO_DIRECT_MAP);
+		flush_tlb_kernel_range(addr, addr + folio_size(folio));
+	}
+
+	return r;
+}
+
+static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
+{
+	/*
+	 * Direct map restoration cannot fail, as the only error condition
+	 * for direct map manipulation is failure to allocate page tables
+	 * when splitting huge pages, but this split would have already
+	 * happened in set_direct_map_invalid_noflush() in kvm_gmem_folio_zap_direct_map().
+	 * Thus set_direct_map_valid_noflush() here only updates prot bits.
+	 */
+	if (kvm_gmem_folio_no_direct_map(folio))
+		set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
+					 true);
+}
+
 static inline void kvm_gmem_mark_prepared(struct folio *folio)
 {
 	folio_mark_uptodate(folio);
@@ -324,13 +365,14 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct folio *folio;
 	vm_fault_t ret = VM_FAULT_LOCKED;
+	int err;
 
 	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
 	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
 	if (IS_ERR(folio)) {
-		int err = PTR_ERR(folio);
+		err = PTR_ERR(folio);
 
 		if (err == -EAGAIN)
 			return VM_FAULT_RETRY;
@@ -348,6 +390,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
 		kvm_gmem_mark_prepared(folio);
 	}
 
+	err = kvm_gmem_folio_zap_direct_map(folio);
+
+	if (err) {
+		ret = vmf_error(err);
+		goto out_folio;
+	}
+
 	vmf->page = folio_file_page(folio, vmf->pgoff);
 
 out_folio:
@@ -435,6 +484,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
 	kvm_pfn_t pfn = page_to_pfn(page);
 	int order = folio_order(folio);
 
+	kvm_gmem_folio_restore_direct_map(folio);
+
 	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
 }
 
@@ -499,6 +550,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	/* Unmovable mappings are supposed to be marked unevictable as well. */
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
+	if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
+		mapping_set_no_direct_map(inode->i_mapping);
+
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
@@ -523,6 +577,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	if (kvm_arch_supports_gmem_mmap(kvm))
 		valid_flags |= GUEST_MEMFD_FLAG_MMAP;
 
+	if (kvm_arch_gmem_supports_no_direct_map())
+		valid_flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
+
 	if (flags & ~valid_flags)
 		return -EINVAL;
 
@@ -687,6 +744,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (!is_prepared)
 		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
 
+	kvm_gmem_folio_zap_direct_map(folio);
+
 	folio_unlock(folio);
 
 	if (!r)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 18f29ef93543..b5e702d95230 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -65,6 +65,7 @@
 #include <trace/events/kvm.h>
 
 #include <linux/kvm_dirty_ring.h>
+#include <linux/set_memory.h>
 
 
 /* Worst case buffer size needed for holding an integer. */
@@ -4916,6 +4917,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return kvm_supported_mem_attributes(kvm);
 #endif
 #ifdef CONFIG_KVM_GUEST_MEMFD
+	case KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP:
+		if (!kvm_arch_gmem_supports_no_direct_map())
+			return 0;
+		fallthrough;
 	case KVM_CAP_GUEST_MEMFD:
 		return 1;
 	case KVM_CAP_GUEST_MEMFD_MMAP:
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
  2025-09-24 15:22     ` [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map Roy, Patrick
@ 2025-09-24 15:22     ` Roy, Patrick
  2025-09-25 11:02       ` David Hildenbrand
  2025-09-25 18:27       ` Dave Hansen
  2025-09-24 15:22     ` [PATCH v7 07/12] KVM: selftests: load elf via bounce buffer Roy, Patrick
                       ` (6 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:22 UTC (permalink / raw)
  Cc: Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

Add an option to not perform TLB flushes after direct map manipulations.
TLB flushes result in a up to 40x elongation of page faults in
guest_memfd (scaling with the number of CPU cores), or a 5x elongation
of memory population, which is inacceptable when wanting to use direct
map removed guest_memfd as a drop-in replacement for existing workloads.

TLB flushes are not needed for functional correctness (the virt->phys
mapping technically stays "correct", the kernel should simply not use it
for a while), so we can skip them to keep performance in-line with
"traditional" VMs.

Enabling this option means that the desired protection from
Spectre-style attacks is not perfect, as an attacker could try to
prevent a stale TLB entry from getting evicted, keeping it alive until
the page it refers to is used by the guest for some sensitive data, and
then targeting it using a spectre-gadget.

Cc: Will Deacon <will@kernel.org>
Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 include/linux/kvm_host.h | 1 +
 virt/kvm/guest_memfd.c   | 3 ++-
 virt/kvm/kvm_main.c      | 3 +++
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 73a15cade54a..4d2bc18860fc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2298,6 +2298,7 @@ extern unsigned int halt_poll_ns;
 extern unsigned int halt_poll_ns_grow;
 extern unsigned int halt_poll_ns_grow_start;
 extern unsigned int halt_poll_ns_shrink;
+extern bool guest_memfd_tlb_flush;
 
 struct kvm_device {
 	const struct kvm_device_ops *ops;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b7129c4868c5..d8dd24459f0d 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -63,7 +63,8 @@ static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
 	if (!r) {
 		unsigned long addr = (unsigned long) folio_address(folio);
 		folio->private = (void *) ((u64) folio->private & KVM_GMEM_FOLIO_NO_DIRECT_MAP);
-		flush_tlb_kernel_range(addr, addr + folio_size(folio));
+		if (guest_memfd_tlb_flush)
+			flush_tlb_kernel_range(addr, addr + folio_size(folio));
 	}
 
 	return r;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b5e702d95230..753c06ebba7f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -95,6 +95,9 @@ unsigned int halt_poll_ns_shrink = 2;
 module_param(halt_poll_ns_shrink, uint, 0644);
 EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
 
+bool guest_memfd_tlb_flush = true;
+module_param(guest_memfd_tlb_flush, bool, 0444);
+
 /*
  * Allow direct access (from KVM or the CPU) without MMU notifier protection
  * to unpinned pages.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 07/12] KVM: selftests: load elf via bounce buffer
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
  2025-09-24 15:22     ` [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map Roy, Patrick
  2025-09-24 15:22     ` [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing Roy, Patrick
@ 2025-09-24 15:22     ` Roy, Patrick
  2025-09-24 15:22     ` [PATCH v7 08/12] KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd != -1 Roy, Patrick
                       ` (5 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:22 UTC (permalink / raw)
  Cc: Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

If guest memory is backed using a VMA that does not allow GUP (e.g. a
userspace mapping of guest_memfd when the fd was allocated using
KVM_GMEM_NO_DIRECT_MAP), then directly loading the test ELF binary into
it via read(2) potentially does not work. To nevertheless support
loading binaries in this cases, do the read(2) syscall using a bounce
buffer, and then memcpy from the bounce buffer into guest memory.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 .../testing/selftests/kvm/include/test_util.h |  1 +
 tools/testing/selftests/kvm/lib/elf.c         |  8 +++----
 tools/testing/selftests/kvm/lib/io.c          | 23 +++++++++++++++++++
 3 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index c6ef895fbd9a..0409b7b96c94 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -46,6 +46,7 @@ do {								\
 
 ssize_t test_write(int fd, const void *buf, size_t count);
 ssize_t test_read(int fd, void *buf, size_t count);
+ssize_t test_read_bounce(int fd, void *buf, size_t count);
 int test_seq_read(const char *path, char **bufp, size_t *sizep);
 
 void __printf(5, 6) test_assert(bool exp, const char *exp_str,
diff --git a/tools/testing/selftests/kvm/lib/elf.c b/tools/testing/selftests/kvm/lib/elf.c
index f34d926d9735..e829fbe0a11e 100644
--- a/tools/testing/selftests/kvm/lib/elf.c
+++ b/tools/testing/selftests/kvm/lib/elf.c
@@ -31,7 +31,7 @@ static void elfhdr_get(const char *filename, Elf64_Ehdr *hdrp)
 	 * the real size of the ELF header.
 	 */
 	unsigned char ident[EI_NIDENT];
-	test_read(fd, ident, sizeof(ident));
+	test_read_bounce(fd, ident, sizeof(ident));
 	TEST_ASSERT((ident[EI_MAG0] == ELFMAG0) && (ident[EI_MAG1] == ELFMAG1)
 		&& (ident[EI_MAG2] == ELFMAG2) && (ident[EI_MAG3] == ELFMAG3),
 		"ELF MAGIC Mismatch,\n"
@@ -79,7 +79,7 @@ static void elfhdr_get(const char *filename, Elf64_Ehdr *hdrp)
 	offset_rv = lseek(fd, 0, SEEK_SET);
 	TEST_ASSERT(offset_rv == 0, "Seek to ELF header failed,\n"
 		"  rv: %zi expected: %i", offset_rv, 0);
-	test_read(fd, hdrp, sizeof(*hdrp));
+	test_read_bounce(fd, hdrp, sizeof(*hdrp));
 	TEST_ASSERT(hdrp->e_phentsize == sizeof(Elf64_Phdr),
 		"Unexpected physical header size,\n"
 		"  hdrp->e_phentsize: %x\n"
@@ -146,7 +146,7 @@ void kvm_vm_elf_load(struct kvm_vm *vm, const char *filename)
 
 		/* Read in the program header. */
 		Elf64_Phdr phdr;
-		test_read(fd, &phdr, sizeof(phdr));
+		test_read_bounce(fd, &phdr, sizeof(phdr));
 
 		/* Skip if this header doesn't describe a loadable segment. */
 		if (phdr.p_type != PT_LOAD)
@@ -187,7 +187,7 @@ void kvm_vm_elf_load(struct kvm_vm *vm, const char *filename)
 				"  expected: 0x%jx",
 				n1, errno, (intmax_t) offset_rv,
 				(intmax_t) phdr.p_offset);
-			test_read(fd, addr_gva2hva(vm, phdr.p_vaddr),
+			test_read_bounce(fd, addr_gva2hva(vm, phdr.p_vaddr),
 				phdr.p_filesz);
 		}
 	}
diff --git a/tools/testing/selftests/kvm/lib/io.c b/tools/testing/selftests/kvm/lib/io.c
index fedb2a741f0b..74419becc8bc 100644
--- a/tools/testing/selftests/kvm/lib/io.c
+++ b/tools/testing/selftests/kvm/lib/io.c
@@ -155,3 +155,26 @@ ssize_t test_read(int fd, void *buf, size_t count)
 
 	return num_read;
 }
+
+/* Test read via intermediary buffer
+ *
+ * Same as test_read, except read(2)s happen into a bounce buffer that is memcpy'd
+ * to buf. For use with buffers that cannot be GUP'd (e.g. guest_memfd VMAs if
+ * guest_memfd was created with GUEST_MEMFD_FLAG_NO_DIRECT_MAP).
+ */
+ssize_t test_read_bounce(int fd, void *buf, size_t count)
+{
+	void *bounce_buffer;
+	ssize_t num_read;
+
+	TEST_ASSERT(count >= 0, "Unexpected count, count: %li", count);
+
+	bounce_buffer = malloc(count);
+	TEST_ASSERT(bounce_buffer != NULL, "Failed to allocate bounce buffer");
+
+	num_read = test_read(fd, bounce_buffer, count);
+	memcpy(buf, bounce_buffer, num_read);
+	free(bounce_buffer);
+
+	return num_read;
+}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 08/12] KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd != -1
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
                       ` (2 preceding siblings ...)
  2025-09-24 15:22     ` [PATCH v7 07/12] KVM: selftests: load elf via bounce buffer Roy, Patrick
@ 2025-09-24 15:22     ` Roy, Patrick
  2025-09-24 15:22     ` [PATCH v7 09/12] KVM: selftests: Add guest_memfd based vm_mem_backing_src_types Roy, Patrick
                       ` (4 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:22 UTC (permalink / raw)
  Cc: Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

Have vm_mem_add() always set KVM_MEM_GUEST_MEMFD in the memslot flags if
a guest_memfd is passed in as an argument. This eliminates the
possibility where a guest_memfd instance is passed to vm_mem_add(), but
it ends up being ignored because the flags argument does not specify
KVM_MEM_GUEST_MEMFD at the same time.

This makes it easy to support more scenarios in which no vm_mem_add() is
not passed a guest_memfd instance, but is expected to allocate one.
Currently, this only happens if guest_memfd == -1 but flags &
KVM_MEM_GUEST_MEMFD != 0, but later vm_mem_add() will gain support for
loading the test code itself into guest_memfd (via
GUEST_MEMFD_FLAG_MMAP) if requested via a special
vm_mem_backing_src_type, at which point having to make sure the src_type
and flags are in-sync becomes cumbersome.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 tools/testing/selftests/kvm/lib/kvm_util.c | 26 +++++++++++++---------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index c3f5142b0a54..cc67dfecbf65 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1107,22 +1107,26 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 
 	region->backing_src_type = src_type;
 
-	if (flags & KVM_MEM_GUEST_MEMFD) {
-		if (guest_memfd < 0) {
+	if (guest_memfd < 0) {
+		if (flags & KVM_MEM_GUEST_MEMFD) {
 			uint32_t guest_memfd_flags = 0;
 			TEST_ASSERT(!guest_memfd_offset,
 				    "Offset must be zero when creating new guest_memfd");
 			guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
-		} else {
-			/*
-			 * Install a unique fd for each memslot so that the fd
-			 * can be closed when the region is deleted without
-			 * needing to track if the fd is owned by the framework
-			 * or by the caller.
-			 */
-			guest_memfd = dup(guest_memfd);
-			TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
 		}
+	} else {
+		/*
+		 * Install a unique fd for each memslot so that the fd
+		 * can be closed when the region is deleted without
+		 * needing to track if the fd is owned by the framework
+		 * or by the caller.
+		 */
+		guest_memfd = dup(guest_memfd);
+		TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
+	}
+
+	if (guest_memfd > 0) {
+		flags |= KVM_MEM_GUEST_MEMFD;
 
 		region->region.guest_memfd = guest_memfd;
 		region->region.guest_memfd_offset = guest_memfd_offset;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 09/12] KVM: selftests: Add guest_memfd based vm_mem_backing_src_types
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
                       ` (3 preceding siblings ...)
  2025-09-24 15:22     ` [PATCH v7 08/12] KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd != -1 Roy, Patrick
@ 2025-09-24 15:22     ` Roy, Patrick
  2025-09-24 15:22     ` [PATCH v7 10/12] KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing selftests Roy, Patrick
                       ` (3 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:22 UTC (permalink / raw)
  Cc: Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

Allow selftests to configure their memslots such that userspace_addr is
set to a MAP_SHARED mapping of the guest_memfd that's associated with
the memslot. This setup is the configuration for non-CoCo VMs, where all
guest memory is backed by a guest_memfd whose folios are all marked
shared, but KVM is still able to access guest memory to provide
functionality such as MMIO emulation on x86.

Add backing types for normal guest_memfd, as well as direct map removed
guest_memfd.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 .../testing/selftests/kvm/include/kvm_util.h  | 18 ++++++
 .../testing/selftests/kvm/include/test_util.h |  7 +++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 63 ++++++++++---------
 tools/testing/selftests/kvm/lib/test_util.c   |  8 +++
 4 files changed, 66 insertions(+), 30 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 23a506d7eca3..5204a0a18a7f 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -635,6 +635,24 @@ static inline bool is_smt_on(void)
 
 void vm_create_irqchip(struct kvm_vm *vm);
 
+static inline uint32_t backing_src_guest_memfd_flags(enum vm_mem_backing_src_type t)
+{
+	uint32_t flags = 0;
+
+	switch (t) {
+	case VM_MEM_SRC_GUEST_MEMFD:
+		flags |= GUEST_MEMFD_FLAG_MMAP;
+		fallthrough;
+	case VM_MEM_SRC_GUEST_MEMFD_NO_DIRECT_MAP:
+		flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
+		break;
+	default:
+		break;
+	}
+
+	return flags;
+}
+
 static inline int __vm_create_guest_memfd(struct kvm_vm *vm, uint64_t size,
 					uint64_t flags)
 {
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 0409b7b96c94..a56e53fc7b39 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -133,6 +133,8 @@ enum vm_mem_backing_src_type {
 	VM_MEM_SRC_ANONYMOUS_HUGETLB_16GB,
 	VM_MEM_SRC_SHMEM,
 	VM_MEM_SRC_SHARED_HUGETLB,
+	VM_MEM_SRC_GUEST_MEMFD,
+	VM_MEM_SRC_GUEST_MEMFD_NO_DIRECT_MAP,
 	NUM_SRC_TYPES,
 };
 
@@ -165,6 +167,11 @@ static inline bool backing_src_is_shared(enum vm_mem_backing_src_type t)
 	return vm_mem_backing_src_alias(t)->flag & MAP_SHARED;
 }
 
+static inline bool backing_src_is_guest_memfd(enum vm_mem_backing_src_type t)
+{
+	return t == VM_MEM_SRC_GUEST_MEMFD || t == VM_MEM_SRC_GUEST_MEMFD_NO_DIRECT_MAP;
+}
+
 static inline bool backing_src_can_be_huge(enum vm_mem_backing_src_type t)
 {
 	return t != VM_MEM_SRC_ANONYMOUS && t != VM_MEM_SRC_SHMEM;
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index cc67dfecbf65..a81089f7c83f 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1060,6 +1060,34 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 	alignment = 1;
 #endif
 
+	if (guest_memfd < 0) {
+		if ((flags & KVM_MEM_GUEST_MEMFD) || backing_src_is_guest_memfd(src_type)) {
+			uint32_t guest_memfd_flags = backing_src_guest_memfd_flags(src_type);
+
+			TEST_ASSERT(!guest_memfd_offset,
+				    "Offset must be zero when creating new guest_memfd");
+			guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
+		}
+	} else {
+		/*
+		 * Install a unique fd for each memslot so that the fd
+		 * can be closed when the region is deleted without
+		 * needing to track if the fd is owned by the framework
+		 * or by the caller.
+		 */
+		guest_memfd = dup(guest_memfd);
+		TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
+	}
+
+	if (guest_memfd > 0) {
+		flags |= KVM_MEM_GUEST_MEMFD;
+
+		region->region.guest_memfd = guest_memfd;
+		region->region.guest_memfd_offset = guest_memfd_offset;
+	} else {
+		region->region.guest_memfd = -1;
+	}
+
 	/*
 	 * When using THP mmap is not guaranteed to returned a hugepage aligned
 	 * address so we have to pad the mmap. Padding is not needed for HugeTLB
@@ -1075,10 +1103,13 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 	if (alignment > 1)
 		region->mmap_size += alignment;
 
-	region->fd = -1;
-	if (backing_src_is_shared(src_type))
+	if (backing_src_is_guest_memfd(src_type))
+		region->fd = guest_memfd;
+	else if (backing_src_is_shared(src_type))
 		region->fd = kvm_memfd_alloc(region->mmap_size,
 					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
+	else
+		region->fd = -1;
 
 	region->mmap_start = mmap(NULL, region->mmap_size,
 				  PROT_READ | PROT_WRITE,
@@ -1106,34 +1137,6 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 	}
 
 	region->backing_src_type = src_type;
-
-	if (guest_memfd < 0) {
-		if (flags & KVM_MEM_GUEST_MEMFD) {
-			uint32_t guest_memfd_flags = 0;
-			TEST_ASSERT(!guest_memfd_offset,
-				    "Offset must be zero when creating new guest_memfd");
-			guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
-		}
-	} else {
-		/*
-		 * Install a unique fd for each memslot so that the fd
-		 * can be closed when the region is deleted without
-		 * needing to track if the fd is owned by the framework
-		 * or by the caller.
-		 */
-		guest_memfd = dup(guest_memfd);
-		TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
-	}
-
-	if (guest_memfd > 0) {
-		flags |= KVM_MEM_GUEST_MEMFD;
-
-		region->region.guest_memfd = guest_memfd;
-		region->region.guest_memfd_offset = guest_memfd_offset;
-	} else {
-		region->region.guest_memfd = -1;
-	}
-
 	region->unused_phy_pages = sparsebit_alloc();
 	if (vm_arch_has_protected_memory(vm))
 		region->protected_phy_pages = sparsebit_alloc();
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 03eb99af9b8d..b2baee680083 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -299,6 +299,14 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
 			 */
 			.flag = MAP_SHARED,
 		},
+		[VM_MEM_SRC_GUEST_MEMFD] = {
+			.name = "guest_memfd",
+			.flag = MAP_SHARED,
+		},
+		[VM_MEM_SRC_GUEST_MEMFD_NO_DIRECT_MAP] = {
+			.name = "guest_memfd_no_direct_map",
+			.flag = MAP_SHARED,
+		}
 	};
 	_Static_assert(ARRAY_SIZE(aliases) == NUM_SRC_TYPES,
 		       "Missing new backing src types?");
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 10/12] KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing selftests
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
                       ` (4 preceding siblings ...)
  2025-09-24 15:22     ` [PATCH v7 09/12] KVM: selftests: Add guest_memfd based vm_mem_backing_src_types Roy, Patrick
@ 2025-09-24 15:22     ` Roy, Patrick
  2025-09-24 15:22     ` [PATCH v7 11/12] KVM: selftests: stuff vm_mem_backing_src_type into vm_shape Roy, Patrick
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:22 UTC (permalink / raw)
  Cc: Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

Extend mem conversion selftests to cover the scenario that the guest can
fault in and write gmem-backed guest memory even if its direct map
removed. Also cover the new flag in guest_memfd_test.c tests.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 tools/testing/selftests/kvm/guest_memfd_test.c             | 2 ++
 .../selftests/kvm/x86/private_mem_conversions_test.c       | 7 ++++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index b3ca6737f304..1187438b6831 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -275,6 +275,8 @@ static void test_guest_memfd(unsigned long vm_type)
 
 	if (vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_MMAP))
 		flags |= GUEST_MEMFD_FLAG_MMAP;
+	if (vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP))
+		flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
 
 	test_create_guest_memfd_multiple(vm);
 	test_create_guest_memfd_invalid_sizes(vm, flags, page_size);
diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
index 82a8d88b5338..8427d9fbdb23 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
@@ -367,7 +367,7 @@ static void *__test_mem_conversions(void *__vcpu)
 }
 
 static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus,
-				 uint32_t nr_memslots)
+				 uint32_t nr_memslots, uint64_t gmem_flags)
 {
 	/*
 	 * Allocate enough memory so that each vCPU's chunk of memory can be
@@ -394,7 +394,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 
 	vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
 
-	memfd = vm_create_guest_memfd(vm, memfd_size, 0);
+	memfd = vm_create_guest_memfd(vm, memfd_size, gmem_flags);
 
 	for (i = 0; i < nr_memslots; i++)
 		vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
@@ -477,7 +477,8 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	test_mem_conversions(src_type, nr_vcpus, nr_memslots);
+	test_mem_conversions(src_type, nr_vcpus, nr_memslots, 0);
+	test_mem_conversions(src_type, nr_vcpus, nr_memslots, GUEST_MEMFD_FLAG_NO_DIRECT_MAP);
 
 	return 0;
 }
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 11/12] KVM: selftests: stuff vm_mem_backing_src_type into vm_shape
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
                       ` (5 preceding siblings ...)
  2025-09-24 15:22     ` [PATCH v7 10/12] KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing selftests Roy, Patrick
@ 2025-09-24 15:22     ` Roy, Patrick
  2025-09-24 15:22     ` [PATCH v7 12/12] KVM: selftests: Test guest execution from direct map removed gmem Roy, Patrick
  2025-09-25 10:26     ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate David Hildenbrand
  8 siblings, 0 replies; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:22 UTC (permalink / raw)
  Cc: Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

Use one of the padding fields in struct vm_shape to carry an enum
vm_mem_backing_src_type value, to give the option to overwrite the
default of VM_MEM_SRC_ANONYMOUS in __vm_create().

Overwriting this default will allow tests to create VMs where the test
code is backed by mmap'd guest_memfd instead of anonymous memory.

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 .../testing/selftests/kvm/include/kvm_util.h  | 19 ++++++++++---------
 tools/testing/selftests/kvm/lib/kvm_util.c    |  2 +-
 tools/testing/selftests/kvm/lib/x86/sev.c     |  1 +
 .../selftests/kvm/pre_fault_memory_test.c     |  1 +
 4 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 5204a0a18a7f..8baa0bbacd09 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -188,7 +188,7 @@ enum vm_guest_mode {
 struct vm_shape {
 	uint32_t type;
 	uint8_t  mode;
-	uint8_t  pad0;
+	uint8_t  src_type;
 	uint16_t pad1;
 };
 
@@ -196,14 +196,15 @@ kvm_static_assert(sizeof(struct vm_shape) == sizeof(uint64_t));
 
 #define VM_TYPE_DEFAULT			0
 
-#define VM_SHAPE(__mode)			\
-({						\
-	struct vm_shape shape = {		\
-		.mode = (__mode),		\
-		.type = VM_TYPE_DEFAULT		\
-	};					\
-						\
-	shape;					\
+#define VM_SHAPE(__mode)				\
+({							\
+	struct vm_shape shape = {			\
+		.mode	  = (__mode),			\
+		.type	  = VM_TYPE_DEFAULT,		\
+		.src_type = VM_MEM_SRC_ANONYMOUS	\
+	};						\
+							\
+	shape;						\
 })
 
 #if defined(__aarch64__)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index a81089f7c83f..3a22794bd959 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -495,7 +495,7 @@ struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
 	if (is_guest_memfd_required(shape))
 		flags |= KVM_MEM_GUEST_MEMFD;
 
-	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, flags);
+	vm_userspace_mem_region_add(vm, shape.src_type, 0, 0, nr_pages, flags);
 	for (i = 0; i < NR_MEM_REGIONS; i++)
 		vm->memslots[i] = 0;
 
diff --git a/tools/testing/selftests/kvm/lib/x86/sev.c b/tools/testing/selftests/kvm/lib/x86/sev.c
index c3a9838f4806..d920880e4fc0 100644
--- a/tools/testing/selftests/kvm/lib/x86/sev.c
+++ b/tools/testing/selftests/kvm/lib/x86/sev.c
@@ -164,6 +164,7 @@ struct kvm_vm *vm_sev_create_with_one_vcpu(uint32_t type, void *guest_code,
 	struct vm_shape shape = {
 		.mode = VM_MODE_DEFAULT,
 		.type = type,
+		.src_type = VM_MEM_SRC_ANONYMOUS,
 	};
 	struct kvm_vm *vm;
 	struct kvm_vcpu *cpus[1];
diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c
index 0350a8896a2f..d403f8d2f26f 100644
--- a/tools/testing/selftests/kvm/pre_fault_memory_test.c
+++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c
@@ -68,6 +68,7 @@ static void __test_pre_fault_memory(unsigned long vm_type, bool private)
 	const struct vm_shape shape = {
 		.mode = VM_MODE_DEFAULT,
 		.type = vm_type,
+		.src_type = VM_MEM_SRC_ANONYMOUS,
 	};
 	struct kvm_vcpu *vcpu;
 	struct kvm_run *run;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 12/12] KVM: selftests: Test guest execution from direct map removed gmem
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
                       ` (6 preceding siblings ...)
  2025-09-24 15:22     ` [PATCH v7 11/12] KVM: selftests: stuff vm_mem_backing_src_type into vm_shape Roy, Patrick
@ 2025-09-24 15:22     ` Roy, Patrick
  2025-09-25 10:26     ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate David Hildenbrand
  8 siblings, 0 replies; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:22 UTC (permalink / raw)
  Cc: Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

Add a selftest that loads itself into guest_memfd (via
GUEST_MEMFD_FLAG_MMAP) and triggers an MMIO exit when executed. This
exercises x86 MMIO emulation code inside KVM for guest_memfd-backed
memslots where the guest_memfd folios are direct map removed.
Particularly, it validates that x86 MMIO emulation code (guest page
table walks + instruction fetch) correctly accesses gmem through the VMA
that's been reflected into the memslot's userspace_addr field (instead
of trying to do direct map accesses).

Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
---
 .../selftests/kvm/set_memory_region_test.c    | 50 +++++++++++++++++--
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c
index ce3ac0fd6dfb..cb3bc642d376 100644
--- a/tools/testing/selftests/kvm/set_memory_region_test.c
+++ b/tools/testing/selftests/kvm/set_memory_region_test.c
@@ -603,6 +603,41 @@ static void test_mmio_during_vectoring(void)
 
 	kvm_vm_free(vm);
 }
+
+static void guest_code_trigger_mmio(void)
+{
+	/*
+	 * Read some GPA that is not backed by a memslot. KVM consider this
+	 * as MMIO and tell userspace to emulate the read.
+	 */
+	READ_ONCE(*((uint64_t *)MEM_REGION_GPA));
+
+	GUEST_DONE();
+}
+
+static void test_guest_memfd_mmio(void)
+{
+	struct kvm_vm *vm;
+	struct kvm_vcpu *vcpu;
+	struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.src_type = VM_MEM_SRC_GUEST_MEMFD_NO_DIRECT_MAP,
+	};
+	pthread_t vcpu_thread;
+
+	pr_info("Testing MMIO emulation for instructions in gmem\n");
+
+	vm = __vm_create_shape_with_one_vcpu(shape, &vcpu, 0, guest_code_trigger_mmio);
+
+	virt_map(vm, MEM_REGION_GPA, MEM_REGION_GPA, 1);
+
+	pthread_create(&vcpu_thread, NULL, vcpu_worker, vcpu);
+
+	/* If the MMIO read was successfully emulated, the vcpu thread will exit */
+	pthread_join(vcpu_thread, NULL);
+
+	kvm_vm_free(vm);
+}
 #endif
 
 int main(int argc, char *argv[])
@@ -626,10 +661,17 @@ int main(int argc, char *argv[])
 	test_add_max_memory_regions();
 
 #ifdef __x86_64__
-	if (kvm_has_cap(KVM_CAP_GUEST_MEMFD) &&
-	    (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) {
-		test_add_private_memory_region();
-		test_add_overlapping_private_memory_regions();
+	if (kvm_has_cap(KVM_CAP_GUEST_MEMFD)) {
+		if (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM)) {
+			test_add_private_memory_region();
+			test_add_overlapping_private_memory_regions();
+		}
+
+		if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_MMAP) &&
+			kvm_has_cap(KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP))
+			test_guest_memfd_mmio();
+		else
+			pr_info("Skipping tests requiring KVM_CAP_GUEST_MEMFD_MMAP | KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP");
 	} else {
 		pr_info("Skipping tests for KVM_MEM_GUEST_MEMFD memory regions\n");
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
  2025-09-24 15:10 [PATCH v7 00/12] Direct Map Removal Support for guest_memfd Patrick Roy
                   ` (2 preceding siblings ...)
  2025-09-24 15:10 ` [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP Patrick Roy
@ 2025-09-24 15:29 ` Roy, Patrick
  2025-09-24 15:38   ` David Hildenbrand
  3 siblings, 1 reply; 34+ messages in thread
From: Roy, Patrick @ 2025-09-24 15:29 UTC (permalink / raw)
  To: patrick.roy@campus.lmu.de
  Cc: Liam.Howlett@oracle.com, ackerleytng@google.com,
	akpm@linux-foundation.org, andrii@kernel.org, ast@kernel.org,
	bp@alien8.de, bpf@vger.kernel.org, catalin.marinas@arm.com,
	corbet@lwn.net, daniel@iogearbox.net, dave.hansen@linux.intel.com,
	david@redhat.com, derekmn@amazon.co.uk, eddyz87@gmail.com,
	haoluo@google.com, hpa@zytor.com, Thomson, Jack, jannh@google.com,
	jgg@ziepe.ca, jhubbard@nvidia.com, joey.gouly@arm.com,
	john.fastabend@gmail.com, jolsa@kernel.org, Kalyazin, Nikita,
	kpsingh@kernel.org, kvm@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-mm@kvack.org,
	lorenzo.stoakes@oracle.com, luto@kernel.org, martin.lau@linux.dev,
	maz@kernel.org, mhocko@suse.com, mingo@redhat.com,
	oliver.upton@linux.dev, pbonzini@redhat.com, peterx@redhat.com,
	peterz@infradead.org, pfalcato@suse.de, Roy, Patrick,
	rppt@kernel.org, sdf@fomichev.me, seanjc@google.com,
	shuah@kernel.org, song@kernel.org, surenb@google.com,
	suzuki.poulose@arm.com, tabba@google.com, tglx@linutronix.de,
	vbabka@suse.cz, will@kernel.org, willy@infradead.org,
	x86@kernel.org, Cali, Marco, yonghong.song@linux.dev,
	yuzenghui@huawei.com

_sigh_

I tried to submit this iteration from a personal email, because amazon's mail
server was scrambling the "From" header and I couldn't figure out why (and also
because I am leaving Amazon next month and wanted replies to go into an inbox
to which I'll continue to have access). And then after posting the first 4
emails I hit "daily mail quota exceeded", and had to submit the rest of the
patch series from the amazon email anyway. Sorry about the resulting mess (i
think the threading got slightly messed up as a result of this). I'll something
else out for the next iteration.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 00/12] Direct Map Removal Support for guest_memfd
  2025-09-24 15:29 ` [PATCH v7 00/12] Direct Map Removal Support for guest_memfd Roy, Patrick
@ 2025-09-24 15:38   ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2025-09-24 15:38 UTC (permalink / raw)
  To: Roy, Patrick, patrick.roy@campus.lmu.de
  Cc: Liam.Howlett@oracle.com, ackerleytng@google.com,
	akpm@linux-foundation.org, andrii@kernel.org, ast@kernel.org,
	bp@alien8.de, bpf@vger.kernel.org, catalin.marinas@arm.com,
	corbet@lwn.net, daniel@iogearbox.net, dave.hansen@linux.intel.com,
	derekmn@amazon.co.uk, eddyz87@gmail.com, haoluo@google.com,
	hpa@zytor.com, Thomson, Jack, jannh@google.com, jgg@ziepe.ca,
	jhubbard@nvidia.com, joey.gouly@arm.com, john.fastabend@gmail.com,
	jolsa@kernel.org, Kalyazin, Nikita, kpsingh@kernel.org,
	kvm@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-mm@kvack.org,
	lorenzo.stoakes@oracle.com, luto@kernel.org, martin.lau@linux.dev,
	maz@kernel.org, mhocko@suse.com, mingo@redhat.com,
	oliver.upton@linux.dev, pbonzini@redhat.com, peterx@redhat.com,
	peterz@infradead.org, pfalcato@suse.de, rppt@kernel.org,
	sdf@fomichev.me, seanjc@google.com, shuah@kernel.org,
	song@kernel.org, surenb@google.com, suzuki.poulose@arm.com,
	tabba@google.com, tglx@linutronix.de, vbabka@suse.cz,
	will@kernel.org, willy@infradead.org, x86@kernel.org, Cali, Marco,
	yonghong.song@linux.dev, yuzenghui@huawei.com

On 24.09.25 17:29, Roy, Patrick wrote:
> _sigh_

Happens to the best of us :)

> 
> I tried to submit this iteration from a personal email, because amazon's mail
> server was scrambling the "From" header and I couldn't figure out why (and also
> because I am leaving Amazon next month and wanted replies to go into an inbox
> to which I'll continue to have access). And then after posting the first 4
> emails I hit "daily mail quota exceeded", and had to submit the rest of the
> patch series from the amazon email anyway. Sorry about the resulting mess (i
> think the threading got slightly messed up as a result of this). I'll something
> else out for the next iteration.

I had luck recovering from temporary mail server issues in the past by 
sending the remainder as "--in-reply-to=" with message-id of cover 
letter and using "--no-thread" IIRC.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP
  2025-09-24 15:10 ` [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP Patrick Roy
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
@ 2025-09-25 10:25   ` David Hildenbrand
  1 sibling, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2025-09-25 10:25 UTC (permalink / raw)
  To: Patrick Roy
  Cc: Patrick Roy, pbonzini, corbet, maz, oliver.upton, joey.gouly,
	suzuki.poulose, yuzenghui, catalin.marinas, will, tglx, mingo, bp,
	dave.hansen, x86, hpa, luto, peterz, willy, akpm, lorenzo.stoakes,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, song, jolsa, ast,
	daniel, andrii, martin.lau, eddyz87, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jgg, jhubbard, peterx,
	jannh, pfalcato, shuah, seanjc, kvm, linux-doc, linux-kernel,
	linux-arm-kernel, kvmarm, linux-fsdevel, linux-mm, bpf,
	linux-kselftest, xmarcalx, kalyazin, jackabt, derekmn, tabba,
	ackerleytng

On 24.09.25 17:10, Patrick Roy wrote:
> From: Patrick Roy <roypat@amazon.co.uk>
> 
> Add AS_NO_DIRECT_MAP for mappings where direct map entries of folios are
> set to not present . Currently, mappings that match this description are
> secretmem mappings (memfd_secret()). Later, some guest_memfd
> configurations will also fall into this category.
> 
> Reject this new type of mappings in all locations that currently reject
> secretmem mappings, on the assumption that if secretmem mappings are
> rejected somewhere, it is precisely because of an inability to deal with
> folios without direct map entries, and then make memfd_secret() use
> AS_NO_DIRECT_MAP on its address_space to drop its special
> vma_is_secretmem()/secretmem_mapping() checks.
> 
> This drops a optimization in gup_fast_folio_allowed() where
> secretmem_mapping() was only called if CONFIG_SECRETMEM=y. secretmem is
> enabled by default since commit b758fe6df50d ("mm/secretmem: make it on
> by default"), so the secretmem check did not actually end up elided in
> most cases anymore anyway.
> 
> Use a new flag instead of overloading AS_INACCESSIBLE (which is already
> set by guest_memfd) because not all guest_memfd mappings will end up
> being direct map removed (e.g. in pKVM setups, parts of guest_memfd that
> can be mapped to userspace should also be GUP-able, and generally not
> have restrictions on who can access it).
> 
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---

I enjoy seeing secretmem special-casing in common code go away.

[...]

>   
>   	/*
> @@ -2763,18 +2761,10 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
>   		reject_file_backed = true;
>   
>   	/* We hold a folio reference, so we can safely access folio fields. */
> -
> -	/* secretmem folios are always order-0 folios. */
> -	if (IS_ENABLED(CONFIG_SECRETMEM) && !folio_test_large(folio))
> -		check_secretmem = true;
> -
> -	if (!reject_file_backed && !check_secretmem)
> -		return true;
> -

Losing that optimization is not too bad I guess.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate
  2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
                       ` (7 preceding siblings ...)
  2025-09-24 15:22     ` [PATCH v7 12/12] KVM: selftests: Test guest execution from direct map removed gmem Roy, Patrick
@ 2025-09-25 10:26     ` David Hildenbrand
  8 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2025-09-25 10:26 UTC (permalink / raw)
  To: Roy, Patrick
  Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, song@kernel.org,
	jolsa@kernel.org, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@linux.dev, eddyz87@gmail.com,
	yonghong.song@linux.dev, john.fastabend@gmail.com,
	kpsingh@kernel.org, sdf@fomichev.me, haoluo@google.com,
	jgg@ziepe.ca, jhubbard@nvidia.com, peterx@redhat.com,
	jannh@google.com, pfalcato@suse.de, shuah@kernel.org,
	seanjc@google.com, kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On 24.09.25 17:22, Roy, Patrick wrote:
> Add a no-op stub for kvm_arch_gmem_invalidate if
> CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE=n. This allows defining
> kvm_gmem_free_folio without ifdef-ery, which allows more cleanly using
> guest_memfd's free_folio callback for non-arch-invalidation related
> code.
> 
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---

We'll now always perform a callback from the core, but I guess that's 
tolerable.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map
  2025-09-24 15:22     ` [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map Roy, Patrick
@ 2025-09-25 11:00       ` David Hildenbrand
  2025-09-25 15:52         ` Roy, Patrick
  2025-09-26 14:49       ` Patrick Roy
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-09-25 11:00 UTC (permalink / raw)
  To: Roy, Patrick
  Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, song@kernel.org,
	jolsa@kernel.org, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@linux.dev, eddyz87@gmail.com,
	yonghong.song@linux.dev, john.fastabend@gmail.com,
	kpsingh@kernel.org, sdf@fomichev.me, haoluo@google.com,
	jgg@ziepe.ca, jhubbard@nvidia.com, peterx@redhat.com,
	jannh@google.com, pfalcato@suse.de, shuah@kernel.org,
	seanjc@google.com, kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On 24.09.25 17:22, Roy, Patrick wrote:
> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
> ioctl. When set, guest_memfd folios will be removed from the direct map
> after preparation, with direct map entries only restored when the folios
> are freed.
> 
> To ensure these folios do not end up in places where the kernel cannot
> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
> 
> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
> guest_memfd itself being supported, but also on whether linux supports
> manipulatomg the direct map at page granularity at all (possible most of
> the time, outliers being arm64 where its impossible if the direct map
> has been setup using hugepages, as arm64 cannot break these apart due to
> break-before-make semantics, and powerpc, which does not select
> ARCH_HAS_SET_DIRECT_MAP, though also doesn't support guest_memfd
> anyway).
> 
> Note that this flag causes removal of direct map entries for all
> guest_memfd folios independent of whether they are "shared" or "private"
> (although current guest_memfd only supports either all folios in the
> "shared" state, or all folios in the "private" state if
> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
> entries of also the shared parts of guest_memfd are a special type of
> non-CoCo VM where, host userspace is trusted to have access to all of
> guest memory, but where Spectre-style transient execution attacks
> through the host kernel's direct map should still be mitigated.  In this
> setup, KVM retains access to guest memory via userspace mappings of
> guest_memfd, which are reflected back into KVM's memslots via
> userspace_addr. This is needed for things like MMIO emulation on x86_64
> to work.
> 
> Direct map entries are zapped right before guest or userspace mappings
> of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
> kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
> a gmem folio can be allocated without being mapped anywhere is
> kvm_gmem_populate(), where handling potential failures of direct map
> removal is not possible (by the time direct map removal is attempted,
> the folio is already marked as prepared, meaning attempting to re-try
> kvm_gmem_populate() would just result in -EEXIST without fixing up the
> direct map state). These folios are then removed form the direct map
> upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
> 
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---
>   Documentation/virt/kvm/api.rst    |  5 +++
>   arch/arm64/include/asm/kvm_host.h | 12 ++++++
>   include/linux/kvm_host.h          |  6 +++
>   include/uapi/linux/kvm.h          |  2 +
>   virt/kvm/guest_memfd.c            | 61 ++++++++++++++++++++++++++++++-
>   virt/kvm/kvm_main.c               |  5 +++
>   6 files changed, 90 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index c17a87a0a5ac..b52c14d58798 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
>   supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
>   enables mmap() and faulting of guest_memfd memory to host userspace.
>   
> +When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field
> +supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the guest_memfd
> +instance behave similarly to memfd_secret, and unmaps the memory backing it from
> +the kernel's address space after allocation.
> +

Do we want to document what the implication of that is? Meaning, 
limitations etc. I recall that we would need the user mapping for gmem 
slots to be properly set up.

Is that still the case in this patch set?

>   When the KVM MMU performs a PFN lookup to service a guest fault and the backing
>   guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
>   consumed from guest_memfd, regardless of whether it is a shared or a private
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 2f2394cce24e..0bfd8e5fd9de 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -19,6 +19,7 @@
>   #include <linux/maple_tree.h>
>   #include <linux/percpu.h>
>   #include <linux/psci.h>
> +#include <linux/set_memory.h>
>   #include <asm/arch_gicv3.h>
>   #include <asm/barrier.h>
>   #include <asm/cpufeature.h>
> @@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
>   void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
>   void check_feature_map(void);
>   
> +#ifdef CONFIG_KVM_GUEST_MEMFD
> +static inline bool kvm_arch_gmem_supports_no_direct_map(void)
> +{
> +	/*
> +	 * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
> +	 * as it calls dcache_clean_inval_poc().
> +	 */
> +	return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
> +}
> +#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
> +#endif /* CONFIG_KVM_GUEST_MEMFD */
>   

I strongly assume that the aarch64 support should be moved to a separate 
patch -- if possible, see below.

>   #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 1d0585616aa3..73a15cade54a 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -731,6 +731,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>   bool kvm_arch_supports_gmem_mmap(struct kvm *kvm);
>   #endif
>   
> +#ifdef CONFIG_KVM_GUEST_MEMFD
> +#ifndef kvm_arch_gmem_supports_no_direct_map
> +#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map
> +#endif

Hm, wouldn't it be better to have an opt-in per arch, and really only 
unlock the ones we know work (tested etc), explicitly in separate patches.


[...]

>   
>   #include "kvm_mm.h"
>   
> @@ -42,6 +45,44 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>   	return 0;
>   }
>   
> +#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
> +
> +static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
> +{
> +	return ((u64) folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
> +}
> +
> +static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
> +{
> +	if (kvm_gmem_folio_no_direct_map(folio))
> +		return 0;
> +
> +	int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
> +					 false);
> +
> +	if (!r) {
> +		unsigned long addr = (unsigned long) folio_address(folio);

empty line missing.

> +		folio->private = (void *) ((u64) folio->private & KVM_GMEM_FOLIO_NO_DIRECT_MAP);
> +		flush_tlb_kernel_range(addr, addr + folio_size(folio));
> +	}
> +
> +	return r;
> +}
> +
> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
> +{
> +	/*
> +	 * Direct map restoration cannot fail, as the only error condition
> +	 * for direct map manipulation is failure to allocate page tables
> +	 * when splitting huge pages, but this split would have already
> +	 * happened in set_direct_map_invalid_noflush() in kvm_gmem_folio_zap_direct_map().
> +	 * Thus set_direct_map_valid_noflush() here only updates prot bits.
> +	 */
> +	if (kvm_gmem_folio_no_direct_map(folio))
> +		set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
> +					 true);
> +}
> +
>   static inline void kvm_gmem_mark_prepared(struct folio *folio)
>   {
>   	folio_mark_uptodate(folio);
> @@ -324,13 +365,14 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>   	struct inode *inode = file_inode(vmf->vma->vm_file);
>   	struct folio *folio;
>   	vm_fault_t ret = VM_FAULT_LOCKED;
> +	int err;
>   
>   	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>   		return VM_FAULT_SIGBUS;
>   
>   	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>   	if (IS_ERR(folio)) {
> -		int err = PTR_ERR(folio);
> +		err = PTR_ERR(folio);
>   
>   		if (err == -EAGAIN)
>   			return VM_FAULT_RETRY;
> @@ -348,6 +390,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>   		kvm_gmem_mark_prepared(folio);
>   	}
>   
> +	err = kvm_gmem_folio_zap_direct_map(folio);
> +

I'd drop this empty line here.

> +	if (err) {
> +		ret = vmf_error(err);
> +		goto out_folio;
> +	}
> +
>   	vmf->page = folio_file_page(folio, vmf->pgoff);
>   
>   out_folio:
> @@ -435,6 +484,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
>   	kvm_pfn_t pfn = page_to_pfn(page);
>   	int order = folio_order(folio);
>   
> +	kvm_gmem_folio_restore_direct_map(folio);
> +
>   	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
>   }
>   
> @@ -499,6 +550,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>   	/* Unmovable mappings are supposed to be marked unevictable as well. */
>   	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>   
> +	if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
> +		mapping_set_no_direct_map(inode->i_mapping);
> +
>   	kvm_get_kvm(kvm);
>   	gmem->kvm = kvm;
>   	xa_init(&gmem->bindings);
> @@ -523,6 +577,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>   	if (kvm_arch_supports_gmem_mmap(kvm))
>   		valid_flags |= GUEST_MEMFD_FLAG_MMAP;
>   
> +	if (kvm_arch_gmem_supports_no_direct_map())
> +		valid_flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
> +
>   	if (flags & ~valid_flags)
>   		return -EINVAL;
>   
> @@ -687,6 +744,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>   	if (!is_prepared)
>   		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>   
> +	kvm_gmem_folio_zap_direct_map(folio);
> +
>   	folio_unlock(folio);
>   
>   	if (!r)
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 18f29ef93543..b5e702d95230 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -65,6 +65,7 @@
>   #include <trace/events/kvm.h>
>   
>   #include <linux/kvm_dirty_ring.h>
> +#include <linux/set_memory.h>

Likely not required here.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-24 15:22     ` [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing Roy, Patrick
@ 2025-09-25 11:02       ` David Hildenbrand
  2025-09-25 15:50         ` Roy, Patrick
  2025-09-25 18:27       ` Dave Hansen
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-09-25 11:02 UTC (permalink / raw)
  To: Roy, Patrick
  Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, song@kernel.org,
	jolsa@kernel.org, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@linux.dev, eddyz87@gmail.com,
	yonghong.song@linux.dev, john.fastabend@gmail.com,
	kpsingh@kernel.org, sdf@fomichev.me, haoluo@google.com,
	jgg@ziepe.ca, jhubbard@nvidia.com, peterx@redhat.com,
	jannh@google.com, pfalcato@suse.de, shuah@kernel.org,
	seanjc@google.com, kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On 24.09.25 17:22, Roy, Patrick wrote:
> Add an option to not perform TLB flushes after direct map manipulations.
> TLB flushes result in a up to 40x elongation of page faults in
> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
> of memory population, which is inacceptable when wanting to use direct
> map removed guest_memfd as a drop-in replacement for existing workloads.
> 
> TLB flushes are not needed for functional correctness (the virt->phys
> mapping technically stays "correct", the kernel should simply not use it
> for a while), so we can skip them to keep performance in-line with
> "traditional" VMs.
> 
> Enabling this option means that the desired protection from
> Spectre-style attacks is not perfect, as an attacker could try to
> prevent a stale TLB entry from getting evicted, keeping it alive until
> the page it refers to is used by the guest for some sensitive data, and
> then targeting it using a spectre-gadget.
> 
> Cc: Will Deacon <will@kernel.org>
> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
> ---
>   include/linux/kvm_host.h | 1 +
>   virt/kvm/guest_memfd.c   | 3 ++-
>   virt/kvm/kvm_main.c      | 3 +++
>   3 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 73a15cade54a..4d2bc18860fc 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2298,6 +2298,7 @@ extern unsigned int halt_poll_ns;
>   extern unsigned int halt_poll_ns_grow;
>   extern unsigned int halt_poll_ns_grow_start;
>   extern unsigned int halt_poll_ns_shrink;
> +extern bool guest_memfd_tlb_flush;
>   
>   struct kvm_device {
>   	const struct kvm_device_ops *ops;
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b7129c4868c5..d8dd24459f0d 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -63,7 +63,8 @@ static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
>   	if (!r) {
>   		unsigned long addr = (unsigned long) folio_address(folio);
>   		folio->private = (void *) ((u64) folio->private & KVM_GMEM_FOLIO_NO_DIRECT_MAP);
> -		flush_tlb_kernel_range(addr, addr + folio_size(folio));
> +		if (guest_memfd_tlb_flush)
> +			flush_tlb_kernel_range(addr, addr + folio_size(folio));
>   	}
>   
>   	return r;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b5e702d95230..753c06ebba7f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -95,6 +95,9 @@ unsigned int halt_poll_ns_shrink = 2;
>   module_param(halt_poll_ns_shrink, uint, 0644);
>   EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
>   
> +bool guest_memfd_tlb_flush = true;
> +module_param(guest_memfd_tlb_flush, bool, 0444);

The parameter name is a bit too generic. I think you somehow have to 
incorporate the "direct_map" aspects.

Also, I wonder if this could be a capability per vm/guest_memfd?

Then, you could also nicely document the semantics, considerations, 
impact etc :)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-25 11:02       ` David Hildenbrand
@ 2025-09-25 15:50         ` Roy, Patrick
  2025-09-25 19:32           ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Roy, Patrick @ 2025-09-25 15:50 UTC (permalink / raw)
  To: david@redhat.com
  Cc: Liam.Howlett@oracle.com, ackerleytng@google.com,
	akpm@linux-foundation.org, andrii@kernel.org, ast@kernel.org,
	bp@alien8.de, bpf@vger.kernel.org, catalin.marinas@arm.com,
	corbet@lwn.net, daniel@iogearbox.net, dave.hansen@linux.intel.com,
	derekmn@amazon.co.uk, eddyz87@gmail.com, haoluo@google.com,
	hpa@zytor.com, Thomson, Jack, jannh@google.com, jgg@ziepe.ca,
	jhubbard@nvidia.com, joey.gouly@arm.com, john.fastabend@gmail.com,
	jolsa@kernel.org, Kalyazin, Nikita, kpsingh@kernel.org,
	kvm@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-mm@kvack.org,
	lorenzo.stoakes@oracle.com, luto@kernel.org, martin.lau@linux.dev,
	maz@kernel.org, mhocko@suse.com, mingo@redhat.com,
	oliver.upton@linux.dev, pbonzini@redhat.com, peterx@redhat.com,
	peterz@infradead.org, pfalcato@suse.de, Roy, Patrick,
	rppt@kernel.org, sdf@fomichev.me, seanjc@google.com,
	shuah@kernel.org, song@kernel.org, surenb@google.com,
	suzuki.poulose@arm.com, tabba@google.com, tglx@linutronix.de,
	vbabka@suse.cz, will@kernel.org, willy@infradead.org,
	x86@kernel.org, Cali, Marco, yonghong.song@linux.dev,
	yuzenghui@huawei.com

On Thu, 2025-09-25 at 12:02 +0100, David Hildenbrand wrote:
> On 24.09.25 17:22, Roy, Patrick wrote:
>> Add an option to not perform TLB flushes after direct map manipulations.
>> TLB flushes result in a up to 40x elongation of page faults in
>> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
>> of memory population, which is inacceptable when wanting to use direct
>> map removed guest_memfd as a drop-in replacement for existing workloads.
>>
>> TLB flushes are not needed for functional correctness (the virt->phys
>> mapping technically stays "correct", the kernel should simply not use it
>> for a while), so we can skip them to keep performance in-line with
>> "traditional" VMs.
>>
>> Enabling this option means that the desired protection from
>> Spectre-style attacks is not perfect, as an attacker could try to
>> prevent a stale TLB entry from getting evicted, keeping it alive until
>> the page it refers to is used by the guest for some sensitive data, and
>> then targeting it using a spectre-gadget.
>>
>> Cc: Will Deacon <will@kernel.org>
>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>> ---
>>   include/linux/kvm_host.h | 1 +
>>   virt/kvm/guest_memfd.c   | 3 ++-
>>   virt/kvm/kvm_main.c      | 3 +++
>>   3 files changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 73a15cade54a..4d2bc18860fc 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -2298,6 +2298,7 @@ extern unsigned int halt_poll_ns;
>>   extern unsigned int halt_poll_ns_grow;
>>   extern unsigned int halt_poll_ns_grow_start;
>>   extern unsigned int halt_poll_ns_shrink;
>> +extern bool guest_memfd_tlb_flush;
>>
>>   struct kvm_device {
>>       const struct kvm_device_ops *ops;
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index b7129c4868c5..d8dd24459f0d 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -63,7 +63,8 @@ static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
>>       if (!r) {
>>               unsigned long addr = (unsigned long) folio_address(folio);
>>               folio->private = (void *) ((u64) folio->private & KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>> -             flush_tlb_kernel_range(addr, addr + folio_size(folio));
>> +             if (guest_memfd_tlb_flush)
>> +                     flush_tlb_kernel_range(addr, addr + folio_size(folio));
>>       }
>>
>>       return r;
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index b5e702d95230..753c06ebba7f 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -95,6 +95,9 @@ unsigned int halt_poll_ns_shrink = 2;
>>   module_param(halt_poll_ns_shrink, uint, 0644);
>>   EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
>>
>> +bool guest_memfd_tlb_flush = true;
>> +module_param(guest_memfd_tlb_flush, bool, 0444);
> 
> The parameter name is a bit too generic. I think you somehow have to
> incorporate the "direct_map" aspects.

Fair :)

> Also, I wonder if this could be a capability per vm/guest_memfd?

I don't really have any opinions on how to expose this knob, but I
thought capabilities should be additive? (e.g. we only have
KVM_ENABLE_EXTENSION(), and then having a capability with a negative
polarity "enable to _not_ do TLB flushes" is a bit weird in my head).
Then again, if people are fine having TLB flushes be opt-in instead of
opt-out (Will's comment on v6 makes me believe that the opt-out itself
might already be controversial for arm64), a capability would work.

> Then, you could also nicely document the semantics, considerations,
> impact etc :)

Yup, I got so lost in trying to figure out why flush_kernel_tlb_range()
didnt refused to let itself be exported that docs slipped my mind haha.

> -- 
> Cheers
> 
> David / dhildenb
> 

Best,
Patrick


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map
  2025-09-25 11:00       ` David Hildenbrand
@ 2025-09-25 15:52         ` Roy, Patrick
  2025-09-25 19:28           ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Roy, Patrick @ 2025-09-25 15:52 UTC (permalink / raw)
  To: david@redhat.com
  Cc: Liam.Howlett@oracle.com, ackerleytng@google.com,
	akpm@linux-foundation.org, andrii@kernel.org, ast@kernel.org,
	bp@alien8.de, bpf@vger.kernel.org, catalin.marinas@arm.com,
	corbet@lwn.net, daniel@iogearbox.net, dave.hansen@linux.intel.com,
	derekmn@amazon.co.uk, eddyz87@gmail.com, haoluo@google.com,
	hpa@zytor.com, Thomson, Jack, jannh@google.com, jgg@ziepe.ca,
	jhubbard@nvidia.com, joey.gouly@arm.com, john.fastabend@gmail.com,
	jolsa@kernel.org, Kalyazin, Nikita, kpsingh@kernel.org,
	kvm@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-mm@kvack.org,
	lorenzo.stoakes@oracle.com, luto@kernel.org, martin.lau@linux.dev,
	maz@kernel.org, mhocko@suse.com, mingo@redhat.com,
	oliver.upton@linux.dev, pbonzini@redhat.com, peterx@redhat.com,
	peterz@infradead.org, pfalcato@suse.de, Roy, Patrick,
	rppt@kernel.org, sdf@fomichev.me, seanjc@google.com,
	shuah@kernel.org, song@kernel.org, surenb@google.com,
	suzuki.poulose@arm.com, tabba@google.com, tglx@linutronix.de,
	vbabka@suse.cz, will@kernel.org, willy@infradead.org,
	x86@kernel.org, Cali, Marco, yonghong.song@linux.dev,
	yuzenghui@huawei.com

On Thu, 2025-09-25 at 12:00 +0100, David Hildenbrand wrote:
> On 24.09.25 17:22, Roy, Patrick wrote:
>> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
>> ioctl. When set, guest_memfd folios will be removed from the direct map
>> after preparation, with direct map entries only restored when the folios
>> are freed.
>>
>> To ensure these folios do not end up in places where the kernel cannot
>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
>>
>> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
>> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
>> guest_memfd itself being supported, but also on whether linux supports
>> manipulatomg the direct map at page granularity at all (possible most of
>> the time, outliers being arm64 where its impossible if the direct map
>> has been setup using hugepages, as arm64 cannot break these apart due to
>> break-before-make semantics, and powerpc, which does not select
>> ARCH_HAS_SET_DIRECT_MAP, though also doesn't support guest_memfd
>> anyway).
>>
>> Note that this flag causes removal of direct map entries for all
>> guest_memfd folios independent of whether they are "shared" or "private"
>> (although current guest_memfd only supports either all folios in the
>> "shared" state, or all folios in the "private" state if
>> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
>> entries of also the shared parts of guest_memfd are a special type of
>> non-CoCo VM where, host userspace is trusted to have access to all of
>> guest memory, but where Spectre-style transient execution attacks
>> through the host kernel's direct map should still be mitigated.  In this
>> setup, KVM retains access to guest memory via userspace mappings of
>> guest_memfd, which are reflected back into KVM's memslots via
>> userspace_addr. This is needed for things like MMIO emulation on x86_64
>> to work.
>>
>> Direct map entries are zapped right before guest or userspace mappings
>> of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
>> kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
>> a gmem folio can be allocated without being mapped anywhere is
>> kvm_gmem_populate(), where handling potential failures of direct map
>> removal is not possible (by the time direct map removal is attempted,
>> the folio is already marked as prepared, meaning attempting to re-try
>> kvm_gmem_populate() would just result in -EEXIST without fixing up the
>> direct map state). These folios are then removed form the direct map
>> upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
>>
>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>> ---
>>   Documentation/virt/kvm/api.rst    |  5 +++
>>   arch/arm64/include/asm/kvm_host.h | 12 ++++++
>>   include/linux/kvm_host.h          |  6 +++
>>   include/uapi/linux/kvm.h          |  2 +
>>   virt/kvm/guest_memfd.c            | 61 ++++++++++++++++++++++++++++++-
>>   virt/kvm/kvm_main.c               |  5 +++
>>   6 files changed, 90 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index c17a87a0a5ac..b52c14d58798 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
>>   supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
>>   enables mmap() and faulting of guest_memfd memory to host userspace.
>>
>> +When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field
>> +supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the guest_memfd
>> +instance behave similarly to memfd_secret, and unmaps the memory backing it from
>> +the kernel's address space after allocation.
>> +
> 
> Do we want to document what the implication of that is? Meaning,
> limitations etc. I recall that we would need the user mapping for gmem
> slots to be properly set up.
> 
> Is that still the case in this patch set?

The ->userspace_addr thing is the general requirement for non-CoCo VMs,
and not specific for direct map removal (e.g. I expect direct map
removal to just work out of the box for CoCo setups, where KVM already
cannot access guest memory, ignoring the question of whether direct map
removal is even useful for CoCo VMs). So I don't think it should be
documented as part of
KVM_CAP_GMEM_NO_DIRECT_MAP/GUEST_MEMFG_FLAG_NO_DIRECT_MAP (heh, there's
a typo I just noticed. "MEMFG". Also "GMEM" needs to be "GUEST_MEMFD".
Will fix that), but rather as part of GUEST_MEMFD_FLAG_MMAP. I can add a
patch it there (or maybe send it separately, since FLAG_MMAP is already
in -next?).

>>   When the KVM MMU performs a PFN lookup to service a guest fault and the backing
>>   guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
>>   consumed from guest_memfd, regardless of whether it is a shared or a private
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 2f2394cce24e..0bfd8e5fd9de 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -19,6 +19,7 @@
>>   #include <linux/maple_tree.h>
>>   #include <linux/percpu.h>
>>   #include <linux/psci.h>
>> +#include <linux/set_memory.h>
>>   #include <asm/arch_gicv3.h>
>>   #include <asm/barrier.h>
>>   #include <asm/cpufeature.h>
>> @@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
>>   void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
>>   void check_feature_map(void);
>>
>> +#ifdef CONFIG_KVM_GUEST_MEMFD
>> +static inline bool kvm_arch_gmem_supports_no_direct_map(void)
>> +{
>> +     /*
>> +      * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
>> +      * as it calls dcache_clean_inval_poc().
>> +      */
>> +     return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
>> +}
>> +#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
>> +#endif /* CONFIG_KVM_GUEST_MEMFD */
>>
> 
> I strongly assume that the aarch64 support should be moved to a separate
> patch -- if possible, see below.
> 
>>   #endif /* __ARM64_KVM_HOST_H__ */
>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>> index 1d0585616aa3..73a15cade54a 100644
>> --- a/include/linux/kvm_host.h
>> +++ b/include/linux/kvm_host.h
>> @@ -731,6 +731,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>>   bool kvm_arch_supports_gmem_mmap(struct kvm *kvm);
>>   #endif
>>
>> +#ifdef CONFIG_KVM_GUEST_MEMFD
>> +#ifndef kvm_arch_gmem_supports_no_direct_map
>> +#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map
>> +#endif
> 
> Hm, wouldn't it be better to have an opt-in per arch, and really only
> unlock the ones we know work (tested etc), explicitly in separate patches.
> 

Ack, can definitely do that. Something like 

#ifndef kvm_arch_gmem_supports_no_direct_map
static inline bool kvm_arch_gmem_supports_no_direct_map()
{
	return false;
}
#endif

and then actual definitions (in separate patches) in the arm64 and x86
headers?

On a related note, maybe PATCH 2 should only export
set_direct_map_valid_noflush() for the architectures on which we
actually need it? Which would only be x86, since arm64 doesnt allow
building KVM as a module, and nothing else supports guest_memfd right
now.

> [...]
> 
>>
>>   #include "kvm_mm.h"
>>
>> @@ -42,6 +45,44 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>>       return 0;
>>   }
>>
>> +#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
>> +
>> +static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
>> +{
>> +     return ((u64) folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
>> +}
>> +
>> +static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
>> +{
>> +     if (kvm_gmem_folio_no_direct_map(folio))
>> +             return 0;
>> +
>> +     int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>> +                                      false);
>> +
>> +     if (!r) {
>> +             unsigned long addr = (unsigned long) folio_address(folio);
> 
> empty line missing.
> 

Ack

>> +             folio->private = (void *) ((u64) folio->private & KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>> +             flush_tlb_kernel_range(addr, addr + folio_size(folio));
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
>> +{
>> +     /*
>> +      * Direct map restoration cannot fail, as the only error condition
>> +      * for direct map manipulation is failure to allocate page tables
>> +      * when splitting huge pages, but this split would have already
>> +      * happened in set_direct_map_invalid_noflush() in kvm_gmem_folio_zap_direct_map().
>> +      * Thus set_direct_map_valid_noflush() here only updates prot bits.
>> +      */
>> +     if (kvm_gmem_folio_no_direct_map(folio))
>> +             set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
>> +                                      true);
>> +}
>> +
>>   static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>   {
>>       folio_mark_uptodate(folio);
>> @@ -324,13 +365,14 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>       struct inode *inode = file_inode(vmf->vma->vm_file);
>>       struct folio *folio;
>>       vm_fault_t ret = VM_FAULT_LOCKED;
>> +     int err;
>>
>>       if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>>               return VM_FAULT_SIGBUS;
>>
>>       folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>>       if (IS_ERR(folio)) {
>> -             int err = PTR_ERR(folio);
>> +             err = PTR_ERR(folio);
>>
>>               if (err == -EAGAIN)
>>                       return VM_FAULT_RETRY;
>> @@ -348,6 +390,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>>               kvm_gmem_mark_prepared(folio);
>>       }
>>
>> +     err = kvm_gmem_folio_zap_direct_map(folio);
>> +
> 
> I'd drop this empty line here.
>

Ack

>> +     if (err) {
>> +             ret = vmf_error(err);
>> +             goto out_folio;
>> +     }
>> +
>>       vmf->page = folio_file_page(folio, vmf->pgoff);
>>
>>   out_folio:
>> @@ -435,6 +484,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
>>       kvm_pfn_t pfn = page_to_pfn(page);
>>       int order = folio_order(folio);
>>
>> +     kvm_gmem_folio_restore_direct_map(folio);
>> +
>>       kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
>>   }
>>
>> @@ -499,6 +550,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>>       /* Unmovable mappings are supposed to be marked unevictable as well. */
>>       WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>>
>> +     if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
>> +             mapping_set_no_direct_map(inode->i_mapping);
>> +
>>       kvm_get_kvm(kvm);
>>       gmem->kvm = kvm;
>>       xa_init(&gmem->bindings);
>> @@ -523,6 +577,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>>       if (kvm_arch_supports_gmem_mmap(kvm))
>>               valid_flags |= GUEST_MEMFD_FLAG_MMAP;
>>
>> +     if (kvm_arch_gmem_supports_no_direct_map())
>> +             valid_flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
>> +
>>       if (flags & ~valid_flags)
>>               return -EINVAL;
>>
>> @@ -687,6 +744,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>>       if (!is_prepared)
>>               r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>>
>> +     kvm_gmem_folio_zap_direct_map(folio);
>> +
>>       folio_unlock(folio);
>>
>>       if (!r)
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 18f29ef93543..b5e702d95230 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -65,6 +65,7 @@
>>   #include <trace/events/kvm.h>
>>
>>   #include <linux/kvm_dirty_ring.h>
>> +#include <linux/set_memory.h>
> 
> Likely not required here.
> 

Seems for now it is because of how
kvm_arch_gmem_supports_no_direct_map() is defined, but I suspect the
need will disappear once that is changed as you suggested above :)

> -- 
> Cheers
> 
> David / dhildenb

Best,
Patrick



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-24 15:22     ` [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing Roy, Patrick
  2025-09-25 11:02       ` David Hildenbrand
@ 2025-09-25 18:27       ` Dave Hansen
  2025-09-25 19:20         ` David Hildenbrand
  1 sibling, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2025-09-25 18:27 UTC (permalink / raw)
  To: Roy, Patrick
  Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On 9/24/25 08:22, Roy, Patrick wrote:
> Add an option to not perform TLB flushes after direct map manipulations.

I'd really prefer this be left out for now. It's a massive can of worms.
Let's agree on something that works and has well-defined behavior before
we go breaking it on purpose.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-25 18:27       ` Dave Hansen
@ 2025-09-25 19:20         ` David Hildenbrand
  2025-09-25 19:59           ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-09-25 19:20 UTC (permalink / raw)
  To: Dave Hansen, Roy, Patrick
  Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, song@kernel.org,
	jolsa@kernel.org, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@linux.dev, eddyz87@gmail.com,
	yonghong.song@linux.dev, john.fastabend@gmail.com,
	kpsingh@kernel.org, sdf@fomichev.me, haoluo@google.com,
	jgg@ziepe.ca, jhubbard@nvidia.com, peterx@redhat.com,
	jannh@google.com, pfalcato@suse.de, shuah@kernel.org,
	seanjc@google.com, kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On 25.09.25 20:27, Dave Hansen wrote:
> On 9/24/25 08:22, Roy, Patrick wrote:
>> Add an option to not perform TLB flushes after direct map manipulations.
> 
> I'd really prefer this be left out for now. It's a massive can of worms.
> Let's agree on something that works and has well-defined behavior before
> we go breaking it on purpose.

May I ask what the big concern here is? Not to challenge your position 
but to understand the involved problems and what would have to be 
documented at some point in a patch.

Essentially we're removing the direct map from some memory we allocated 
through the buddy to reinstall it before we free the memory back to the 
buddy.

So from the buddy POV whether we flush or don't flush the TLB shouldn't 
matter, right?

Where the missing TLB flush would be relevant is to the workload (VM) 
where some (speculative) access through the direct map would be possible 
until the TLB was flushed.

So until flushed, it's not-as-secure-as-you think. A flush after some 
time (batched over multiple page allocations?) could make it deterministic.

Is there something else that's problematic?

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map
  2025-09-25 15:52         ` Roy, Patrick
@ 2025-09-25 19:28           ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2025-09-25 19:28 UTC (permalink / raw)
  To: Roy, Patrick
  Cc: Liam.Howlett@oracle.com, ackerleytng@google.com,
	akpm@linux-foundation.org, andrii@kernel.org, ast@kernel.org,
	bp@alien8.de, bpf@vger.kernel.org, catalin.marinas@arm.com,
	corbet@lwn.net, daniel@iogearbox.net, dave.hansen@linux.intel.com,
	derekmn@amazon.co.uk, eddyz87@gmail.com, haoluo@google.com,
	hpa@zytor.com, Thomson, Jack, jannh@google.com, jgg@ziepe.ca,
	jhubbard@nvidia.com, joey.gouly@arm.com, john.fastabend@gmail.com,
	jolsa@kernel.org, Kalyazin, Nikita, kpsingh@kernel.org,
	kvm@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-mm@kvack.org,
	lorenzo.stoakes@oracle.com, luto@kernel.org, martin.lau@linux.dev,
	maz@kernel.org, mhocko@suse.com, mingo@redhat.com,
	oliver.upton@linux.dev, pbonzini@redhat.com, peterx@redhat.com,
	peterz@infradead.org, pfalcato@suse.de, rppt@kernel.org,
	sdf@fomichev.me, seanjc@google.com, shuah@kernel.org,
	song@kernel.org, surenb@google.com, suzuki.poulose@arm.com,
	tabba@google.com, tglx@linutronix.de, vbabka@suse.cz,
	will@kernel.org, willy@infradead.org, x86@kernel.org, Cali, Marco,
	yonghong.song@linux.dev, yuzenghui@huawei.com

On 25.09.25 17:52, Roy, Patrick wrote:
> On Thu, 2025-09-25 at 12:00 +0100, David Hildenbrand wrote:
>> On 24.09.25 17:22, Roy, Patrick wrote:
>>> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
>>> ioctl. When set, guest_memfd folios will be removed from the direct map
>>> after preparation, with direct map entries only restored when the folios
>>> are freed.
>>>
>>> To ensure these folios do not end up in places where the kernel cannot
>>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>>> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
>>>
>>> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
>>> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
>>> guest_memfd itself being supported, but also on whether linux supports
>>> manipulatomg the direct map at page granularity at all (possible most of
>>> the time, outliers being arm64 where its impossible if the direct map
>>> has been setup using hugepages, as arm64 cannot break these apart due to
>>> break-before-make semantics, and powerpc, which does not select
>>> ARCH_HAS_SET_DIRECT_MAP, though also doesn't support guest_memfd
>>> anyway).
>>>
>>> Note that this flag causes removal of direct map entries for all
>>> guest_memfd folios independent of whether they are "shared" or "private"
>>> (although current guest_memfd only supports either all folios in the
>>> "shared" state, or all folios in the "private" state if
>>> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
>>> entries of also the shared parts of guest_memfd are a special type of
>>> non-CoCo VM where, host userspace is trusted to have access to all of
>>> guest memory, but where Spectre-style transient execution attacks
>>> through the host kernel's direct map should still be mitigated.  In this
>>> setup, KVM retains access to guest memory via userspace mappings of
>>> guest_memfd, which are reflected back into KVM's memslots via
>>> userspace_addr. This is needed for things like MMIO emulation on x86_64
>>> to work.
>>>
>>> Direct map entries are zapped right before guest or userspace mappings
>>> of gmem folios are set up, e.g. in kvm_gmem_fault_user_mapping() or
>>> kvm_gmem_get_pfn() [called from the KVM MMU code]. The only place where
>>> a gmem folio can be allocated without being mapped anywhere is
>>> kvm_gmem_populate(), where handling potential failures of direct map
>>> removal is not possible (by the time direct map removal is attempted,
>>> the folio is already marked as prepared, meaning attempting to re-try
>>> kvm_gmem_populate() would just result in -EEXIST without fixing up the
>>> direct map state). These folios are then removed form the direct map
>>> upon kvm_gmem_get_pfn(), e.g. when they are mapped into the guest later.
>>>
>>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>>> ---
>>>    Documentation/virt/kvm/api.rst    |  5 +++
>>>    arch/arm64/include/asm/kvm_host.h | 12 ++++++
>>>    include/linux/kvm_host.h          |  6 +++
>>>    include/uapi/linux/kvm.h          |  2 +
>>>    virt/kvm/guest_memfd.c            | 61 ++++++++++++++++++++++++++++++-
>>>    virt/kvm/kvm_main.c               |  5 +++
>>>    6 files changed, 90 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>>> index c17a87a0a5ac..b52c14d58798 100644
>>> --- a/Documentation/virt/kvm/api.rst
>>> +++ b/Documentation/virt/kvm/api.rst
>>> @@ -6418,6 +6418,11 @@ When the capability KVM_CAP_GUEST_MEMFD_MMAP is supported, the 'flags' field
>>>    supports GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation
>>>    enables mmap() and faulting of guest_memfd memory to host userspace.
>>>
>>> +When the capability KVM_CAP_GMEM_NO_DIRECT_MAP is supported, the 'flags' field
>>> +supports GUEST_MEMFG_FLAG_NO_DIRECT_MAP. Setting this flag makes the guest_memfd
>>> +instance behave similarly to memfd_secret, and unmaps the memory backing it from
>>> +the kernel's address space after allocation.
>>> +
>>
>> Do we want to document what the implication of that is? Meaning,
>> limitations etc. I recall that we would need the user mapping for gmem
>> slots to be properly set up.
>>
>> Is that still the case in this patch set?
> 
> The ->userspace_addr thing is the general requirement for non-CoCo VMs,
> and not specific for direct map removal (e.g. I expect direct map
> removal to just work out of the box for CoCo setups, where KVM already
> cannot access guest memory, ignoring the question of whether direct map
> removal is even useful for CoCo VMs). So I don't think it should be
> documented as part of
> KVM_CAP_GMEM_NO_DIRECT_MAP/GUEST_MEMFG_FLAG_NO_DIRECT_MAP (heh, there's
> a typo I just noticed.

Okay I was rather wondering whether this will be the first patch set 
where it is actually required to be set. In the basic mmap series, I am 
not sure yet if we really depend on it (but IIRC we did document it, but 
do no sanity checks etc).

"MEMFG". Also "GMEM" needs to be "GUEST_MEMFD".
> Will fix that), but rather as part of GUEST_MEMFD_FLAG_MMAP. I can add a
> patch it there (or maybe send it separately, since FLAG_MMAP is already
> in -next?).

Yes, it's in kvm/next and will go upstream soon.

> 
>>>    When the KVM MMU performs a PFN lookup to service a guest fault and the backing
>>>    guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
>>>    consumed from guest_memfd, regardless of whether it is a shared or a private
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index 2f2394cce24e..0bfd8e5fd9de 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -19,6 +19,7 @@
>>>    #include <linux/maple_tree.h>
>>>    #include <linux/percpu.h>
>>>    #include <linux/psci.h>
>>> +#include <linux/set_memory.h>
>>>    #include <asm/arch_gicv3.h>
>>>    #include <asm/barrier.h>
>>>    #include <asm/cpufeature.h>
>>> @@ -1706,5 +1707,16 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
>>>    void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
>>>    void check_feature_map(void);
>>>
>>> +#ifdef CONFIG_KVM_GUEST_MEMFD
>>> +static inline bool kvm_arch_gmem_supports_no_direct_map(void)
>>> +{
>>> +     /*
>>> +      * Without FWB, direct map access is needed in kvm_pgtable_stage2_map(),
>>> +      * as it calls dcache_clean_inval_poc().
>>> +      */
>>> +     return can_set_direct_map() && cpus_have_final_cap(ARM64_HAS_STAGE2_FWB);
>>> +}
>>> +#define kvm_arch_gmem_supports_no_direct_map kvm_arch_gmem_supports_no_direct_map
>>> +#endif /* CONFIG_KVM_GUEST_MEMFD */
>>>
>>
>> I strongly assume that the aarch64 support should be moved to a separate
>> patch -- if possible, see below.
>>
>>>    #endif /* __ARM64_KVM_HOST_H__ */
>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>> index 1d0585616aa3..73a15cade54a 100644
>>> --- a/include/linux/kvm_host.h
>>> +++ b/include/linux/kvm_host.h
>>> @@ -731,6 +731,12 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
>>>    bool kvm_arch_supports_gmem_mmap(struct kvm *kvm);
>>>    #endif
>>>
>>> +#ifdef CONFIG_KVM_GUEST_MEMFD
>>> +#ifndef kvm_arch_gmem_supports_no_direct_map
>>> +#define kvm_arch_gmem_supports_no_direct_map can_set_direct_map
>>> +#endif
>>
>> Hm, wouldn't it be better to have an opt-in per arch, and really only
>> unlock the ones we know work (tested etc), explicitly in separate patches.
>>
> 
> Ack, can definitely do that. Something like
> 
> #ifndef kvm_arch_gmem_supports_no_direct_map
> static inline bool kvm_arch_gmem_supports_no_direct_map()
> {
> 	return false;
> }
> #endif
> 
> and then actual definitions (in separate patches) in the arm64 and x86
> headers?
> 
> On a related note, maybe PATCH 2 should only export
> set_direct_map_valid_noflush() for the architectures on which we
> actually need it? Which would only be x86, since arm64 doesnt allow
> building KVM as a module, and nothing else supports guest_memfd right
> now.

Yes, that's probably best. Could be done in the same arch patch then.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-25 15:50         ` Roy, Patrick
@ 2025-09-25 19:32           ` David Hildenbrand
  0 siblings, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2025-09-25 19:32 UTC (permalink / raw)
  To: Roy, Patrick
  Cc: Liam.Howlett@oracle.com, ackerleytng@google.com,
	akpm@linux-foundation.org, andrii@kernel.org, ast@kernel.org,
	bp@alien8.de, bpf@vger.kernel.org, catalin.marinas@arm.com,
	corbet@lwn.net, daniel@iogearbox.net, dave.hansen@linux.intel.com,
	derekmn@amazon.co.uk, eddyz87@gmail.com, haoluo@google.com,
	hpa@zytor.com, Thomson, Jack, jannh@google.com, jgg@ziepe.ca,
	jhubbard@nvidia.com, joey.gouly@arm.com, john.fastabend@gmail.com,
	jolsa@kernel.org, Kalyazin, Nikita, kpsingh@kernel.org,
	kvm@vger.kernel.org, kvmarm@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-mm@kvack.org,
	lorenzo.stoakes@oracle.com, luto@kernel.org, martin.lau@linux.dev,
	maz@kernel.org, mhocko@suse.com, mingo@redhat.com,
	oliver.upton@linux.dev, pbonzini@redhat.com, peterx@redhat.com,
	peterz@infradead.org, pfalcato@suse.de, rppt@kernel.org,
	sdf@fomichev.me, seanjc@google.com, shuah@kernel.org,
	song@kernel.org, surenb@google.com, suzuki.poulose@arm.com,
	tabba@google.com, tglx@linutronix.de, vbabka@suse.cz,
	will@kernel.org, willy@infradead.org, x86@kernel.org, Cali, Marco,
	yonghong.song@linux.dev, yuzenghui@huawei.com

On 25.09.25 17:50, Roy, Patrick wrote:
> On Thu, 2025-09-25 at 12:02 +0100, David Hildenbrand wrote:
>> On 24.09.25 17:22, Roy, Patrick wrote:
>>> Add an option to not perform TLB flushes after direct map manipulations.
>>> TLB flushes result in a up to 40x elongation of page faults in
>>> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
>>> of memory population, which is inacceptable when wanting to use direct
>>> map removed guest_memfd as a drop-in replacement for existing workloads.
>>>
>>> TLB flushes are not needed for functional correctness (the virt->phys
>>> mapping technically stays "correct", the kernel should simply not use it
>>> for a while), so we can skip them to keep performance in-line with
>>> "traditional" VMs.
>>>
>>> Enabling this option means that the desired protection from
>>> Spectre-style attacks is not perfect, as an attacker could try to
>>> prevent a stale TLB entry from getting evicted, keeping it alive until
>>> the page it refers to is used by the guest for some sensitive data, and
>>> then targeting it using a spectre-gadget.
>>>
>>> Cc: Will Deacon <will@kernel.org>
>>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk>
>>> ---
>>>    include/linux/kvm_host.h | 1 +
>>>    virt/kvm/guest_memfd.c   | 3 ++-
>>>    virt/kvm/kvm_main.c      | 3 +++
>>>    3 files changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>> index 73a15cade54a..4d2bc18860fc 100644
>>> --- a/include/linux/kvm_host.h
>>> +++ b/include/linux/kvm_host.h
>>> @@ -2298,6 +2298,7 @@ extern unsigned int halt_poll_ns;
>>>    extern unsigned int halt_poll_ns_grow;
>>>    extern unsigned int halt_poll_ns_grow_start;
>>>    extern unsigned int halt_poll_ns_shrink;
>>> +extern bool guest_memfd_tlb_flush;
>>>
>>>    struct kvm_device {
>>>        const struct kvm_device_ops *ops;
>>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>>> index b7129c4868c5..d8dd24459f0d 100644
>>> --- a/virt/kvm/guest_memfd.c
>>> +++ b/virt/kvm/guest_memfd.c
>>> @@ -63,7 +63,8 @@ static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
>>>        if (!r) {
>>>                unsigned long addr = (unsigned long) folio_address(folio);
>>>                folio->private = (void *) ((u64) folio->private & KVM_GMEM_FOLIO_NO_DIRECT_MAP);
>>> -             flush_tlb_kernel_range(addr, addr + folio_size(folio));
>>> +             if (guest_memfd_tlb_flush)
>>> +                     flush_tlb_kernel_range(addr, addr + folio_size(folio));
>>>        }
>>>
>>>        return r;
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index b5e702d95230..753c06ebba7f 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -95,6 +95,9 @@ unsigned int halt_poll_ns_shrink = 2;
>>>    module_param(halt_poll_ns_shrink, uint, 0644);
>>>    EXPORT_SYMBOL_GPL(halt_poll_ns_shrink);
>>>
>>> +bool guest_memfd_tlb_flush = true;
>>> +module_param(guest_memfd_tlb_flush, bool, 0444);
>>
>> The parameter name is a bit too generic. I think you somehow have to
>> incorporate the "direct_map" aspects.
> 
> Fair :)
> 
>> Also, I wonder if this could be a capability per vm/guest_memfd?
> 
> I don't really have any opinions on how to expose this knob, but I
> thought capabilities should be additive? (e.g. we only have
> KVM_ENABLE_EXTENSION(), and then having a capability with a negative
> polarity "enable to _not_ do TLB flushes" is a bit weird in my head).

Well, you are enabling the "skip-tlbflush" feature :) So a kernel that 
knows that extension could skip tlb flushes.

So I wouldn't see this as "perform-tlbflush" but "skip-tlbflush" / 
"no-tlbflush"

> Then again, if people are fine having TLB flushes be opt-in instead of
> opt-out (Will's comment on v6 makes me believe that the opt-out itself
> might already be controversial for arm64), a capability would work.

Yeah, I think this definitely should be opt-in: opt-in to slightly less 
security in a given timeframe by performing less tlb flushes.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-25 19:20         ` David Hildenbrand
@ 2025-09-25 19:59           ` Dave Hansen
  2025-09-25 20:13             ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2025-09-25 19:59 UTC (permalink / raw)
  To: David Hildenbrand, Roy, Patrick
  Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, song@kernel.org,
	jolsa@kernel.org, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@linux.dev, eddyz87@gmail.com,
	yonghong.song@linux.dev, john.fastabend@gmail.com,
	kpsingh@kernel.org, sdf@fomichev.me, haoluo@google.com,
	jgg@ziepe.ca, jhubbard@nvidia.com, peterx@redhat.com,
	jannh@google.com, pfalcato@suse.de, shuah@kernel.org,
	seanjc@google.com, kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On 9/25/25 12:20, David Hildenbrand wrote:
> On 25.09.25 20:27, Dave Hansen wrote:
>> On 9/24/25 08:22, Roy, Patrick wrote:
>>> Add an option to not perform TLB flushes after direct map manipulations.
>>
>> I'd really prefer this be left out for now. It's a massive can of worms.
>> Let's agree on something that works and has well-defined behavior before
>> we go breaking it on purpose.
> 
> May I ask what the big concern here is?

It's not a _big_ concern. I just think we want to start on something
like this as simple, secure, and deterministic as possible.

Let's say that with all the unmaps that load_unaligned_zeropad() faults
start to bite us. It'll take longer to find them if the TLB isn't flushed.

Basically, it'll make the bad things happen sooner rather than later.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-25 19:59           ` Dave Hansen
@ 2025-09-25 20:13             ` David Hildenbrand
  2025-09-26  9:46               ` Patrick Roy
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-09-25 20:13 UTC (permalink / raw)
  To: Dave Hansen, Roy, Patrick
  Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, song@kernel.org,
	jolsa@kernel.org, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@linux.dev, eddyz87@gmail.com,
	yonghong.song@linux.dev, john.fastabend@gmail.com,
	kpsingh@kernel.org, sdf@fomichev.me, haoluo@google.com,
	jgg@ziepe.ca, jhubbard@nvidia.com, peterx@redhat.com,
	jannh@google.com, pfalcato@suse.de, shuah@kernel.org,
	seanjc@google.com, kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On 25.09.25 21:59, Dave Hansen wrote:
> On 9/25/25 12:20, David Hildenbrand wrote:
>> On 25.09.25 20:27, Dave Hansen wrote:
>>> On 9/24/25 08:22, Roy, Patrick wrote:
>>>> Add an option to not perform TLB flushes after direct map manipulations.
>>>
>>> I'd really prefer this be left out for now. It's a massive can of worms.
>>> Let's agree on something that works and has well-defined behavior before
>>> we go breaking it on purpose.
>>
>> May I ask what the big concern here is?
> 
> It's not a _big_ concern. 

Oh, I read "can of worms" and thought there is something seriously 
problematic :)

> I just think we want to start on something
> like this as simple, secure, and deterministic as possible.

Yes, I agree. And it should be the default. Less secure would have to be 
opt-in and documented thoroughly.

> 
> Let's say that with all the unmaps that load_unaligned_zeropad() faults
> start to bite us. It'll take longer to find them if the TLB isn't flushed.
> 
> Basically, it'll make the bad things happen sooner rather than later.

Agreed.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-25 20:13             ` David Hildenbrand
@ 2025-09-26  9:46               ` Patrick Roy
  2025-09-26 10:53                 ` Will Deacon
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2025-09-26  9:46 UTC (permalink / raw)
  To: David Hildenbrand, Dave Hansen, Roy, Patrick
  Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, lorenzo.stoakes@oracle.com,
	Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, song@kernel.org,
	jolsa@kernel.org, ast@kernel.org, daniel@iogearbox.net,
	andrii@kernel.org, martin.lau@linux.dev, eddyz87@gmail.com,
	yonghong.song@linux.dev, john.fastabend@gmail.com,
	kpsingh@kernel.org, sdf@fomichev.me, haoluo@google.com,
	jgg@ziepe.ca, jhubbard@nvidia.com, peterx@redhat.com,
	jannh@google.com, pfalcato@suse.de, shuah@kernel.org,
	seanjc@google.com, kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com



On Thu, 2025-09-25 at 21:13 +0100, David Hildenbrand wrote:
> On 25.09.25 21:59, Dave Hansen wrote:
>> On 9/25/25 12:20, David Hildenbrand wrote:
>>> On 25.09.25 20:27, Dave Hansen wrote:
>>>> On 9/24/25 08:22, Roy, Patrick wrote:
>>>>> Add an option to not perform TLB flushes after direct map manipulations.
>>>>
>>>> I'd really prefer this be left out for now. It's a massive can of worms.
>>>> Let's agree on something that works and has well-defined behavior before
>>>> we go breaking it on purpose.
>>>
>>> May I ask what the big concern here is?
>>
>> It's not a _big_ concern. 
> 
> Oh, I read "can of worms" and thought there is something seriously problematic :)
> 
>> I just think we want to start on something
>> like this as simple, secure, and deterministic as possible.
> 
> Yes, I agree. And it should be the default. Less secure would have to be opt-in and documented thoroughly.

Yes, I am definitely happy to have the 100% secure behavior be the
default, and the skipping of TLB flushes be an opt-in, with thorough
documentation!

But I would like to include the "skip tlb flushes" option as part of
this patch series straight away, because as I was alluding to in the
commit message, with TLB flushes this is not usable for Firecracker for
performance reasons :(

>>
>> Let's say that with all the unmaps that load_unaligned_zeropad() faults
>> start to bite us. It'll take longer to find them if the TLB isn't flushed.
>>
>> Basically, it'll make the bad things happen sooner rather than later.
> 
> Agreed.
> 

Best,
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-26  9:46               ` Patrick Roy
@ 2025-09-26 10:53                 ` Will Deacon
  2025-09-26 20:09                   ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Will Deacon @ 2025-09-26 10:53 UTC (permalink / raw)
  To: Patrick Roy
  Cc: David Hildenbrand, Dave Hansen, Roy, Patrick, pbonzini@redhat.com,
	corbet@lwn.net, maz@kernel.org, oliver.upton@linux.dev,
	joey.gouly@arm.com, suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
	willy@infradead.org, akpm@linux-foundation.org,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On Fri, Sep 26, 2025 at 10:46:15AM +0100, Patrick Roy wrote:
> 
> 
> On Thu, 2025-09-25 at 21:13 +0100, David Hildenbrand wrote:
> > On 25.09.25 21:59, Dave Hansen wrote:
> >> On 9/25/25 12:20, David Hildenbrand wrote:
> >>> On 25.09.25 20:27, Dave Hansen wrote:
> >>>> On 9/24/25 08:22, Roy, Patrick wrote:
> >>>>> Add an option to not perform TLB flushes after direct map manipulations.
> >>>>
> >>>> I'd really prefer this be left out for now. It's a massive can of worms.
> >>>> Let's agree on something that works and has well-defined behavior before
> >>>> we go breaking it on purpose.
> >>>
> >>> May I ask what the big concern here is?
> >>
> >> It's not a _big_ concern. 
> > 
> > Oh, I read "can of worms" and thought there is something seriously problematic :)
> > 
> >> I just think we want to start on something
> >> like this as simple, secure, and deterministic as possible.
> > 
> > Yes, I agree. And it should be the default. Less secure would have to be opt-in and documented thoroughly.
> 
> Yes, I am definitely happy to have the 100% secure behavior be the
> default, and the skipping of TLB flushes be an opt-in, with thorough
> documentation!
> 
> But I would like to include the "skip tlb flushes" option as part of
> this patch series straight away, because as I was alluding to in the
> commit message, with TLB flushes this is not usable for Firecracker for
> performance reasons :(

I really don't want that option for arm64. If we're going to bother
unmapping from the linear map, we should invalidate the TLB.

Will

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map
  2025-09-24 15:22     ` [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map Roy, Patrick
  2025-09-25 11:00       ` David Hildenbrand
@ 2025-09-26 14:49       ` Patrick Roy
  1 sibling, 0 replies; 34+ messages in thread
From: Patrick Roy @ 2025-09-26 14:49 UTC (permalink / raw)
  To: Roy, Patrick
  Cc: pbonzini@redhat.com, corbet@lwn.net, maz@kernel.org,
	oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, will@kernel.org, tglx@linutronix.de,
	mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, luto@kernel.org,
	peterz@infradead.org, willy@infradead.org,
	akpm@linux-foundation.org, david@redhat.com,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com



On Wed, 2025-09-24 at 16:22 +0100, "Roy, Patrick" wrote:

[...]

> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 55b8d739779f..b7129c4868c5 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -4,6 +4,9 @@
>  #include <linux/kvm_host.h>
>  #include <linux/pagemap.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/set_memory.h>
> +
> +#include <asm/tlbflush.h>
>  
>  #include "kvm_mm.h"
>  
> @@ -42,6 +45,44 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>  	return 0;
>  }
>  
> +#define KVM_GMEM_FOLIO_NO_DIRECT_MAP BIT(0)
> +
> +static bool kvm_gmem_folio_no_direct_map(struct folio *folio)
> +{
> +	return ((u64) folio->private) & KVM_GMEM_FOLIO_NO_DIRECT_MAP;
> +}
> +
> +static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
> +{
> +	if (kvm_gmem_folio_no_direct_map(folio))
> +		return 0;
> +
> +	int r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
> +					 false);
> +
> +	if (!r) {
> +		unsigned long addr = (unsigned long) folio_address(folio);
> +		folio->private = (void *) ((u64) folio->private & KVM_GMEM_FOLIO_NO_DIRECT_MAP);
> +		flush_tlb_kernel_range(addr, addr + folio_size(folio));
> +	}
> +
> +	return r;
> +}

No idea how I managed to mess this function up so completely, but it
should be more like

static int kvm_gmem_folio_zap_direct_map(struct folio *folio)
{
	int r = 0;
	unsigned long addr = (unsigned long) folio_address(folio);
	u64 gmem_flags = (u64) folio_inode(folio)->i_private;

	if (kvm_gmem_folio_no_direct_map(folio) || !(gmem_flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP))
		goto out;

	r = set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio), false);

	if (r)
		goto out;

	folio->private = (void *) KVM_GMEM_FOLIO_NO_DIRECT_MAP;
	flush_tlb_kernel_range(addr, addr + folio_size(folio));

out:
	return r;
}

the version I sent (a) does not respect the flags passed to guest_memfd
on creation, and (b) does not correctly set the bit in folio->private.

> +static void kvm_gmem_folio_restore_direct_map(struct folio *folio)
> +{
> +	/*
> +	 * Direct map restoration cannot fail, as the only error condition
> +	 * for direct map manipulation is failure to allocate page tables
> +	 * when splitting huge pages, but this split would have already
> +	 * happened in set_direct_map_invalid_noflush() in kvm_gmem_folio_zap_direct_map().
> +	 * Thus set_direct_map_valid_noflush() here only updates prot bits.
> +	 */
> +	if (kvm_gmem_folio_no_direct_map(folio))
> +		set_direct_map_valid_noflush(folio_page(folio, 0), folio_nr_pages(folio),
> +					 true);
> +}
> +
>  static inline void kvm_gmem_mark_prepared(struct folio *folio)
>  {
>  	folio_mark_uptodate(folio);
> @@ -324,13 +365,14 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>  	struct inode *inode = file_inode(vmf->vma->vm_file);
>  	struct folio *folio;
>  	vm_fault_t ret = VM_FAULT_LOCKED;
> +	int err;
>  
>  	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
>  		return VM_FAULT_SIGBUS;
>  
>  	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
>  	if (IS_ERR(folio)) {
> -		int err = PTR_ERR(folio);
> +		err = PTR_ERR(folio);
>  
>  		if (err == -EAGAIN)
>  			return VM_FAULT_RETRY;
> @@ -348,6 +390,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
>  		kvm_gmem_mark_prepared(folio);
>  	}
>  
> +	err = kvm_gmem_folio_zap_direct_map(folio);
> +
> +	if (err) {
> +		ret = vmf_error(err);
> +		goto out_folio;
> +	}
> +
>  	vmf->page = folio_file_page(folio, vmf->pgoff);
>  
>  out_folio:
> @@ -435,6 +484,8 @@ static void kvm_gmem_free_folio(struct folio *folio)
>  	kvm_pfn_t pfn = page_to_pfn(page);
>  	int order = folio_order(folio);
>  
> +	kvm_gmem_folio_restore_direct_map(folio);
> +
>  	kvm_arch_gmem_invalidate(pfn, pfn + (1ul << order));
>  }
>  
> @@ -499,6 +550,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
>  	/* Unmovable mappings are supposed to be marked unevictable as well. */
>  	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
>  
> +	if (flags & GUEST_MEMFD_FLAG_NO_DIRECT_MAP)
> +		mapping_set_no_direct_map(inode->i_mapping);
> +
>  	kvm_get_kvm(kvm);
>  	gmem->kvm = kvm;
>  	xa_init(&gmem->bindings);
> @@ -523,6 +577,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  	if (kvm_arch_supports_gmem_mmap(kvm))
>  		valid_flags |= GUEST_MEMFD_FLAG_MMAP;
>  
> +	if (kvm_arch_gmem_supports_no_direct_map())
> +		valid_flags |= GUEST_MEMFD_FLAG_NO_DIRECT_MAP;
> +
>  	if (flags & ~valid_flags)
>  		return -EINVAL;
>  
> @@ -687,6 +744,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	if (!is_prepared)
>  		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
>  
> +	kvm_gmem_folio_zap_direct_map(folio);
> +
>  	folio_unlock(folio);
>  
>  	if (!r)

[...]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-26 10:53                 ` Will Deacon
@ 2025-09-26 20:09                   ` David Hildenbrand
  2025-09-27  7:38                     ` Patrick Roy
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-09-26 20:09 UTC (permalink / raw)
  To: Will Deacon, Patrick Roy
  Cc: Dave Hansen, Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net,
	maz@kernel.org, oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
	willy@infradead.org, akpm@linux-foundation.org,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On 26.09.25 12:53, Will Deacon wrote:
> On Fri, Sep 26, 2025 at 10:46:15AM +0100, Patrick Roy wrote:
>>
>>
>> On Thu, 2025-09-25 at 21:13 +0100, David Hildenbrand wrote:
>>> On 25.09.25 21:59, Dave Hansen wrote:
>>>> On 9/25/25 12:20, David Hildenbrand wrote:
>>>>> On 25.09.25 20:27, Dave Hansen wrote:
>>>>>> On 9/24/25 08:22, Roy, Patrick wrote:
>>>>>>> Add an option to not perform TLB flushes after direct map manipulations.
>>>>>>
>>>>>> I'd really prefer this be left out for now. It's a massive can of worms.
>>>>>> Let's agree on something that works and has well-defined behavior before
>>>>>> we go breaking it on purpose.
>>>>>
>>>>> May I ask what the big concern here is?
>>>>
>>>> It's not a _big_ concern.
>>>
>>> Oh, I read "can of worms" and thought there is something seriously problematic :)
>>>
>>>> I just think we want to start on something
>>>> like this as simple, secure, and deterministic as possible.
>>>
>>> Yes, I agree. And it should be the default. Less secure would have to be opt-in and documented thoroughly.
>>
>> Yes, I am definitely happy to have the 100% secure behavior be the
>> default, and the skipping of TLB flushes be an opt-in, with thorough
>> documentation!
>>
>> But I would like to include the "skip tlb flushes" option as part of
>> this patch series straight away, because as I was alluding to in the
>> commit message, with TLB flushes this is not usable for Firecracker for
>> performance reasons :(
> 
> I really don't want that option for arm64. If we're going to bother
> unmapping from the linear map, we should invalidate the TLB.

Reading "TLB flushes result in a up to 40x elongation of page faults in
guest_memfd (scaling with the number of CPU cores), or a 5x elongation
of memory population,", I can understand why one would want that 
optimization :)

@Patrick, couldn't we use fallocate() to preallocate memory and batch 
the TLB flush within such an operation?

That is, we wouldn't flush after each individual direct-map modification 
but after multiple ones part of a single operation like fallocate of a 
larger range.

Likely wouldn't make all use cases happy.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-26 20:09                   ` David Hildenbrand
@ 2025-09-27  7:38                     ` Patrick Roy
  2025-09-29 10:20                       ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick Roy @ 2025-09-27  7:38 UTC (permalink / raw)
  To: David Hildenbrand, Will Deacon
  Cc: Dave Hansen, Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net,
	maz@kernel.org, oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
	willy@infradead.org, akpm@linux-foundation.org,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com



On Fri, 2025-09-26 at 21:09 +0100, David Hildenbrand wrote:
> On 26.09.25 12:53, Will Deacon wrote:
>> On Fri, Sep 26, 2025 at 10:46:15AM +0100, Patrick Roy wrote:
>>>
>>>
>>> On Thu, 2025-09-25 at 21:13 +0100, David Hildenbrand wrote:
>>>> On 25.09.25 21:59, Dave Hansen wrote:
>>>>> On 9/25/25 12:20, David Hildenbrand wrote:
>>>>>> On 25.09.25 20:27, Dave Hansen wrote:
>>>>>>> On 9/24/25 08:22, Roy, Patrick wrote:
>>>>>>>> Add an option to not perform TLB flushes after direct map manipulations.
>>>>>>>
>>>>>>> I'd really prefer this be left out for now. It's a massive can of worms.
>>>>>>> Let's agree on something that works and has well-defined behavior before
>>>>>>> we go breaking it on purpose.
>>>>>>
>>>>>> May I ask what the big concern here is?
>>>>>
>>>>> It's not a _big_ concern.
>>>>
>>>> Oh, I read "can of worms" and thought there is something seriously problematic :)
>>>>
>>>>> I just think we want to start on something
>>>>> like this as simple, secure, and deterministic as possible.
>>>>
>>>> Yes, I agree. And it should be the default. Less secure would have to be opt-in and documented thoroughly.
>>>
>>> Yes, I am definitely happy to have the 100% secure behavior be the
>>> default, and the skipping of TLB flushes be an opt-in, with thorough
>>> documentation!
>>>
>>> But I would like to include the "skip tlb flushes" option as part of
>>> this patch series straight away, because as I was alluding to in the
>>> commit message, with TLB flushes this is not usable for Firecracker for
>>> performance reasons :(
>>
>> I really don't want that option for arm64. If we're going to bother
>> unmapping from the linear map, we should invalidate the TLB.
> 
> Reading "TLB flushes result in a up to 40x elongation of page faults in
> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
> of memory population,", I can understand why one would want that optimization :)
> 
> @Patrick, couldn't we use fallocate() to preallocate memory and batch the TLB flush within such an operation?
> 
> That is, we wouldn't flush after each individual direct-map modification but after multiple ones part of a single operation like fallocate of a larger range.
> 
> Likely wouldn't make all use cases happy.
>

For Firecracker, we rely a lot on not preallocating _all_ VM memory, and
trying to ensure only the actual "working set" of a VM is faulted in (we
pack a lot more VMs onto a physical host than there is actual physical
memory available). For VMs that are restored from a snapshot, we know
pretty well what memory needs to be faulted in (that's where @Nikita's
write syscall comes in), so there we could try such an optimization. But
for everything else we very much rely on the on-demand nature of guest
memory allocation (and hence direct map removal). And even right now,
the long pole performance-wise are these on-demand faults, so really, we
don't want them to become even slower :(

Also, can we really batch multiple TLB flushes as you suggest? Even if
pages are at consecutive indices in guest_memfd, they're not guaranteed
to be continguous physically, e.g. we couldn't just coalesce multiple
TLB flushes into a single TLB flush of a larger range.

There's probably other things we can try. Backing guest_memfd with
hugepages would reduce the number TLB flushes by 512x (although not all
users of Firecracker at Amazon [can] use hugepages).

And I do still wonder if it's possible to have "async TLB flushes" where
we simply don't wait for the IPI (x86 terminology, not sure what the
mechanism on arm64 is). Looking at
smp_call_function_many_cond()/invlpgb_kernel_range_flush() on x86, it
seems so? Although seems like on ARM it's actually just handled by a
single instruction (TLBI) and not some interprocess communication
thingy. Maybe there's a variant that's faster / better for this usecase?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-27  7:38                     ` Patrick Roy
@ 2025-09-29 10:20                       ` David Hildenbrand
  2025-10-11 14:32                         ` Patrick Roy
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-09-29 10:20 UTC (permalink / raw)
  To: Patrick Roy, Will Deacon
  Cc: Dave Hansen, Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net,
	maz@kernel.org, oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
	willy@infradead.org, akpm@linux-foundation.org,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

On 27.09.25 09:38, Patrick Roy wrote:
> 
> 
> On Fri, 2025-09-26 at 21:09 +0100, David Hildenbrand wrote:
>> On 26.09.25 12:53, Will Deacon wrote:
>>> On Fri, Sep 26, 2025 at 10:46:15AM +0100, Patrick Roy wrote:
>>>>
>>>>
>>>> On Thu, 2025-09-25 at 21:13 +0100, David Hildenbrand wrote:
>>>>> On 25.09.25 21:59, Dave Hansen wrote:
>>>>>> On 9/25/25 12:20, David Hildenbrand wrote:
>>>>>>> On 25.09.25 20:27, Dave Hansen wrote:
>>>>>>>> On 9/24/25 08:22, Roy, Patrick wrote:
>>>>>>>>> Add an option to not perform TLB flushes after direct map manipulations.
>>>>>>>>
>>>>>>>> I'd really prefer this be left out for now. It's a massive can of worms.
>>>>>>>> Let's agree on something that works and has well-defined behavior before
>>>>>>>> we go breaking it on purpose.
>>>>>>>
>>>>>>> May I ask what the big concern here is?
>>>>>>
>>>>>> It's not a _big_ concern.
>>>>>
>>>>> Oh, I read "can of worms" and thought there is something seriously problematic :)
>>>>>
>>>>>> I just think we want to start on something
>>>>>> like this as simple, secure, and deterministic as possible.
>>>>>
>>>>> Yes, I agree. And it should be the default. Less secure would have to be opt-in and documented thoroughly.
>>>>
>>>> Yes, I am definitely happy to have the 100% secure behavior be the
>>>> default, and the skipping of TLB flushes be an opt-in, with thorough
>>>> documentation!
>>>>
>>>> But I would like to include the "skip tlb flushes" option as part of
>>>> this patch series straight away, because as I was alluding to in the
>>>> commit message, with TLB flushes this is not usable for Firecracker for
>>>> performance reasons :(
>>>
>>> I really don't want that option for arm64. If we're going to bother
>>> unmapping from the linear map, we should invalidate the TLB.
>>
>> Reading "TLB flushes result in a up to 40x elongation of page faults in
>> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
>> of memory population,", I can understand why one would want that optimization :)
>>
>> @Patrick, couldn't we use fallocate() to preallocate memory and batch the TLB flush within such an operation?
>>
>> That is, we wouldn't flush after each individual direct-map modification but after multiple ones part of a single operation like fallocate of a larger range.
>>
>> Likely wouldn't make all use cases happy.
>>
> 
> For Firecracker, we rely a lot on not preallocating _all_ VM memory, and
> trying to ensure only the actual "working set" of a VM is faulted in (we
> pack a lot more VMs onto a physical host than there is actual physical
> memory available). For VMs that are restored from a snapshot, we know
> pretty well what memory needs to be faulted in (that's where @Nikita's
> write syscall comes in), so there we could try such an optimization. But
> for everything else we very much rely on the on-demand nature of guest
> memory allocation (and hence direct map removal). And even right now,
> the long pole performance-wise are these on-demand faults, so really, we
> don't want them to become even slower :(

Makes sense. I guess even without support for large folios one could 
implement a kind of "fault" around: for example, on access to one addr, 
allocate+prepare all pages in the same 2 M chunk, flushing the tlb only 
once after adjusting all the direct map entries.

> 
> Also, can we really batch multiple TLB flushes as you suggest? Even if
> pages are at consecutive indices in guest_memfd, they're not guaranteed
> to be continguous physically, e.g. we couldn't just coalesce multiple
> TLB flushes into a single TLB flush of a larger range.

Well, you there is the option on just flushing the complete tlb of 
course :) When trying to flush a range you would indeed run into the 
problem of flushing an ever growing range.

> 
> There's probably other things we can try. Backing guest_memfd with
> hugepages would reduce the number TLB flushes by 512x (although not all
> users of Firecracker at Amazon [can] use hugepages).

Right.

> 
> And I do still wonder if it's possible to have "async TLB flushes" where
> we simply don't wait for the IPI (x86 terminology, not sure what the
> mechanism on arm64 is). Looking at
> smp_call_function_many_cond()/invlpgb_kernel_range_flush() on x86, it
> seems so? Although seems like on ARM it's actually just handled by a
> single instruction (TLBI) and not some interprocess communication
> thingy. Maybe there's a variant that's faster / better for this usecase?

Right, some architectures (and IIRC also x86 with some extension) are 
able to flush remote TLBs without IPIs.

Doing a quick search, there seems to be some research on async TLB 
flushing, e.g., [1].

In the context here, I wonder whether an async TLB flush would be 
significantly better than not doing an explicit TLB flush: in both 
cases, it's not really deterministic when the relevant TLB entries will 
vanish: with the async variant it might happen faster on average I guess.


[1] https://cs.yale.edu/homes/abhishek/kumar-taco20.pdf

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing
  2025-09-29 10:20                       ` David Hildenbrand
@ 2025-10-11 14:32                         ` Patrick Roy
  0 siblings, 0 replies; 34+ messages in thread
From: Patrick Roy @ 2025-10-11 14:32 UTC (permalink / raw)
  To: David Hildenbrand, Will Deacon
  Cc: Dave Hansen, Roy, Patrick, pbonzini@redhat.com, corbet@lwn.net,
	maz@kernel.org, oliver.upton@linux.dev, joey.gouly@arm.com,
	suzuki.poulose@arm.com, yuzenghui@huawei.com,
	catalin.marinas@arm.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, luto@kernel.org, peterz@infradead.org,
	willy@infradead.org, akpm@linux-foundation.org,
	lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, song@kernel.org, jolsa@kernel.org,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	martin.lau@linux.dev, eddyz87@gmail.com, yonghong.song@linux.dev,
	john.fastabend@gmail.com, kpsingh@kernel.org, sdf@fomichev.me,
	haoluo@google.com, jgg@ziepe.ca, jhubbard@nvidia.com,
	peterx@redhat.com, jannh@google.com, pfalcato@suse.de,
	shuah@kernel.org, seanjc@google.com, kvm@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, Cali, Marco,
	Kalyazin, Nikita, Thomson, Jack, derekmn@amazon.co.uk,
	tabba@google.com, ackerleytng@google.com

Hey all,

sorry it took me a while to get back to this, turns out moving
internationally is move time consuming than I expected.

On Mon, 2025-09-29 at 12:20 +0200, David Hildenbrand wrote:
> On 27.09.25 09:38, Patrick Roy wrote:
>> On Fri, 2025-09-26 at 21:09 +0100, David Hildenbrand wrote:
>>> On 26.09.25 12:53, Will Deacon wrote:
>>>> On Fri, Sep 26, 2025 at 10:46:15AM +0100, Patrick Roy wrote:
>>>>> On Thu, 2025-09-25 at 21:13 +0100, David Hildenbrand wrote:
>>>>>> On 25.09.25 21:59, Dave Hansen wrote:
>>>>>>> On 9/25/25 12:20, David Hildenbrand wrote:
>>>>>>>> On 25.09.25 20:27, Dave Hansen wrote:
>>>>>>>>> On 9/24/25 08:22, Roy, Patrick wrote:
>>>>>>>>>> Add an option to not perform TLB flushes after direct map manipulations.
>>>>>>>>>
>>>>>>>>> I'd really prefer this be left out for now. It's a massive can of worms.
>>>>>>>>> Let's agree on something that works and has well-defined behavior before
>>>>>>>>> we go breaking it on purpose.
>>>>>>>>
>>>>>>>> May I ask what the big concern here is?
>>>>>>>
>>>>>>> It's not a _big_ concern.
>>>>>>
>>>>>> Oh, I read "can of worms" and thought there is something seriously problematic :)
>>>>>>
>>>>>>> I just think we want to start on something
>>>>>>> like this as simple, secure, and deterministic as possible.
>>>>>>
>>>>>> Yes, I agree. And it should be the default. Less secure would have to be opt-in and documented thoroughly.
>>>>>
>>>>> Yes, I am definitely happy to have the 100% secure behavior be the
>>>>> default, and the skipping of TLB flushes be an opt-in, with thorough
>>>>> documentation!
>>>>>
>>>>> But I would like to include the "skip tlb flushes" option as part of
>>>>> this patch series straight away, because as I was alluding to in the
>>>>> commit message, with TLB flushes this is not usable for Firecracker for
>>>>> performance reasons :(
>>>>
>>>> I really don't want that option for arm64. If we're going to bother
>>>> unmapping from the linear map, we should invalidate the TLB.
>>>
>>> Reading "TLB flushes result in a up to 40x elongation of page faults in
>>> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
>>> of memory population,", I can understand why one would want that optimization :)
>>>
>>> @Patrick, couldn't we use fallocate() to preallocate memory and batch the TLB flush within such an operation?
>>>
>>> That is, we wouldn't flush after each individual direct-map modification but after multiple ones part of a single operation like fallocate of a larger range.
>>>
>>> Likely wouldn't make all use cases happy.
>>>
>>
>> For Firecracker, we rely a lot on not preallocating _all_ VM memory, and
>> trying to ensure only the actual "working set" of a VM is faulted in (we
>> pack a lot more VMs onto a physical host than there is actual physical
>> memory available). For VMs that are restored from a snapshot, we know
>> pretty well what memory needs to be faulted in (that's where @Nikita's
>> write syscall comes in), so there we could try such an optimization. But
>> for everything else we very much rely on the on-demand nature of guest
>> memory allocation (and hence direct map removal). And even right now,
>> the long pole performance-wise are these on-demand faults, so really, we
>> don't want them to become even slower :(
> 
> Makes sense. I guess even without support for large folios one could implement a kind of "fault" around: for example, on access to one addr, allocate+prepare all pages in the same 2 M chunk, flushing the tlb only once after adjusting all the direct map entries.
> 
>>
>> Also, can we really batch multiple TLB flushes as you suggest? Even if
>> pages are at consecutive indices in guest_memfd, they're not guaranteed
>> to be continguous physically, e.g. we couldn't just coalesce multiple
>> TLB flushes into a single TLB flush of a larger range.
> 
> Well, you there is the option on just flushing the complete tlb of course :) When trying to flush a range you would indeed run into the problem of flushing an ever growing range.

In the last guest_memfd upstream call (over a week ago now), we've
discussed the option of batching and deferring TLB flushes, while
providing a sort of "deadline" at which a TLB flush will
deterministically be done.  E.g. guest_memfd would keep a counter of how
many pages got direct map zapped, and do a flush of a range that
contains all zapped pages every 512 allocated pages (and to ensure the
flushes even happen in a timely manner if no allocations happen for a
long time, also every, say, 5 seconds or something like that). Would
that work for everyone? I briefly tested the performance of
batch-flushes with secretmem in QEMU, and its within of 30% of the "no
TLB flushes at all" solution in a simple benchmark that just memsets
2GiB of memory.

I think something like this, together with the batch-flushing at the end
of fallocate() / write() as David suggested above should work for
Firecracker.

>> There's probably other things we can try. Backing guest_memfd with
>> hugepages would reduce the number TLB flushes by 512x (although not all
>> users of Firecracker at Amazon [can] use hugepages).
> 
> Right.
> 
>>
>> And I do still wonder if it's possible to have "async TLB flushes" where
>> we simply don't wait for the IPI (x86 terminology, not sure what the
>> mechanism on arm64 is). Looking at
>> smp_call_function_many_cond()/invlpgb_kernel_range_flush() on x86, it
>> seems so? Although seems like on ARM it's actually just handled by a
>> single instruction (TLBI) and not some interprocess communication
>> thingy. Maybe there's a variant that's faster / better for this usecase?
> 
> Right, some architectures (and IIRC also x86 with some extension) are able to flush remote TLBs without IPIs.
> 
> Doing a quick search, there seems to be some research on async TLB flushing, e.g., [1].
> 
> In the context here, I wonder whether an async TLB flush would be
> significantly better than not doing an explicit TLB flush: in both
> cases, it's not really deterministic when the relevant TLB entries
> will vanish: with the async variant it might happen faster on average
> I guess.

I actually did end up playing around with this a while ago, and it made
things slightly better performance wise, but it was still too bad to be
useful :(

> 
> [1] https://cs.yale.edu/homes/abhishek/kumar-taco20.pdf
>

Best, 
Patrick

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-10-11 14:32 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-24 15:10 [PATCH v7 00/12] Direct Map Removal Support for guest_memfd Patrick Roy
2025-09-24 15:10 ` [PATCH v7 01/12] arch: export set_direct_map_valid_noflush to KVM module Patrick Roy
2025-09-24 15:10 ` [PATCH v7 02/12] x86/tlb: export flush_tlb_kernel_range " Patrick Roy
2025-09-24 15:10 ` [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP Patrick Roy
2025-09-24 15:22   ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate Roy, Patrick
2025-09-24 15:22     ` [PATCH v7 05/12] KVM: guest_memfd: Add flag to remove from direct map Roy, Patrick
2025-09-25 11:00       ` David Hildenbrand
2025-09-25 15:52         ` Roy, Patrick
2025-09-25 19:28           ` David Hildenbrand
2025-09-26 14:49       ` Patrick Roy
2025-09-24 15:22     ` [PATCH v7 06/12] KVM: guest_memfd: add module param for disabling TLB flushing Roy, Patrick
2025-09-25 11:02       ` David Hildenbrand
2025-09-25 15:50         ` Roy, Patrick
2025-09-25 19:32           ` David Hildenbrand
2025-09-25 18:27       ` Dave Hansen
2025-09-25 19:20         ` David Hildenbrand
2025-09-25 19:59           ` Dave Hansen
2025-09-25 20:13             ` David Hildenbrand
2025-09-26  9:46               ` Patrick Roy
2025-09-26 10:53                 ` Will Deacon
2025-09-26 20:09                   ` David Hildenbrand
2025-09-27  7:38                     ` Patrick Roy
2025-09-29 10:20                       ` David Hildenbrand
2025-10-11 14:32                         ` Patrick Roy
2025-09-24 15:22     ` [PATCH v7 07/12] KVM: selftests: load elf via bounce buffer Roy, Patrick
2025-09-24 15:22     ` [PATCH v7 08/12] KVM: selftests: set KVM_MEM_GUEST_MEMFD in vm_mem_add() if guest_memfd != -1 Roy, Patrick
2025-09-24 15:22     ` [PATCH v7 09/12] KVM: selftests: Add guest_memfd based vm_mem_backing_src_types Roy, Patrick
2025-09-24 15:22     ` [PATCH v7 10/12] KVM: selftests: cover GUEST_MEMFD_FLAG_NO_DIRECT_MAP in existing selftests Roy, Patrick
2025-09-24 15:22     ` [PATCH v7 11/12] KVM: selftests: stuff vm_mem_backing_src_type into vm_shape Roy, Patrick
2025-09-24 15:22     ` [PATCH v7 12/12] KVM: selftests: Test guest execution from direct map removed gmem Roy, Patrick
2025-09-25 10:26     ` [PATCH v7 04/12] KVM: guest_memfd: Add stub for kvm_arch_gmem_invalidate David Hildenbrand
2025-09-25 10:25   ` [PATCH v7 03/12] mm: introduce AS_NO_DIRECT_MAP David Hildenbrand
2025-09-24 15:29 ` [PATCH v7 00/12] Direct Map Removal Support for guest_memfd Roy, Patrick
2025-09-24 15:38   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).