[PATCH v13 00/20] KVM: Enable host userspace mapping for guest

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs
@ 2025-07-09 10:59 Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 01/20] KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GMEM Fuad Tabba
                   ` (19 more replies)
  0 siblings, 20 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Main changes since v12 [1]:
* Rename various functions and variables
* Expand and clarify commit messages
* Rebase on Linux 6.16-rc5

This patch series enables host userspace mapping of guest_memfd-backed
memory for non-CoCo VMs. This is required for several evolving KVM use
cases:

* Allows VMMs like Firecracker to run guests entirely backed by
  guest_memfd [2]. This provides a unified memory management model for
  both confidential and non-confidential guests, simplifying VMM design.

* Enhanced Security via direct map removal: When combined with Patrick's
  series for direct map removal [3], this provides additional hardening
  against Spectre-like transient execution attacks by eliminating the
  need for host kernel direct maps of guest memory.

* Lays the groundwork for *restricted* mmap() support for
  guest_memfd-backed memory on CoCo platforms [4] that permit in-place
  sharing of guest memory with the host.

Patch breakdown:

Patches 1-7: Primarily infrastructure refactorings and renames to decouple
guest_memfd from the concept of "private" memory.

Patches 8-9: Add support for the host to map guest_memfd backed memory
for non-CoCo VMs, which includes support for mmap() and fault handling.
This is gated by a new configuration option, toggled by a new flag, and
advertised to userspace by a new capability (introduced in patch 18).

Patches 10-14: Implement x86 guest_memfd mmap support.

Patches 15-17: Implement arm64 guest_memfd mmap support.

Patch 18: Introduce the new capability to advertise this support and
update the documentation.

Patches 19-20: Update and expand selftests for guest_memfd to include
mmap functionality and improve portability.

To test this patch series and boot a guest utilizing the new features,
please refer to the instructions in v8 of the series [5]. Note that
kvmtool for Linux 6.16 (available at [6]) is required, as the
KVM_CAP_GMEM_MMAP capability number has changed, additionally, drop the
--sw_protected kvmtool parameter to test with the default VM type.

Cheers,
/fuad

[1] https://lore.kernel.org/all/20250611133330.1514028-3-tabba@google.com/T/
[2] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
[3] https://lore.kernel.org/all/20250221160728.1584559-1-roypat@amazon.co.uk/
[4] https://lore.kernel.org/all/20250328153133.3504118-1-tabba@google.com/
[5] https://lore.kernel.org/all/20250430165655.605595-1-tabba@google.com/
[6] https://android-kvm.googlesource.com/kvmtool/+/refs/heads/tabba/guestmem-basic-6.16

Ackerley Tng (4):
  KVM: x86/mmu: Generalize private_max_mapping_level x86 op to
    max_mapping_level
  KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level
  KVM: x86/mmu: Consult guest_memfd when computing max_mapping_level
  KVM: x86/mmu: Handle guest page faults for guest_memfd with shared
    memory

Fuad Tabba (16):
  KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GMEM
  KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to
    CONFIG_KVM_GENERIC_GMEM_POPULATE
  KVM: Introduce kvm_arch_supports_gmem()
  KVM: x86: Introduce kvm->arch.supports_gmem
  KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()
  KVM: Fix comments that refer to slots_lock
  KVM: Fix comment that refers to kvm uapi header path
  KVM: guest_memfd: Allow host to map guest_memfd pages
  KVM: guest_memfd: Track guest_memfd mmap support in memslot
  KVM: x86: Enable guest_memfd mmap for default VM type
  KVM: arm64: Refactor user_mem_abort()
  KVM: arm64: Handle guest_memfd-backed guest page faults
  KVM: arm64: Enable host mapping of shared guest_memfd memory
  KVM: Introduce the KVM capability KVM_CAP_GMEM_MMAP
  KVM: selftests: Do not use hardcoded page sizes in guest_memfd test
  KVM: selftests: guest_memfd mmap() test when mmap is supported

 Documentation/virt/kvm/api.rst                |   9 +
 arch/arm64/include/asm/kvm_host.h             |   4 +
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/mmu.c                          | 190 ++++++++++++----
 arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
 arch/x86/include/asm/kvm_host.h               |  18 +-
 arch/x86/kvm/Kconfig                          |   7 +-
 arch/x86/kvm/mmu/mmu.c                        | 115 ++++++----
 arch/x86/kvm/svm/sev.c                        |  12 +-
 arch/x86/kvm/svm/svm.c                        |   3 +-
 arch/x86/kvm/svm/svm.h                        |   4 +-
 arch/x86/kvm/vmx/main.c                       |   6 +-
 arch/x86/kvm/vmx/tdx.c                        |   6 +-
 arch/x86/kvm/vmx/x86_ops.h                    |   2 +-
 arch/x86/kvm/x86.c                            |   5 +-
 include/linux/kvm_host.h                      |  64 +++++-
 include/uapi/linux/kvm.h                      |   2 +
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../testing/selftests/kvm/guest_memfd_test.c  | 208 +++++++++++++++---
 virt/kvm/Kconfig                              |  14 +-
 virt/kvm/Makefile.kvm                         |   2 +-
 virt/kvm/guest_memfd.c                        |  96 +++++++-
 virt/kvm/kvm_main.c                           |  14 +-
 virt/kvm/kvm_mm.h                             |   4 +-
 24 files changed, 622 insertions(+), 167 deletions(-)


base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v13 01/20] KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GMEM
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 02/20] KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_KVM_GENERIC_GMEM_POPULATE Fuad Tabba
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GMEM. The
original name implied that the feature only supported "private" memory.
However, CONFIG_KVM_PRIVATE_MEM enables guest_memfd in general, which is
not exclusively for private memory. Subsequent patches in this series
will add guest_memfd support for non-CoCo VMs, whose memory is not
private.

Renaming the Kconfig option to CONFIG_KVM_GMEM more accurately reflects
its broader scope as the main Kconfig option for all guest_memfd-backed
memory. This provides clearer semantics for the option and avoids
confusion as new features are introduced.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 include/linux/kvm_host.h        | 14 +++++++-------
 virt/kvm/Kconfig                |  8 ++++----
 virt/kvm/Makefile.kvm           |  2 +-
 virt/kvm/kvm_main.c             |  4 ++--
 virt/kvm/kvm_mm.h               |  4 ++--
 6 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 639d9bcee842..66bdd0759d27 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2269,7 +2269,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		       int tdp_max_root_level, int tdp_huge_page_level);
 
 
-#ifdef CONFIG_KVM_PRIVATE_MEM
+#ifdef CONFIG_KVM_GMEM
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
 #else
 #define kvm_arch_has_private_mem(kvm) false
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 3bde4fb5c6aa..755b09dcafce 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -601,7 +601,7 @@ struct kvm_memory_slot {
 	short id;
 	u16 as_id;
 
-#ifdef CONFIG_KVM_PRIVATE_MEM
+#ifdef CONFIG_KVM_GMEM
 	struct {
 		/*
 		 * Writes protected by kvm->slots_lock.  Acquiring a
@@ -719,10 +719,10 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 #endif
 
 /*
- * Arch code must define kvm_arch_has_private_mem if support for private memory
- * is enabled.
+ * Arch code must define kvm_arch_has_private_mem if support for guest_memfd is
+ * enabled.
  */
-#if !defined(kvm_arch_has_private_mem) && !IS_ENABLED(CONFIG_KVM_PRIVATE_MEM)
+#if !defined(kvm_arch_has_private_mem) && !IS_ENABLED(CONFIG_KVM_GMEM)
 static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 {
 	return false;
@@ -2527,7 +2527,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
-	return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) &&
+	return IS_ENABLED(CONFIG_KVM_GMEM) &&
 	       kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
 }
 #else
@@ -2537,7 +2537,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 }
 #endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
 
-#ifdef CONFIG_KVM_PRIVATE_MEM
+#ifdef CONFIG_KVM_GMEM
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
 		     int *max_order);
@@ -2550,7 +2550,7 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 	KVM_BUG_ON(1, kvm);
 	return -EIO;
 }
-#endif /* CONFIG_KVM_PRIVATE_MEM */
+#endif /* CONFIG_KVM_GMEM */
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
 int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 727b542074e7..49df4e32bff7 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -112,19 +112,19 @@ config KVM_GENERIC_MEMORY_ATTRIBUTES
        depends on KVM_GENERIC_MMU_NOTIFIER
        bool
 
-config KVM_PRIVATE_MEM
+config KVM_GMEM
        select XARRAY_MULTI
        bool
 
 config KVM_GENERIC_PRIVATE_MEM
        select KVM_GENERIC_MEMORY_ATTRIBUTES
-       select KVM_PRIVATE_MEM
+       select KVM_GMEM
        bool
 
 config HAVE_KVM_ARCH_GMEM_PREPARE
        bool
-       depends on KVM_PRIVATE_MEM
+       depends on KVM_GMEM
 
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
-       depends on KVM_PRIVATE_MEM
+       depends on KVM_GMEM
diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm
index 724c89af78af..8d00918d4c8b 100644
--- a/virt/kvm/Makefile.kvm
+++ b/virt/kvm/Makefile.kvm
@@ -12,4 +12,4 @@ kvm-$(CONFIG_KVM_ASYNC_PF) += $(KVM)/async_pf.o
 kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o
 kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o
 kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o
-kvm-$(CONFIG_KVM_PRIVATE_MEM) += $(KVM)/guest_memfd.o
+kvm-$(CONFIG_KVM_GMEM) += $(KVM)/guest_memfd.o
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index eec82775c5bf..898c3d5a7ba8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4910,7 +4910,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_MEMORY_ATTRIBUTES:
 		return kvm_supported_mem_attributes(kvm);
 #endif
-#ifdef CONFIG_KVM_PRIVATE_MEM
+#ifdef CONFIG_KVM_GMEM
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
 #endif
@@ -5344,7 +5344,7 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_GET_STATS_FD:
 		r = kvm_vm_ioctl_get_stats_fd(kvm);
 		break;
-#ifdef CONFIG_KVM_PRIVATE_MEM
+#ifdef CONFIG_KVM_GMEM
 	case KVM_CREATE_GUEST_MEMFD: {
 		struct kvm_create_guest_memfd guest_memfd;
 
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index acef3f5c582a..ec311c0d6718 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -67,7 +67,7 @@ static inline void gfn_to_pfn_cache_invalidate_start(struct kvm *kvm,
 }
 #endif /* HAVE_KVM_PFNCACHE */
 
-#ifdef CONFIG_KVM_PRIVATE_MEM
+#ifdef CONFIG_KVM_GMEM
 void kvm_gmem_init(struct module *module);
 int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args);
 int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
@@ -91,6 +91,6 @@ static inline void kvm_gmem_unbind(struct kvm_memory_slot *slot)
 {
 	WARN_ON_ONCE(1);
 }
-#endif /* CONFIG_KVM_PRIVATE_MEM */
+#endif /* CONFIG_KVM_GMEM */
 
 #endif /* __KVM_MM_H__ */
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 02/20] KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_KVM_GENERIC_GMEM_POPULATE
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 01/20] KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GMEM Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 03/20] KVM: Introduce kvm_arch_supports_gmem() Fuad Tabba
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

The original name was vague regarding its functionality. This Kconfig
option specifically enables and gates the kvm_gmem_populate() function,
which is responsible for populating a GPA range with guest data.

The new name, KVM_GENERIC_GMEM_POPULATE, describes the purpose of the
option: to enable generic guest_memfd population mechanisms. This
improves clarity for developers and ensures the name accurately reflects
the functionality it controls, especially as guest_memfd support expands
beyond purely "private" memory scenarios.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/kvm/Kconfig     | 6 +++---
 include/linux/kvm_host.h | 2 +-
 virt/kvm/Kconfig         | 2 +-
 virt/kvm/guest_memfd.c   | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 2eeffcec5382..df1fdbb4024b 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -46,7 +46,7 @@ config KVM_X86
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_GENERIC_PRE_FAULT_MEMORY
-	select KVM_GENERIC_PRIVATE_MEM if KVM_SW_PROTECTED_VM
+	select KVM_GENERIC_GMEM_POPULATE if KVM_SW_PROTECTED_VM
 	select KVM_WERROR if WERROR
 
 config KVM
@@ -95,7 +95,7 @@ config KVM_SW_PROTECTED_VM
 config KVM_INTEL
 	tristate "KVM for Intel (and compatible) processors support"
 	depends on KVM && IA32_FEAT_CTL
-	select KVM_GENERIC_PRIVATE_MEM if INTEL_TDX_HOST
+	select KVM_GENERIC_GMEM_POPULATE if INTEL_TDX_HOST
 	select KVM_GENERIC_MEMORY_ATTRIBUTES if INTEL_TDX_HOST
 	help
 	  Provides support for KVM on processors equipped with Intel's VT
@@ -157,7 +157,7 @@ config KVM_AMD_SEV
 	depends on KVM_AMD && X86_64
 	depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
 	select ARCH_HAS_CC_PLATFORM
-	select KVM_GENERIC_PRIVATE_MEM
+	select KVM_GENERIC_GMEM_POPULATE
 	select HAVE_KVM_ARCH_GMEM_PREPARE
 	select HAVE_KVM_ARCH_GMEM_INVALIDATE
 	help
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 755b09dcafce..359baaae5e9f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2556,7 +2556,7 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 int kvm_arch_gmem_prepare(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int max_order);
 #endif
 
-#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+#ifdef CONFIG_KVM_GENERIC_GMEM_POPULATE
 /**
  * kvm_gmem_populate() - Populate/prepare a GPA range with guest data
  *
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 49df4e32bff7..559c93ad90be 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -116,7 +116,7 @@ config KVM_GMEM
        select XARRAY_MULTI
        bool
 
-config KVM_GENERIC_PRIVATE_MEM
+config KVM_GENERIC_GMEM_POPULATE
        select KVM_GENERIC_MEMORY_ATTRIBUTES
        select KVM_GMEM
        bool
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b2aa6bf24d3a..befea51bbc75 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -638,7 +638,7 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
 
-#ifdef CONFIG_KVM_GENERIC_PRIVATE_MEM
+#ifdef CONFIG_KVM_GENERIC_GMEM_POPULATE
 long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
 		       kvm_gmem_populate_cb post_populate, void *opaque)
 {
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 03/20] KVM: Introduce kvm_arch_supports_gmem()
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 01/20] KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GMEM Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 02/20] KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_KVM_GENERIC_GMEM_POPULATE Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 04/20] KVM: x86: Introduce kvm->arch.supports_gmem Fuad Tabba
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Introduce kvm_arch_supports_gmem() to explicitly indicate whether an
architecture supports guest_memfd.

Previously, kvm_arch_has_private_mem() was used to check for guest_memfd
support. However, this conflated guest_memfd with "private" memory,
implying that guest_memfd was exclusively for CoCo VMs or other private
memory use cases.

With the expansion of guest_memfd to support non-private memory, such as
shared host mappings, it is necessary to decouple these concepts. The
new kvm_arch_supports_gmem() function provides a clear way to check for
guest_memfd support.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/include/asm/kvm_host.h |  4 +++-
 include/linux/kvm_host.h        | 11 +++++++++++
 virt/kvm/kvm_main.c             |  4 ++--
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 66bdd0759d27..09f4f6240d9d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2271,8 +2271,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 
 #ifdef CONFIG_KVM_GMEM
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
+#define kvm_arch_supports_gmem(kvm) kvm_arch_has_private_mem(kvm)
 #else
 #define kvm_arch_has_private_mem(kvm) false
+#define kvm_arch_supports_gmem(kvm) false
 #endif
 
 #define kvm_arch_has_readonly_mem(kvm) (!(kvm)->arch.has_protected_state)
@@ -2325,7 +2327,7 @@ enum {
 #define HF_SMM_INSIDE_NMI_MASK	(1 << 2)
 
 # define KVM_MAX_NR_ADDRESS_SPACES	2
-/* SMM is currently unsupported for guests with private memory. */
+/* SMM is currently unsupported for guests with guest_memfd private memory. */
 # define kvm_arch_nr_memslot_as_ids(kvm) (kvm_arch_has_private_mem(kvm) ? 1 : 2)
 # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK ? 1 : 0)
 # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, (role).smm)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 359baaae5e9f..ab1bde048034 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -729,6 +729,17 @@ static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 }
 #endif
 
+/*
+ * Arch code must define kvm_arch_supports_gmem if support for guest_memfd is
+ * enabled.
+ */
+#if !defined(kvm_arch_supports_gmem) && !IS_ENABLED(CONFIG_KVM_GMEM)
+static inline bool kvm_arch_supports_gmem(struct kvm *kvm)
+{
+	return false;
+}
+#endif
+
 #ifndef kvm_arch_has_readonly_mem
 static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 898c3d5a7ba8..afbc025ce4d3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1588,7 +1588,7 @@ static int check_memory_region_flags(struct kvm *kvm,
 {
 	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
 
-	if (kvm_arch_has_private_mem(kvm))
+	if (kvm_arch_supports_gmem(kvm))
 		valid_flags |= KVM_MEM_GUEST_MEMFD;
 
 	/* Dirty logging private memory is not currently supported. */
@@ -4912,7 +4912,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #endif
 #ifdef CONFIG_KVM_GMEM
 	case KVM_CAP_GUEST_MEMFD:
-		return !kvm || kvm_arch_has_private_mem(kvm);
+		return !kvm || kvm_arch_supports_gmem(kvm);
 #endif
 	default:
 		break;
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 04/20] KVM: x86: Introduce kvm->arch.supports_gmem
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (2 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 03/20] KVM: Introduce kvm_arch_supports_gmem() Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 05/20] KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() Fuad Tabba
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Introduce a new boolean member, supports_gmem, to kvm->arch.

Previously, the has_private_mem boolean within kvm->arch was implicitly
used to indicate whether guest_memfd was supported for a KVM instance.
However, with the broader support for guest_memfd, it's not exclusively
for private or confidential memory. Therefore, it's necessary to
distinguish between a VM's general guest_memfd capabilities and its
support for private memory.

This new supports_gmem member will now explicitly indicate guest_memfd
support for a given VM, allowing has_private_mem to represent only
support for private memory.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/include/asm/kvm_host.h | 3 ++-
 arch/x86/kvm/svm/svm.c          | 1 +
 arch/x86/kvm/vmx/tdx.c          | 1 +
 arch/x86/kvm/x86.c              | 4 ++--
 4 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 09f4f6240d9d..ebddedf0a1f2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1342,6 +1342,7 @@ struct kvm_arch {
 	u8 mmu_valid_gen;
 	u8 vm_type;
 	bool has_private_mem;
+	bool supports_gmem;
 	bool has_protected_state;
 	bool pre_fault_allowed;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
@@ -2271,7 +2272,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 
 #ifdef CONFIG_KVM_GMEM
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
-#define kvm_arch_supports_gmem(kvm) kvm_arch_has_private_mem(kvm)
+#define kvm_arch_supports_gmem(kvm)  ((kvm)->arch.supports_gmem)
 #else
 #define kvm_arch_has_private_mem(kvm) false
 #define kvm_arch_supports_gmem(kvm) false
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ab9b947dbf4f..d1c484eaa8ad 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5181,6 +5181,7 @@ static int svm_vm_init(struct kvm *kvm)
 		to_kvm_sev_info(kvm)->need_init = true;
 
 		kvm->arch.has_private_mem = (type == KVM_X86_SNP_VM);
+		kvm->arch.supports_gmem = (type == KVM_X86_SNP_VM);
 		kvm->arch.pre_fault_allowed = !kvm->arch.has_private_mem;
 	}
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1ad20c273f3b..c227516e6a02 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -625,6 +625,7 @@ int tdx_vm_init(struct kvm *kvm)
 
 	kvm->arch.has_protected_state = true;
 	kvm->arch.has_private_mem = true;
+	kvm->arch.supports_gmem = true;
 	kvm->arch.disabled_quirks |= KVM_X86_QUIRK_IGNORE_GUEST_PAT;
 
 	/*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a9d992d5652f..b34236029383 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12778,8 +12778,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 		return -EINVAL;
 
 	kvm->arch.vm_type = type;
-	kvm->arch.has_private_mem =
-		(type == KVM_X86_SW_PROTECTED_VM);
+	kvm->arch.has_private_mem = (type == KVM_X86_SW_PROTECTED_VM);
+	kvm->arch.supports_gmem = (type == KVM_X86_SW_PROTECTED_VM);
 	/* Decided by the vendor code for other VM types.  */
 	kvm->arch.pre_fault_allowed =
 		type == KVM_X86_DEFAULT_VM || type == KVM_X86_SW_PROTECTED_VM;
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 05/20] KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (3 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 04/20] KVM: x86: Introduce kvm->arch.supports_gmem Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 06/20] KVM: Fix comments that refer to slots_lock Fuad Tabba
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve
clarity and accurately reflect its purpose.

The function kvm_slot_can_be_private() was previously used to check if a
given kvm_memory_slot is backed by guest_memfd. However, its name
implied that the memory in such a slot was exclusively "private".

As guest_memfd support expands to include non-private memory (e.g.,
shared host mappings), it's important to remove this association. The
new name, kvm_slot_has_gmem(), states that the slot is backed by
guest_memfd without making assumptions about the memory's privacy
attributes.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/kvm/mmu/mmu.c   | 4 ++--
 arch/x86/kvm/svm/sev.c   | 4 ++--
 include/linux/kvm_host.h | 2 +-
 virt/kvm/guest_memfd.c   | 2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4e06e2e89a8f..213904daf1e5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3285,7 +3285,7 @@ static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn)
 {
-	bool is_private = kvm_slot_can_be_private(slot) &&
+	bool is_private = kvm_slot_has_gmem(slot) &&
 			  kvm_mem_is_private(kvm, gfn);
 
 	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, PG_LEVEL_NUM, is_private);
@@ -4498,7 +4498,7 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
 {
 	int max_order, r;
 
-	if (!kvm_slot_can_be_private(fault->slot)) {
+	if (!kvm_slot_has_gmem(fault->slot)) {
 		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
 		return -EFAULT;
 	}
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 459c3b791fd4..ade7a5b36c68 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -2319,7 +2319,7 @@ static int snp_launch_update(struct kvm *kvm, struct kvm_sev_cmd *argp)
 	mutex_lock(&kvm->slots_lock);
 
 	memslot = gfn_to_memslot(kvm, params.gfn_start);
-	if (!kvm_slot_can_be_private(memslot)) {
+	if (!kvm_slot_has_gmem(memslot)) {
 		ret = -EINVAL;
 		goto out;
 	}
@@ -4670,7 +4670,7 @@ void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code)
 	}
 
 	slot = gfn_to_memslot(kvm, gfn);
-	if (!kvm_slot_can_be_private(slot)) {
+	if (!kvm_slot_has_gmem(slot)) {
 		pr_warn_ratelimited("SEV: Unexpected RMP fault, non-private slot for GPA 0x%llx\n",
 				    gpa);
 		return;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ab1bde048034..ed00c2b40e4b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -614,7 +614,7 @@ struct kvm_memory_slot {
 #endif
 };
 
-static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot)
+static inline bool kvm_slot_has_gmem(const struct kvm_memory_slot *slot)
 {
 	return slot && (slot->flags & KVM_MEM_GUEST_MEMFD);
 }
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index befea51bbc75..6db515833f61 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -654,7 +654,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 		return -EINVAL;
 
 	slot = gfn_to_memslot(kvm, start_gfn);
-	if (!kvm_slot_can_be_private(slot))
+	if (!kvm_slot_has_gmem(slot))
 		return -EINVAL;
 
 	file = kvm_gmem_get_file(slot);
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 06/20] KVM: Fix comments that refer to slots_lock
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (4 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 05/20] KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 07/20] KVM: Fix comment that refers to kvm uapi header path Fuad Tabba
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Fix comments so that they refer to slots_lock instead of slots_locks
(remove trailing s).

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 2 +-
 virt/kvm/kvm_main.c      | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ed00c2b40e4b..9c654dfb6dce 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -870,7 +870,7 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
-	/* Protected by slots_locks (for writes) and RCU (for reads) */
+	/* Protected by slots_lock (for writes) and RCU (for reads) */
 	struct xarray mem_attr_array;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index afbc025ce4d3..81bb18fa8655 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -331,7 +331,7 @@ void kvm_flush_remote_tlbs_memslot(struct kvm *kvm,
 	 * All current use cases for flushing the TLBs for a specific memslot
 	 * are related to dirty logging, and many do the TLB flush out of
 	 * mmu_lock. The interaction between the various operations on memslot
-	 * must be serialized by slots_locks to ensure the TLB flush from one
+	 * must be serialized by slots_lock to ensure the TLB flush from one
 	 * operation is observed by any other operation on the same memslot.
 	 */
 	lockdep_assert_held(&kvm->slots_lock);
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 07/20] KVM: Fix comment that refers to kvm uapi header path
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (5 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 06/20] KVM: Fix comments that refer to slots_lock Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 08/20] KVM: guest_memfd: Allow host to map guest_memfd pages Fuad Tabba
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

The comment that points to the path where the user-visible memslot flags
are refers to an outdated path and has a typo.

Update the comment to refer to the correct path.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9c654dfb6dce..1ec71648824c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -52,7 +52,7 @@
 /*
  * The bit 16 ~ bit 31 of kvm_userspace_memory_region::flags are internally
  * used in kvm, other bits are visible for userspace which are defined in
- * include/linux/kvm_h.
+ * include/uapi/linux/kvm.h.
  */
 #define KVM_MEMSLOT_INVALID	(1UL << 16)
 
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 08/20] KVM: guest_memfd: Allow host to map guest_memfd pages
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (6 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 07/20] KVM: Fix comment that refers to kvm uapi header path Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 09/20] KVM: guest_memfd: Track guest_memfd mmap support in memslot Fuad Tabba
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Introduce the core infrastructure to enable host userspace to mmap()
guest_memfd-backed memory. This is needed for several evolving KVM use
cases:

* Non-CoCo VM backing: Allows VMMs like Firecracker to run guests
  entirely backed by guest_memfd, even for non-CoCo VMs [1]. This
  provides a unified memory management model and simplifies guest memory
  handling.

* Direct map removal for enhanced security: This is an important step
  for direct map removal of guest memory [2]. By allowing host userspace
  to fault in guest_memfd pages directly, we can avoid maintaining host
  kernel direct maps of guest memory. This provides additional hardening
  against Spectre-like transient execution attacks by removing a
  potential attack surface within the kernel.

* Future guest_memfd features: This also lays the groundwork for future
  enhancements to guest_memfd, such as supporting huge pages and
  enabling in-place sharing of guest memory with the host for CoCo
  platforms that permit it [3].

Therefore, enable the basic mmap and fault handling logic within
guest_memfd. However, this functionality is not yet exposed to userspace
and remains inactive until two conditions are met in subsequent patches:

* Kconfig Gate (CONFIG_KVM_GMEM_SUPPORTS_MMAP): A new Kconfig option,
  KVM_GMEM_SUPPORTS_MMAP, is introduced later in this series. This
  option gates the compilation and availability of this mmap
  functionality at a system level. While the code changes in this patch
  might seem small, the Kconfig option is introduced to explicitly
  signal the intent to enable this new capability and to provide a clear
  compile-time switch for it. It also helps ensure that the necessary
  architecture-specific glue (like kvm_arch_supports_gmem_mmap) is
  properly defined.

* Per-instance opt-in (GUEST_MEMFD_FLAG_MMAP): On a per-instance basis,
  this functionality is enabled by the guest_memfd flag
  GUEST_MEMFD_FLAG_MMAP, which will be set in the KVM_CREATE_GUEST_MEMFD
  ioctl. This flag is crucial because when host userspace maps
  guest_memfd pages, KVM must *not* manage the these memory regions in
  the same way it does for traditional KVM memory slots. The presence of
  GUEST_MEMFD_FLAG_MMAP on a guest_memfd instance allows mmap() and
  faulting of guest_memfd memory to host userspace. Additionally, it
  informs KVM to always consume guest faults to this memory from
  guest_memfd, regardless of whether it is a shared or a private fault.
  This opt-in mechanism ensures compatibility and prevents conflicts
  with existing KVM memory management. This is a per-guest_memfd flag
  rather than a per-memslot or per-VM capability because the ability to
  mmap directly applies to the specific guest_memfd object, regardless
  of how it might be used within various memory slots or VMs.

[1] https://github.com/firecracker-microvm/firecracker/tree/feature/secret-hiding
[2] https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com
[3] https://lore.kernel.org/all/c1c9591d-218a-495c-957b-ba356c8f8e09@redhat.com/T/#u

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Acked-by: David Hildenbrand <david@redhat.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 13 +++++++
 include/uapi/linux/kvm.h |  1 +
 virt/kvm/Kconfig         |  4 +++
 virt/kvm/guest_memfd.c   | 73 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 91 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 1ec71648824c..9ac21985f3b5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -740,6 +740,19 @@ static inline bool kvm_arch_supports_gmem(struct kvm *kvm)
 }
 #endif

+/*
+ * Returns true if this VM supports mmap() in guest_memfd.
+ *
+ * Arch code must define kvm_arch_supports_gmem_mmap if support for guest_memfd
+ * is enabled.
+ */
+#if !defined(kvm_arch_supports_gmem_mmap)
+static inline bool kvm_arch_supports_gmem_mmap(struct kvm *kvm)
+{
+	return false;
+}
+#endif
+
 #ifndef kvm_arch_has_readonly_mem
 static inline bool kvm_arch_has_readonly_mem(struct kvm *kvm)
 {
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 37891580d05d..c71348db818f 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1592,6 +1592,7 @@ struct kvm_memory_attributes {
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)

 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
+#define GUEST_MEMFD_FLAG_MMAP	(1ULL << 0)

 struct kvm_create_guest_memfd {
 	__u64 size;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 559c93ad90be..fa4acbedb953 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -128,3 +128,7 @@ config HAVE_KVM_ARCH_GMEM_PREPARE
 config HAVE_KVM_ARCH_GMEM_INVALIDATE
        bool
        depends on KVM_GMEM
+
+config KVM_GMEM_SUPPORTS_MMAP
+       select KVM_GMEM
+       bool
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 6db515833f61..07a4b165471d 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -312,7 +312,77 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }

+static bool kvm_gmem_supports_mmap(struct inode *inode)
+{
+	const u64 flags = (u64)inode->i_private;
+
+	if (!IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP))
+		return false;
+
+	return flags & GUEST_MEMFD_FLAG_MMAP;
+}
+
+static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
+{
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	struct folio *folio;
+	vm_fault_t ret = VM_FAULT_LOCKED;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return VM_FAULT_SIGBUS;
+
+	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	if (IS_ERR(folio)) {
+		int err = PTR_ERR(folio);
+
+		if (err == -EAGAIN)
+			return VM_FAULT_RETRY;
+
+		return vmf_error(err);
+	}
+
+	if (WARN_ON_ONCE(folio_test_large(folio))) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_folio;
+	}
+
+	if (!folio_test_uptodate(folio)) {
+		clear_highpage(folio_page(folio, 0));
+		kvm_gmem_mark_prepared(folio);
+	}
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
+
+out_folio:
+	if (ret != VM_FAULT_LOCKED) {
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+	return ret;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+	.fault = kvm_gmem_fault_user_mapping,
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if (!kvm_gmem_supports_mmap(file_inode(file)))
+		return -ENODEV;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
+	    (VM_SHARED | VM_MAYSHARE)) {
+		return -EINVAL;
+	}
+
+	vma->vm_ops = &kvm_gmem_vm_ops;
+
+	return 0;
+}
+
 static struct file_operations kvm_gmem_fops = {
+	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
@@ -463,6 +533,9 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 	u64 flags = args->flags;
 	u64 valid_flags = 0;

+	if (kvm_arch_supports_gmem_mmap(kvm))
+		valid_flags |= GUEST_MEMFD_FLAG_MMAP;
+
 	if (flags & ~valid_flags)
 		return -EINVAL;

-- 
2.50.0.727.gbf7dc18ff4-goog

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 09/20] KVM: guest_memfd: Track guest_memfd mmap support in memslot
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (7 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 08/20] KVM: guest_memfd: Allow host to map guest_memfd pages Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-11  8:34   ` Shivank Garg
  2025-07-09 10:59 ` [PATCH v13 10/20] KVM: x86/mmu: Generalize private_max_mapping_level x86 op to max_mapping_level Fuad Tabba
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Add a new internal flag, KVM_MEMSLOT_GMEM_ONLY, to the top half of
memslot->flags. This flag tracks when a guest_memfd-backed memory slot
supports host userspace mmap operations. It's strictly for KVM's
internal use.

This optimization avoids repeatedly checking the underlying guest_memfd
file for mmap support, which would otherwise require taking and
releasing a reference on the file for each check. By caching this
information directly in the memslot, we reduce overhead and simplify the
logic involved in handling guest_memfd-backed pages for host mappings.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 include/linux/kvm_host.h | 11 ++++++++++-
 virt/kvm/guest_memfd.c   |  2 ++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9ac21985f3b5..d2218ec57ceb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -54,7 +54,8 @@
  * used in kvm, other bits are visible for userspace which are defined in
  * include/uapi/linux/kvm.h.
  */
-#define KVM_MEMSLOT_INVALID	(1UL << 16)
+#define KVM_MEMSLOT_INVALID			(1UL << 16)
+#define KVM_MEMSLOT_GMEM_ONLY			(1UL << 17)
 
 /*
  * Bit 63 of the memslot generation number is an "update in-progress flag",
@@ -2536,6 +2537,14 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
 
+static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
+{
+	if (!IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP))
+		return false;
+
+	return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
+}
+
 #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 07a4b165471d..2b00f8796a15 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -592,6 +592,8 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 	 */
 	WRITE_ONCE(slot->gmem.file, file);
 	slot->gmem.pgoff = start;
+	if (kvm_gmem_supports_mmap(inode))
+		slot->flags |= KVM_MEMSLOT_GMEM_ONLY;
 
 	xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
 	filemap_invalidate_unlock(inode->i_mapping);
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 10/20] KVM: x86/mmu: Generalize private_max_mapping_level x86 op to max_mapping_level
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (8 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 09/20] KVM: guest_memfd: Track guest_memfd mmap support in memslot Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-11  9:36   ` David Hildenbrand
  2025-07-09 10:59 ` [PATCH v13 11/20] KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level Fuad Tabba
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

From: Ackerley Tng <ackerleytng@google.com>

Generalize the private_max_mapping_level x86 operation to
max_mapping_level.

The private_max_mapping_level operation allows platform-specific code to
limit mapping levels (e.g., forcing 4K pages for certain memory types).
While it was previously used exclusively for private memory, guest_memfd
can now back both private and non-private memory. Platforms may have
specific mapping level restrictions that apply to guest_memfd memory
regardless of its privacy attribute. Therefore, generalize this
operation.

Rename the operation: Removes the "private" prefix to reflect its
broader applicability to any guest_memfd-backed memory.

Pass kvm_page_fault information: The operation is updated to receive a
struct kvm_page_fault object instead of just the pfn. This provides
platform-specific implementations (e.g., for TDX or SEV) with additional
context about the fault, such as whether it is private or shared,
allowing them to apply different mapping level rules as needed.

Enforce "private-only" behavior (for now): Since the current consumers
of this hook (TDX and SEV) still primarily use it to enforce private
memory constraints, platform-specific implementations are made to return
0 for non-private pages. A return value of 0 signals to callers that
platform-specific input should be ignored for that particular fault,
indicating no specific platform-imposed mapping level limits for
non-private pages. This allows the core MMU to continue determining the
mapping level based on generic rules for such cases.

Suggested-by: Sean Christoperson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 +-
 arch/x86/include/asm/kvm_host.h    |  2 +-
 arch/x86/kvm/mmu/mmu.c             | 11 ++++++-----
 arch/x86/kvm/svm/sev.c             |  8 ++++++--
 arch/x86/kvm/svm/svm.c             |  2 +-
 arch/x86/kvm/svm/svm.h             |  4 ++--
 arch/x86/kvm/vmx/main.c            |  6 +++---
 arch/x86/kvm/vmx/tdx.c             |  5 ++++-
 arch/x86/kvm/vmx/x86_ops.h         |  2 +-
 9 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8d50e3e0a19b..02301fbad449 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -146,7 +146,7 @@ KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
 KVM_X86_OP_OPTIONAL(get_untagged_addr)
 KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
 KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
-KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
+KVM_X86_OP_OPTIONAL_RET0(max_mapping_level)
 KVM_X86_OP_OPTIONAL(gmem_invalidate)
 
 #undef KVM_X86_OP
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ebddedf0a1f2..4c764faa12f3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1901,7 +1901,7 @@ struct kvm_x86_ops {
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
 	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 	void (*gmem_invalidate)(kvm_pfn_t start, kvm_pfn_t end);
-	int (*private_max_mapping_level)(struct kvm *kvm, kvm_pfn_t pfn);
+	int (*max_mapping_level)(struct kvm *kvm, struct kvm_page_fault *fault);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 213904daf1e5..bb925994cbc5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4467,9 +4467,11 @@ static inline u8 kvm_max_level_for_order(int order)
 	return PG_LEVEL_4K;
 }
 
-static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
-					u8 max_level, int gmem_order)
+static u8 kvm_max_private_mapping_level(struct kvm *kvm,
+					struct kvm_page_fault *fault,
+					int gmem_order)
 {
+	u8 max_level = fault->max_level;
 	u8 req_max_level;
 
 	if (max_level == PG_LEVEL_4K)
@@ -4479,7 +4481,7 @@ static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
-	req_max_level = kvm_x86_call(private_max_mapping_level)(kvm, pfn);
+	req_max_level = kvm_x86_call(max_mapping_level)(kvm, fault);
 	if (req_max_level)
 		max_level = min(max_level, req_max_level);
 
@@ -4511,8 +4513,7 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	}
 
 	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
-	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->pfn,
-							 fault->max_level, max_order);
+	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault, max_order);
 
 	return RET_PF_CONTINUE;
 }
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index ade7a5b36c68..58116439d7c0 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -29,6 +29,7 @@
 #include <asm/msr.h>
 #include <asm/sev.h>
 
+#include "mmu/mmu_internal.h"
 #include "mmu.h"
 #include "x86.h"
 #include "svm.h"
@@ -4898,7 +4899,7 @@ void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end)
 	}
 }
 
-int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+int sev_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault)
 {
 	int level, rc;
 	bool assigned;
@@ -4906,7 +4907,10 @@ int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
 	if (!sev_snp_guest(kvm))
 		return 0;
 
-	rc = snp_lookup_rmpentry(pfn, &assigned, &level);
+	if (!fault->is_private)
+		return 0;
+
+	rc = snp_lookup_rmpentry(fault->pfn, &assigned, &level);
 	if (rc || !assigned)
 		return PG_LEVEL_4K;
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d1c484eaa8ad..6ad047189210 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5347,7 +5347,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 
 	.gmem_prepare = sev_gmem_prepare,
 	.gmem_invalidate = sev_gmem_invalidate,
-	.private_max_mapping_level = sev_private_max_mapping_level,
+	.max_mapping_level = sev_max_mapping_level,
 };
 
 /*
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index e6f3c6a153a0..c2579f7df734 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -787,7 +787,7 @@ void sev_handle_rmp_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code);
 void sev_snp_init_protected_guest_state(struct kvm_vcpu *vcpu);
 int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, int max_order);
 void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end);
-int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
+int sev_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault);
 struct vmcb_save_area *sev_decrypt_vmsa(struct kvm_vcpu *vcpu);
 void sev_free_decrypted_vmsa(struct kvm_vcpu *vcpu, struct vmcb_save_area *vmsa);
 #else
@@ -816,7 +816,7 @@ static inline int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn, gfn_t gfn, in
 	return 0;
 }
 static inline void sev_gmem_invalidate(kvm_pfn_t start, kvm_pfn_t end) {}
-static inline int sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+static inline int sev_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault)
 {
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d1e02e567b57..8e53554932ba 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -871,10 +871,10 @@ static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return tdx_vcpu_ioctl(vcpu, argp);
 }
 
-static int vt_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+static int vt_gmem_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault)
 {
 	if (is_td(kvm))
-		return tdx_gmem_private_max_mapping_level(kvm, pfn);
+		return tdx_gmem_max_mapping_level(kvm, fault);
 
 	return 0;
 }
@@ -1044,7 +1044,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.mem_enc_ioctl = vt_op_tdx_only(mem_enc_ioctl),
 	.vcpu_mem_enc_ioctl = vt_op_tdx_only(vcpu_mem_enc_ioctl),
 
-	.private_max_mapping_level = vt_op_tdx_only(gmem_private_max_mapping_level)
+	.max_mapping_level = vt_op_tdx_only(gmem_max_mapping_level)
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c227516e6a02..1607b1f6be21 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3292,8 +3292,11 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	return ret;
 }
 
-int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn)
+int tdx_gmem_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault)
 {
+	if (!fault->is_private)
+		return 0;
+
 	return PG_LEVEL_4K;
 }
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index b4596f651232..ca7bc9e0fce5 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -163,7 +163,7 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 void tdx_flush_tlb_current(struct kvm_vcpu *vcpu);
 void tdx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
-int tdx_gmem_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn);
+int tdx_gmem_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault);
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 11/20] KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (9 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 10/20] KVM: x86/mmu: Generalize private_max_mapping_level x86 op to max_mapping_level Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-11  9:38   ` David Hildenbrand
  2025-07-09 10:59 ` [PATCH v13 12/20] KVM: x86/mmu: Consult guest_memfd when computing max_mapping_level Fuad Tabba
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

From: Ackerley Tng <ackerleytng@google.com>

Refactor kvm_max_private_mapping_level() to accept a NULL kvm_page_fault
pointer and rename it to kvm_gmem_max_mapping_level().

The max_mapping_level x86 operation (previously private_max_mapping_level)
is designed to potentially be called without an active page fault, for
instance, when kvm_mmu_max_mapping_level() is determining the maximum
mapping level for a gfn proactively.

Allow NULL fault pointer: Modify kvm_max_private_mapping_level() to
safely handle a NULL fault argument. This aligns its interface with the
kvm_x86_ops.max_mapping_level operation it wraps, which can also be
called with NULL.

Rename function to kvm_gmem_max_mapping_level(): This reinforces that
the function's scope is for guest_memfd-backed memory, which can be
either private or non-private, removing any remaining "private"
connotation from its name.

Optimize max_level checks: Introduce a check in the caller to skip
querying for max_mapping_level if the current max_level is already
PG_LEVEL_4K, as no further reduction is possible.

Suggested-by: Sean Christoperson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index bb925994cbc5..495dcedaeafa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4467,17 +4467,13 @@ static inline u8 kvm_max_level_for_order(int order)
 	return PG_LEVEL_4K;
 }
 
-static u8 kvm_max_private_mapping_level(struct kvm *kvm,
-					struct kvm_page_fault *fault,
-					int gmem_order)
+static u8 kvm_gmem_max_mapping_level(struct kvm *kvm, int order,
+				     struct kvm_page_fault *fault)
 {
-	u8 max_level = fault->max_level;
 	u8 req_max_level;
+	u8 max_level;
 
-	if (max_level == PG_LEVEL_4K)
-		return PG_LEVEL_4K;
-
-	max_level = min(kvm_max_level_for_order(gmem_order), max_level);
+	max_level = kvm_max_level_for_order(order);
 	if (max_level == PG_LEVEL_4K)
 		return PG_LEVEL_4K;
 
@@ -4513,7 +4509,10 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	}
 
 	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
-	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault, max_order);
+	if (fault->max_level >= PG_LEVEL_4K) {
+		fault->max_level = kvm_gmem_max_mapping_level(vcpu->kvm,
+							      max_order, fault);
+	}
 
 	return RET_PF_CONTINUE;
 }
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 12/20] KVM: x86/mmu: Consult guest_memfd when computing max_mapping_level
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (10 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 11/20] KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 13/20] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory Fuad Tabba
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

From: Ackerley Tng <ackerleytng@google.com>

Modify kvm_mmu_max_mapping_level() to consult guest_memfd for memory
regions backed by it when computing the maximum mapping level,
especially during huge page recovery.

Previously, kvm_mmu_max_mapping_level() was designed primarily for
host-backed memory and private pages. With guest_memfd now supporting
non-private memory, it's necessary to factor in guest_memfd's influence
on mapping levels for such memory.

Since guest_memfd can now be used for non-private memory, make
kvm_max_max_mapping_level, when recovering huge pages, take input from
guest_memfd.

Input is taken from guest_memfd as long as a fault to that slot and gfn
would have been served from guest_memfd. For now, take a shortcut if the
slot and gfn points to memory that is private, since recovering huge
pages aren't supported for private memory yet.

Since guest_memfd memory can also be faulted into host page tables,
__kvm_mmu_max_mapping_level() still applies since consulting lpage_info
and host page tables are required.

Move functions kvm_max_level_for_order() and
kvm_gmem_max_mapping_level() so kvm_mmu_max_mapping_level() can use
those functions.

Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/kvm/mmu/mmu.c   | 90 ++++++++++++++++++++++++----------------
 include/linux/kvm_host.h |  7 ++++
 virt/kvm/guest_memfd.c   | 17 ++++++++
 3 files changed, 79 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 495dcedaeafa..6d997063f76f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3282,13 +3282,67 @@ static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
 	return min(host_level, max_level);
 }
 
+static u8 kvm_max_level_for_order(int order)
+{
+	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
+
+	KVM_MMU_WARN_ON(order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) &&
+			order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) &&
+			order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K));
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
+		return PG_LEVEL_1G;
+
+	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
+		return PG_LEVEL_2M;
+
+	return PG_LEVEL_4K;
+}
+
+static u8 kvm_gmem_max_mapping_level(struct kvm *kvm, int order,
+				     struct kvm_page_fault *fault)
+{
+	u8 req_max_level;
+	u8 max_level;
+
+	max_level = kvm_max_level_for_order(order);
+	if (max_level == PG_LEVEL_4K)
+		return PG_LEVEL_4K;
+
+	req_max_level = kvm_x86_call(max_mapping_level)(kvm, fault);
+	if (req_max_level)
+		max_level = min(max_level, req_max_level);
+
+	return max_level;
+}
+
 int kvm_mmu_max_mapping_level(struct kvm *kvm,
 			      const struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	bool is_private = kvm_slot_has_gmem(slot) &&
 			  kvm_mem_is_private(kvm, gfn);
+	int max_level = PG_LEVEL_NUM;
+
+	/*
+	 * For now, kvm_mmu_max_mapping_level() is only called from
+	 * kvm_mmu_recover_huge_pages(), and that's not yet supported for
+	 * private memory, hence we can take a shortcut and return early.
+	 */
+	if (is_private)
+		return PG_LEVEL_4K;
 
-	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, PG_LEVEL_NUM, is_private);
+	/*
+	 * For non-private pages that would have been faulted from guest_memfd,
+	 * let guest_memfd influence max_mapping_level.
+	 */
+	if (kvm_memslot_is_gmem_only(slot)) {
+		int order = kvm_gmem_mapping_order(slot, gfn);
+
+		max_level = min(max_level,
+				kvm_gmem_max_mapping_level(kvm, order, NULL));
+	}
+
+	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, max_level, is_private);
 }
 
 void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
@@ -4450,40 +4504,6 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
 		vcpu->stat.pf_fixed++;
 }
 
-static inline u8 kvm_max_level_for_order(int order)
-{
-	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
-
-	KVM_MMU_WARN_ON(order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G) &&
-			order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M) &&
-			order != KVM_HPAGE_GFN_SHIFT(PG_LEVEL_4K));
-
-	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
-		return PG_LEVEL_1G;
-
-	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
-		return PG_LEVEL_2M;
-
-	return PG_LEVEL_4K;
-}
-
-static u8 kvm_gmem_max_mapping_level(struct kvm *kvm, int order,
-				     struct kvm_page_fault *fault)
-{
-	u8 req_max_level;
-	u8 max_level;
-
-	max_level = kvm_max_level_for_order(order);
-	if (max_level == PG_LEVEL_4K)
-		return PG_LEVEL_4K;
-
-	req_max_level = kvm_x86_call(max_mapping_level)(kvm, fault);
-	if (req_max_level)
-		max_level = min(max_level, req_max_level);
-
-	return max_level;
-}
-
 static void kvm_mmu_finish_page_fault(struct kvm_vcpu *vcpu,
 				      struct kvm_page_fault *fault, int r)
 {
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d2218ec57ceb..662271314778 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2574,6 +2574,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
 		     int *max_order);
+int kvm_gmem_mapping_order(const struct kvm_memory_slot *slot, gfn_t gfn);
 #else
 static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
@@ -2583,6 +2584,12 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 	KVM_BUG_ON(1, kvm);
 	return -EIO;
 }
+static inline int kvm_gmem_mapping_order(const struct kvm_memory_slot *slot,
+					 gfn_t gfn)
+{
+	WARN_ONCE(1, "Unexpected call since gmem is disabled.");
+	return 0;
+}
 #endif /* CONFIG_KVM_GMEM */
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 2b00f8796a15..d01bd7a2c2bd 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -713,6 +713,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_get_pfn);
 
+/**
+ * kvm_gmem_mapping_order() - Get the mapping order for this @gfn in @slot.
+ *
+ * @slot: the memslot that gfn belongs to.
+ * @gfn: the gfn to look up mapping order for.
+ *
+ * This is equal to max_order that would be returned if kvm_gmem_get_pfn() were
+ * called now.
+ *
+ * Return: the mapping order for this @gfn in @slot.
+ */
+int kvm_gmem_mapping_order(const struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_mapping_order);
+
 #ifdef CONFIG_KVM_GENERIC_GMEM_POPULATE
 long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
 		       kvm_gmem_populate_cb post_populate, void *opaque)
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 13/20] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (11 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 12/20] KVM: x86/mmu: Consult guest_memfd when computing max_mapping_level Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type Fuad Tabba
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

From: Ackerley Tng <ackerleytng@google.com>

Update the KVM MMU fault handler to service guest page faults
for memory slots backed by guest_memfd with mmap support. For such
slots, the MMU must always fault in pages directly from guest_memfd,
bypassing the host's userspace_addr.

This ensures that guest_memfd-backed memory is always handled through
the guest_memfd specific faulting path, regardless of whether it's for
private or non-private (shared) use cases.

Additionally, rename kvm_mmu_faultin_pfn_private() to
kvm_mmu_faultin_pfn_gmem(), as this function is now used to fault in
pages from guest_memfd for both private and non-private memory,
accommodating the new use cases.

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6d997063f76f..cc4cdfea343b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4511,8 +4511,8 @@ static void kvm_mmu_finish_page_fault(struct kvm_vcpu *vcpu,
 				 r == RET_PF_RETRY, fault->map_writable);
 }
 
-static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
-				       struct kvm_page_fault *fault)
+static int kvm_mmu_faultin_pfn_gmem(struct kvm_vcpu *vcpu,
+				    struct kvm_page_fault *fault)
 {
 	int max_order, r;
 
@@ -4537,13 +4537,18 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
 	return RET_PF_CONTINUE;
 }
 
+static bool fault_from_gmem(struct kvm_page_fault *fault)
+{
+	return fault->is_private || kvm_memslot_is_gmem_only(fault->slot);
+}
+
 static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
 				 struct kvm_page_fault *fault)
 {
 	unsigned int foll = fault->write ? FOLL_WRITE : 0;
 
-	if (fault->is_private)
-		return kvm_mmu_faultin_pfn_private(vcpu, fault);
+	if (fault_from_gmem(fault))
+		return kvm_mmu_faultin_pfn_gmem(vcpu, fault);
 
 	foll |= FOLL_NOWAIT;
 	fault->pfn = __kvm_faultin_pfn(fault->slot, fault->gfn, foll,
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (12 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 13/20] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-11  1:14   ` kernel test robot
  2025-07-11  9:45   ` David Hildenbrand
  2025-07-09 10:59 ` [PATCH v13 15/20] KVM: arm64: Refactor user_mem_abort() Fuad Tabba
                   ` (5 subsequent siblings)
  19 siblings, 2 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Enable host userspace mmap support for guest_memfd-backed memory when
running KVM with the KVM_X86_DEFAULT_VM type:

* Define kvm_arch_supports_gmem_mmap() for KVM_X86_DEFAULT_VM: Introduce
  the architecture-specific kvm_arch_supports_gmem_mmap() macro,
  specifically enabling mmap support for KVM_X86_DEFAULT_VM instances.
  This macro, gated by CONFIG_KVM_GMEM_SUPPORTS_MMAP, ensures that only
  the default VM type can leverage guest_memfd mmap functionality on
  x86. This explicit enablement prevents CoCo VMs, which use guest_memfd
  primarily for private memory and rely on hardware-enforced privacy,
  from accidentally exposing guest memory via host userspace mappings.

* Select CONFIG_KVM_GMEM_SUPPORTS_MMAP in KVM_X86: Enable the
  CONFIG_KVM_GMEM_SUPPORTS_MMAP Kconfig option when KVM_X86 is selected.
  This ensures that the necessary code for guest_memfd mmap support
  (introduced earlier) is compiled into the kernel for x86. This Kconfig
  option acts as a system-wide gate for the guest_memfd mmap capability.
  It implicitly enables CONFIG_KVM_GMEM, making guest_memfd available,
  and then layers the mmap capability on top specifically for the
  default VM.

These changes make guest_memfd a more versatile memory backing for
standard KVM guests, allowing VMMs to use a unified guest_memfd model
for both private (CoCo) and non-private (default) VMs. This is a
prerequisite for use cases such as running Firecracker guests entirely
backed by guest_memfd and implementing direct map removal for non-CoCo
VMs.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/x86/include/asm/kvm_host.h | 9 +++++++++
 arch/x86/kvm/Kconfig            | 1 +
 arch/x86/kvm/x86.c              | 3 ++-
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4c764faa12f3..4c89feaa1910 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2273,9 +2273,18 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 #ifdef CONFIG_KVM_GMEM
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
 #define kvm_arch_supports_gmem(kvm)  ((kvm)->arch.supports_gmem)
+
+/*
+ * CoCo VMs with hardware support that use guest_memfd only for backing private
+ * memory, e.g., TDX, cannot use guest_memfd with userspace mapping enabled.
+ */
+#define kvm_arch_supports_gmem_mmap(kvm)		\
+	(IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP) &&	\
+	 (kvm)->arch.vm_type == KVM_X86_DEFAULT_VM)
 #else
 #define kvm_arch_has_private_mem(kvm) false
 #define kvm_arch_supports_gmem(kvm) false
+#define kvm_arch_supports_gmem_mmap(kvm) false
 #endif
 
 #define kvm_arch_has_readonly_mem(kvm) (!(kvm)->arch.has_protected_state)
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index df1fdbb4024b..239637b663dc 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -47,6 +47,7 @@ config KVM_X86
 	select KVM_GENERIC_HARDWARE_ENABLING
 	select KVM_GENERIC_PRE_FAULT_MEMORY
 	select KVM_GENERIC_GMEM_POPULATE if KVM_SW_PROTECTED_VM
+	select KVM_GMEM_SUPPORTS_MMAP
 	select KVM_WERROR if WERROR
 
 config KVM
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b34236029383..17c655e5716e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12779,7 +12779,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	kvm->arch.vm_type = type;
 	kvm->arch.has_private_mem = (type == KVM_X86_SW_PROTECTED_VM);
-	kvm->arch.supports_gmem = (type == KVM_X86_SW_PROTECTED_VM);
+	kvm->arch.supports_gmem =
+		type == KVM_X86_DEFAULT_VM || type == KVM_X86_SW_PROTECTED_VM;
 	/* Decided by the vendor code for other VM types.  */
 	kvm->arch.pre_fault_allowed =
 		type == KVM_X86_DEFAULT_VM || type == KVM_X86_SW_PROTECTED_VM;
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 15/20] KVM: arm64: Refactor user_mem_abort()
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (13 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-11 13:25   ` Marc Zyngier
  2025-07-09 10:59 ` [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults Fuad Tabba
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Refactor user_mem_abort() to improve code clarity and simplify
assumptions within the function.

Key changes include:

* Immediately set force_pte to true at the beginning of the function if
  logging_active is true. This simplifies the flow and makes the
  condition for forcing a PTE more explicit.

* Remove the misleading comment stating that logging_active is
  guaranteed to never be true for VM_PFNMAP memslots, as this assertion
  is not entirely correct.

* Extract reusable code blocks into new helper functions:
  * prepare_mmu_memcache(): Encapsulates the logic for preparing and
    topping up the MMU page cache.
  * adjust_nested_fault_perms(): Isolates the adjustments to shadow S2
    permissions and the encoding of nested translation levels.

* Update min(a, (long)b) to min_t(long, a, b) for better type safety and
  consistency.

* Perform other minor tidying up of the code.

These changes primarily aim to simplify user_mem_abort() and make its
logic easier to understand and maintain, setting the stage for future
modifications.

No functional change intended.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 100 ++++++++++++++++++++++++-------------------
 1 file changed, 55 insertions(+), 45 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 2942ec92c5a4..58662e0ef13e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1470,13 +1470,56 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_MTE_ALLOWED;
 }
 
+static int prepare_mmu_memcache(struct kvm_vcpu *vcpu, bool topup_memcache,
+				void **memcache)
+{
+	int min_pages;
+
+	if (!is_protected_kvm_enabled())
+		*memcache = &vcpu->arch.mmu_page_cache;
+	else
+		*memcache = &vcpu->arch.pkvm_memcache;
+
+	if (!topup_memcache)
+		return 0;
+
+	min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu);
+
+	if (!is_protected_kvm_enabled())
+		return kvm_mmu_topup_memory_cache(*memcache, min_pages);
+
+	return topup_hyp_memcache(*memcache, min_pages);
+}
+
+/*
+ * Potentially reduce shadow S2 permissions to match the guest's own S2. For
+ * exec faults, we'd only reach this point if the guest actually allowed it (see
+ * kvm_s2_handle_perm_fault).
+ *
+ * Also encode the level of the original translation in the SW bits of the leaf
+ * entry as a proxy for the span of that translation. This will be retrieved on
+ * TLB invalidation from the guest and used to limit the invalidation scope if a
+ * TTL hint or a range isn't provided.
+ */
+static void adjust_nested_fault_perms(struct kvm_s2_trans *nested,
+				      enum kvm_pgtable_prot *prot,
+				      bool *writable)
+{
+	*writable &= kvm_s2_trans_writable(nested);
+	if (!kvm_s2_trans_readable(nested))
+		*prot &= ~KVM_PGTABLE_PROT_R;
+
+	*prot |= kvm_encode_nested_level(nested);
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
 			  bool fault_is_perm)
 {
 	int ret = 0;
-	bool write_fault, writable, force_pte = false;
+	bool topup_memcache;
+	bool write_fault, writable;
 	bool exec_fault, mte_allowed;
 	bool device = false, vfio_allow_any_uc = false;
 	unsigned long mmu_seq;
@@ -1488,6 +1531,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 	bool logging_active = memslot_is_logging(memslot);
+	bool force_pte = logging_active;
 	long vma_pagesize, fault_granule;
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
@@ -1505,28 +1549,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
-	if (!is_protected_kvm_enabled())
-		memcache = &vcpu->arch.mmu_page_cache;
-	else
-		memcache = &vcpu->arch.pkvm_memcache;
-
 	/*
 	 * Permission faults just need to update the existing leaf entry,
 	 * and so normally don't require allocations from the memcache. The
 	 * only exception to this is when dirty logging is enabled at runtime
 	 * and a write fault needs to collapse a block entry into a table.
 	 */
-	if (!fault_is_perm || (logging_active && write_fault)) {
-		int min_pages = kvm_mmu_cache_min_pages(vcpu->arch.hw_mmu);
-
-		if (!is_protected_kvm_enabled())
-			ret = kvm_mmu_topup_memory_cache(memcache, min_pages);
-		else
-			ret = topup_hyp_memcache(memcache, min_pages);
-
-		if (ret)
-			return ret;
-	}
+	topup_memcache = !fault_is_perm || (logging_active && write_fault);
+	ret = prepare_mmu_memcache(vcpu, topup_memcache, &memcache);
+	if (ret)
+		return ret;
 
 	/*
 	 * Let's check if we will get back a huge page backed by hugetlbfs, or
@@ -1540,16 +1572,10 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		return -EFAULT;
 	}
 
-	/*
-	 * logging_active is guaranteed to never be true for VM_PFNMAP
-	 * memslots.
-	 */
-	if (logging_active) {
-		force_pte = true;
+	if (force_pte)
 		vma_shift = PAGE_SHIFT;
-	} else {
+	else
 		vma_shift = get_vma_page_shift(vma, hva);
-	}
 
 	switch (vma_shift) {
 #ifndef __PAGETABLE_PMD_FOLDED
@@ -1601,7 +1627,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			max_map_size = PAGE_SIZE;
 
 		force_pte = (max_map_size == PAGE_SIZE);
-		vma_pagesize = min(vma_pagesize, (long)max_map_size);
+		vma_pagesize = min_t(long, vma_pagesize, max_map_size);
 	}
 
 	/*
@@ -1630,7 +1656,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	 * Rely on mmap_read_unlock() for an implicit smp_rmb(), which pairs
 	 * with the smp_wmb() in kvm_mmu_invalidate_end().
 	 */
-	mmu_seq = vcpu->kvm->mmu_invalidate_seq;
+	mmu_seq = kvm->mmu_invalidate_seq;
 	mmap_read_unlock(current->mm);
 
 	pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
@@ -1665,24 +1691,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	if (exec_fault && device)
 		return -ENOEXEC;
 
-	/*
-	 * Potentially reduce shadow S2 permissions to match the guest's own
-	 * S2. For exec faults, we'd only reach this point if the guest
-	 * actually allowed it (see kvm_s2_handle_perm_fault).
-	 *
-	 * Also encode the level of the original translation in the SW bits
-	 * of the leaf entry as a proxy for the span of that translation.
-	 * This will be retrieved on TLB invalidation from the guest and
-	 * used to limit the invalidation scope if a TTL hint or a range
-	 * isn't provided.
-	 */
-	if (nested) {
-		writable &= kvm_s2_trans_writable(nested);
-		if (!kvm_s2_trans_readable(nested))
-			prot &= ~KVM_PGTABLE_PROT_R;
-
-		prot |= kvm_encode_nested_level(nested);
-	}
+	if (nested)
+		adjust_nested_fault_perms(nested, &prot, &writable);
 
 	kvm_fault_lock(kvm);
 	pgt = vcpu->arch.hw_mmu->pgt;
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (14 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 15/20] KVM: arm64: Refactor user_mem_abort() Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-11  9:59   ` Roy, Patrick
  2025-07-11 16:37   ` Marc Zyngier
  2025-07-09 10:59 ` [PATCH v13 17/20] KVM: arm64: Enable host mapping of shared guest_memfd memory Fuad Tabba
                   ` (3 subsequent siblings)
  19 siblings, 2 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Add arm64 architecture support for handling guest page faults on memory
slots backed by guest_memfd.

This change introduces a new function, gmem_abort(), which encapsulates
the fault handling logic specific to guest_memfd-backed memory. The
kvm_handle_guest_abort() entry point is updated to dispatch to
gmem_abort() when a fault occurs on a guest_memfd-backed memory slot (as
determined by kvm_slot_has_gmem()).

Until guest_memfd gains support for huge pages, the fault granule for
these memory regions is restricted to PAGE_SIZE.

Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: James Houghton <jthoughton@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/kvm/mmu.c | 82 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 58662e0ef13e..71f8b53683e7 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1512,6 +1512,78 @@ static void adjust_nested_fault_perms(struct kvm_s2_trans *nested,
 	*prot |= kvm_encode_nested_level(nested);
 }
 
+#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
+
+static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+		      struct kvm_s2_trans *nested,
+		      struct kvm_memory_slot *memslot, bool is_perm)
+{
+	bool write_fault, exec_fault, writable;
+	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
+	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
+	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
+	struct page *page;
+	struct kvm *kvm = vcpu->kvm;
+	void *memcache;
+	kvm_pfn_t pfn;
+	gfn_t gfn;
+	int ret;
+
+	ret = prepare_mmu_memcache(vcpu, true, &memcache);
+	if (ret)
+		return ret;
+
+	if (nested)
+		gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
+	else
+		gfn = fault_ipa >> PAGE_SHIFT;
+
+	write_fault = kvm_is_write_fault(vcpu);
+	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
+
+	if (write_fault && exec_fault) {
+		kvm_err("Simultaneous write and execution fault\n");
+		return -EFAULT;
+	}
+
+	if (is_perm && !write_fault && !exec_fault) {
+		kvm_err("Unexpected L2 read permission error\n");
+		return -EFAULT;
+	}
+
+	ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
+	if (ret) {
+		kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE,
+					      write_fault, exec_fault, false);
+		return ret;
+	}
+
+	writable = !(memslot->flags & KVM_MEM_READONLY);
+
+	if (nested)
+		adjust_nested_fault_perms(nested, &prot, &writable);
+
+	if (writable)
+		prot |= KVM_PGTABLE_PROT_W;
+
+	if (exec_fault ||
+	    (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
+	     (!nested || kvm_s2_trans_executable(nested))))
+		prot |= KVM_PGTABLE_PROT_X;
+
+	kvm_fault_lock(kvm);
+	ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, PAGE_SIZE,
+						 __pfn_to_phys(pfn), prot,
+						 memcache, flags);
+	kvm_release_faultin_page(kvm, page, !!ret, writable);
+	kvm_fault_unlock(kvm);
+
+	if (writable && !ret)
+		mark_page_dirty_in_slot(kvm, memslot, gfn);
+
+	return ret != -EAGAIN ? ret : 0;
+}
+
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 			  struct kvm_s2_trans *nested,
 			  struct kvm_memory_slot *memslot, unsigned long hva,
@@ -1536,7 +1608,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
 	struct kvm_pgtable *pgt;
 	struct page *page;
-	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED;
+	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
 
 	if (fault_is_perm)
 		fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
@@ -1963,8 +2035,12 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 		goto out_unlock;
 	}
 
-	ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
-			     esr_fsc_is_permission_fault(esr));
+	if (kvm_slot_has_gmem(memslot))
+		ret = gmem_abort(vcpu, fault_ipa, nested, memslot,
+				 esr_fsc_is_permission_fault(esr));
+	else
+		ret = user_mem_abort(vcpu, fault_ipa, nested, memslot, hva,
+				     esr_fsc_is_permission_fault(esr));
 	if (ret == 0)
 		ret = 1;
 out:
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 17/20] KVM: arm64: Enable host mapping of shared guest_memfd memory
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (15 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-11 14:25   ` Marc Zyngier
  2025-07-09 10:59 ` [PATCH v13 18/20] KVM: Introduce the KVM capability KVM_CAP_GMEM_MMAP Fuad Tabba
                   ` (2 subsequent siblings)
  19 siblings, 1 reply; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Enable host userspace mmap support for guest_memfd-backed memory on
arm64. This change provides arm64 with the capability to map guest
memory at the host directly from guest_memfd:

* Define kvm_arch_supports_gmem_mmap() for arm64: The
  kvm_arch_supports_gmem_mmap() macro is defined for arm64 to be true if
  CONFIG_KVM_GMEM_SUPPORTS_MMAP is enabled. For existing arm64 KVM VM
  types that support guest_memfd, this enables them to use guest_memfd
  with host userspace mappings. This provides a consistent behavior as
  there are currently no arm64 CoCo VMs that rely on guest_memfd solely
  for private, non-mappable memory. Future arm64 VM types can override
  or restrict this behavior via the kvm_arch_supports_gmem_mmap() hook
  if needed.

* Select CONFIG_KVM_GMEM_SUPPORTS_MMAP in arm64 Kconfig.

* Enforce KVM_MEMSLOT_GMEM_ONLY for guest_memfd on arm64: Compile and
  runtime checks are added to ensure that if guest_memfd is enabled on
  arm64, KVM_GMEM_SUPPORTS_MMAP must also be enabled. This means
  guest_memfd-backed memory slots on arm64 are currently only supported
  if they are intended for shared memory use cases (i.e.,
  kvm_memslot_is_gmem_only() is true). This design reflects the current
  arm64 KVM ecosystem where guest_memfd is primarily being introduced
  for VMs that support shared memory.

Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 arch/arm64/include/asm/kvm_host.h | 4 ++++
 arch/arm64/kvm/Kconfig            | 1 +
 arch/arm64/kvm/mmu.c              | 8 ++++++++
 3 files changed, 13 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index d27079968341..bd2af5470c66 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -1675,5 +1675,9 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
 void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
 void check_feature_map(void);
 
+#ifdef CONFIG_KVM_GMEM
+#define kvm_arch_supports_gmem(kvm) true
+#define kvm_arch_supports_gmem_mmap(kvm) IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP)
+#endif
 
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 713248f240e0..28539479f083 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -37,6 +37,7 @@ menuconfig KVM
 	select HAVE_KVM_VCPU_RUN_PID_CHANGE
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
+	select KVM_GMEM_SUPPORTS_MMAP
 	help
 	  Support hosting virtualized guest machines.
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 71f8b53683e7..b92ce4d9b4e0 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2274,6 +2274,14 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 	if ((new->base_gfn + new->npages) > (kvm_phys_size(&kvm->arch.mmu) >> PAGE_SHIFT))
 		return -EFAULT;
 
+	/*
+	 * Only support guest_memfd backed memslots with mappable memory, since
+	 * there aren't any CoCo VMs that support only private memory on arm64.
+	 */
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_KVM_GMEM) && !IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP));
+	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
+		return -EINVAL;
+
 	hva = new->userspace_addr;
 	reg_end = hva + (new->npages << PAGE_SHIFT);
 
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 18/20] KVM: Introduce the KVM capability KVM_CAP_GMEM_MMAP
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (16 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 17/20] KVM: arm64: Enable host mapping of shared guest_memfd memory Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-11  8:48   ` Shivank Garg
  2025-07-09 10:59 ` [PATCH v13 19/20] KVM: selftests: Do not use hardcoded page sizes in guest_memfd test Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 20/20] KVM: selftests: guest_memfd mmap() test when mmap is supported Fuad Tabba
  19 siblings, 1 reply; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Introduce the new KVM capability KVM_CAP_GMEM_MMAP. This capability
signals to userspace that a KVM instance supports host userspace mapping
of guest_memfd-backed memory.

The availability of this capability is determined per architecture, and
its enablement for a specific guest_memfd instance is controlled by the
GUEST_MEMFD_FLAG_MMAP flag at creation time.

Update the KVM API documentation to detail the KVM_CAP_GMEM_MMAP
capability, the associated GUEST_MEMFD_FLAG_MMAP, and provide essential
information regarding support for mmap in guest_memfd.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 Documentation/virt/kvm/api.rst | 9 +++++++++
 include/uapi/linux/kvm.h       | 1 +
 virt/kvm/kvm_main.c            | 4 ++++
 3 files changed, 14 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9abf93ee5f65..70261e189162 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6407,6 +6407,15 @@ most one mapping per page, i.e. binding multiple memory regions to a single
 guest_memfd range is not allowed (any number of memory regions can be bound to
 a single guest_memfd file, but the bound ranges must not overlap).
 
+When the capability KVM_CAP_GMEM_MMAP is supported, the 'flags' field supports
+GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation enables mmap()
+and faulting of guest_memfd memory to host userspace.
+
+When the KVM MMU performs a PFN lookup to service a guest fault and the backing
+guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
+consumed from guest_memfd, regardless of whether it is a shared or a private
+fault.
+
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
 4.143 KVM_PRE_FAULT_MEMORY
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index c71348db818f..cbf28237af79 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -956,6 +956,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_ARM_EL2 240
 #define KVM_CAP_ARM_EL2_E2H0 241
 #define KVM_CAP_RISCV_MP_STATE_RESET 242
+#define KVM_CAP_GMEM_MMAP 243
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 81bb18fa8655..5463e81b08b9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4913,6 +4913,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 #ifdef CONFIG_KVM_GMEM
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_supports_gmem(kvm);
+#endif
+#ifdef CONFIG_KVM_GMEM_SUPPORTS_MMAP
+	case KVM_CAP_GMEM_MMAP:
+		return !kvm || kvm_arch_supports_gmem_mmap(kvm);
 #endif
 	default:
 		break;
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 19/20] KVM: selftests: Do not use hardcoded page sizes in guest_memfd test
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (17 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 18/20] KVM: Introduce the KVM capability KVM_CAP_GMEM_MMAP Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  2025-07-09 10:59 ` [PATCH v13 20/20] KVM: selftests: guest_memfd mmap() test when mmap is supported Fuad Tabba
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Update the guest_memfd_test selftest to use getpagesize() instead of
hardcoded 4KB page size values.

Using hardcoded page sizes can cause test failures on architectures or
systems configured with larger page sizes, such as arm64 with 64KB
pages. By dynamically querying the system's page size, the test becomes
more portable and robust across different environments.

Additionally, build the guest_memfd_test selftest for arm64.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Suggested-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 tools/testing/selftests/kvm/Makefile.kvm       |  1 +
 tools/testing/selftests/kvm/guest_memfd_test.c | 11 ++++++-----
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index 38b95998e1e6..e11ed9e59ab5 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -172,6 +172,7 @@ TEST_GEN_PROGS_arm64 += arch_timer
 TEST_GEN_PROGS_arm64 += coalesced_io_test
 TEST_GEN_PROGS_arm64 += dirty_log_perf_test
 TEST_GEN_PROGS_arm64 += get-reg-list
+TEST_GEN_PROGS_arm64 += guest_memfd_test
 TEST_GEN_PROGS_arm64 += memslot_modification_stress_test
 TEST_GEN_PROGS_arm64 += memslot_perf_test
 TEST_GEN_PROGS_arm64 += mmu_stress_test
diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index ce687f8d248f..341ba616cf55 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -146,24 +146,25 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
 {
 	int fd1, fd2, ret;
 	struct stat st1, st2;
+	size_t page_size = getpagesize();
 
-	fd1 = __vm_create_guest_memfd(vm, 4096, 0);
+	fd1 = __vm_create_guest_memfd(vm, page_size, 0);
 	TEST_ASSERT(fd1 != -1, "memfd creation should succeed");
 
 	ret = fstat(fd1, &st1);
 	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
-	TEST_ASSERT(st1.st_size == 4096, "memfd st_size should match requested size");
+	TEST_ASSERT(st1.st_size == page_size, "memfd st_size should match requested size");
 
-	fd2 = __vm_create_guest_memfd(vm, 8192, 0);
+	fd2 = __vm_create_guest_memfd(vm, page_size * 2, 0);
 	TEST_ASSERT(fd2 != -1, "memfd creation should succeed");
 
 	ret = fstat(fd2, &st2);
 	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
-	TEST_ASSERT(st2.st_size == 8192, "second memfd st_size should match requested size");
+	TEST_ASSERT(st2.st_size == page_size * 2, "second memfd st_size should match requested size");
 
 	ret = fstat(fd1, &st1);
 	TEST_ASSERT(ret != -1, "memfd fstat should succeed");
-	TEST_ASSERT(st1.st_size == 4096, "first memfd st_size should still match requested size");
+	TEST_ASSERT(st1.st_size == page_size, "first memfd st_size should still match requested size");
 	TEST_ASSERT(st1.st_ino != st2.st_ino, "different memfd should have different inode numbers");
 
 	close(fd2);
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v13 20/20] KVM: selftests: guest_memfd mmap() test when mmap is supported
  2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
                   ` (18 preceding siblings ...)
  2025-07-09 10:59 ` [PATCH v13 19/20] KVM: selftests: Do not use hardcoded page sizes in guest_memfd test Fuad Tabba
@ 2025-07-09 10:59 ` Fuad Tabba
  19 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-09 10:59 UTC (permalink / raw)
  To: kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny, tabba

Expand the guest_memfd selftests to comprehensively test host userspace
mmap functionality for guest_memfd-backed memory when supported by the
VM type.

Introduce new test cases to verify the following:

* Successful mmap operations: Ensure that MAP_SHARED mappings succeed
  when guest_memfd mmap is enabled.

* Data integrity: Validate that data written to the mmap'd region is
  correctly persistent and readable.

* fallocate interaction: Test that fallocate(FALLOC_FL_PUNCH_HOLE)
  correctly zeros out mapped pages.

* Out-of-bounds access: Verify that accessing memory beyond the
  guest_memfd's size correctly triggers a SIGBUS signal.

* Unsupported mmap: Confirm that mmap attempts fail as expected when
  guest_memfd mmap support is not enabled for the specific guest_memfd
  instance or VM type.

* Flag validity: Introduce test_vm_type_gmem_flag_validity() to
  systematically test that only allowed guest_memfd creation flags are
  accepted for different VM types (e.g., GUEST_MEMFD_FLAG_MMAP for
  default VMs, no flags for CoCo VMs).

The existing tests for guest_memfd creation (multiple instances, invalid
sizes), file read/write, file size, and invalid punch hole operations
are integrated into the new test_with_type() framework to allow testing
across different VM types.

Reviewed-by: James Houghton <jthoughton@google.com>
Reviewed-by: Gavin Shan <gshan@redhat.com>
Reviewed-by: Shivank Garg <shivankg@amd.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 197 ++++++++++++++++--
 1 file changed, 176 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 341ba616cf55..1252e74fbb8f 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -13,6 +13,8 @@
 
 #include <linux/bitmap.h>
 #include <linux/falloc.h>
+#include <setjmp.h>
+#include <signal.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -34,12 +36,83 @@ static void test_file_read_write(int fd)
 		    "pwrite on a guest_mem fd should fail");
 }
 
-static void test_mmap(int fd, size_t page_size)
+static void test_mmap_supported(int fd, size_t page_size, size_t total_size)
+{
+	const char val = 0xaa;
+	char *mem;
+	size_t i;
+	int ret;
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
+	TEST_ASSERT(mem == MAP_FAILED, "Copy-on-write not allowed by guest_memfd.");
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap() for guest_memfd should succeed.");
+
+	memset(mem, val, total_size);
+	for (i = 0; i < total_size; i++)
+		TEST_ASSERT_EQ(READ_ONCE(mem[i]), val);
+
+	ret = fallocate(fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+			page_size);
+	TEST_ASSERT(!ret, "fallocate the first page should succeed.");
+
+	for (i = 0; i < page_size; i++)
+		TEST_ASSERT_EQ(READ_ONCE(mem[i]), 0x00);
+	for (; i < total_size; i++)
+		TEST_ASSERT_EQ(READ_ONCE(mem[i]), val);
+
+	memset(mem, val, page_size);
+	for (i = 0; i < total_size; i++)
+		TEST_ASSERT_EQ(READ_ONCE(mem[i]), val);
+
+	ret = munmap(mem, total_size);
+	TEST_ASSERT(!ret, "munmap() should succeed.");
+}
+
+static sigjmp_buf jmpbuf;
+void fault_sigbus_handler(int signum)
+{
+	siglongjmp(jmpbuf, 1);
+}
+
+static void test_fault_overflow(int fd, size_t page_size, size_t total_size)
+{
+	struct sigaction sa_old, sa_new = {
+		.sa_handler = fault_sigbus_handler,
+	};
+	size_t map_size = total_size * 4;
+	const char val = 0xaa;
+	char *mem;
+	size_t i;
+	int ret;
+
+	mem = mmap(NULL, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap() for guest_memfd should succeed.");
+
+	sigaction(SIGBUS, &sa_new, &sa_old);
+	if (sigsetjmp(jmpbuf, 1) == 0) {
+		memset(mem, 0xaa, map_size);
+		TEST_ASSERT(false, "memset() should have triggered SIGBUS.");
+	}
+	sigaction(SIGBUS, &sa_old, NULL);
+
+	for (i = 0; i < total_size; i++)
+		TEST_ASSERT_EQ(READ_ONCE(mem[i]), val);
+
+	ret = munmap(mem, map_size);
+	TEST_ASSERT(!ret, "munmap() should succeed.");
+}
+
+static void test_mmap_not_supported(int fd, size_t page_size, size_t total_size)
 {
 	char *mem;
 
 	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
 	TEST_ASSERT_EQ(mem, MAP_FAILED);
+
+	mem = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT_EQ(mem, MAP_FAILED);
 }
 
 static void test_file_size(int fd, size_t page_size, size_t total_size)
@@ -120,26 +193,19 @@ static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
 	}
 }
 
-static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
+static void test_create_guest_memfd_invalid_sizes(struct kvm_vm *vm,
+						  uint64_t guest_memfd_flags,
+						  size_t page_size)
 {
-	size_t page_size = getpagesize();
-	uint64_t flag;
 	size_t size;
 	int fd;
 
 	for (size = 1; size < page_size; size++) {
-		fd = __vm_create_guest_memfd(vm, size, 0);
-		TEST_ASSERT(fd == -1 && errno == EINVAL,
+		fd = __vm_create_guest_memfd(vm, size, guest_memfd_flags);
+		TEST_ASSERT(fd < 0 && errno == EINVAL,
 			    "guest_memfd() with non-page-aligned page size '0x%lx' should fail with EINVAL",
 			    size);
 	}
-
-	for (flag = BIT(0); flag; flag <<= 1) {
-		fd = __vm_create_guest_memfd(vm, page_size, flag);
-		TEST_ASSERT(fd == -1 && errno == EINVAL,
-			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
-			    flag);
-	}
 }
 
 static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
@@ -171,30 +237,119 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
 	close(fd1);
 }
 
-int main(int argc, char *argv[])
+static bool check_vm_type(unsigned long vm_type)
 {
-	size_t page_size;
+	/*
+	 * Not all architectures support KVM_CAP_VM_TYPES. However, those that
+	 * support guest_memfd have that support for the default VM type.
+	 */
+	if (vm_type == VM_TYPE_DEFAULT)
+		return true;
+
+	return kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(vm_type);
+}
+
+static void test_with_type(unsigned long vm_type, uint64_t guest_memfd_flags,
+			   bool expect_mmap_allowed)
+{
+	struct kvm_vm *vm;
 	size_t total_size;
+	size_t page_size;
 	int fd;
-	struct kvm_vm *vm;
 
-	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
+	if (!check_vm_type(vm_type))
+		return;
 
 	page_size = getpagesize();
 	total_size = page_size * 4;
 
-	vm = vm_create_barebones();
+	vm = vm_create_barebones_type(vm_type);
 
-	test_create_guest_memfd_invalid(vm);
 	test_create_guest_memfd_multiple(vm);
+	test_create_guest_memfd_invalid_sizes(vm, guest_memfd_flags, page_size);
 
-	fd = vm_create_guest_memfd(vm, total_size, 0);
+	fd = vm_create_guest_memfd(vm, total_size, guest_memfd_flags);
 
 	test_file_read_write(fd);
-	test_mmap(fd, page_size);
+
+	if (expect_mmap_allowed) {
+		test_mmap_supported(fd, page_size, total_size);
+		test_fault_overflow(fd, page_size, total_size);
+
+	} else {
+		test_mmap_not_supported(fd, page_size, total_size);
+	}
+
 	test_file_size(fd, page_size, total_size);
 	test_fallocate(fd, page_size, total_size);
 	test_invalid_punch_hole(fd, page_size, total_size);
 
 	close(fd);
+	kvm_vm_free(vm);
+}
+
+static void test_vm_type_gmem_flag_validity(unsigned long vm_type,
+					    uint64_t expected_valid_flags)
+{
+	size_t page_size = getpagesize();
+	struct kvm_vm *vm;
+	uint64_t flag = 0;
+	int fd;
+
+	if (!check_vm_type(vm_type))
+		return;
+
+	vm = vm_create_barebones_type(vm_type);
+
+	for (flag = BIT(0); flag; flag <<= 1) {
+		fd = __vm_create_guest_memfd(vm, page_size, flag);
+
+		if (flag & expected_valid_flags) {
+			TEST_ASSERT(fd >= 0,
+				    "guest_memfd() with flag '0x%lx' should be valid",
+				    flag);
+			close(fd);
+		} else {
+			TEST_ASSERT(fd < 0 && errno == EINVAL,
+				    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
+				    flag);
+		}
+	}
+
+	kvm_vm_free(vm);
+}
+
+static void test_gmem_flag_validity(void)
+{
+	uint64_t non_coco_vm_valid_flags = 0;
+
+	if (kvm_has_cap(KVM_CAP_GMEM_MMAP))
+		non_coco_vm_valid_flags = GUEST_MEMFD_FLAG_MMAP;
+
+	test_vm_type_gmem_flag_validity(VM_TYPE_DEFAULT, non_coco_vm_valid_flags);
+
+#ifdef __x86_64__
+	test_vm_type_gmem_flag_validity(KVM_X86_SW_PROTECTED_VM, 0);
+	test_vm_type_gmem_flag_validity(KVM_X86_SEV_VM, 0);
+	test_vm_type_gmem_flag_validity(KVM_X86_SEV_ES_VM, 0);
+	test_vm_type_gmem_flag_validity(KVM_X86_SNP_VM, 0);
+	test_vm_type_gmem_flag_validity(KVM_X86_TDX_VM, 0);
+#endif
+}
+
+int main(int argc, char *argv[])
+{
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
+
+	test_gmem_flag_validity();
+
+	test_with_type(VM_TYPE_DEFAULT, 0, false);
+	if (kvm_has_cap(KVM_CAP_GMEM_MMAP)) {
+		test_with_type(VM_TYPE_DEFAULT, GUEST_MEMFD_FLAG_MMAP,
+			       true);
+	}
+
+#ifdef __x86_64__
+	test_with_type(KVM_X86_SW_PROTECTED_VM, 0, false);
+#endif
 }
-- 
2.50.0.727.gbf7dc18ff4-goog



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type
  2025-07-09 10:59 ` [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type Fuad Tabba
@ 2025-07-11  1:14   ` kernel test robot
  2025-07-11  9:45   ` David Hildenbrand
  1 sibling, 0 replies; 40+ messages in thread
From: kernel test robot @ 2025-07-11  1:14 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: oe-kbuild-all, pbonzini, chenhuacai, mpe, anup, paul.walmsley,
	palmer, aou, seanjc, viro, brauner, willy, akpm, xiaoyao.li,
	yilun.xu, chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata,
	mic, vbabka, vannapurve, ackerleytng, mail, david, michael.roth

Hi Fuad,

kernel test robot noticed the following build errors:

[auto build test ERROR on d7b8f8e20813f0179d8ef519541a3527e7661d3a]

url:    https://github.com/intel-lab-lkp/linux/commits/Fuad-Tabba/KVM-Rename-CONFIG_KVM_PRIVATE_MEM-to-CONFIG_KVM_GMEM/20250709-190344
base:   d7b8f8e20813f0179d8ef519541a3527e7661d3a
patch link:    https://lore.kernel.org/r/20250709105946.4009897-15-tabba%40google.com
patch subject: [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type
config: i386-randconfig-013-20250711 (https://download.01.org/0day-ci/archive/20250711/202507110822.EaBBAGre-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250711/202507110822.EaBBAGre-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507110822.EaBBAGre-lkp@intel.com/

All errors (new ones prefixed by >>):

   arch/x86/kvm/../../../virt/kvm/guest_memfd.c: In function 'kvm_gmem_supports_mmap':
   arch/x86/kvm/../../../virt/kvm/guest_memfd.c:317:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     317 |         const u64 flags = (u64)inode->i_private;
         |                           ^
   In file included from <command-line>:
   arch/x86/kvm/../../../virt/kvm/guest_memfd.c: In function 'kvm_gmem_bind':
>> include/linux/compiler_types.h:568:45: error: call to '__compiletime_assert_624' declared with attribute error: BUILD_BUG_ON failed: sizeof(gfn_t) != sizeof(slot->gmem.pgoff)
     568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:549:25: note: in definition of macro '__compiletime_assert'
     549 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:568:9: note: in expansion of macro '_compiletime_assert'
     568 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   arch/x86/kvm/../../../virt/kvm/guest_memfd.c:558:9: note: in expansion of macro 'BUILD_BUG_ON'
     558 |         BUILD_BUG_ON(sizeof(gfn_t) != sizeof(slot->gmem.pgoff));
         |         ^~~~~~~~~~~~


vim +/__compiletime_assert_624 +568 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21  554  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  555  #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21  556  	__compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  557  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  558  /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21  559   * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  560   * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21  561   * @msg:       a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  562   *
eb5c2d4b45e3d2 Will Deacon 2020-07-21  563   * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  564   * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  565   * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21  566   */
eb5c2d4b45e3d2 Will Deacon 2020-07-21  567  #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @568  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  569  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 09/20] KVM: guest_memfd: Track guest_memfd mmap support in memslot
  2025-07-09 10:59 ` [PATCH v13 09/20] KVM: guest_memfd: Track guest_memfd mmap support in memslot Fuad Tabba
@ 2025-07-11  8:34   ` Shivank Garg
  0 siblings, 0 replies; 40+ messages in thread
From: Shivank Garg @ 2025-07-11  8:34 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny



On 7/9/2025 4:29 PM, Fuad Tabba wrote:
> Add a new internal flag, KVM_MEMSLOT_GMEM_ONLY, to the top half of
> memslot->flags. This flag tracks when a guest_memfd-backed memory slot
> supports host userspace mmap operations. It's strictly for KVM's
> internal use.
> 
> This optimization avoids repeatedly checking the underlying guest_memfd
> file for mmap support, which would otherwise require taking and
> releasing a reference on the file for each check. By caching this
> information directly in the memslot, we reduce overhead and simplify the
> logic involved in handling guest_memfd-backed pages for host mappings.
> 
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  include/linux/kvm_host.h | 11 ++++++++++-
>  virt/kvm/guest_memfd.c   |  2 ++
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 9ac21985f3b5..d2218ec57ceb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -54,7 +54,8 @@
>   * used in kvm, other bits are visible for userspace which are defined in
>   * include/uapi/linux/kvm.h.
>   */
> -#define KVM_MEMSLOT_INVALID	(1UL << 16)
> +#define KVM_MEMSLOT_INVALID			(1UL << 16)
> +#define KVM_MEMSLOT_GMEM_ONLY			(1UL << 17)
>  
>  /*
>   * Bit 63 of the memslot generation number is an "update in-progress flag",
> @@ -2536,6 +2537,14 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
>  }
>  
> +static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
> +{
> +	if (!IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP))
> +		return false;
> +
> +	return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
> +}
> +
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
>  static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>  {
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 07a4b165471d..2b00f8796a15 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -592,6 +592,8 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	 */
>  	WRITE_ONCE(slot->gmem.file, file);
>  	slot->gmem.pgoff = start;
> +	if (kvm_gmem_supports_mmap(inode))
> +		slot->flags |= KVM_MEMSLOT_GMEM_ONLY;
>  
>  	xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
>  	filemap_invalidate_unlock(inode->i_mapping);

Reviewed-by: Shivank Garg <shivankg@amd.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 18/20] KVM: Introduce the KVM capability KVM_CAP_GMEM_MMAP
  2025-07-09 10:59 ` [PATCH v13 18/20] KVM: Introduce the KVM capability KVM_CAP_GMEM_MMAP Fuad Tabba
@ 2025-07-11  8:48   ` Shivank Garg
  0 siblings, 0 replies; 40+ messages in thread
From: Shivank Garg @ 2025-07-11  8:48 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, david, michael.roth,
	wei.w.wang, liam.merwick, isaku.yamahata, kirill.shutemov,
	suzuki.poulose, steven.price, quic_eberman, quic_mnalajal,
	quic_tsoni, quic_svaddagi, quic_cvanscha, quic_pderrin,
	quic_pheragu, catalin.marinas, james.morse, yuzenghui,
	oliver.upton, maz, will, qperret, keirf, roypat, shuah, hch, jgg,
	rientjes, jhubbard, fvdl, hughd, jthoughton, peterx, pankaj.gupta,
	ira.weiny



On 7/9/2025 4:29 PM, Fuad Tabba wrote:
> Introduce the new KVM capability KVM_CAP_GMEM_MMAP. This capability
> signals to userspace that a KVM instance supports host userspace mapping
> of guest_memfd-backed memory.
> 
> The availability of this capability is determined per architecture, and
> its enablement for a specific guest_memfd instance is controlled by the
> GUEST_MEMFD_FLAG_MMAP flag at creation time.
> 
> Update the KVM API documentation to detail the KVM_CAP_GMEM_MMAP
> capability, the associated GUEST_MEMFD_FLAG_MMAP, and provide essential
> information regarding support for mmap in guest_memfd.
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  Documentation/virt/kvm/api.rst | 9 +++++++++
>  include/uapi/linux/kvm.h       | 1 +
>  virt/kvm/kvm_main.c            | 4 ++++
>  3 files changed, 14 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9abf93ee5f65..70261e189162 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6407,6 +6407,15 @@ most one mapping per page, i.e. binding multiple memory regions to a single
>  guest_memfd range is not allowed (any number of memory regions can be bound to
>  a single guest_memfd file, but the bound ranges must not overlap).
>  
> +When the capability KVM_CAP_GMEM_MMAP is supported, the 'flags' field supports
> +GUEST_MEMFD_FLAG_MMAP.  Setting this flag on guest_memfd creation enables mmap()
> +and faulting of guest_memfd memory to host userspace.
> +
> +When the KVM MMU performs a PFN lookup to service a guest fault and the backing
> +guest_memfd has the GUEST_MEMFD_FLAG_MMAP set, then the fault will always be
> +consumed from guest_memfd, regardless of whether it is a shared or a private
> +fault.
> +
>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>  
>  4.143 KVM_PRE_FAULT_MEMORY
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index c71348db818f..cbf28237af79 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -956,6 +956,7 @@ struct kvm_enable_cap {
>  #define KVM_CAP_ARM_EL2 240
>  #define KVM_CAP_ARM_EL2_E2H0 241
>  #define KVM_CAP_RISCV_MP_STATE_RESET 242
> +#define KVM_CAP_GMEM_MMAP 243
>  
>  struct kvm_irq_routing_irqchip {
>  	__u32 irqchip;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 81bb18fa8655..5463e81b08b9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4913,6 +4913,10 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
>  #ifdef CONFIG_KVM_GMEM
>  	case KVM_CAP_GUEST_MEMFD:
>  		return !kvm || kvm_arch_supports_gmem(kvm);
> +#endif
> +#ifdef CONFIG_KVM_GMEM_SUPPORTS_MMAP
> +	case KVM_CAP_GMEM_MMAP:
> +		return !kvm || kvm_arch_supports_gmem_mmap(kvm);
>  #endif
>  	default:
>  		break;

Reviewed-by: Shivank Garg <shivankg@amd.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 10/20] KVM: x86/mmu: Generalize private_max_mapping_level x86 op to max_mapping_level
  2025-07-09 10:59 ` [PATCH v13 10/20] KVM: x86/mmu: Generalize private_max_mapping_level x86 op to max_mapping_level Fuad Tabba
@ 2025-07-11  9:36   ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2025-07-11  9:36 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_eberman, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf, roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl,
	hughd, jthoughton, peterx, pankaj.gupta, ira.weiny

On 09.07.25 12:59, Fuad Tabba wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Generalize the private_max_mapping_level x86 operation to
> max_mapping_level.
> 
> The private_max_mapping_level operation allows platform-specific code to
> limit mapping levels (e.g., forcing 4K pages for certain memory types).
> While it was previously used exclusively for private memory, guest_memfd
> can now back both private and non-private memory. Platforms may have
> specific mapping level restrictions that apply to guest_memfd memory
> regardless of its privacy attribute. Therefore, generalize this
> operation.
> 
> Rename the operation: Removes the "private" prefix to reflect its
> broader applicability to any guest_memfd-backed memory.
> 
> Pass kvm_page_fault information: The operation is updated to receive a
> struct kvm_page_fault object instead of just the pfn. This provides
> platform-specific implementations (e.g., for TDX or SEV) with additional
> context about the fault, such as whether it is private or shared,
> allowing them to apply different mapping level rules as needed.
> 
> Enforce "private-only" behavior (for now): Since the current consumers
> of this hook (TDX and SEV) still primarily use it to enforce private
> memory constraints, platform-specific implementations are made to return
> 0 for non-private pages. A return value of 0 signals to callers that
> platform-specific input should be ignored for that particular fault,
> indicating no specific platform-imposed mapping level limits for
> non-private pages. This allows the core MMU to continue determining the
> mapping level based on generic rules for such cases.
> 
> Suggested-by: Sean Christoperson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 11/20] KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level
  2025-07-09 10:59 ` [PATCH v13 11/20] KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level Fuad Tabba
@ 2025-07-11  9:38   ` David Hildenbrand
  0 siblings, 0 replies; 40+ messages in thread
From: David Hildenbrand @ 2025-07-11  9:38 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_eberman, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf, roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl,
	hughd, jthoughton, peterx, pankaj.gupta, ira.weiny

On 09.07.25 12:59, Fuad Tabba wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Refactor kvm_max_private_mapping_level() to accept a NULL kvm_page_fault
> pointer and rename it to kvm_gmem_max_mapping_level().
> 
> The max_mapping_level x86 operation (previously private_max_mapping_level)
> is designed to potentially be called without an active page fault, for
> instance, when kvm_mmu_max_mapping_level() is determining the maximum
> mapping level for a gfn proactively.
> 
> Allow NULL fault pointer: Modify kvm_max_private_mapping_level() to
> safely handle a NULL fault argument. This aligns its interface with the
> kvm_x86_ops.max_mapping_level operation it wraps, which can also be
> called with NULL.
> 
> Rename function to kvm_gmem_max_mapping_level(): This reinforces that
> the function's scope is for guest_memfd-backed memory, which can be
> either private or non-private, removing any remaining "private"
> connotation from its name.
> 
> Optimize max_level checks: Introduce a check in the caller to skip
> querying for max_mapping_level if the current max_level is already
> PG_LEVEL_4K, as no further reduction is possible.
> 
> Suggested-by: Sean Christoperson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 17 ++++++++---------
>   1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index bb925994cbc5..495dcedaeafa 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4467,17 +4467,13 @@ static inline u8 kvm_max_level_for_order(int order)
>   	return PG_LEVEL_4K;
>   }
>   
> -static u8 kvm_max_private_mapping_level(struct kvm *kvm,
> -					struct kvm_page_fault *fault,
> -					int gmem_order)
> +static u8 kvm_gmem_max_mapping_level(struct kvm *kvm, int order,
> +				     struct kvm_page_fault *fault)
>   {
> -	u8 max_level = fault->max_level;
>   	u8 req_max_level;
> +	u8 max_level;
>   
> -	if (max_level == PG_LEVEL_4K)
> -		return PG_LEVEL_4K;
> -
> -	max_level = min(kvm_max_level_for_order(gmem_order), max_level);
> +	max_level = kvm_max_level_for_order(order);
>   	if (max_level == PG_LEVEL_4K)
>   		return PG_LEVEL_4K;
>   
> @@ -4513,7 +4509,10 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
>   	}
>   
>   	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> -	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault, max_order);
> +	if (fault->max_level >= PG_LEVEL_4K) {
> +		fault->max_level = kvm_gmem_max_mapping_level(vcpu->kvm,
> +							      max_order, fault);
> +	}

Could drop {}

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type
  2025-07-09 10:59 ` [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type Fuad Tabba
  2025-07-11  1:14   ` kernel test robot
@ 2025-07-11  9:45   ` David Hildenbrand
  2025-07-11 11:09     ` Fuad Tabba
  1 sibling, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2025-07-11  9:45 UTC (permalink / raw)
  To: Fuad Tabba, kvm, linux-arm-msm, linux-mm, kvmarm
  Cc: pbonzini, chenhuacai, mpe, anup, paul.walmsley, palmer, aou,
	seanjc, viro, brauner, willy, akpm, xiaoyao.li, yilun.xu,
	chao.p.peng, jarkko, amoorthy, dmatlack, isaku.yamahata, mic,
	vbabka, vannapurve, ackerleytng, mail, michael.roth, wei.w.wang,
	liam.merwick, isaku.yamahata, kirill.shutemov, suzuki.poulose,
	steven.price, quic_eberman, quic_mnalajal, quic_tsoni,
	quic_svaddagi, quic_cvanscha, quic_pderrin, quic_pheragu,
	catalin.marinas, james.morse, yuzenghui, oliver.upton, maz, will,
	qperret, keirf, roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl,
	hughd, jthoughton, peterx, pankaj.gupta, ira.weiny

On 09.07.25 12:59, Fuad Tabba wrote:
> Enable host userspace mmap support for guest_memfd-backed memory when
> running KVM with the KVM_X86_DEFAULT_VM type:
> 
> * Define kvm_arch_supports_gmem_mmap() for KVM_X86_DEFAULT_VM: Introduce
>    the architecture-specific kvm_arch_supports_gmem_mmap() macro,
>    specifically enabling mmap support for KVM_X86_DEFAULT_VM instances.
>    This macro, gated by CONFIG_KVM_GMEM_SUPPORTS_MMAP, ensures that only
>    the default VM type can leverage guest_memfd mmap functionality on
>    x86. This explicit enablement prevents CoCo VMs, which use guest_memfd
>    primarily for private memory and rely on hardware-enforced privacy,
>    from accidentally exposing guest memory via host userspace mappings.
> 
> * Select CONFIG_KVM_GMEM_SUPPORTS_MMAP in KVM_X86: Enable the
>    CONFIG_KVM_GMEM_SUPPORTS_MMAP Kconfig option when KVM_X86 is selected.
>    This ensures that the necessary code for guest_memfd mmap support
>    (introduced earlier) is compiled into the kernel for x86. This Kconfig
>    option acts as a system-wide gate for the guest_memfd mmap capability.
>    It implicitly enables CONFIG_KVM_GMEM, making guest_memfd available,
>    and then layers the mmap capability on top specifically for the
>    default VM.
> 
> These changes make guest_memfd a more versatile memory backing for
> standard KVM guests, allowing VMMs to use a unified guest_memfd model
> for both private (CoCo) and non-private (default) VMs. This is a
> prerequisite for use cases such as running Firecracker guests entirely
> backed by guest_memfd and implementing direct map removal for non-CoCo
> VMs.
> 
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>   arch/x86/include/asm/kvm_host.h | 9 +++++++++
>   arch/x86/kvm/Kconfig            | 1 +
>   arch/x86/kvm/x86.c              | 3 ++-
>   3 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4c764faa12f3..4c89feaa1910 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2273,9 +2273,18 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>   #ifdef CONFIG_KVM_GMEM
>   #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>   #define kvm_arch_supports_gmem(kvm)  ((kvm)->arch.supports_gmem)
> +
> +/*
> + * CoCo VMs with hardware support that use guest_memfd only for backing private
> + * memory, e.g., TDX, cannot use guest_memfd with userspace mapping enabled.
> + */
> +#define kvm_arch_supports_gmem_mmap(kvm)		\
> +	(IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP) &&	\
> +	 (kvm)->arch.vm_type == KVM_X86_DEFAULT_VM)
>   #else
>   #define kvm_arch_has_private_mem(kvm) false
>   #define kvm_arch_supports_gmem(kvm) false
> +#define kvm_arch_supports_gmem_mmap(kvm) false
>   #endif
>   
>   #define kvm_arch_has_readonly_mem(kvm) (!(kvm)->arch.has_protected_state)
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index df1fdbb4024b..239637b663dc 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -47,6 +47,7 @@ config KVM_X86
>   	select KVM_GENERIC_HARDWARE_ENABLING
>   	select KVM_GENERIC_PRE_FAULT_MEMORY
>   	select KVM_GENERIC_GMEM_POPULATE if KVM_SW_PROTECTED_VM
> +	select KVM_GMEM_SUPPORTS_MMAP
>   	select KVM_WERROR if WERROR

Given the error, likely we want to limit to 64BIT.

select KVM_GMEM_SUPPORTS_MMAP if X86_64

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-09 10:59 ` [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults Fuad Tabba
@ 2025-07-11  9:59   ` Roy, Patrick
  2025-07-11 11:08     ` Fuad Tabba
  2025-07-11 13:49     ` Marc Zyngier
  2025-07-11 16:37   ` Marc Zyngier
  1 sibling, 2 replies; 40+ messages in thread
From: Roy, Patrick @ 2025-07-11  9:59 UTC (permalink / raw)
  To: tabba@google.com
  Cc: ackerleytng@google.com, akpm@linux-foundation.org,
	amoorthy@google.com, anup@brainfault.org, aou@eecs.berkeley.edu,
	brauner@kernel.org, catalin.marinas@arm.com,
	chao.p.peng@linux.intel.com, chenhuacai@kernel.org,
	david@redhat.com, dmatlack@google.com, fvdl@google.com,
	hch@infradead.org, hughd@google.com, ira.weiny@intel.com,
	isaku.yamahata@gmail.com, isaku.yamahata@intel.com,
	james.morse@arm.com, jarkko@kernel.org, jgg@nvidia.com,
	jhubbard@nvidia.com, jthoughton@google.com, keirf@google.com,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	kvmarm@lists.linux.dev, liam.merwick@oracle.com,
	linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
	mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net,
	michael.roth@amd.com, mpe@ellerman.id.au, oliver.upton@linux.dev,
	palmer@dabbelt.com, pankaj.gupta@amd.com,
	paul.walmsley@sifive.com, pbonzini@redhat.com, peterx@redhat.com,
	qperret@google.com, quic_cvanscha@quicinc.com,
	quic_eberman@quicinc.com, quic_mnalajal@quicinc.com,
	quic_pderrin@quicinc.com, quic_pheragu@quicinc.com,
	quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com,
	rientjes@google.com, Roy, Patrick, seanjc@google.com,
	shuah@kernel.org, steven.price@arm.com, suzuki.poulose@arm.com,
	vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk,
	wei.w.wang@intel.com, will@kernel.org, willy@infradead.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com


Hi Fuad,

On Wed, 2025-07-09 at 11:59 +0100, Fuad Tabba wrote:> -snip-
> +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> +
> +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> +                     struct kvm_s2_trans *nested,
> +                     struct kvm_memory_slot *memslot, bool is_perm)
> +{
> +       bool write_fault, exec_fault, writable;
> +       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> +       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> +       struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> +       struct page *page;
> +       struct kvm *kvm = vcpu->kvm;
> +       void *memcache;
> +       kvm_pfn_t pfn;
> +       gfn_t gfn;
> +       int ret;
> +
> +       ret = prepare_mmu_memcache(vcpu, true, &memcache);
> +       if (ret)
> +               return ret;
> +
> +       if (nested)
> +               gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> +       else
> +               gfn = fault_ipa >> PAGE_SHIFT;
> +
> +       write_fault = kvm_is_write_fault(vcpu);
> +       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> +
> +       if (write_fault && exec_fault) {
> +               kvm_err("Simultaneous write and execution fault\n");
> +               return -EFAULT;
> +       }
> +
> +       if (is_perm && !write_fault && !exec_fault) {
> +               kvm_err("Unexpected L2 read permission error\n");
> +               return -EFAULT;
> +       }
> +
> +       ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
> +       if (ret) {
> +               kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE,
> +                                             write_fault, exec_fault, false);
> +               return ret;
> +       }
> +
> +       writable = !(memslot->flags & KVM_MEM_READONLY);
> +
> +       if (nested)
> +               adjust_nested_fault_perms(nested, &prot, &writable);
> +
> +       if (writable)
> +               prot |= KVM_PGTABLE_PROT_W;
> +
> +       if (exec_fault ||
> +           (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
> +            (!nested || kvm_s2_trans_executable(nested))))
> +               prot |= KVM_PGTABLE_PROT_X;
> +
> +       kvm_fault_lock(kvm);

Doesn't this race with gmem invalidations (e.g. fallocate(PUNCH_HOLE))?
E.g. if between kvm_gmem_get_pfn() above and this kvm_fault_lock() a
gmem invalidation occurs, don't we end up with stage-2 page tables
refering to a stale host page? In user_mem_abort() there's the "grab
mmu_invalidate_seq before dropping mmap_lock and check it hasnt changed
after grabbing mmu_lock" which prevents this, but I don't really see an
equivalent here.

> +       ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, PAGE_SIZE,
> +                                                __pfn_to_phys(pfn), prot,
> +                                                memcache, flags);
> +       kvm_release_faultin_page(kvm, page, !!ret, writable);
> +       kvm_fault_unlock(kvm);
> +
> +       if (writable && !ret)
> +               mark_page_dirty_in_slot(kvm, memslot, gfn);
> +
> +       return ret != -EAGAIN ? ret : 0;
> +}
> +
> -snip-

Best,
Patrick




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-11  9:59   ` Roy, Patrick
@ 2025-07-11 11:08     ` Fuad Tabba
  2025-07-11 13:49     ` Marc Zyngier
  1 sibling, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-11 11:08 UTC (permalink / raw)
  To: Roy, Patrick
  Cc: ackerleytng@google.com, akpm@linux-foundation.org,
	amoorthy@google.com, anup@brainfault.org, aou@eecs.berkeley.edu,
	brauner@kernel.org, catalin.marinas@arm.com,
	chao.p.peng@linux.intel.com, chenhuacai@kernel.org,
	david@redhat.com, dmatlack@google.com, fvdl@google.com,
	hch@infradead.org, hughd@google.com, ira.weiny@intel.com,
	isaku.yamahata@gmail.com, isaku.yamahata@intel.com,
	james.morse@arm.com, jarkko@kernel.org, jgg@nvidia.com,
	jhubbard@nvidia.com, jthoughton@google.com, keirf@google.com,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	kvmarm@lists.linux.dev, liam.merwick@oracle.com,
	linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
	mail@maciej.szmigiero.name, maz@kernel.org, mic@digikod.net,
	michael.roth@amd.com, mpe@ellerman.id.au, oliver.upton@linux.dev,
	palmer@dabbelt.com, pankaj.gupta@amd.com,
	paul.walmsley@sifive.com, pbonzini@redhat.com, peterx@redhat.com,
	qperret@google.com, quic_cvanscha@quicinc.com,
	quic_eberman@quicinc.com, quic_mnalajal@quicinc.com,
	quic_pderrin@quicinc.com, quic_pheragu@quicinc.com,
	quic_svaddagi@quicinc.com, quic_tsoni@quicinc.com,
	rientjes@google.com, seanjc@google.com, shuah@kernel.org,
	steven.price@arm.com, suzuki.poulose@arm.com,
	vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk,
	wei.w.wang@intel.com, will@kernel.org, willy@infradead.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com

Hi Patrick,


On Fri, 11 Jul 2025 at 10:59, Roy, Patrick <roypat@amazon.co.uk> wrote:
>
>
> Hi Fuad,
>
> On Wed, 2025-07-09 at 11:59 +0100, Fuad Tabba wrote:> -snip-
> > +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> > +
> > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > +                     struct kvm_s2_trans *nested,
> > +                     struct kvm_memory_slot *memslot, bool is_perm)
> > +{
> > +       bool write_fault, exec_fault, writable;
> > +       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> > +       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > +       struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > +       struct page *page;
> > +       struct kvm *kvm = vcpu->kvm;
> > +       void *memcache;
> > +       kvm_pfn_t pfn;
> > +       gfn_t gfn;
> > +       int ret;
> > +
> > +       ret = prepare_mmu_memcache(vcpu, true, &memcache);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (nested)
> > +               gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> > +       else
> > +               gfn = fault_ipa >> PAGE_SHIFT;
> > +
> > +       write_fault = kvm_is_write_fault(vcpu);
> > +       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > +
> > +       if (write_fault && exec_fault) {
> > +               kvm_err("Simultaneous write and execution fault\n");
> > +               return -EFAULT;
> > +       }
> > +
> > +       if (is_perm && !write_fault && !exec_fault) {
> > +               kvm_err("Unexpected L2 read permission error\n");
> > +               return -EFAULT;
> > +       }
> > +
> > +       ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
> > +       if (ret) {
> > +               kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE,
> > +                                             write_fault, exec_fault, false);
> > +               return ret;
> > +       }
> > +
> > +       writable = !(memslot->flags & KVM_MEM_READONLY);
> > +
> > +       if (nested)
> > +               adjust_nested_fault_perms(nested, &prot, &writable);
> > +
> > +       if (writable)
> > +               prot |= KVM_PGTABLE_PROT_W;
> > +
> > +       if (exec_fault ||
> > +           (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
> > +            (!nested || kvm_s2_trans_executable(nested))))
> > +               prot |= KVM_PGTABLE_PROT_X;
> > +
> > +       kvm_fault_lock(kvm);
>
> Doesn't this race with gmem invalidations (e.g. fallocate(PUNCH_HOLE))?
> E.g. if between kvm_gmem_get_pfn() above and this kvm_fault_lock() a
> gmem invalidation occurs, don't we end up with stage-2 page tables
> refering to a stale host page? In user_mem_abort() there's the "grab
> mmu_invalidate_seq before dropping mmap_lock and check it hasnt changed
> after grabbing mmu_lock" which prevents this, but I don't really see an
> equivalent here.

You're right. I'll add a check for this.

Thanks for pointing this out,
/fuad

> > +       ret = KVM_PGT_FN(kvm_pgtable_stage2_map)(pgt, fault_ipa, PAGE_SIZE,
> > +                                                __pfn_to_phys(pfn), prot,
> > +                                                memcache, flags);
> > +       kvm_release_faultin_page(kvm, page, !!ret, writable);
> > +       kvm_fault_unlock(kvm);
> > +
> > +       if (writable && !ret)
> > +               mark_page_dirty_in_slot(kvm, memslot, gfn);
> > +
> > +       return ret != -EAGAIN ? ret : 0;
> > +}
> > +
> > -snip-
>
> Best,
> Patrick
>
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type
  2025-07-11  9:45   ` David Hildenbrand
@ 2025-07-11 11:09     ` Fuad Tabba
  0 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-11 11:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-arm-msm, linux-mm, kvmarm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, michael.roth, wei.w.wang, liam.merwick, isaku.yamahata,
	kirill.shutemov, suzuki.poulose, steven.price, quic_eberman,
	quic_mnalajal, quic_tsoni, quic_svaddagi, quic_cvanscha,
	quic_pderrin, quic_pheragu, catalin.marinas, james.morse,
	yuzenghui, oliver.upton, maz, will, qperret, keirf, roypat, shuah,
	hch, jgg, rientjes, jhubbard, fvdl, hughd, jthoughton, peterx,
	pankaj.gupta, ira.weiny

Hi David,

On Fri, 11 Jul 2025 at 10:45, David Hildenbrand <david@redhat.com> wrote:
>
> On 09.07.25 12:59, Fuad Tabba wrote:
> > Enable host userspace mmap support for guest_memfd-backed memory when
> > running KVM with the KVM_X86_DEFAULT_VM type:
> >
> > * Define kvm_arch_supports_gmem_mmap() for KVM_X86_DEFAULT_VM: Introduce
> >    the architecture-specific kvm_arch_supports_gmem_mmap() macro,
> >    specifically enabling mmap support for KVM_X86_DEFAULT_VM instances.
> >    This macro, gated by CONFIG_KVM_GMEM_SUPPORTS_MMAP, ensures that only
> >    the default VM type can leverage guest_memfd mmap functionality on
> >    x86. This explicit enablement prevents CoCo VMs, which use guest_memfd
> >    primarily for private memory and rely on hardware-enforced privacy,
> >    from accidentally exposing guest memory via host userspace mappings.
> >
> > * Select CONFIG_KVM_GMEM_SUPPORTS_MMAP in KVM_X86: Enable the
> >    CONFIG_KVM_GMEM_SUPPORTS_MMAP Kconfig option when KVM_X86 is selected.
> >    This ensures that the necessary code for guest_memfd mmap support
> >    (introduced earlier) is compiled into the kernel for x86. This Kconfig
> >    option acts as a system-wide gate for the guest_memfd mmap capability.
> >    It implicitly enables CONFIG_KVM_GMEM, making guest_memfd available,
> >    and then layers the mmap capability on top specifically for the
> >    default VM.
> >
> > These changes make guest_memfd a more versatile memory backing for
> > standard KVM guests, allowing VMMs to use a unified guest_memfd model
> > for both private (CoCo) and non-private (default) VMs. This is a
> > prerequisite for use cases such as running Firecracker guests entirely
> > backed by guest_memfd and implementing direct map removal for non-CoCo
> > VMs.
> >
> > Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >   arch/x86/include/asm/kvm_host.h | 9 +++++++++
> >   arch/x86/kvm/Kconfig            | 1 +
> >   arch/x86/kvm/x86.c              | 3 ++-
> >   3 files changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 4c764faa12f3..4c89feaa1910 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -2273,9 +2273,18 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> >   #ifdef CONFIG_KVM_GMEM
> >   #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> >   #define kvm_arch_supports_gmem(kvm)  ((kvm)->arch.supports_gmem)
> > +
> > +/*
> > + * CoCo VMs with hardware support that use guest_memfd only for backing private
> > + * memory, e.g., TDX, cannot use guest_memfd with userspace mapping enabled.
> > + */
> > +#define kvm_arch_supports_gmem_mmap(kvm)             \
> > +     (IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP) &&   \
> > +      (kvm)->arch.vm_type == KVM_X86_DEFAULT_VM)
> >   #else
> >   #define kvm_arch_has_private_mem(kvm) false
> >   #define kvm_arch_supports_gmem(kvm) false
> > +#define kvm_arch_supports_gmem_mmap(kvm) false
> >   #endif
> >
> >   #define kvm_arch_has_readonly_mem(kvm) (!(kvm)->arch.has_protected_state)
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index df1fdbb4024b..239637b663dc 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -47,6 +47,7 @@ config KVM_X86
> >       select KVM_GENERIC_HARDWARE_ENABLING
> >       select KVM_GENERIC_PRE_FAULT_MEMORY
> >       select KVM_GENERIC_GMEM_POPULATE if KVM_SW_PROTECTED_VM
> > +     select KVM_GMEM_SUPPORTS_MMAP
> >       select KVM_WERROR if WERROR
>
> Given the error, likely we want to limit to 64BIT.
>
> select KVM_GMEM_SUPPORTS_MMAP if X86_64

Will do.

Cheers,
/fuad

> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 15/20] KVM: arm64: Refactor user_mem_abort()
  2025-07-09 10:59 ` [PATCH v13 15/20] KVM: arm64: Refactor user_mem_abort() Fuad Tabba
@ 2025-07-11 13:25   ` Marc Zyngier
  0 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2025-07-11 13:25 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, kvmarm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton, peterx, pankaj.gupta, ira.weiny

On Wed, 09 Jul 2025 11:59:41 +0100,
Fuad Tabba <tabba@google.com> wrote:
> 
> Refactor user_mem_abort() to improve code clarity and simplify
> assumptions within the function.
> 
> Key changes include:
> 
> * Immediately set force_pte to true at the beginning of the function if
>   logging_active is true. This simplifies the flow and makes the
>   condition for forcing a PTE more explicit.
> 
> * Remove the misleading comment stating that logging_active is
>   guaranteed to never be true for VM_PFNMAP memslots, as this assertion
>   is not entirely correct.
> 
> * Extract reusable code blocks into new helper functions:
>   * prepare_mmu_memcache(): Encapsulates the logic for preparing and
>     topping up the MMU page cache.
>   * adjust_nested_fault_perms(): Isolates the adjustments to shadow S2
>     permissions and the encoding of nested translation levels.
> 
> * Update min(a, (long)b) to min_t(long, a, b) for better type safety and
>   consistency.
> 
> * Perform other minor tidying up of the code.
> 
> These changes primarily aim to simplify user_mem_abort() and make its
> logic easier to understand and maintain, setting the stage for future
> modifications.
> 
> No functional change intended.
> 
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>

Reviewed-by: Marc Zyngier <maz@kernel.org>

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-11  9:59   ` Roy, Patrick
  2025-07-11 11:08     ` Fuad Tabba
@ 2025-07-11 13:49     ` Marc Zyngier
  2025-07-11 14:17       ` Fuad Tabba
  1 sibling, 1 reply; 40+ messages in thread
From: Marc Zyngier @ 2025-07-11 13:49 UTC (permalink / raw)
  To: Roy, Patrick, Fuad Tabba
  Cc: ackerleytng@google.com, akpm@linux-foundation.org,
	amoorthy@google.com, anup@brainfault.org, aou@eecs.berkeley.edu,
	brauner@kernel.org, catalin.marinas@arm.com,
	chao.p.peng@linux.intel.com, chenhuacai@kernel.org,
	david@redhat.com, dmatlack@google.com, fvdl@google.com,
	hch@infradead.org, hughd@google.com, ira.weiny@intel.com,
	isaku.yamahata@gmail.com, isaku.yamahata@intel.com,
	james.morse@arm.com, jarkko@kernel.org, jgg@nvidia.com,
	jhubbard@nvidia.com, jthoughton@google.com, keirf@google.com,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	kvmarm@lists.linux.dev, liam.merwick@oracle.com,
	linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
	mail@maciej.szmigiero.name, mic@digikod.net, michael.roth@amd.com,
	mpe@ellerman.id.au, oliver.upton@linux.dev, palmer@dabbelt.com,
	pankaj.gupta@amd.com, paul.walmsley@sifive.com,
	pbonzini@redhat.com, peterx@redhat.com, qperret@google.com,
	quic_cvanscha@quicinc.com, quic_eberman@quicinc.com,
	quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com,
	quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com,
	quic_tsoni@quicinc.com, rientjes@google.com, seanjc@google.com,
	shuah@kernel.org, steven.price@arm.com, suzuki.poulose@arm.com,
	vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk,
	wei.w.wang@intel.com, will@kernel.org, willy@infradead.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com

On Fri, 11 Jul 2025 10:59:39 +0100,
"Roy, Patrick" <roypat@amazon.co.uk> wrote:
> 
> 
> Hi Fuad,
> 
> On Wed, 2025-07-09 at 11:59 +0100, Fuad Tabba wrote:> -snip-
> > +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> > +
> > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > +                     struct kvm_s2_trans *nested,
> > +                     struct kvm_memory_slot *memslot, bool is_perm)
> > +{
> > +       bool write_fault, exec_fault, writable;
> > +       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> > +       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > +       struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > +       struct page *page;
> > +       struct kvm *kvm = vcpu->kvm;
> > +       void *memcache;
> > +       kvm_pfn_t pfn;
> > +       gfn_t gfn;
> > +       int ret;
> > +
> > +       ret = prepare_mmu_memcache(vcpu, true, &memcache);
> > +       if (ret)
> > +               return ret;
> > +
> > +       if (nested)
> > +               gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> > +       else
> > +               gfn = fault_ipa >> PAGE_SHIFT;
> > +
> > +       write_fault = kvm_is_write_fault(vcpu);
> > +       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > +
> > +       if (write_fault && exec_fault) {
> > +               kvm_err("Simultaneous write and execution fault\n");
> > +               return -EFAULT;
> > +       }
> > +
> > +       if (is_perm && !write_fault && !exec_fault) {
> > +               kvm_err("Unexpected L2 read permission error\n");
> > +               return -EFAULT;
> > +       }
> > +
> > +       ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
> > +       if (ret) {
> > +               kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE,
> > +                                             write_fault, exec_fault, false);
> > +               return ret;
> > +       }
> > +
> > +       writable = !(memslot->flags & KVM_MEM_READONLY);
> > +
> > +       if (nested)
> > +               adjust_nested_fault_perms(nested, &prot, &writable);
> > +
> > +       if (writable)
> > +               prot |= KVM_PGTABLE_PROT_W;
> > +
> > +       if (exec_fault ||
> > +           (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
> > +            (!nested || kvm_s2_trans_executable(nested))))
> > +               prot |= KVM_PGTABLE_PROT_X;
> > +
> > +       kvm_fault_lock(kvm);
> 
> Doesn't this race with gmem invalidations (e.g. fallocate(PUNCH_HOLE))?
> E.g. if between kvm_gmem_get_pfn() above and this kvm_fault_lock() a
> gmem invalidation occurs, don't we end up with stage-2 page tables
> refering to a stale host page? In user_mem_abort() there's the "grab
> mmu_invalidate_seq before dropping mmap_lock and check it hasnt changed
> after grabbing mmu_lock" which prevents this, but I don't really see an
> equivalent here.

Indeed. We have a similar construct in kvm_translate_vncr() as well,
and I'd definitely expect something of the sort 'round here. If for
some reason this is not needed, then a comment explaining why would be
welcome.

But this brings me to another interesting bit: kvm_translate_vncr() is
another path that deals with a guest translation fault (despite being
caught as an EL2 S1 fault), and calls kvm_faultin_pfn(). What happens
when the backing store is gmem? Probably nothin

I don't immediately see why NV and gmem should be incompatible, so
something must be done on that front too (including the return to
userspace if the page is gone).

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-11 13:49     ` Marc Zyngier
@ 2025-07-11 14:17       ` Fuad Tabba
  2025-07-11 15:48         ` Marc Zyngier
  0 siblings, 1 reply; 40+ messages in thread
From: Fuad Tabba @ 2025-07-11 14:17 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Roy, Patrick, ackerleytng@google.com, akpm@linux-foundation.org,
	amoorthy@google.com, anup@brainfault.org, aou@eecs.berkeley.edu,
	brauner@kernel.org, catalin.marinas@arm.com,
	chao.p.peng@linux.intel.com, chenhuacai@kernel.org,
	david@redhat.com, dmatlack@google.com, fvdl@google.com,
	hch@infradead.org, hughd@google.com, ira.weiny@intel.com,
	isaku.yamahata@gmail.com, isaku.yamahata@intel.com,
	james.morse@arm.com, jarkko@kernel.org, jgg@nvidia.com,
	jhubbard@nvidia.com, jthoughton@google.com, keirf@google.com,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	kvmarm@lists.linux.dev, liam.merwick@oracle.com,
	linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
	mail@maciej.szmigiero.name, mic@digikod.net, michael.roth@amd.com,
	mpe@ellerman.id.au, oliver.upton@linux.dev, palmer@dabbelt.com,
	pankaj.gupta@amd.com, paul.walmsley@sifive.com,
	pbonzini@redhat.com, peterx@redhat.com, qperret@google.com,
	quic_cvanscha@quicinc.com, quic_eberman@quicinc.com,
	quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com,
	quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com,
	quic_tsoni@quicinc.com, rientjes@google.com, seanjc@google.com,
	shuah@kernel.org, steven.price@arm.com, suzuki.poulose@arm.com,
	vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk,
	wei.w.wang@intel.com, will@kernel.org, willy@infradead.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com

Hi Marc,

On Fri, 11 Jul 2025 at 14:50, Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 11 Jul 2025 10:59:39 +0100,
> "Roy, Patrick" <roypat@amazon.co.uk> wrote:
> >
> >
> > Hi Fuad,
> >
> > On Wed, 2025-07-09 at 11:59 +0100, Fuad Tabba wrote:> -snip-
> > > +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> > > +
> > > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > +                     struct kvm_s2_trans *nested,
> > > +                     struct kvm_memory_slot *memslot, bool is_perm)
> > > +{
> > > +       bool write_fault, exec_fault, writable;
> > > +       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> > > +       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > +       struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > > +       struct page *page;
> > > +       struct kvm *kvm = vcpu->kvm;
> > > +       void *memcache;
> > > +       kvm_pfn_t pfn;
> > > +       gfn_t gfn;
> > > +       int ret;
> > > +
> > > +       ret = prepare_mmu_memcache(vcpu, true, &memcache);
> > > +       if (ret)
> > > +               return ret;
> > > +
> > > +       if (nested)
> > > +               gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> > > +       else
> > > +               gfn = fault_ipa >> PAGE_SHIFT;
> > > +
> > > +       write_fault = kvm_is_write_fault(vcpu);
> > > +       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > > +
> > > +       if (write_fault && exec_fault) {
> > > +               kvm_err("Simultaneous write and execution fault\n");
> > > +               return -EFAULT;
> > > +       }
> > > +
> > > +       if (is_perm && !write_fault && !exec_fault) {
> > > +               kvm_err("Unexpected L2 read permission error\n");
> > > +               return -EFAULT;
> > > +       }
> > > +
> > > +       ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
> > > +       if (ret) {
> > > +               kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE,
> > > +                                             write_fault, exec_fault, false);
> > > +               return ret;
> > > +       }
> > > +
> > > +       writable = !(memslot->flags & KVM_MEM_READONLY);
> > > +
> > > +       if (nested)
> > > +               adjust_nested_fault_perms(nested, &prot, &writable);
> > > +
> > > +       if (writable)
> > > +               prot |= KVM_PGTABLE_PROT_W;
> > > +
> > > +       if (exec_fault ||
> > > +           (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
> > > +            (!nested || kvm_s2_trans_executable(nested))))
> > > +               prot |= KVM_PGTABLE_PROT_X;
> > > +
> > > +       kvm_fault_lock(kvm);
> >
> > Doesn't this race with gmem invalidations (e.g. fallocate(PUNCH_HOLE))?
> > E.g. if between kvm_gmem_get_pfn() above and this kvm_fault_lock() a
> > gmem invalidation occurs, don't we end up with stage-2 page tables
> > refering to a stale host page? In user_mem_abort() there's the "grab
> > mmu_invalidate_seq before dropping mmap_lock and check it hasnt changed
> > after grabbing mmu_lock" which prevents this, but I don't really see an
> > equivalent here.
>
> Indeed. We have a similar construct in kvm_translate_vncr() as well,
> and I'd definitely expect something of the sort 'round here. If for
> some reason this is not needed, then a comment explaining why would be
> welcome.
>
> But this brings me to another interesting bit: kvm_translate_vncr() is
> another path that deals with a guest translation fault (despite being
> caught as an EL2 S1 fault), and calls kvm_faultin_pfn(). What happens
> when the backing store is gmem? Probably nothin

I'll add guest_memfd handling logic to kvm_translate_vncr().

> I don't immediately see why NV and gmem should be incompatible, so
> something must be done on that front too (including the return to
> userspace if the page is gone).

Should it return to userspace or go back to the guest?
user_mem_abort() returns to the guest if the page disappears (I don't
quite understand the rationale behind that, but it was a deliberate
change [1]): on mmu_invalidate_retry() it sets ret to -EAGAIN [2],
which gets flipped to 0 on returning from user_mem_abort() [3].

[1] https://lore.kernel.org/all/20210114121350.123684-4-wangyanan55@huawei.com/
[2] https://elixir.bootlin.com/linux/v6.16-rc5/source/arch/arm64/kvm/mmu.c#L1690
[3] https://elixir.bootlin.com/linux/v6.16-rc5/source/arch/arm64/kvm/mmu.c#L1764

Cheers,
/fuad


> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 17/20] KVM: arm64: Enable host mapping of shared guest_memfd memory
  2025-07-09 10:59 ` [PATCH v13 17/20] KVM: arm64: Enable host mapping of shared guest_memfd memory Fuad Tabba
@ 2025-07-11 14:25   ` Marc Zyngier
  2025-07-11 14:34     ` Fuad Tabba
  0 siblings, 1 reply; 40+ messages in thread
From: Marc Zyngier @ 2025-07-11 14:25 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, kvmarm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton, peterx, pankaj.gupta, ira.weiny

On Wed, 09 Jul 2025 11:59:43 +0100,
Fuad Tabba <tabba@google.com> wrote:
> 
> Enable host userspace mmap support for guest_memfd-backed memory on
> arm64. This change provides arm64 with the capability to map guest
> memory at the host directly from guest_memfd:
> 
> * Define kvm_arch_supports_gmem_mmap() for arm64: The
>   kvm_arch_supports_gmem_mmap() macro is defined for arm64 to be true if
>   CONFIG_KVM_GMEM_SUPPORTS_MMAP is enabled. For existing arm64 KVM VM
>   types that support guest_memfd, this enables them to use guest_memfd
>   with host userspace mappings. This provides a consistent behavior as
>   there are currently no arm64 CoCo VMs that rely on guest_memfd solely
>   for private, non-mappable memory. Future arm64 VM types can override
>   or restrict this behavior via the kvm_arch_supports_gmem_mmap() hook
>   if needed.
> 
> * Select CONFIG_KVM_GMEM_SUPPORTS_MMAP in arm64 Kconfig.
> 
> * Enforce KVM_MEMSLOT_GMEM_ONLY for guest_memfd on arm64: Compile and
>   runtime checks are added to ensure that if guest_memfd is enabled on
>   arm64, KVM_GMEM_SUPPORTS_MMAP must also be enabled. This means
>   guest_memfd-backed memory slots on arm64 are currently only supported
>   if they are intended for shared memory use cases (i.e.,
>   kvm_memslot_is_gmem_only() is true). This design reflects the current
>   arm64 KVM ecosystem where guest_memfd is primarily being introduced
>   for VMs that support shared memory.
>
> Reviewed-by: James Houghton <jthoughton@google.com>
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h | 4 ++++
>  arch/arm64/kvm/Kconfig            | 1 +
>  arch/arm64/kvm/mmu.c              | 8 ++++++++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index d27079968341..bd2af5470c66 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -1675,5 +1675,9 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
>  void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
>  void check_feature_map(void);
>  
> +#ifdef CONFIG_KVM_GMEM
> +#define kvm_arch_supports_gmem(kvm) true
> +#define kvm_arch_supports_gmem_mmap(kvm) IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP)
> +#endif
>  
>  #endif /* __ARM64_KVM_HOST_H__ */
> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> index 713248f240e0..28539479f083 100644
> --- a/arch/arm64/kvm/Kconfig
> +++ b/arch/arm64/kvm/Kconfig
> @@ -37,6 +37,7 @@ menuconfig KVM
>  	select HAVE_KVM_VCPU_RUN_PID_CHANGE
>  	select SCHED_INFO
>  	select GUEST_PERF_EVENTS if PERF_EVENTS
> +	select KVM_GMEM_SUPPORTS_MMAP
>  	help
>  	  Support hosting virtualized guest machines.
>  
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 71f8b53683e7..b92ce4d9b4e0 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -2274,6 +2274,14 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  	if ((new->base_gfn + new->npages) > (kvm_phys_size(&kvm->arch.mmu) >> PAGE_SHIFT))
>  		return -EFAULT;
>  
> +	/*
> +	 * Only support guest_memfd backed memslots with mappable memory, since
> +	 * there aren't any CoCo VMs that support only private memory on arm64.
> +	 */
> +	BUILD_BUG_ON(IS_ENABLED(CONFIG_KVM_GMEM) && !IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP));
> +	if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
> +		return -EINVAL;
> +
>  	hva = new->userspace_addr;
>  	reg_end = hva + (new->npages << PAGE_SHIFT);
>  

Honestly, I don't see the point in making CONFIG_KVM_GMEM a buy in. We
have *no* configurability for KVM/arm64, the only exception being the
PMU support, and that has been a pain at every step of the way.

Either KVM is enabled, and it comes with "batteries included", or it's
not. Either way, we know exactly what we're getting, and it makes
reproducing problems much easier.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 17/20] KVM: arm64: Enable host mapping of shared guest_memfd memory
  2025-07-11 14:25   ` Marc Zyngier
@ 2025-07-11 14:34     ` Fuad Tabba
  0 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-11 14:34 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, linux-arm-msm, linux-mm, kvmarm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton, peterx, pankaj.gupta, ira.weiny

Hi Marc,

On Fri, 11 Jul 2025 at 15:25, Marc Zyngier <maz@kernel.org> wrote:
>
> On Wed, 09 Jul 2025 11:59:43 +0100,
> Fuad Tabba <tabba@google.com> wrote:
> >
> > Enable host userspace mmap support for guest_memfd-backed memory on
> > arm64. This change provides arm64 with the capability to map guest
> > memory at the host directly from guest_memfd:
> >
> > * Define kvm_arch_supports_gmem_mmap() for arm64: The
> >   kvm_arch_supports_gmem_mmap() macro is defined for arm64 to be true if
> >   CONFIG_KVM_GMEM_SUPPORTS_MMAP is enabled. For existing arm64 KVM VM
> >   types that support guest_memfd, this enables them to use guest_memfd
> >   with host userspace mappings. This provides a consistent behavior as
> >   there are currently no arm64 CoCo VMs that rely on guest_memfd solely
> >   for private, non-mappable memory. Future arm64 VM types can override
> >   or restrict this behavior via the kvm_arch_supports_gmem_mmap() hook
> >   if needed.
> >
> > * Select CONFIG_KVM_GMEM_SUPPORTS_MMAP in arm64 Kconfig.
> >
> > * Enforce KVM_MEMSLOT_GMEM_ONLY for guest_memfd on arm64: Compile and
> >   runtime checks are added to ensure that if guest_memfd is enabled on
> >   arm64, KVM_GMEM_SUPPORTS_MMAP must also be enabled. This means
> >   guest_memfd-backed memory slots on arm64 are currently only supported
> >   if they are intended for shared memory use cases (i.e.,
> >   kvm_memslot_is_gmem_only() is true). This design reflects the current
> >   arm64 KVM ecosystem where guest_memfd is primarily being introduced
> >   for VMs that support shared memory.
> >
> > Reviewed-by: James Houghton <jthoughton@google.com>
> > Reviewed-by: Gavin Shan <gshan@redhat.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_host.h | 4 ++++
> >  arch/arm64/kvm/Kconfig            | 1 +
> >  arch/arm64/kvm/mmu.c              | 8 ++++++++
> >  3 files changed, 13 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> > index d27079968341..bd2af5470c66 100644
> > --- a/arch/arm64/include/asm/kvm_host.h
> > +++ b/arch/arm64/include/asm/kvm_host.h
> > @@ -1675,5 +1675,9 @@ void compute_fgu(struct kvm *kvm, enum fgt_group_id fgt);
> >  void get_reg_fixed_bits(struct kvm *kvm, enum vcpu_sysreg reg, u64 *res0, u64 *res1);
> >  void check_feature_map(void);
> >
> > +#ifdef CONFIG_KVM_GMEM
> > +#define kvm_arch_supports_gmem(kvm) true
> > +#define kvm_arch_supports_gmem_mmap(kvm) IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP)
> > +#endif
> >
> >  #endif /* __ARM64_KVM_HOST_H__ */
> > diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> > index 713248f240e0..28539479f083 100644
> > --- a/arch/arm64/kvm/Kconfig
> > +++ b/arch/arm64/kvm/Kconfig
> > @@ -37,6 +37,7 @@ menuconfig KVM
> >       select HAVE_KVM_VCPU_RUN_PID_CHANGE
> >       select SCHED_INFO
> >       select GUEST_PERF_EVENTS if PERF_EVENTS
> > +     select KVM_GMEM_SUPPORTS_MMAP
> >       help
> >         Support hosting virtualized guest machines.
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 71f8b53683e7..b92ce4d9b4e0 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -2274,6 +2274,14 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
> >       if ((new->base_gfn + new->npages) > (kvm_phys_size(&kvm->arch.mmu) >> PAGE_SHIFT))
> >               return -EFAULT;
> >
> > +     /*
> > +      * Only support guest_memfd backed memslots with mappable memory, since
> > +      * there aren't any CoCo VMs that support only private memory on arm64.
> > +      */
> > +     BUILD_BUG_ON(IS_ENABLED(CONFIG_KVM_GMEM) && !IS_ENABLED(CONFIG_KVM_GMEM_SUPPORTS_MMAP));
> > +     if (kvm_slot_has_gmem(new) && !kvm_memslot_is_gmem_only(new))
> > +             return -EINVAL;
> > +
> >       hva = new->userspace_addr;
> >       reg_end = hva + (new->npages << PAGE_SHIFT);
> >
>
> Honestly, I don't see the point in making CONFIG_KVM_GMEM a buy in. We
> have *no* configurability for KVM/arm64, the only exception being the
> PMU support, and that has been a pain at every step of the way.
>
> Either KVM is enabled, and it comes with "batteries included", or it's
> not. Either way, we know exactly what we're getting, and it makes
> reproducing problems much easier.

Batteries included is always best I think (all the times I got
disappointed as a kid..... sight :) ). I'll always enable guest_memfd
when KVM is enabled on arm64.

Cheers,
/fuad

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-11 14:17       ` Fuad Tabba
@ 2025-07-11 15:48         ` Marc Zyngier
  2025-07-14  6:35           ` Fuad Tabba
  0 siblings, 1 reply; 40+ messages in thread
From: Marc Zyngier @ 2025-07-11 15:48 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Roy, Patrick, ackerleytng@google.com, akpm@linux-foundation.org,
	amoorthy@google.com, anup@brainfault.org, aou@eecs.berkeley.edu,
	brauner@kernel.org, catalin.marinas@arm.com,
	chao.p.peng@linux.intel.com, chenhuacai@kernel.org,
	david@redhat.com, dmatlack@google.com, fvdl@google.com,
	hch@infradead.org, hughd@google.com, ira.weiny@intel.com,
	isaku.yamahata@gmail.com, isaku.yamahata@intel.com,
	james.morse@arm.com, jarkko@kernel.org, jgg@nvidia.com,
	jhubbard@nvidia.com, jthoughton@google.com, keirf@google.com,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	kvmarm@lists.linux.dev, liam.merwick@oracle.com,
	linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
	mail@maciej.szmigiero.name, mic@digikod.net, michael.roth@amd.com,
	mpe@ellerman.id.au, oliver.upton@linux.dev, palmer@dabbelt.com,
	pankaj.gupta@amd.com, paul.walmsley@sifive.com,
	pbonzini@redhat.com, peterx@redhat.com, qperret@google.com,
	quic_cvanscha@quicinc.com, quic_eberman@quicinc.com,
	quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com,
	quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com,
	quic_tsoni@quicinc.com, rientjes@google.com, seanjc@google.com,
	shuah@kernel.org, steven.price@arm.com, suzuki.poulose@arm.com,
	vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk,
	wei.w.wang@intel.com, will@kernel.org, willy@infradead.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com

On Fri, 11 Jul 2025 15:17:46 +0100,
Fuad Tabba <tabba@google.com> wrote:
> 
> Hi Marc,
> 
> On Fri, 11 Jul 2025 at 14:50, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Fri, 11 Jul 2025 10:59:39 +0100,
> > "Roy, Patrick" <roypat@amazon.co.uk> wrote:
> > >
> > >
> > > Hi Fuad,
> > >
> > > On Wed, 2025-07-09 at 11:59 +0100, Fuad Tabba wrote:> -snip-
> > > > +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> > > > +
> > > > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > > +                     struct kvm_s2_trans *nested,
> > > > +                     struct kvm_memory_slot *memslot, bool is_perm)
> > > > +{
> > > > +       bool write_fault, exec_fault, writable;
> > > > +       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> > > > +       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > > +       struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > > > +       struct page *page;
> > > > +       struct kvm *kvm = vcpu->kvm;
> > > > +       void *memcache;
> > > > +       kvm_pfn_t pfn;
> > > > +       gfn_t gfn;
> > > > +       int ret;
> > > > +
> > > > +       ret = prepare_mmu_memcache(vcpu, true, &memcache);
> > > > +       if (ret)
> > > > +               return ret;
> > > > +
> > > > +       if (nested)
> > > > +               gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> > > > +       else
> > > > +               gfn = fault_ipa >> PAGE_SHIFT;
> > > > +
> > > > +       write_fault = kvm_is_write_fault(vcpu);
> > > > +       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > > > +
> > > > +       if (write_fault && exec_fault) {
> > > > +               kvm_err("Simultaneous write and execution fault\n");
> > > > +               return -EFAULT;
> > > > +       }
> > > > +
> > > > +       if (is_perm && !write_fault && !exec_fault) {
> > > > +               kvm_err("Unexpected L2 read permission error\n");
> > > > +               return -EFAULT;
> > > > +       }
> > > > +
> > > > +       ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
> > > > +       if (ret) {
> > > > +               kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE,
> > > > +                                             write_fault, exec_fault, false);
> > > > +               return ret;
> > > > +       }
> > > > +
> > > > +       writable = !(memslot->flags & KVM_MEM_READONLY);
> > > > +
> > > > +       if (nested)
> > > > +               adjust_nested_fault_perms(nested, &prot, &writable);
> > > > +
> > > > +       if (writable)
> > > > +               prot |= KVM_PGTABLE_PROT_W;
> > > > +
> > > > +       if (exec_fault ||
> > > > +           (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
> > > > +            (!nested || kvm_s2_trans_executable(nested))))
> > > > +               prot |= KVM_PGTABLE_PROT_X;
> > > > +
> > > > +       kvm_fault_lock(kvm);
> > >
> > > Doesn't this race with gmem invalidations (e.g. fallocate(PUNCH_HOLE))?
> > > E.g. if between kvm_gmem_get_pfn() above and this kvm_fault_lock() a
> > > gmem invalidation occurs, don't we end up with stage-2 page tables
> > > refering to a stale host page? In user_mem_abort() there's the "grab
> > > mmu_invalidate_seq before dropping mmap_lock and check it hasnt changed
> > > after grabbing mmu_lock" which prevents this, but I don't really see an
> > > equivalent here.
> >
> > Indeed. We have a similar construct in kvm_translate_vncr() as well,
> > and I'd definitely expect something of the sort 'round here. If for
> > some reason this is not needed, then a comment explaining why would be
> > welcome.
> >
> > But this brings me to another interesting bit: kvm_translate_vncr() is
> > another path that deals with a guest translation fault (despite being
> > caught as an EL2 S1 fault), and calls kvm_faultin_pfn(). What happens
> > when the backing store is gmem? Probably nothin
> 
> I'll add guest_memfd handling logic to kvm_translate_vncr().
> 
> > I don't immediately see why NV and gmem should be incompatible, so
> > something must be done on that front too (including the return to
> > userspace if the page is gone).
> 
> Should it return to userspace or go back to the guest?
> user_mem_abort() returns to the guest if the page disappears (I don't
> quite understand the rationale behind that, but it was a deliberate
> change [1]): on mmu_invalidate_retry() it sets ret to -EAGAIN [2],
> which gets flipped to 0 on returning from user_mem_abort() [3].

Outside of gmem, racing with an invalidation (resulting in -EAGAIN) is
never a problem. We just replay the faulting instruction.  Also,
kvm_faultin_pfn() never fails outside of error cases (guest accessing
non-memory, or writing to RO memory). So returning to the guest is
always the right thing to do, and userspace never needs to see any of
that (I ignore userfaultfd here, as that's a different matter).

With gmem, you don't really have a choice. Whoever is in charge of the
memory told you it can't get to it, and it's only fair to go back to
userspace for it to sort it out (if at all possible).

So when it comes to VNCR faults, the behaviour should be the same,
given that the faulting page *is* a guest page, even if this isn't a
stage-2 mapping that we are dealing with.

I'd expect something along the lines of the hack below, (completely
untested, as usual).

Thanks,

	M.

diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
index 5b191f4dc5668..98b1d6d4688a6 100644
--- a/arch/arm64/kvm/nested.c
+++ b/arch/arm64/kvm/nested.c
@@ -1172,8 +1172,9 @@ static u64 read_vncr_el2(struct kvm_vcpu *vcpu)
 	return (u64)sign_extend64(__vcpu_sys_reg(vcpu, VNCR_EL2), 48);
 }
 
-static int kvm_translate_vncr(struct kvm_vcpu *vcpu)
+static int kvm_translate_vncr(struct kvm_vcpu *vcpu, bool *gmem)
 {
+	struct kvm_memory_slot *memslot;
 	bool write_fault, writable;
 	unsigned long mmu_seq;
 	struct vncr_tlb *vt;
@@ -1216,9 +1217,21 @@ static int kvm_translate_vncr(struct kvm_vcpu *vcpu)
 	smp_rmb();
 
 	gfn = vt->wr.pa >> PAGE_SHIFT;
-	pfn = kvm_faultin_pfn(vcpu, gfn, write_fault, &writable, &page);
-	if (is_error_noslot_pfn(pfn) || (write_fault && !writable))
-		return -EFAULT;
+	memslot = gfn_to_memslot(vcpu->kvm, gfn);
+	*gmem = kvm_slot_has_gmem(memslot);
+	if (!*gmem) {
+		pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
+					&writable, &page);
+		if (is_error_noslot_pfn(pfn) || (write_fault && !writable))
+			return -EFAULT;
+	} else {
+		ret = kvm_gmem_get_pfn(vcpu->kvm, memslot, gfn, &pfn, &page, NULL);
+		if (ret) {
+			kvm_prepare_memory_fault_exit(vcpu, vt->wr.pa, PAGE_SIZE,
+						      write_fault, false, false);
+			return ret;
+		}
+	}
 
 	scoped_guard(write_lock, &vcpu->kvm->mmu_lock) {
 		if (mmu_invalidate_retry(vcpu->kvm, mmu_seq))
@@ -1292,14 +1305,14 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu)
 	if (esr_fsc_is_permission_fault(esr)) {
 		inject_vncr_perm(vcpu);
 	} else if (esr_fsc_is_translation_fault(esr)) {
-		bool valid;
+		bool valid, gmem = false;
 		int ret;
 
 		scoped_guard(read_lock, &vcpu->kvm->mmu_lock)
 			valid = kvm_vncr_tlb_lookup(vcpu);
 
 		if (!valid)
-			ret = kvm_translate_vncr(vcpu);
+			ret = kvm_translate_vncr(vcpu, &gmem);
 		else
 			ret = -EPERM;
 
@@ -1309,6 +1322,14 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu)
 			/* Let's try again... */
 			break;
 		case -EFAULT:
+		case -EIO:
+			/*
+			 * FIXME: Add whatever other error cases the
+			 * GMEM stuff can spit out.
+			 */
+			if (gmem)
+				return 0;
+			fallthrough;
 		case -EINVAL:
 		case -ENOENT:
 		case -EACCES:

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-09 10:59 ` [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults Fuad Tabba
  2025-07-11  9:59   ` Roy, Patrick
@ 2025-07-11 16:37   ` Marc Zyngier
  2025-07-14  7:42     ` Fuad Tabba
  1 sibling, 1 reply; 40+ messages in thread
From: Marc Zyngier @ 2025-07-11 16:37 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, kvmarm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton, peterx, pankaj.gupta, ira.weiny

On Wed, 09 Jul 2025 11:59:42 +0100,
Fuad Tabba <tabba@google.com> wrote:
> 
> Add arm64 architecture support for handling guest page faults on memory
> slots backed by guest_memfd.
> 
> This change introduces a new function, gmem_abort(), which encapsulates
> the fault handling logic specific to guest_memfd-backed memory. The
> kvm_handle_guest_abort() entry point is updated to dispatch to
> gmem_abort() when a fault occurs on a guest_memfd-backed memory slot (as
> determined by kvm_slot_has_gmem()).
> 
> Until guest_memfd gains support for huge pages, the fault granule for
> these memory regions is restricted to PAGE_SIZE.
> 
> Reviewed-by: Gavin Shan <gshan@redhat.com>
> Reviewed-by: James Houghton <jthoughton@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> ---
>  arch/arm64/kvm/mmu.c | 82 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 79 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 58662e0ef13e..71f8b53683e7 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1512,6 +1512,78 @@ static void adjust_nested_fault_perms(struct kvm_s2_trans *nested,
>  	*prot |= kvm_encode_nested_level(nested);
>  }
>  
> +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> +
> +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> +		      struct kvm_s2_trans *nested,
> +		      struct kvm_memory_slot *memslot, bool is_perm)
> +{
> +	bool write_fault, exec_fault, writable;
> +	enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> +	enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> +	struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> +	struct page *page;
> +	struct kvm *kvm = vcpu->kvm;
> +	void *memcache;
> +	kvm_pfn_t pfn;
> +	gfn_t gfn;
> +	int ret;
> +
> +	ret = prepare_mmu_memcache(vcpu, true, &memcache);
> +	if (ret)
> +		return ret;
> +
> +	if (nested)
> +		gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> +	else
> +		gfn = fault_ipa >> PAGE_SHIFT;
> +
> +	write_fault = kvm_is_write_fault(vcpu);
> +	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> +
> +	if (write_fault && exec_fault) {
> +		kvm_err("Simultaneous write and execution fault\n");
> +		return -EFAULT;
> +	}

I don't think we need to cargo-cult this stuff. This cannot happen
architecturally (data and instruction aborts are two different
exceptions, so you can't have both at the same time), and is only
there because we were young and foolish when we wrote this crap.

Now that we (the royal We) are only foolish, we can save a few bits by
dropping it. Or turn it into a VM_BUG_ON() if you really want to keep
it.

> +
> +	if (is_perm && !write_fault && !exec_fault) {
> +		kvm_err("Unexpected L2 read permission error\n");
> +		return -EFAULT;
> +	}

Again, this is copying something that was always a bit crap:

- it's not an "error", it's a permission fault
- it's not "L2", it's "stage-2"

But this should equally be turned into an assertion, ideally in a
single spot. See below for the usual untested hack.

Thanks,

	M.

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index b92ce4d9b4e01..c79dc8fd45d5a 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1540,16 +1540,7 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 
 	write_fault = kvm_is_write_fault(vcpu);
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
-
-	if (write_fault && exec_fault) {
-		kvm_err("Simultaneous write and execution fault\n");
-		return -EFAULT;
-	}
-
-	if (is_perm && !write_fault && !exec_fault) {
-		kvm_err("Unexpected L2 read permission error\n");
-		return -EFAULT;
-	}
+	VM_BUG_ON(write_fault && exec_fault);
 
 	ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
 	if (ret) {
@@ -1616,11 +1607,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
 	VM_BUG_ON(write_fault && exec_fault);
 
-	if (fault_is_perm && !write_fault && !exec_fault) {
-		kvm_err("Unexpected L2 read permission error\n");
-		return -EFAULT;
-	}
-
 	/*
 	 * Permission faults just need to update the existing leaf entry,
 	 * and so normally don't require allocations from the memcache. The
@@ -2035,6 +2021,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
 		goto out_unlock;
 	}
 
+	VM_BUG_ON(kvm_vcpu_trap_is_permission_fault(vcpu) &&
+		  !write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu));
+
 	if (kvm_slot_has_gmem(memslot))
 		ret = gmem_abort(vcpu, fault_ipa, nested, memslot,
 				 esr_fsc_is_permission_fault(esr));

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-11 15:48         ` Marc Zyngier
@ 2025-07-14  6:35           ` Fuad Tabba
  0 siblings, 0 replies; 40+ messages in thread
From: Fuad Tabba @ 2025-07-14  6:35 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Roy, Patrick, ackerleytng@google.com, akpm@linux-foundation.org,
	amoorthy@google.com, anup@brainfault.org, aou@eecs.berkeley.edu,
	brauner@kernel.org, catalin.marinas@arm.com,
	chao.p.peng@linux.intel.com, chenhuacai@kernel.org,
	david@redhat.com, dmatlack@google.com, fvdl@google.com,
	hch@infradead.org, hughd@google.com, ira.weiny@intel.com,
	isaku.yamahata@gmail.com, isaku.yamahata@intel.com,
	james.morse@arm.com, jarkko@kernel.org, jgg@nvidia.com,
	jhubbard@nvidia.com, jthoughton@google.com, keirf@google.com,
	kirill.shutemov@linux.intel.com, kvm@vger.kernel.org,
	kvmarm@lists.linux.dev, liam.merwick@oracle.com,
	linux-arm-msm@vger.kernel.org, linux-mm@kvack.org,
	mail@maciej.szmigiero.name, mic@digikod.net, michael.roth@amd.com,
	mpe@ellerman.id.au, oliver.upton@linux.dev, palmer@dabbelt.com,
	pankaj.gupta@amd.com, paul.walmsley@sifive.com,
	pbonzini@redhat.com, peterx@redhat.com, qperret@google.com,
	quic_cvanscha@quicinc.com, quic_eberman@quicinc.com,
	quic_mnalajal@quicinc.com, quic_pderrin@quicinc.com,
	quic_pheragu@quicinc.com, quic_svaddagi@quicinc.com,
	quic_tsoni@quicinc.com, rientjes@google.com, seanjc@google.com,
	shuah@kernel.org, steven.price@arm.com, suzuki.poulose@arm.com,
	vannapurve@google.com, vbabka@suse.cz, viro@zeniv.linux.org.uk,
	wei.w.wang@intel.com, will@kernel.org, willy@infradead.org,
	xiaoyao.li@intel.com, yilun.xu@intel.com, yuzenghui@huawei.com

Hi Marc,

On Fri, 11 Jul 2025 at 16:48, Marc Zyngier <maz@kernel.org> wrote:
>
> On Fri, 11 Jul 2025 15:17:46 +0100,
> Fuad Tabba <tabba@google.com> wrote:
> >
> > Hi Marc,
> >
> > On Fri, 11 Jul 2025 at 14:50, Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Fri, 11 Jul 2025 10:59:39 +0100,
> > > "Roy, Patrick" <roypat@amazon.co.uk> wrote:
> > > >
> > > >
> > > > Hi Fuad,
> > > >
> > > > On Wed, 2025-07-09 at 11:59 +0100, Fuad Tabba wrote:> -snip-
> > > > > +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> > > > > +
> > > > > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > > > +                     struct kvm_s2_trans *nested,
> > > > > +                     struct kvm_memory_slot *memslot, bool is_perm)
> > > > > +{
> > > > > +       bool write_fault, exec_fault, writable;
> > > > > +       enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> > > > > +       enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > > > +       struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > > > > +       struct page *page;
> > > > > +       struct kvm *kvm = vcpu->kvm;
> > > > > +       void *memcache;
> > > > > +       kvm_pfn_t pfn;
> > > > > +       gfn_t gfn;
> > > > > +       int ret;
> > > > > +
> > > > > +       ret = prepare_mmu_memcache(vcpu, true, &memcache);
> > > > > +       if (ret)
> > > > > +               return ret;
> > > > > +
> > > > > +       if (nested)
> > > > > +               gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> > > > > +       else
> > > > > +               gfn = fault_ipa >> PAGE_SHIFT;
> > > > > +
> > > > > +       write_fault = kvm_is_write_fault(vcpu);
> > > > > +       exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > > > > +
> > > > > +       if (write_fault && exec_fault) {
> > > > > +               kvm_err("Simultaneous write and execution fault\n");
> > > > > +               return -EFAULT;
> > > > > +       }
> > > > > +
> > > > > +       if (is_perm && !write_fault && !exec_fault) {
> > > > > +               kvm_err("Unexpected L2 read permission error\n");
> > > > > +               return -EFAULT;
> > > > > +       }
> > > > > +
> > > > > +       ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
> > > > > +       if (ret) {
> > > > > +               kvm_prepare_memory_fault_exit(vcpu, fault_ipa, PAGE_SIZE,
> > > > > +                                             write_fault, exec_fault, false);
> > > > > +               return ret;
> > > > > +       }
> > > > > +
> > > > > +       writable = !(memslot->flags & KVM_MEM_READONLY);
> > > > > +
> > > > > +       if (nested)
> > > > > +               adjust_nested_fault_perms(nested, &prot, &writable);
> > > > > +
> > > > > +       if (writable)
> > > > > +               prot |= KVM_PGTABLE_PROT_W;
> > > > > +
> > > > > +       if (exec_fault ||
> > > > > +           (cpus_have_final_cap(ARM64_HAS_CACHE_DIC) &&
> > > > > +            (!nested || kvm_s2_trans_executable(nested))))
> > > > > +               prot |= KVM_PGTABLE_PROT_X;
> > > > > +
> > > > > +       kvm_fault_lock(kvm);
> > > >
> > > > Doesn't this race with gmem invalidations (e.g. fallocate(PUNCH_HOLE))?
> > > > E.g. if between kvm_gmem_get_pfn() above and this kvm_fault_lock() a
> > > > gmem invalidation occurs, don't we end up with stage-2 page tables
> > > > refering to a stale host page? In user_mem_abort() there's the "grab
> > > > mmu_invalidate_seq before dropping mmap_lock and check it hasnt changed
> > > > after grabbing mmu_lock" which prevents this, but I don't really see an
> > > > equivalent here.
> > >
> > > Indeed. We have a similar construct in kvm_translate_vncr() as well,
> > > and I'd definitely expect something of the sort 'round here. If for
> > > some reason this is not needed, then a comment explaining why would be
> > > welcome.
> > >
> > > But this brings me to another interesting bit: kvm_translate_vncr() is
> > > another path that deals with a guest translation fault (despite being
> > > caught as an EL2 S1 fault), and calls kvm_faultin_pfn(). What happens
> > > when the backing store is gmem? Probably nothin
> >
> > I'll add guest_memfd handling logic to kvm_translate_vncr().
> >
> > > I don't immediately see why NV and gmem should be incompatible, so
> > > something must be done on that front too (including the return to
> > > userspace if the page is gone).
> >
> > Should it return to userspace or go back to the guest?
> > user_mem_abort() returns to the guest if the page disappears (I don't
> > quite understand the rationale behind that, but it was a deliberate
> > change [1]): on mmu_invalidate_retry() it sets ret to -EAGAIN [2],
> > which gets flipped to 0 on returning from user_mem_abort() [3].
>
> Outside of gmem, racing with an invalidation (resulting in -EAGAIN) is
> never a problem. We just replay the faulting instruction.  Also,
> kvm_faultin_pfn() never fails outside of error cases (guest accessing
> non-memory, or writing to RO memory). So returning to the guest is
> always the right thing to do, and userspace never needs to see any of
> that (I ignore userfaultfd here, as that's a different matter).
>
> With gmem, you don't really have a choice. Whoever is in charge of the
> memory told you it can't get to it, and it's only fair to go back to
> userspace for it to sort it out (if at all possible).

Makes sense.

> So when it comes to VNCR faults, the behaviour should be the same,
> given that the faulting page *is* a guest page, even if this isn't a
> stage-2 mapping that we are dealing with.
>
> I'd expect something along the lines of the hack below, (completely
> untested, as usual).

Thanks!
/fuad

> Thanks,
>
>         M.
>
> diff --git a/arch/arm64/kvm/nested.c b/arch/arm64/kvm/nested.c
> index 5b191f4dc5668..98b1d6d4688a6 100644
> --- a/arch/arm64/kvm/nested.c
> +++ b/arch/arm64/kvm/nested.c
> @@ -1172,8 +1172,9 @@ static u64 read_vncr_el2(struct kvm_vcpu *vcpu)
>         return (u64)sign_extend64(__vcpu_sys_reg(vcpu, VNCR_EL2), 48);
>  }
>
> -static int kvm_translate_vncr(struct kvm_vcpu *vcpu)
> +static int kvm_translate_vncr(struct kvm_vcpu *vcpu, bool *gmem)
>  {
> +       struct kvm_memory_slot *memslot;
>         bool write_fault, writable;
>         unsigned long mmu_seq;
>         struct vncr_tlb *vt;
> @@ -1216,9 +1217,21 @@ static int kvm_translate_vncr(struct kvm_vcpu *vcpu)
>         smp_rmb();
>
>         gfn = vt->wr.pa >> PAGE_SHIFT;
> -       pfn = kvm_faultin_pfn(vcpu, gfn, write_fault, &writable, &page);
> -       if (is_error_noslot_pfn(pfn) || (write_fault && !writable))
> -               return -EFAULT;
> +       memslot = gfn_to_memslot(vcpu->kvm, gfn);
> +       *gmem = kvm_slot_has_gmem(memslot);
> +       if (!*gmem) {
> +               pfn = __kvm_faultin_pfn(memslot, gfn, write_fault ? FOLL_WRITE : 0,
> +                                       &writable, &page);
> +               if (is_error_noslot_pfn(pfn) || (write_fault && !writable))
> +                       return -EFAULT;
> +       } else {
> +               ret = kvm_gmem_get_pfn(vcpu->kvm, memslot, gfn, &pfn, &page, NULL);
> +               if (ret) {
> +                       kvm_prepare_memory_fault_exit(vcpu, vt->wr.pa, PAGE_SIZE,
> +                                                     write_fault, false, false);
> +                       return ret;
> +               }
> +       }
>
>         scoped_guard(write_lock, &vcpu->kvm->mmu_lock) {
>                 if (mmu_invalidate_retry(vcpu->kvm, mmu_seq))
> @@ -1292,14 +1305,14 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu)
>         if (esr_fsc_is_permission_fault(esr)) {
>                 inject_vncr_perm(vcpu);
>         } else if (esr_fsc_is_translation_fault(esr)) {
> -               bool valid;
> +               bool valid, gmem = false;
>                 int ret;
>
>                 scoped_guard(read_lock, &vcpu->kvm->mmu_lock)
>                         valid = kvm_vncr_tlb_lookup(vcpu);
>
>                 if (!valid)
> -                       ret = kvm_translate_vncr(vcpu);
> +                       ret = kvm_translate_vncr(vcpu, &gmem);
>                 else
>                         ret = -EPERM;
>
> @@ -1309,6 +1322,14 @@ int kvm_handle_vncr_abort(struct kvm_vcpu *vcpu)
>                         /* Let's try again... */
>                         break;
>                 case -EFAULT:
> +               case -EIO:
> +                       /*
> +                        * FIXME: Add whatever other error cases the
> +                        * GMEM stuff can spit out.
> +                        */
> +                       if (gmem)
> +                               return 0;
> +                       fallthrough;
>                 case -EINVAL:
>                 case -ENOENT:
>                 case -EACCES:
>
> --
> Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-11 16:37   ` Marc Zyngier
@ 2025-07-14  7:42     ` Fuad Tabba
  2025-07-14  8:04       ` Marc Zyngier
  0 siblings, 1 reply; 40+ messages in thread
From: Fuad Tabba @ 2025-07-14  7:42 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: kvm, linux-arm-msm, linux-mm, kvmarm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton, peterx, pankaj.gupta, ira.weiny

Hi Marc,


On Fri, 11 Jul 2025 at 17:38, Marc Zyngier <maz@kernel.org> wrote:
>
> On Wed, 09 Jul 2025 11:59:42 +0100,
> Fuad Tabba <tabba@google.com> wrote:
> >
> > Add arm64 architecture support for handling guest page faults on memory
> > slots backed by guest_memfd.
> >
> > This change introduces a new function, gmem_abort(), which encapsulates
> > the fault handling logic specific to guest_memfd-backed memory. The
> > kvm_handle_guest_abort() entry point is updated to dispatch to
> > gmem_abort() when a fault occurs on a guest_memfd-backed memory slot (as
> > determined by kvm_slot_has_gmem()).
> >
> > Until guest_memfd gains support for huge pages, the fault granule for
> > these memory regions is restricted to PAGE_SIZE.
> >
> > Reviewed-by: Gavin Shan <gshan@redhat.com>
> > Reviewed-by: James Houghton <jthoughton@google.com>
> > Signed-off-by: Fuad Tabba <tabba@google.com>
> > ---
> >  arch/arm64/kvm/mmu.c | 82 ++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 79 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 58662e0ef13e..71f8b53683e7 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1512,6 +1512,78 @@ static void adjust_nested_fault_perms(struct kvm_s2_trans *nested,
> >       *prot |= kvm_encode_nested_level(nested);
> >  }
> >
> > +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> > +
> > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > +                   struct kvm_s2_trans *nested,
> > +                   struct kvm_memory_slot *memslot, bool is_perm)
> > +{
> > +     bool write_fault, exec_fault, writable;
> > +     enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> > +     enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > +     struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > +     struct page *page;
> > +     struct kvm *kvm = vcpu->kvm;
> > +     void *memcache;
> > +     kvm_pfn_t pfn;
> > +     gfn_t gfn;
> > +     int ret;
> > +
> > +     ret = prepare_mmu_memcache(vcpu, true, &memcache);
> > +     if (ret)
> > +             return ret;
> > +
> > +     if (nested)
> > +             gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> > +     else
> > +             gfn = fault_ipa >> PAGE_SHIFT;
> > +
> > +     write_fault = kvm_is_write_fault(vcpu);
> > +     exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > +
> > +     if (write_fault && exec_fault) {
> > +             kvm_err("Simultaneous write and execution fault\n");
> > +             return -EFAULT;
> > +     }
>
> I don't think we need to cargo-cult this stuff. This cannot happen
> architecturally (data and instruction aborts are two different
> exceptions, so you can't have both at the same time), and is only
> there because we were young and foolish when we wrote this crap.
>
> Now that we (the royal We) are only foolish, we can save a few bits by
> dropping it. Or turn it into a VM_BUG_ON() if you really want to keep
> it.

Will do, but if you agree, I'll go with a VM_WARN_ON_ONCE() since
VM_BUG_ON is going away [1][2]

[1] https://lore.kernel.org/all/b247be59-c76e-4eb8-8a6a-f0129e330b11@redhat.com/
[2] https://lore.kernel.org/all/20250604140544.688711-1-david@redhat.com/T/#u


> > +
> > +     if (is_perm && !write_fault && !exec_fault) {
> > +             kvm_err("Unexpected L2 read permission error\n");
> > +             return -EFAULT;
> > +     }
>
> Again, this is copying something that was always a bit crap:
>
> - it's not an "error", it's a permission fault
> - it's not "L2", it's "stage-2"
>
> But this should equally be turned into an assertion, ideally in a
> single spot. See below for the usual untested hack.

Will do, but like above, with VM_WARN_ON_ONCE() if you agree.

Thanks!
/fuad

> Thanks,
>
>         M.
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index b92ce4d9b4e01..c79dc8fd45d5a 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1540,16 +1540,7 @@ static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>
>         write_fault = kvm_is_write_fault(vcpu);
>         exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> -
> -       if (write_fault && exec_fault) {
> -               kvm_err("Simultaneous write and execution fault\n");
> -               return -EFAULT;
> -       }
> -
> -       if (is_perm && !write_fault && !exec_fault) {
> -               kvm_err("Unexpected L2 read permission error\n");
> -               return -EFAULT;
> -       }
> +       VM_BUG_ON(write_fault && exec_fault);
>
>         ret = kvm_gmem_get_pfn(kvm, memslot, gfn, &pfn, &page, NULL);
>         if (ret) {
> @@ -1616,11 +1607,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>         exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
>         VM_BUG_ON(write_fault && exec_fault);
>
> -       if (fault_is_perm && !write_fault && !exec_fault) {
> -               kvm_err("Unexpected L2 read permission error\n");
> -               return -EFAULT;
> -       }
> -
>         /*
>          * Permission faults just need to update the existing leaf entry,
>          * and so normally don't require allocations from the memcache. The
> @@ -2035,6 +2021,9 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu)
>                 goto out_unlock;
>         }
>
> +       VM_BUG_ON(kvm_vcpu_trap_is_permission_fault(vcpu) &&
> +                 !write_fault && !kvm_vcpu_trap_is_exec_fault(vcpu));
> +
>         if (kvm_slot_has_gmem(memslot))
>                 ret = gmem_abort(vcpu, fault_ipa, nested, memslot,
>                                  esr_fsc_is_permission_fault(esr));
>
> --
> Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults
  2025-07-14  7:42     ` Fuad Tabba
@ 2025-07-14  8:04       ` Marc Zyngier
  0 siblings, 0 replies; 40+ messages in thread
From: Marc Zyngier @ 2025-07-14  8:04 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-arm-msm, linux-mm, kvmarm, pbonzini, chenhuacai, mpe,
	anup, paul.walmsley, palmer, aou, seanjc, viro, brauner, willy,
	akpm, xiaoyao.li, yilun.xu, chao.p.peng, jarkko, amoorthy,
	dmatlack, isaku.yamahata, mic, vbabka, vannapurve, ackerleytng,
	mail, david, michael.roth, wei.w.wang, liam.merwick,
	isaku.yamahata, kirill.shutemov, suzuki.poulose, steven.price,
	quic_eberman, quic_mnalajal, quic_tsoni, quic_svaddagi,
	quic_cvanscha, quic_pderrin, quic_pheragu, catalin.marinas,
	james.morse, yuzenghui, oliver.upton, will, qperret, keirf,
	roypat, shuah, hch, jgg, rientjes, jhubbard, fvdl, hughd,
	jthoughton, peterx, pankaj.gupta, ira.weiny

On Mon, 14 Jul 2025 08:42:00 +0100,
Fuad Tabba <tabba@google.com> wrote:
> 
> Hi Marc,
> 
> 
> On Fri, 11 Jul 2025 at 17:38, Marc Zyngier <maz@kernel.org> wrote:
> >
> > On Wed, 09 Jul 2025 11:59:42 +0100,
> > Fuad Tabba <tabba@google.com> wrote:
> > >
> > > Add arm64 architecture support for handling guest page faults on memory
> > > slots backed by guest_memfd.
> > >
> > > This change introduces a new function, gmem_abort(), which encapsulates
> > > the fault handling logic specific to guest_memfd-backed memory. The
> > > kvm_handle_guest_abort() entry point is updated to dispatch to
> > > gmem_abort() when a fault occurs on a guest_memfd-backed memory slot (as
> > > determined by kvm_slot_has_gmem()).
> > >
> > > Until guest_memfd gains support for huge pages, the fault granule for
> > > these memory regions is restricted to PAGE_SIZE.
> > >
> > > Reviewed-by: Gavin Shan <gshan@redhat.com>
> > > Reviewed-by: James Houghton <jthoughton@google.com>
> > > Signed-off-by: Fuad Tabba <tabba@google.com>
> > > ---
> > >  arch/arm64/kvm/mmu.c | 82 ++++++++++++++++++++++++++++++++++++++++++--
> > >  1 file changed, 79 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 58662e0ef13e..71f8b53683e7 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -1512,6 +1512,78 @@ static void adjust_nested_fault_perms(struct kvm_s2_trans *nested,
> > >       *prot |= kvm_encode_nested_level(nested);
> > >  }
> > >
> > > +#define KVM_PGTABLE_WALK_MEMABORT_FLAGS (KVM_PGTABLE_WALK_HANDLE_FAULT | KVM_PGTABLE_WALK_SHARED)
> > > +
> > > +static int gmem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> > > +                   struct kvm_s2_trans *nested,
> > > +                   struct kvm_memory_slot *memslot, bool is_perm)
> > > +{
> > > +     bool write_fault, exec_fault, writable;
> > > +     enum kvm_pgtable_walk_flags flags = KVM_PGTABLE_WALK_MEMABORT_FLAGS;
> > > +     enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
> > > +     struct kvm_pgtable *pgt = vcpu->arch.hw_mmu->pgt;
> > > +     struct page *page;
> > > +     struct kvm *kvm = vcpu->kvm;
> > > +     void *memcache;
> > > +     kvm_pfn_t pfn;
> > > +     gfn_t gfn;
> > > +     int ret;
> > > +
> > > +     ret = prepare_mmu_memcache(vcpu, true, &memcache);
> > > +     if (ret)
> > > +             return ret;
> > > +
> > > +     if (nested)
> > > +             gfn = kvm_s2_trans_output(nested) >> PAGE_SHIFT;
> > > +     else
> > > +             gfn = fault_ipa >> PAGE_SHIFT;
> > > +
> > > +     write_fault = kvm_is_write_fault(vcpu);
> > > +     exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu);
> > > +
> > > +     if (write_fault && exec_fault) {
> > > +             kvm_err("Simultaneous write and execution fault\n");
> > > +             return -EFAULT;
> > > +     }
> >
> > I don't think we need to cargo-cult this stuff. This cannot happen
> > architecturally (data and instruction aborts are two different
> > exceptions, so you can't have both at the same time), and is only
> > there because we were young and foolish when we wrote this crap.
> >
> > Now that we (the royal We) are only foolish, we can save a few bits by
> > dropping it. Or turn it into a VM_BUG_ON() if you really want to keep
> > it.
> 
> Will do, but if you agree, I'll go with a VM_WARN_ON_ONCE() since
> VM_BUG_ON is going away [1][2]
> 
> [1] https://lore.kernel.org/all/b247be59-c76e-4eb8-8a6a-f0129e330b11@redhat.com/
> [2] https://lore.kernel.org/all/20250604140544.688711-1-david@redhat.com/T/#u

Ah, sure. We've never seen these anyway in any situation other than
"I've mutated this kernel so badly it's closer to a hamster".

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2025-07-14  8:04 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 10:59 [PATCH v13 00/20] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 01/20] KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GMEM Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 02/20] KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_KVM_GENERIC_GMEM_POPULATE Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 03/20] KVM: Introduce kvm_arch_supports_gmem() Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 04/20] KVM: x86: Introduce kvm->arch.supports_gmem Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 05/20] KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 06/20] KVM: Fix comments that refer to slots_lock Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 07/20] KVM: Fix comment that refers to kvm uapi header path Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 08/20] KVM: guest_memfd: Allow host to map guest_memfd pages Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 09/20] KVM: guest_memfd: Track guest_memfd mmap support in memslot Fuad Tabba
2025-07-11  8:34   ` Shivank Garg
2025-07-09 10:59 ` [PATCH v13 10/20] KVM: x86/mmu: Generalize private_max_mapping_level x86 op to max_mapping_level Fuad Tabba
2025-07-11  9:36   ` David Hildenbrand
2025-07-09 10:59 ` [PATCH v13 11/20] KVM: x86/mmu: Allow NULL-able fault in kvm_max_private_mapping_level Fuad Tabba
2025-07-11  9:38   ` David Hildenbrand
2025-07-09 10:59 ` [PATCH v13 12/20] KVM: x86/mmu: Consult guest_memfd when computing max_mapping_level Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 13/20] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 14/20] KVM: x86: Enable guest_memfd mmap for default VM type Fuad Tabba
2025-07-11  1:14   ` kernel test robot
2025-07-11  9:45   ` David Hildenbrand
2025-07-11 11:09     ` Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 15/20] KVM: arm64: Refactor user_mem_abort() Fuad Tabba
2025-07-11 13:25   ` Marc Zyngier
2025-07-09 10:59 ` [PATCH v13 16/20] KVM: arm64: Handle guest_memfd-backed guest page faults Fuad Tabba
2025-07-11  9:59   ` Roy, Patrick
2025-07-11 11:08     ` Fuad Tabba
2025-07-11 13:49     ` Marc Zyngier
2025-07-11 14:17       ` Fuad Tabba
2025-07-11 15:48         ` Marc Zyngier
2025-07-14  6:35           ` Fuad Tabba
2025-07-11 16:37   ` Marc Zyngier
2025-07-14  7:42     ` Fuad Tabba
2025-07-14  8:04       ` Marc Zyngier
2025-07-09 10:59 ` [PATCH v13 17/20] KVM: arm64: Enable host mapping of shared guest_memfd memory Fuad Tabba
2025-07-11 14:25   ` Marc Zyngier
2025-07-11 14:34     ` Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 18/20] KVM: Introduce the KVM capability KVM_CAP_GMEM_MMAP Fuad Tabba
2025-07-11  8:48   ` Shivank Garg
2025-07-09 10:59 ` [PATCH v13 19/20] KVM: selftests: Do not use hardcoded page sizes in guest_memfd test Fuad Tabba
2025-07-09 10:59 ` [PATCH v13 20/20] KVM: selftests: guest_memfd mmap() test when mmap is supported Fuad Tabba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).