[RFC PATCH v2 01/37] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v2 01/37] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 02/37] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES Ackerley Tng
                   ` (36 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

From: Sean Christopherson <seanjc@google.com>

Start plumbing in guest_memfd support for in-place private<=>shared
conversions by tracking attributes via a maple tree.  KVM currently tracks
private vs. shared attributes on a per-VM basis, which made sense when a
guest_memfd _only_ supported private memory, but tracking per-VM simply
can't work for in-place conversions as the shareability of a given page
needs to be per-gmem_inode, not per-VM.

Use the filemap invalidation lock to protect the maple tree, as taking the
lock for read when faulting in memory (for userspace or the guest) isn't
expected to result in meaningful contention, and using a separate lock
would add significant complexity (avoid deadlock is quite difficult).

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
---
 virt/kvm/guest_memfd.c | 134 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 118 insertions(+), 16 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index fdaea3422c30..7a970e5fab67 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -4,6 +4,7 @@
 #include <linux/falloc.h>
 #include <linux/fs.h>
 #include <linux/kvm_host.h>
+#include <linux/maple_tree.h>
 #include <linux/mempolicy.h>
 #include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
@@ -32,6 +33,7 @@ struct gmem_inode {
 	struct inode vfs_inode;
 
 	u64 flags;
+	struct maple_tree attributes;
 };
 
 static __always_inline struct gmem_inode *GMEM_I(struct inode *inode)
@@ -59,6 +61,31 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn)
 	return gfn - slot->base_gfn + slot->gmem.pgoff;
 }
 
+static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index)
+{
+	struct maple_tree *mt = &GMEM_I(inode)->attributes;
+	void *entry = mtree_load(mt, index);
+
+	/*
+	 * The lock _must_ be held for lookups, as some maple tree operations,
+	 * e.g. append, are unsafe (return inaccurate information) with respect
+	 * to concurrent RCU-protected lookups.
+	 */
+	lockdep_assert(mt_lock_is_held(mt));
+
+	return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry);
+}
+
+static bool kvm_gmem_is_private_mem(struct inode *inode, pgoff_t index)
+{
+	return kvm_gmem_get_attributes(inode, index) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
+}
+
+static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t index)
+{
+	return !kvm_gmem_is_private_mem(inode, index);
+}
+
 static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 				    pgoff_t index, struct folio *folio)
 {
@@ -402,10 +429,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf)
 	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
 		return VM_FAULT_SIGBUS;
 
-	if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED))
-		return VM_FAULT_SIGBUS;
+	filemap_invalidate_lock_shared(inode->i_mapping);
+	if (kvm_gmem_is_shared_mem(inode, vmf->pgoff))
+		folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	else
+		folio = ERR_PTR(-EACCES);
+	filemap_invalidate_unlock_shared(inode->i_mapping);
 
-	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
 	if (IS_ERR(folio)) {
 		if (PTR_ERR(folio) == -EAGAIN)
 			return VM_FAULT_RETRY;
@@ -561,6 +591,51 @@ bool __weak kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
 	return true;
 }
 
+static int kvm_gmem_init_inode(struct inode *inode, loff_t size, u64 flags)
+{
+	struct gmem_inode *gi = GMEM_I(inode);
+	MA_STATE(mas, &gi->attributes, 0, (size >> PAGE_SHIFT) - 1);
+	u64 attrs;
+	int r;
+
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+
+	/*
+	 * guest_memfd memory is neither migratable nor swappable: set
+	 * inaccessible to gate off both.
+	 */
+	mapping_set_inaccessible(inode->i_mapping);
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	gi->flags = flags;
+
+	mt_set_external_lock(&gi->attributes,
+			     &inode->i_mapping->invalidate_lock);
+
+	/*
+	 * Store default attributes for the entire gmem instance. Ensuring every
+	 * index is represented in the maple tree at all times simplifies the
+	 * conversion and merging logic.
+	 */
+	attrs = gi->flags & GUEST_MEMFD_FLAG_INIT_SHARED ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE;
+
+	/*
+	 * Acquire the invalidation lock purely to make lockdep happy.  The
+	 * maple tree library expects all stores to be protected via the lock,
+	 * and the library can't know when the tree is reachable only by the
+	 * caller, as is the case here.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+	r = mas_store_gfp(&mas, xa_mk_value(attrs), GFP_KERNEL);
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return r;
+}
+
 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 {
 	static const char *name = "[kvm-gmem]";
@@ -591,16 +666,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_fops;
 	}
 
-	inode->i_op = &kvm_gmem_iops;
-	inode->i_mapping->a_ops = &kvm_gmem_aops;
-	inode->i_mode |= S_IFREG;
-	inode->i_size = size;
-	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
-	mapping_set_inaccessible(inode->i_mapping);
-	/* Unmovable mappings are supposed to be marked unevictable as well. */
-	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
-
-	GMEM_I(inode)->flags = flags;
+	err = kvm_gmem_init_inode(inode, size, flags);
+	if (err)
+		goto err_inode;
 
 	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
 	if (IS_ERR(file)) {
@@ -804,9 +872,13 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	if (!file)
 		return -EFAULT;
 
+	filemap_invalidate_lock_shared(file_inode(file)->i_mapping);
+
 	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
-	if (IS_ERR(folio))
-		return PTR_ERR(folio);
+	if (IS_ERR(folio)) {
+		r = PTR_ERR(folio);
+		goto out;
+	}
 
 	if (!is_prepared)
 		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
@@ -818,6 +890,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 	else
 		folio_put(folio);
 
+out:
+	filemap_invalidate_unlock_shared(file_inode(file)->i_mapping);
 	return r;
 }
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
@@ -931,13 +1005,41 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb)
 
 	mpol_shared_policy_init(&gi->policy, NULL);
 
+	/*
+	 * Memory attributes are protected by the filemap invalidation lock, but
+	 * the lock structure isn't available at this time.  Immediately mark
+	 * maple tree as using external locking so that accessing the tree
+	 * before it's fully initialized results in NULL pointer dereferences
+	 * and not more subtle bugs.
+	 */
+	mt_init_flags(&gi->attributes, MT_FLAGS_LOCK_EXTERN);
+
 	gi->flags = 0;
 	return &gi->vfs_inode;
 }
 
 static void kvm_gmem_destroy_inode(struct inode *inode)
 {
-	mpol_free_shared_policy(&GMEM_I(inode)->policy);
+	struct gmem_inode *gi = GMEM_I(inode);
+
+	mpol_free_shared_policy(&gi->policy);
+
+	/*
+	 * Note!  Checking for an empty tree is functionally necessary
+	 * to avoid explosions if the tree hasn't been fully
+	 * initialized, i.e. if the inode is being destroyed before
+	 * guest_memfd can set the external lock, lockdep would find
+	 * that the tree's internal ma_lock was not held.
+	 */
+	if (!mtree_empty(&gi->attributes)) {
+		/*
+		 * Acquire the invalidation lock purely to make lockdep happy,
+		 * the inode is unreachable at this point.
+		 */
+		filemap_invalidate_lock(inode->i_mapping);
+		__mt_destroy(&gi->attributes);
+		filemap_invalidate_unlock(inode->i_mapping);
+	}
 }
 
 static void kvm_gmem_free_inode(struct inode *inode)
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 02/37] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 01/37] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 03/37] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined Ackerley Tng
                   ` (35 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

Rename the per-VM memory attributes Kconfig to make it explicitly about
per-VM attributes in anticipation of adding memory attributes support to
guest_memfd, at which point it will be possible (and desirable) to have
memory attributes without the per-VM support, even in x86.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/Kconfig            |  6 +++---
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/x86.c              |  2 +-
 include/linux/kvm_host.h        |  8 ++++----
 include/trace/events/kvm.h      |  4 ++--
 virt/kvm/Kconfig                |  2 +-
 virt/kvm/kvm_main.c             | 14 +++++++-------
 8 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5a3bfa293e8b..d427cf9187c3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2307,7 +2307,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		       int tdp_max_root_level, int tdp_huge_page_level);
 
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
 #endif
 
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 278f08194ec8..1683dbab870e 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -84,7 +84,7 @@ config KVM_SW_PROTECTED_VM
 	bool "Enable support for KVM software-protected VMs"
 	depends on EXPERT
 	depends on KVM_X86 && X86_64
-	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	select KVM_VM_MEMORY_ATTRIBUTES
 	help
 	  Enable support for KVM software-protected VMs.  Currently, software-
 	  protected VMs are purely a development and testing vehicle for
@@ -135,7 +135,7 @@ config KVM_INTEL_TDX
 	bool "Intel Trust Domain Extensions (TDX) support"
 	default y
 	depends on INTEL_TDX_HOST
-	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_POPULATE
 	help
 	  Provides support for launching Intel Trust Domain Extensions (TDX)
@@ -159,7 +159,7 @@ config KVM_AMD_SEV
 	depends on KVM_AMD && X86_64
 	depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
 	select ARCH_HAS_CC_PLATFORM
-	select KVM_GENERIC_MEMORY_ATTRIBUTES
+	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_PREPARE
 	select HAVE_KVM_ARCH_GMEM_INVALIDATE
 	select HAVE_KVM_ARCH_GMEM_POPULATE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 02c450686b4a..c7de8ff84fd2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7890,7 +7890,7 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 		vhost_task_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static bool hugepage_test_mixed(struct kvm_memory_slot *slot, gfn_t gfn,
 				int level)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ff8812f3a129..b2e93f836dca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13437,7 +13437,7 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 		}
 	}
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	kvm_mmu_init_memslot_memory_attributes(kvm, slot);
 #endif
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d93f75b05ae2..66a5e05bb5b7 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -721,7 +721,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
-#ifndef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 {
 	return false;
@@ -871,7 +871,7 @@ struct kvm {
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
 	struct notifier_block pm_notifier;
 #endif
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	/* Protected by slots_lock (for writes) and RCU (for reads) */
 	struct xarray mem_attr_array;
 #endif
@@ -2515,7 +2515,7 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
 	return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
 }
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
 	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
@@ -2537,7 +2537,7 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return false;
 }
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 #ifdef CONFIG_KVM_GUEST_MEMFD
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index b282e3a86769..1ba72bd73ea2 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -358,7 +358,7 @@ TRACE_EVENT(kvm_dirty_ring_exit,
 	TP_printk("vcpu %d", __entry->vcpu_id)
 );
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 /*
  * @start:	Starting address of guest memory range
  * @end:	End address of guest memory range
@@ -383,7 +383,7 @@ TRACE_EVENT(kvm_vm_set_mem_attributes,
 	TP_printk("%#016llx -- %#016llx [0x%lx]",
 		  __entry->start, __entry->end, __entry->attr)
 );
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 TRACE_EVENT(kvm_unmap_hva_range,
 	TP_PROTO(unsigned long start, unsigned long end),
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 267c7369c765..c12dc19f0a5e 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -105,7 +105,7 @@ config KVM_MMU_LOCKLESS_AGING
        depends on KVM_GENERIC_MMU_NOTIFIER
        bool
 
-config KVM_GENERIC_MEMORY_ATTRIBUTES
+config KVM_VM_MEMORY_ATTRIBUTES
        depends on KVM_GENERIC_MMU_NOTIFIER
        bool
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5b5b69c97665..11c34311b0ac 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1132,7 +1132,7 @@ static struct kvm *kvm_create_vm(unsigned long type, const char *fdname)
 	spin_lock_init(&kvm->mn_invalidate_lock);
 	rcuwait_init(&kvm->mn_memslots_update_rcuwait);
 	xa_init(&kvm->vcpu_array);
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	xa_init(&kvm->mem_attr_array);
 #endif
 
@@ -1323,7 +1323,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 	cleanup_srcu_struct(&kvm->irq_srcu);
 	srcu_barrier(&kvm->srcu);
 	cleanup_srcu_struct(&kvm->srcu);
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	xa_destroy(&kvm->mem_attr_array);
 #endif
 	kvm_arch_free_vm(kvm);
@@ -2441,7 +2441,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
 	if (!kvm || kvm_arch_has_private_mem(kvm))
@@ -2646,7 +2646,7 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 
 	return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
 }
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
@@ -4943,7 +4943,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_SYSTEM_EVENT_DATA:
 	case KVM_CAP_DEVICE_CTRL:
 		return 1;
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	case KVM_CAP_MEMORY_ATTRIBUTES:
 		return kvm_supported_mem_attributes(kvm);
 #endif
@@ -5347,7 +5347,7 @@ static long kvm_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
-#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	case KVM_SET_MEMORY_ATTRIBUTES: {
 		struct kvm_memory_attributes attrs;
 
@@ -5358,7 +5358,7 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
 		break;
 	}
-#endif /* CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES */
+#endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 	case KVM_CREATE_DEVICE: {
 		struct kvm_create_device cd;
 
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 03/37] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 01/37] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 02/37] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 04/37] KVM: Stub in ability to disable per-VM memory attribute tracking Ackerley Tng
                   ` (34 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

Explicitly guard reporting support for KVM_MEMORY_ATTRIBUTE_PRIVATE based
on kvm_arch_has_private_mem being #defined in anticipation of decoupling
kvm_supported_mem_attributes() from CONFIG_KVM_VM_MEMORY_ATTRIBUTES.
guest_memfd support for memory attributes will be unconditional to avoid
yet more macros (all architectures that support guest_memfd are expected to
use per-gmem attributes at some point), at which point enumerating support
KVM_MEMORY_ATTRIBUTE_PRIVATE based solely on memory attributes being
supported _somewhere_ would result in KVM over-reporting support on arm64.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_host.h | 2 +-
 virt/kvm/kvm_main.c      | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 66a5e05bb5b7..af2fcfff7692 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -721,7 +721,7 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
 }
 #endif
 
-#ifndef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#ifndef kvm_arch_has_private_mem
 static inline bool kvm_arch_has_private_mem(struct kvm *kvm)
 {
 	return false;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 11c34311b0ac..26e0d532ba03 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2444,8 +2444,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
+#ifdef kvm_arch_has_private_mem
 	if (!kvm || kvm_arch_has_private_mem(kvm))
 		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
+#endif
 
 	return 0;
 }
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 04/37] KVM: Stub in ability to disable per-VM memory attribute tracking
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (2 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 03/37] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 05/37] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes Ackerley Tng
                   ` (33 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

Introduce the basic infrastructure to allow per-VM memory attribute
tracking to be disabled. This will be built-upon in a later patch, where a
module param can disable per-VM memory attribute tracking.

Split the Kconfig option into a base KVM_MEMORY_ATTRIBUTES and the
existing KVM_VM_MEMORY_ATTRIBUTES. The base option provides the core
plumbing, while the latter enables the full per-VM tracking via an xarray
and the associated ioctls.

kvm_get_memory_attributes() now performs a static call that either looks up
kvm->mem_attr_array with CONFIG_KVM_VM_MEMORY_ATTRIBUTES is enabled, or
just returns 0 otherwise. The static call can be patched depending on
whether per-VM tracking is enabled by the CONFIG.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 include/linux/kvm_host.h        | 23 ++++++++++-------
 virt/kvm/Kconfig                |  6 ++++-
 virt/kvm/kvm_main.c             | 44 ++++++++++++++++++++++++++++++++-
 4 files changed, 63 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d427cf9187c3..f8f0de0c367e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2307,7 +2307,7 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 		       int tdp_max_root_level, int tdp_huge_page_level);
 
 
-#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
 #endif
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index af2fcfff7692..0e7920a0f957 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2515,19 +2515,15 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
 	return slot->flags & KVM_MEMSLOT_GMEM_ONLY;
 }
 
-#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
+typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
+DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
+
 static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
-	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+	return static_call(__kvm_get_memory_attributes)(kvm, gfn);
 }
 
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
-				     unsigned long mask, unsigned long attrs);
-bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
-					struct kvm_gfn_range *range);
-bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
-					 struct kvm_gfn_range *range);
-
 static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE;
@@ -2537,6 +2533,15 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 {
 	return false;
 }
+#endif
+
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				     unsigned long mask, unsigned long attrs);
+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+					struct kvm_gfn_range *range);
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+					 struct kvm_gfn_range *range);
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 #ifdef CONFIG_KVM_GUEST_MEMFD
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index c12dc19f0a5e..02c4c653bb31 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -105,10 +105,14 @@ config KVM_MMU_LOCKLESS_AGING
        depends on KVM_GENERIC_MMU_NOTIFIER
        bool
 
-config KVM_VM_MEMORY_ATTRIBUTES
+config KVM_MEMORY_ATTRIBUTES
        depends on KVM_GENERIC_MMU_NOTIFIER
        bool
 
+config KVM_VM_MEMORY_ATTRIBUTES
+       select KVM_MEMORY_ATTRIBUTES
+       bool
+
 config KVM_GUEST_MEMFD
        depends on KVM_GENERIC_MMU_NOTIFIER
        select XARRAY_MULTI
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 26e0d532ba03..4dc9fd941ecb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -102,6 +102,17 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(halt_poll_ns_shrink);
 static bool allow_unsafe_mappings;
 module_param(allow_unsafe_mappings, bool, 0444);
 
+#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+static bool vm_memory_attributes = true;
+#else
+#define vm_memory_attributes false
+#endif
+DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
+EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_TRAMP(__kvm_get_memory_attributes));
+#endif
+
 /*
  * Ordering of locks:
  *
@@ -2441,7 +2452,7 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
-#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+#ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
 static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 {
 #ifdef kvm_arch_has_private_mem
@@ -2452,6 +2463,12 @@ static u64 kvm_supported_mem_attributes(struct kvm *kvm)
 	return 0;
 }
 
+#ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
+{
+	return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+}
+
 /*
  * Returns true if _all_ gfns in the range [@start, @end) have attributes
  * such that the bits in @mask match @attrs.
@@ -2648,7 +2665,24 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 
 	return kvm_vm_set_mem_attributes(kvm, start, end, attrs->attributes);
 }
+#else  /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
+static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
+{
+	BUILD_BUG_ON(1);
+}
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
+static void kvm_init_memory_attributes(void)
+{
+	if (vm_memory_attributes)
+		static_call_update(__kvm_get_memory_attributes,
+				   kvm_get_vm_memory_attributes);
+	else
+		static_call_update(__kvm_get_memory_attributes,
+				   (void *)__static_call_return0);
+}
+#else  /* CONFIG_KVM_MEMORY_ATTRIBUTES */
+static void kvm_init_memory_attributes(void) { }
+#endif /* CONFIG_KVM_MEMORY_ATTRIBUTES */
 
 struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn)
 {
@@ -4947,6 +4981,9 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return 1;
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 	case KVM_CAP_MEMORY_ATTRIBUTES:
+		if (!vm_memory_attributes)
+			return 0;
+
 		return kvm_supported_mem_attributes(kvm);
 #endif
 #ifdef CONFIG_KVM_GUEST_MEMFD
@@ -5353,6 +5390,10 @@ static long kvm_vm_ioctl(struct file *filp,
 	case KVM_SET_MEMORY_ATTRIBUTES: {
 		struct kvm_memory_attributes attrs;
 
+		r = -ENOTTY;
+		if (!vm_memory_attributes)
+			goto out;
+
 		r = -EFAULT;
 		if (copy_from_user(&attrs, argp, sizeof(attrs)))
 			goto out;
@@ -6539,6 +6580,7 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module)
 	kvm_preempt_ops.sched_in = kvm_sched_in;
 	kvm_preempt_ops.sched_out = kvm_sched_out;
 
+	kvm_init_memory_attributes();
 	kvm_init_debug();
 
 	r = kvm_vfio_ops_init();
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 05/37] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (3 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 04/37] KVM: Stub in ability to disable per-VM memory attribute tracking Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 06/37] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes Ackerley Tng
                   ` (32 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

From: Sean Christopherson <seanjc@google.com>

Implement kvm_gmem_get_memory_attributes() for guest_memfd to allow the KVM
core and architecture code to query per-GFN memory attributes.

kvm_gmem_get_memory_attributes() finds the memory slot for a given GFN and
queries the guest_memfd file's to determine if the page is marked as
private.

If vm_memory_attributes is not enabled, there is no shared/private tracking
at the VM level. Install the guest_memfd implementation as long as
guest_memfd is enabled to give guest_memfd a chance to respond on
attributes.

guest_memfd should look up attributes regardless of whether this memslot is
gmem-only since attributes are now tracked by gmem regardless of whether
mmap() is enabled.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/guest_memfd.c   | 37 +++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c      |  3 +++
 3 files changed, 42 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e7920a0f957..c12ee89392d8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2544,6 +2544,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
+unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
+
 #ifdef CONFIG_KVM_GUEST_MEMFD
 int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 		     gfn_t gfn, kvm_pfn_t *pfn, struct page **page,
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 7a970e5fab67..810fec394075 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -515,6 +515,43 @@ static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
+{
+	struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
+	struct inode *inode;
+	unsigned long attrs;
+
+	/*
+	 * If this gfn has no associated memslot, there's no chance of the gfn
+	 * being backed by private memory, since guest_memfd must be used for
+	 * private memory, and guest_memfd must be associated with some memslot.
+	 */
+	if (!slot)
+		return 0;
+
+	CLASS(gmem_get_file, file)(slot);
+	if (!file)
+		return 0;
+
+	inode = file_inode(file);
+
+	/*
+	 * Acquire the filemap lock to ensure the mtree lookup gets a
+	 * stable result.  The caller _must_ still protect consumption
+	 * of private vs. shared by checking
+	 * mmu_invalidate_retry_gfn() under mmu_lock to serialize
+	 * against ongoing attribute updates.  Acquiring the filemap
+	 * lock only ensures a stable _lookup_, the result can become
+	 * stale as soon as the lock is dropped.
+	 */
+	filemap_invalidate_lock_shared(inode->i_mapping);
+	attrs = kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+
+	return attrs;
+}
+EXPORT_SYMBOL_GPL(kvm_gmem_get_memory_attributes);
+
 static struct file_operations kvm_gmem_fops = {
 	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4dc9fd941ecb..5d06e2c74ae7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2676,6 +2676,9 @@ static void kvm_init_memory_attributes(void)
 	if (vm_memory_attributes)
 		static_call_update(__kvm_get_memory_attributes,
 				   kvm_get_vm_memory_attributes);
+	else if (IS_ENABLED(CONFIG_KVM_GUEST_MEMFD))
+		static_call_update(__kvm_get_memory_attributes,
+				   kvm_gmem_get_memory_attributes);
 	else
 		static_call_update(__kvm_get_memory_attributes,
 				   (void *)__static_call_return0);
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 06/37] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (4 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 05/37] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 07/37] KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
                   ` (31 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Update the guest_memfd populate() flow to pull memory attributes from the
gmem instance instead of the VM when KVM is not configured to track
shared/private status in the VM.

Rename the per-VM API to make it clear that it retrieves per-VM
attributes, i.e. is not suitable for use outside of flows that are
specific to generic per-VM attributes.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c   |  2 +-
 include/linux/kvm_host.h | 14 +++++++++++++-
 virt/kvm/guest_memfd.c   | 26 +++++++++++++++++++++++---
 virt/kvm/kvm_main.c      |  8 +++-----
 4 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c7de8ff84fd2..25ab6b8901e2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -7979,7 +7979,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot,
 	const unsigned long end = start + KVM_PAGES_PER_HPAGE(level);
 
 	if (level == PG_LEVEL_2M)
-		return kvm_range_has_memory_attributes(kvm, start, end, ~0, attrs);
+		return kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attrs);
 
 	for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) {
 		if (hugepage_test_mixed(slot, gfn, level - 1) ||
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c12ee89392d8..8f1e10a503f4 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2536,12 +2536,24 @@ static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn)
 #endif
 
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+extern bool vm_memory_attributes;
+bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				     unsigned long mask, unsigned long attrs);
 bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
 					struct kvm_gfn_range *range);
 bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
 					 struct kvm_gfn_range *range);
+#else
+#define vm_memory_attributes false
+static inline bool kvm_range_has_vm_memory_attributes(struct kvm *kvm,
+						      gfn_t start, gfn_t end,
+						      unsigned long mask,
+						      unsigned long attrs)
+{
+	WARN_ONCE(1, "Unexpected call to kvm_range_has_vm_memory_attributes()");
+
+	return false;
+}
 #endif /* CONFIG_KVM_VM_MEMORY_ATTRIBUTES */
 
 unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn);
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 810fec394075..6b3477b226ad 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -934,10 +934,30 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
+static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
+				      size_t nr_pages, struct kvm *kvm, gfn_t gfn)
+{
+	pgoff_t end = index + nr_pages - 1;
+	void *entry;
+
+	if (vm_memory_attributes)
+		return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
+						       KVM_MEMORY_ATTRIBUTE_PRIVATE,
+						       KVM_MEMORY_ATTRIBUTE_PRIVATE);
+
+	mt_for_each(&gi->attributes, entry, index, end) {
+		if (xa_to_value(entry) != attributes)
+			return false;
+	}
+
+	return true;
+}
+
 long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
 		       kvm_gmem_populate_cb post_populate, void *opaque)
 {
 	struct kvm_memory_slot *slot;
+	struct gmem_inode *gi;
 	void __user *p;
 
 	int ret = 0, max_order;
@@ -956,6 +976,8 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 	if (!file)
 		return -EFAULT;
 
+	gi = GMEM_I(file_inode(file));
+
 	filemap_invalidate_lock(file->f_mapping);
 
 	npages = min_t(ulong, slot->npages - (start_gfn - slot->base_gfn), npages);
@@ -989,9 +1011,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 			(npages - i) < (1 << max_order));
 
 		ret = -EINVAL;
-		while (!kvm_range_has_memory_attributes(kvm, gfn, gfn + (1 << max_order),
-							KVM_MEMORY_ATTRIBUTE_PRIVATE,
-							KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
+		while (!kvm_gmem_range_is_private(gi, index, 1 << max_order, kvm, gfn)) {
 			if (!max_order)
 				goto put_folio_and_exit;
 			max_order--;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5d06e2c74ae7..666d1f7fbf07 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -104,9 +104,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
 
 #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
-static bool vm_memory_attributes = true;
-#else
-#define vm_memory_attributes false
+bool vm_memory_attributes = true;
 #endif
 DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
@@ -2473,7 +2471,7 @@ static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
  * Returns true if _all_ gfns in the range [@start, @end) have attributes
  * such that the bits in @mask match @attrs.
  */
-bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+bool kvm_range_has_vm_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				     unsigned long mask, unsigned long attrs)
 {
 	XA_STATE(xas, &kvm->mem_attr_array, start);
@@ -2607,7 +2605,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	mutex_lock(&kvm->slots_lock);
 
 	/* Nothing to do if the entire range has the desired attributes. */
-	if (kvm_range_has_memory_attributes(kvm, start, end, ~0, attributes))
+	if (kvm_range_has_vm_memory_attributes(kvm, start, end, ~0, attributes))
 		goto out_unlock;
 
 	/*
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 07/37] KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (5 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 06/37] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 08/37] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs Ackerley Tng
                   ` (30 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Introduce a "version 2" of KVM_SET_MEMORY_ATTRIBUTES to support returning
information back to userspace.

This new ioctl and structure will, in a later patch, be shared as a
guest_memfd ioctl, where the padding in the new kvm_memory_attributes2
structure will be for writing the response from the guest_memfd ioctl to
userspace.

A new ioctl is necessary for these reasons:

1. KVM_SET_MEMORY_ATTRIBUTES is currently a write-only ioctl and does not
   allow userspace to read fields. There's nothing in code (yet?) that
   validates this, but using _IOWR for consistency would be prudent.

2. KVM_SET_MEMORY_ATTRIBUTES, when used as a guest_memfd ioctl, will need
   an additional field to provide userspace with more error details.

Alternatively, a completely new ioctl could be defined, unrelated to
KVM_SET_MEMORY_ATTRIBUTES, but using the same ioctl number and struct for
the vm and guest_memfd ioctls streamlines the interface for userspace. In
addition, any memory attributes, implemented on the vm or guest_memfd
ioctl, can be easily shared with the other.

Add KVM_CAP_MEMORY_ATTRIBUTES2 to indicate that struct
kvm_memory_attributes2 exists and can be used either with
KVM_SET_MEMORY_ATTRIBUTES2 via the vm or guest_memfd ioctl.

Since KVM_SET_MEMORY_ATTRIBUTES2 is not limited to be used only with the vm
ioctl, return 1 for KVM_CAP_MEMORY_ATTRIBUTES2 as long as struct
kvm_memory_attributes2 and KVM_SET_MEMORY_ATTRIBUTES2 can be
used. KVM_CAP_MEMORY_ATTRIBUTES must still be used to actually get valid
attributes.

Handle KVM_CAP_MEMORY_ATTRIBUTES2 and return 1 regardless of
CONFIG_KVM_VM_MEMORY_ATTRIBUTES, since KVM_SET_MEMORY_ATTRIBUTES2 is not
limited to a vm ioctl and can also be used with the guest_memfd ioctl.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 Documentation/virt/kvm/api.rst | 32 +++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       | 12 ++++++++++++
 virt/kvm/kvm_main.c            | 35 +++++++++++++++++++++++++++++++---
 3 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 01a3abef8abb..23ec0b0c3e22 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6360,6 +6360,8 @@ S390:
 Returns -EINVAL if the VM has the KVM_VM_S390_UCONTROL flag set.
 Returns -EINVAL if called on a protected VM.
 
+.. _KVM_SET_MEMORY_ATTRIBUTES:
+
 4.141 KVM_SET_MEMORY_ATTRIBUTES
 -------------------------------
 
@@ -6517,6 +6519,36 @@ the capability to be present.
 
 `flags` must currently be zero.
 
+4.144 KVM_SET_MEMORY_ATTRIBUTES2
+---------------------------------
+
+:Capability: KVM_CAP_MEMORY_ATTRIBUTES2
+:Architectures: x86
+:Type: vm ioctl
+:Parameters: struct kvm_memory_attributes2 (in/out)
+:Returns: 0 on success, <0 on error
+
+KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
+KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
+userspace.  The original (pre-extension) fields are shared with
+KVM_SET_MEMORY_ATTRIBUTES identically.
+
+Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
+
+::
+
+  struct kvm_memory_attributes2 {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+	__u64 reserved[12];
+  };
+
+  #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
+See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
+
 
 .. _kvm_run:
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index dddb781b0507..7f8dac4f4fd3 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -974,6 +974,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_GUEST_MEMFD_FLAGS 244
 #define KVM_CAP_ARM_SEA_TO_USER 245
 #define KVM_CAP_S390_USER_OPEREXEC 246
+#define KVM_CAP_MEMORY_ATTRIBUTES2 247
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
@@ -1607,6 +1608,17 @@ struct kvm_memory_attributes {
 	__u64 flags;
 };
 
+/* Available with KVM_CAP_MEMORY_ATTRIBUTES2 */
+#define KVM_SET_MEMORY_ATTRIBUTES2              _IOWR(KVMIO,  0xd2, struct kvm_memory_attributes2)
+
+struct kvm_memory_attributes2 {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+	__u64 reserved[12];
+};
+
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 666d1f7fbf07..ad70101c2e3f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2637,7 +2637,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	return r;
 }
 static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
-					   struct kvm_memory_attributes *attrs)
+					   struct kvm_memory_attributes2 *attrs)
 {
 	gfn_t start, end;
 
@@ -4981,6 +4981,7 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_DEVICE_CTRL:
 		return 1;
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+	case KVM_CAP_MEMORY_ATTRIBUTES2:
 	case KVM_CAP_MEMORY_ATTRIBUTES:
 		if (!vm_memory_attributes)
 			return 0;
@@ -5206,6 +5207,14 @@ do {										\
 		     sizeof_field(struct kvm_userspace_memory_region2, field));	\
 } while (0)
 
+#define SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(field)				\
+do {										\
+	BUILD_BUG_ON(offsetof(struct kvm_memory_attributes, field) !=		\
+		     offsetof(struct kvm_memory_attributes2, field));		\
+	BUILD_BUG_ON(sizeof_field(struct kvm_memory_attributes, field) !=	\
+		     sizeof_field(struct kvm_memory_attributes2, field));	\
+} while (0)
+
 static long kvm_vm_ioctl(struct file *filp,
 			   unsigned int ioctl, unsigned long arg)
 {
@@ -5388,15 +5397,35 @@ static long kvm_vm_ioctl(struct file *filp,
 	}
 #endif /* CONFIG_HAVE_KVM_IRQ_ROUTING */
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
+	case KVM_SET_MEMORY_ATTRIBUTES2:
 	case KVM_SET_MEMORY_ATTRIBUTES: {
-		struct kvm_memory_attributes attrs;
+		struct kvm_memory_attributes2 attrs;
+		unsigned long size;
+
+		if (ioctl == KVM_SET_MEMORY_ATTRIBUTES) {
+			/*
+			 * Fields beyond struct kvm_memory_attributes shouldn't
+			 * be accessed, but avoid leaking kernel memory in case
+			 * of a bug.
+			 */
+			memset(&attrs, 0, sizeof(attrs));
+			size = sizeof(struct kvm_memory_attributes);
+		} else {
+			size = sizeof(struct kvm_memory_attributes2);
+		}
+
+		/* Ensure the common parts of the two structs are identical. */
+		SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(address);
+		SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(size);
+		SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(attributes);
+		SANITY_CHECK_MEMORY_ATTRIBUTES_FIELD(flags);
 
 		r = -ENOTTY;
 		if (!vm_memory_attributes)
 			goto out;
 
 		r = -EFAULT;
-		if (copy_from_user(&attrs, argp, sizeof(attrs)))
+		if (copy_from_user(&attrs, argp, size))
 			goto out;
 
 		r = kvm_vm_ioctl_set_mem_attributes(kvm, &attrs);
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 08/37] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (6 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 07/37] KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
                   ` (29 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

Now that guest_memfd supports tracking private vs. shared within gmem
itself, allow userspace to specify INIT_SHARED on a guest_memfd instance
for x86 Confidential Computing (CoCo) VMs, so long as per-VM attributes
are disabled, i.e. when it's actually possible for a guest_memfd instance
to contain shared memory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b2e93f836dca..6518cdb4569f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13984,14 +13984,13 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
 }
 
 #ifdef CONFIG_KVM_GUEST_MEMFD
-/*
- * KVM doesn't yet support initializing guest_memfd memory as shared for VMs
- * with private memory (the private vs. shared tracking needs to be moved into
- * guest_memfd).
- */
 bool kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
 {
-	return !kvm_arch_has_private_mem(kvm);
+	/*
+	 * INIT_SHARED isn't supported if the memory attributes are per-VM,
+	 * in which case guest_memfd can _only_ be used for private memory.
+	 */
+	return !vm_memory_attributes || !kvm_arch_has_private_mem(kvm);
 }
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (7 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 08/37] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-14 20:09   ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 10/37] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check Ackerley Tng
                   ` (28 subsequent siblings)
  37 siblings, 1 reply; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

For shared to private conversions, if refcounts on any of the folios
within the range are elevated, fail the conversion with -EAGAIN.

At the point of shared to private conversion, all folios in range are
also unmapped. The filemap_invalidate_lock() is held, so no faulting
can occur. Hence, from that point on, only transient refcounts can be
taken on the folios associated with that guest_memfd.

Hence, it is safe to do the conversion from shared to private.

After conversion is complete, refcounts may become elevated, but that
is fine since users of transient refcounts don't actually access
memory.

For private to shared conversions, there are no refcount checks, since
the guest is the only user of private pages, and guest_memfd will be the
only holder of refcounts on private pages.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst |  48 +++++++-
 include/linux/kvm_host.h       |  10 ++
 include/uapi/linux/kvm.h       |   9 +-
 virt/kvm/Kconfig               |   2 +-
 virt/kvm/guest_memfd.c         | 210 +++++++++++++++++++++++++++++++--
 virt/kvm/kvm_main.c            |  15 +--
 6 files changed, 263 insertions(+), 31 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 23ec0b0c3e22..26e80745c8b4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -117,7 +117,7 @@ description:
       x86 includes both i386 and x86_64.
 
   Type:
-      system, vm, or vcpu.
+      system, vm, vcpu or guest_memfd.
 
   Parameters:
       what parameters are accepted by the ioctl.
@@ -6523,11 +6523,22 @@ the capability to be present.
 ---------------------------------
 
 :Capability: KVM_CAP_MEMORY_ATTRIBUTES2
-:Architectures: x86
-:Type: vm ioctl
+:Architectures: all
+:Type: vm, guest_memfd ioctl
 :Parameters: struct kvm_memory_attributes2 (in/out)
 :Returns: 0 on success, <0 on error
 
+Errors:
+
+  ========== ===============================================================
+  EINVAL     The specified `offset` or `size` were invalid (e.g. not
+             page aligned, causes an overflow, or size is zero).
+  EFAULT     The parameter address was invalid.
+  EAGAIN     Some page within requested range had unexpected refcounts. The
+             offset of the page will be returned in `error_offset`.
+  ENOMEM     Ran out of memory trying to track private/shared state
+  ========== ===============================================================
+
 KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
 KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
 userspace.  The original (pre-extension) fields are shared with
@@ -6538,15 +6549,42 @@ Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
 ::
 
   struct kvm_memory_attributes2 {
-	__u64 address;
+	/* in */
+	union {
+		__u64 address;
+		__u64 offset;
+	};
 	__u64 size;
 	__u64 attributes;
 	__u64 flags;
-	__u64 reserved[12];
+	/* out */
+	__u64 error_offset;
+	__u64 reserved[11];
   };
 
   #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
 
+Set attributes for a range of offsets within a guest_memfd to
+KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
+memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
+supported, after a successful call to set
+KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
+into host userspace and will only be mappable by the guest.
+
+To allow the range to be mappable into host userspace again, call
+KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
+KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
+
+If this ioctl returns -EAGAIN, the offset of the page with unexpected
+refcounts will be returned in `error_offset`. This can occur if there
+are transient refcounts on the pages, taken by other parts of the
+kernel.
+
+Userspace is expected to figure out how to remove all known refcounts
+on the shared pages, such as refcounts taken by get_user_pages(), and
+try the ioctl again. A possible source of these long term refcounts is
+if the guest_memfd memory was pinned in IOMMU page tables.
+
 See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
 
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8f1e10a503f4..8276187f5e4c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2516,6 +2516,16 @@ static inline bool kvm_memslot_is_gmem_only(const struct kvm_memory_slot *slot)
 }
 
 #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
+static inline u64 kvm_supported_mem_attributes(struct kvm *kvm)
+{
+#ifdef kvm_arch_has_private_mem
+	if (!kvm || kvm_arch_has_private_mem(kvm))
+		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
+#endif
+
+	return 0;
+}
+
 typedef unsigned long (kvm_get_memory_attributes_t)(struct kvm *kvm, gfn_t gfn);
 DECLARE_STATIC_CALL(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 7f8dac4f4fd3..fba5e0c851f2 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -975,6 +975,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_ARM_SEA_TO_USER 245
 #define KVM_CAP_S390_USER_OPEREXEC 246
 #define KVM_CAP_MEMORY_ATTRIBUTES2 247
+#define KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES 248
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
@@ -1612,11 +1613,15 @@ struct kvm_memory_attributes {
 #define KVM_SET_MEMORY_ATTRIBUTES2              _IOWR(KVMIO,  0xd2, struct kvm_memory_attributes2)
 
 struct kvm_memory_attributes2 {
-	__u64 address;
+	union {
+		__u64 address;
+		__u64 offset;
+	};
 	__u64 size;
 	__u64 attributes;
 	__u64 flags;
-	__u64 reserved[12];
+	__u64 error_offset;
+	__u64 reserved[11];
 };
 
 #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 02c4c653bb31..f42bc6e7de44 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -114,7 +114,7 @@ config KVM_VM_MEMORY_ATTRIBUTES
        bool
 
 config KVM_GUEST_MEMFD
-       depends on KVM_GENERIC_MMU_NOTIFIER
+       select KVM_MEMORY_ATTRIBUTES
        select XARRAY_MULTI
        bool
 
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 6b3477b226ad..1cd8024cdb39 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -86,6 +86,23 @@ static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t index)
 	return !kvm_gmem_is_private_mem(inode, index);
 }
 
+static bool kvm_gmem_range_has_attributes(struct maple_tree *mt,
+					  pgoff_t index, size_t nr_pages,
+					  u64 attributes)
+{
+	pgoff_t end = index + nr_pages - 1;
+	void *entry;
+
+	lockdep_assert(mt_lock_is_held(mt));
+
+	mt_for_each(mt, entry, index, end) {
+		if (xa_to_value(entry) != attributes)
+			return false;
+	}
+
+	return true;
+}
+
 static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 				    pgoff_t index, struct folio *folio)
 {
@@ -183,10 +200,12 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 
 static enum kvm_gfn_range_filter kvm_gmem_get_invalidate_filter(struct inode *inode)
 {
-	if (GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)
-		return KVM_FILTER_SHARED;
-
-	return KVM_FILTER_PRIVATE;
+	/*
+	 * TODO: Limit invalidations based on the to-be-invalidated range, i.e.
+	 *       invalidate shared/private if and only if there can possibly be
+	 *       such mappings.
+	 */
+	return KVM_FILTER_SHARED | KVM_FILTER_PRIVATE;
 }
 
 static void __kvm_gmem_invalidate_begin(struct gmem_file *f, pgoff_t start,
@@ -552,11 +571,183 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_get_memory_attributes);
 
+static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
+					    size_t nr_pages, pgoff_t *err_index)
+{
+	struct address_space *mapping = inode->i_mapping;
+	const int filemap_get_folios_refcount = 1;
+	pgoff_t last = start + nr_pages - 1;
+	struct folio_batch fbatch;
+	bool safe = true;
+	int i;
+
+	folio_batch_init(&fbatch);
+	while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
+
+		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+			struct folio *folio = fbatch.folios[i];
+
+			if (folio_ref_count(folio) !=
+			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
+				safe = false;
+				*err_index = folio->index;
+				break;
+			}
+		}
+
+		folio_batch_release(&fbatch);
+	}
+
+	return safe;
+}
+
+/*
+ * Preallocate memory for attributes to be stored on a maple tree, pointed to
+ * by mas.  Adjacent ranges with attributes identical to the new attributes
+ * will be merged.  Also sets mas's bounds up for storing attributes.
+ *
+ * This maintains the invariant that ranges with the same attributes will
+ * always be merged.
+ */
+static int kvm_gmem_mas_preallocate(struct ma_state *mas, u64 attributes,
+				    pgoff_t start, size_t nr_pages)
+{
+	pgoff_t end = start + nr_pages;
+	pgoff_t last = end - 1;
+	void *entry;
+
+	/* Try extending range. entry is NULL on overflow/wrap-around. */
+	mas_set_range(mas, end, end);
+	entry = mas_find(mas, end);
+	if (entry && xa_to_value(entry) == attributes)
+		last = mas->last;
+
+	if (start > 0) {
+		mas_set_range(mas, start - 1, start - 1);
+		entry = mas_find(mas, start - 1);
+		if (entry && xa_to_value(entry) == attributes)
+			start = mas->index;
+	}
+
+	mas_set_range(mas, start, last);
+	return mas_preallocate(mas, xa_mk_value(attributes), GFP_KERNEL);
+}
+
+static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
+				     size_t nr_pages, uint64_t attrs,
+				     pgoff_t *err_index)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct gmem_inode *gi = GMEM_I(inode);
+	pgoff_t end = start + nr_pages;
+	struct maple_tree *mt;
+	struct ma_state mas;
+	int r;
+
+	mt = &gi->attributes;
+
+	filemap_invalidate_lock(mapping);
+
+	mas_init(&mas, mt, start);
+
+	if (kvm_gmem_range_has_attributes(mt, start, nr_pages, attrs)) {
+		r = 0;
+		goto out;
+	}
+
+	r = kvm_gmem_mas_preallocate(&mas, attrs, start, nr_pages);
+	if (r) {
+		*err_index = start;
+		goto out;
+	}
+
+	if (attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE) {
+		unmap_mapping_pages(mapping, start, nr_pages, false);
+
+		if (!kvm_gmem_is_safe_for_conversion(inode, start, nr_pages,
+						     err_index)) {
+			mas_destroy(&mas);
+			r = -EAGAIN;
+			goto out;
+		}
+	}
+
+	kvm_gmem_invalidate_begin(inode, start, end);
+
+	mas_store_prealloc(&mas, xa_mk_value(attrs));
+
+	kvm_gmem_invalidate_end(inode, start, end);
+out:
+	filemap_invalidate_unlock(mapping);
+	return r;
+}
+
+static long kvm_gmem_set_attributes(struct file *file, void __user *argp)
+{
+	struct gmem_file *f = file->private_data;
+	struct inode *inode = file_inode(file);
+	struct kvm_memory_attributes2 attrs;
+	pgoff_t err_index;
+	size_t nr_pages;
+	pgoff_t index;
+	int i, r;
+
+	if (copy_from_user(&attrs, argp, sizeof(attrs)))
+		return -EFAULT;
+
+	if (attrs.flags)
+		return -EINVAL;
+	if (attrs.error_offset)
+		return -EINVAL;
+	for (i = 0; i < ARRAY_SIZE(attrs.reserved); i++) {
+		if (attrs.reserved[i])
+			return -EINVAL;
+	}
+	if (attrs.attributes & ~kvm_supported_mem_attributes(f->kvm))
+		return -EINVAL;
+	if (attrs.size == 0 || attrs.offset + attrs.size < attrs.offset)
+		return -EINVAL;
+	if (!PAGE_ALIGNED(attrs.offset) || !PAGE_ALIGNED(attrs.size))
+		return -EINVAL;
+
+	if (attrs.offset >= inode->i_size ||
+	    attrs.offset + attrs.size > inode->i_size)
+		return -EINVAL;
+
+	nr_pages = attrs.size >> PAGE_SHIFT;
+	index = attrs.offset >> PAGE_SHIFT;
+	r = __kvm_gmem_set_attributes(inode, index, nr_pages, attrs.attributes,
+				      &err_index);
+	if (r) {
+		attrs.error_offset = err_index << PAGE_SHIFT;
+
+		if (copy_to_user(argp, &attrs, sizeof(attrs)))
+			return -EFAULT;
+	}
+
+	return r;
+}
+
+static long kvm_gmem_ioctl(struct file *file, unsigned int ioctl,
+			   unsigned long arg)
+{
+	switch (ioctl) {
+	case KVM_SET_MEMORY_ATTRIBUTES2:
+		if (vm_memory_attributes)
+			return -ENOTTY;
+
+		return kvm_gmem_set_attributes(file, (void __user *)arg);
+	default:
+		return -ENOTTY;
+	}
+}
+
 static struct file_operations kvm_gmem_fops = {
 	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
+	.unlocked_ioctl	= kvm_gmem_ioctl,
 };
 
 static int kvm_gmem_migrate_folio(struct address_space *mapping,
@@ -937,20 +1128,13 @@ EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn);
 static bool kvm_gmem_range_is_private(struct gmem_inode *gi, pgoff_t index,
 				      size_t nr_pages, struct kvm *kvm, gfn_t gfn)
 {
-	pgoff_t end = index + nr_pages - 1;
-	void *entry;
-
 	if (vm_memory_attributes)
 		return kvm_range_has_vm_memory_attributes(kvm, gfn, gfn + nr_pages,
 						       KVM_MEMORY_ATTRIBUTE_PRIVATE,
 						       KVM_MEMORY_ATTRIBUTE_PRIVATE);
 
-	mt_for_each(&gi->attributes, entry, index, end) {
-		if (xa_to_value(entry) != attributes)
-			return false;
-	}
-
-	return true;
+	return kvm_gmem_range_has_attributes(&gi->attributes, index, nr_pages,
+					     KVM_MEMORY_ATTRIBUTE_PRIVATE);
 }
 
 long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long npages,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ad70101c2e3f..cf56cc892e7c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2451,16 +2451,6 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm,
 #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */
 
 #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
-static u64 kvm_supported_mem_attributes(struct kvm *kvm)
-{
-#ifdef kvm_arch_has_private_mem
-	if (!kvm || kvm_arch_has_private_mem(kvm))
-		return KVM_MEMORY_ATTRIBUTE_PRIVATE;
-#endif
-
-	return 0;
-}
-
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 static unsigned long kvm_get_vm_memory_attributes(struct kvm *kvm, gfn_t gfn)
 {
@@ -4993,6 +4983,11 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 		return 1;
 	case KVM_CAP_GUEST_MEMFD_FLAGS:
 		return kvm_gmem_get_supported_flags(kvm);
+	case KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES:
+		if (vm_memory_attributes)
+			return 0;
+
+		return kvm_supported_mem_attributes(kvm);
 #endif
 	default:
 		break;
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 10/37] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (8 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 11/37] KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86 Ackerley Tng
                   ` (27 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

When checking if a guest_memfd folio is safe for conversion, its refcount
is examined. A folio may be present in a per-CPU lru_add fbatch, which
temporarily increases its refcount. This can lead to a false positive,
incorrectly indicating that the folio is in use and preventing the
conversion, even if it is otherwise safe. The conversion process might not
be on the same CPU that holds the folio in its fbatch, making a simple
per-CPU check insufficient.

To address this, drain all CPUs' lru_add fbatches if an unexpectedly high
refcount is encountered during the safety check. This is performed at most
once per conversion request.

guest_memfd folios are unevictable, so they can only reside in the lru_add
fbatch. If the folio's refcount is still unsafe after draining, then the
conversion is truly deemed unsafe.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 virt/kvm/guest_memfd.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 1cd8024cdb39..a9d12abfacb5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -8,6 +8,7 @@
 #include <linux/mempolicy.h>
 #include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
+#include <linux/swap.h>
 
 #include "kvm_mm.h"
 
@@ -571,25 +572,34 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_get_memory_attributes);
 
-static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
-					    size_t nr_pages, pgoff_t *err_index)
+static bool kvm_gmem_is_safe_for_conversion(struct inode *inode,
+					    pgoff_t start, size_t nr_pages,
+					    pgoff_t *err_index)
 {
 	struct address_space *mapping = inode->i_mapping;
 	const int filemap_get_folios_refcount = 1;
 	pgoff_t last = start + nr_pages - 1;
 	struct folio_batch fbatch;
+	bool lru_drained = false;
 	bool safe = true;
 	int i;
 
 	folio_batch_init(&fbatch);
 	while (safe && filemap_get_folios(mapping, &start, last, &fbatch)) {
 
-		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+		for (i = 0; i < folio_batch_count(&fbatch);) {
 			struct folio *folio = fbatch.folios[i];
 
-			if (folio_ref_count(folio) !=
-			    folio_nr_pages(folio) + filemap_get_folios_refcount) {
-				safe = false;
+			safe = (folio_ref_count(folio) ==
+				folio_nr_pages(folio) +
+				filemap_get_folios_refcount);
+
+			if (safe) {
+				++i;
+			} else if (!lru_drained) {
+				lru_add_drain_all();
+				lru_drained = true;
+			} else {
 				*err_index = folio->index;
 				break;
 			}
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 11/37] KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (9 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 10/37] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 12/37] KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes Ackerley Tng
                   ` (26 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

Bury KVM_VM_MEMORY_ATTRIBUTES in x86 to discourage other architectures
from adding support for per-VM memory attributes, because tracking private
vs. shared memory on a per-VM basis is now deprecated in favor of tracking
on a per-guest_memfd basis, and no other memory attributes are on the
horizon.

This will also allow modifying KVM_VM_MEMORY_ATTRIBUTES to be
user-selectable (in x86) without creating weirdness in KVM's Kconfigs.
Now that guest_memfd support memory attributes, it's entirely possible to
run x86 CoCo VMs without support for KVM_VM_MEMORY_ATTRIBUTES.

Leave the code itself in common KVM so that it's trivial to undo this
change if new per-VM attributes do come along.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/Kconfig | 4 ++++
 virt/kvm/Kconfig     | 4 ----
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 1683dbab870e..385f26da48ae 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -80,6 +80,10 @@ config KVM_WERROR
 
 	  If in doubt, say "N".
 
+config KVM_VM_MEMORY_ATTRIBUTES
+	select KVM_MEMORY_ATTRIBUTES
+	bool
+
 config KVM_SW_PROTECTED_VM
 	bool "Enable support for KVM software-protected VMs"
 	depends on EXPERT
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index f42bc6e7de44..6f6e7da895e3 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -109,10 +109,6 @@ config KVM_MEMORY_ATTRIBUTES
        depends on KVM_GENERIC_MMU_NOTIFIER
        bool
 
-config KVM_VM_MEMORY_ATTRIBUTES
-       select KVM_MEMORY_ATTRIBUTES
-       bool
-
 config KVM_GUEST_MEMFD
        select KVM_MEMORY_ATTRIBUTES
        select XARRAY_MULTI
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 12/37] KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (10 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 11/37] KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86 Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 13/37] KVM: selftests: Create gmem fd before "regular" fd when adding memslot Ackerley Tng
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

From: Sean Christopherson <seanjc@google.com>

Make vm_memory_attributes a module parameter so that userspace can disable
the use of memory attributes on the VM level.

To avoid inconsistencies in the way memory attributes are tracked in KVM
and guest_memfd, the vm_memory_attributes module_param is made
read-only (0444).

Make CONFIG_KVM_VM_MEMORY_ATTRIBUTES selectable, only for (CoCo) VM types
that might use vm_memory_attributes.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 arch/x86/kvm/Kconfig | 13 +++++++++----
 virt/kvm/kvm_main.c  |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 385f26da48ae..fea786906599 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -82,13 +82,20 @@ config KVM_WERROR
 
 config KVM_VM_MEMORY_ATTRIBUTES
 	select KVM_MEMORY_ATTRIBUTES
-	bool
+	depends on KVM_SW_PROTECTED_VM || KVM_INTEL_TDX || KVM_AMD_SEV
+	bool "Enable per-VM memory attributes (for CoCo VMs)"
+	help
+	  Enable support for per-VM memory attributes, which are deprecated in
+	  favor of tracking memory attributes in guest_memfd.  Select this if
+	  you need to run CoCo VMs using a VMM that doesn't support guest_memfd
+	  memory attributes.
+
+	  If unsure, say N.
 
 config KVM_SW_PROTECTED_VM
 	bool "Enable support for KVM software-protected VMs"
 	depends on EXPERT
 	depends on KVM_X86 && X86_64
-	select KVM_VM_MEMORY_ATTRIBUTES
 	help
 	  Enable support for KVM software-protected VMs.  Currently, software-
 	  protected VMs are purely a development and testing vehicle for
@@ -139,7 +146,6 @@ config KVM_INTEL_TDX
 	bool "Intel Trust Domain Extensions (TDX) support"
 	default y
 	depends on INTEL_TDX_HOST
-	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_POPULATE
 	help
 	  Provides support for launching Intel Trust Domain Extensions (TDX)
@@ -163,7 +169,6 @@ config KVM_AMD_SEV
 	depends on KVM_AMD && X86_64
 	depends on CRYPTO_DEV_SP_PSP && !(KVM_AMD=y && CRYPTO_DEV_CCP_DD=m)
 	select ARCH_HAS_CC_PLATFORM
-	select KVM_VM_MEMORY_ATTRIBUTES
 	select HAVE_KVM_ARCH_GMEM_PREPARE
 	select HAVE_KVM_ARCH_GMEM_INVALIDATE
 	select HAVE_KVM_ARCH_GMEM_POPULATE
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index cf56cc892e7c..2226b4061bad 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -105,6 +105,7 @@ module_param(allow_unsafe_mappings, bool, 0444);
 #ifdef CONFIG_KVM_MEMORY_ATTRIBUTES
 #ifdef CONFIG_KVM_VM_MEMORY_ATTRIBUTES
 bool vm_memory_attributes = true;
+module_param(vm_memory_attributes, bool, 0444);
 #endif
 DEFINE_STATIC_CALL_RET0(__kvm_get_memory_attributes, kvm_get_memory_attributes_t);
 EXPORT_SYMBOL_FOR_KVM_INTERNAL(STATIC_CALL_KEY(__kvm_get_memory_attributes));
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 13/37] KVM: selftests: Create gmem fd before "regular" fd when adding memslot
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (11 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 12/37] KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 14/37] KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset} Ackerley Tng
                   ` (24 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

When adding a memslot associated a guest_memfd instance, create/dup the
guest_memfd before creating the "normal" backing file.  This will allow
dup'ing the gmem fd as the normal fd when guest_memfd supports mmap(),
i.e. to make guest_memfd the _only_ backing source for the memslot.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/lib/kvm_util.c | 45 +++++++++++-----------
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 8279b6ced8d2..1d69baf900a2 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1028,6 +1028,29 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 	if (alignment > 1)
 		region->mmap_size += alignment;
 
+	if (flags & KVM_MEM_GUEST_MEMFD) {
+		if (guest_memfd < 0) {
+			uint32_t guest_memfd_flags = 0;
+
+			TEST_ASSERT(!guest_memfd_offset,
+				    "Offset must be zero when creating new guest_memfd");
+			guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
+		} else {
+			/*
+			 * Install a unique fd for each memslot so that the fd
+			 * can be closed when the region is deleted without
+			 * needing to track if the fd is owned by the framework
+			 * or by the caller.
+			 */
+			guest_memfd = kvm_dup(guest_memfd);
+		}
+
+		region->region.guest_memfd = guest_memfd;
+		region->region.guest_memfd_offset = guest_memfd_offset;
+	} else {
+		region->region.guest_memfd = -1;
+	}
+
 	region->fd = -1;
 	if (backing_src_is_shared(src_type))
 		region->fd = kvm_memfd_alloc(region->mmap_size,
@@ -1057,28 +1080,6 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 
 	region->backing_src_type = src_type;
 
-	if (flags & KVM_MEM_GUEST_MEMFD) {
-		if (guest_memfd < 0) {
-			uint32_t guest_memfd_flags = 0;
-			TEST_ASSERT(!guest_memfd_offset,
-				    "Offset must be zero when creating new guest_memfd");
-			guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
-		} else {
-			/*
-			 * Install a unique fd for each memslot so that the fd
-			 * can be closed when the region is deleted without
-			 * needing to track if the fd is owned by the framework
-			 * or by the caller.
-			 */
-			guest_memfd = kvm_dup(guest_memfd);
-		}
-
-		region->region.guest_memfd = guest_memfd;
-		region->region.guest_memfd_offset = guest_memfd_offset;
-	} else {
-		region->region.guest_memfd = -1;
-	}
-
 	region->unused_phy_pages = sparsebit_alloc();
 	if (vm_arch_has_protected_memory(vm))
 		region->protected_phy_pages = sparsebit_alloc();
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 14/37] KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset}
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (12 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 13/37] KVM: selftests: Create gmem fd before "regular" fd when adding memslot Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 15/37] KVM: selftests: Add support for mmap() on guest_memfd in core library Ackerley Tng
                   ` (23 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

Rename local variables and function parameters for the guest memory file
descriptor and its offset to use a "gmem_" prefix instead of
"guest_memfd_".

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/lib/kvm_util.c | 26 +++++++++++-----------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 1d69baf900a2..6daeb4f945a0 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -914,7 +914,7 @@ void vm_set_user_memory_region(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 
 int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 				 uint64_t gpa, uint64_t size, void *hva,
-				 uint32_t guest_memfd, uint64_t guest_memfd_offset)
+				 uint32_t gmem_fd, uint64_t gmem_offset)
 {
 	struct kvm_userspace_memory_region2 region = {
 		.slot = slot,
@@ -922,8 +922,8 @@ int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flag
 		.guest_phys_addr = gpa,
 		.memory_size = size,
 		.userspace_addr = (uintptr_t)hva,
-		.guest_memfd = guest_memfd,
-		.guest_memfd_offset = guest_memfd_offset,
+		.guest_memfd = gmem_fd,
+		.guest_memfd_offset = gmem_offset,
 	};
 
 	TEST_REQUIRE_SET_USER_MEMORY_REGION2();
@@ -933,10 +933,10 @@ int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flag
 
 void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags,
 				uint64_t gpa, uint64_t size, void *hva,
-				uint32_t guest_memfd, uint64_t guest_memfd_offset)
+				uint32_t gmem_fd, uint64_t gmem_offset)
 {
 	int ret = __vm_set_user_memory_region2(vm, slot, flags, gpa, size, hva,
-					       guest_memfd, guest_memfd_offset);
+					       gmem_fd, gmem_offset);
 
 	TEST_ASSERT(!ret, "KVM_SET_USER_MEMORY_REGION2 failed, errno = %d (%s)",
 		    errno, strerror(errno));
@@ -946,7 +946,7 @@ void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags
 /* FIXME: This thing needs to be ripped apart and rewritten. */
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		uint64_t gpa, uint32_t slot, uint64_t npages, uint32_t flags,
-		int guest_memfd, uint64_t guest_memfd_offset)
+		int gmem_fd, uint64_t gmem_offset)
 {
 	int ret;
 	struct userspace_mem_region *region;
@@ -1029,12 +1029,12 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		region->mmap_size += alignment;
 
 	if (flags & KVM_MEM_GUEST_MEMFD) {
-		if (guest_memfd < 0) {
-			uint32_t guest_memfd_flags = 0;
+		if (gmem_fd < 0) {
+			uint32_t gmem_flags = 0;
 
-			TEST_ASSERT(!guest_memfd_offset,
+			TEST_ASSERT(!gmem_offset,
 				    "Offset must be zero when creating new guest_memfd");
-			guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
+			gmem_fd = vm_create_guest_memfd(vm, mem_size, gmem_flags);
 		} else {
 			/*
 			 * Install a unique fd for each memslot so that the fd
@@ -1042,11 +1042,11 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 			 * needing to track if the fd is owned by the framework
 			 * or by the caller.
 			 */
-			guest_memfd = kvm_dup(guest_memfd);
+			gmem_fd = kvm_dup(gmem_fd);
 		}
 
-		region->region.guest_memfd = guest_memfd;
-		region->region.guest_memfd_offset = guest_memfd_offset;
+		region->region.guest_memfd = gmem_fd;
+		region->region.guest_memfd_offset = gmem_offset;
 	} else {
 		region->region.guest_memfd = -1;
 	}
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 15/37] KVM: selftests: Add support for mmap() on guest_memfd in core library
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (13 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 14/37] KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset} Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 16/37] KVM: selftests: Add selftests global for guest memory attributes capability Ackerley Tng
                   ` (22 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

From: Sean Christopherson <seanjc@google.com>

Accept gmem_flags in vm_mem_add() to be able to create a guest_memfd within
vm_mem_add().

When vm_mem_add() is used to set up a guest_memfd for a memslot, set up the
provided (or created) gmem_fd as the fd for the user memory region. This
makes it available to be mmap()-ed from just like fds from other memory
sources. mmap() from guest_memfd using the provided gmem_flags and
gmem_offset.

Add a kvm_slot_to_fd() helper to provide convenient access to the file
descriptor of a memslot.

Update existing callers of vm_mem_add() to pass 0 for gmem_flags to
preserve existing behavior.

Signed-off-by: Sean Christopherson <seanjc@google.com>
[For guest_memfds, mmap() using gmem_offset instead of 0 all the time.]
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/include/kvm_util.h  |  7 ++++++-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 19 +++++++++++--------
 .../kvm/x86/private_mem_conversions_test.c    |  2 +-
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 81f4355ff28a..a64aae271a6a 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -678,7 +678,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 				 uint32_t flags);
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		uint64_t gpa, uint32_t slot, uint64_t npages, uint32_t flags,
-		int guest_memfd_fd, uint64_t guest_memfd_offset);
+		int gmem_fd, uint64_t gmem_offset, uint64_t gmem_flags);
 
 #ifndef vm_arch_has_protected_memory
 static inline bool vm_arch_has_protected_memory(struct kvm_vm *vm)
@@ -712,6 +712,11 @@ void *addr_gva2hva(struct kvm_vm *vm, vm_vaddr_t gva);
 vm_paddr_t addr_hva2gpa(struct kvm_vm *vm, void *hva);
 void *addr_gpa2alias(struct kvm_vm *vm, vm_paddr_t gpa);
 
+static inline int kvm_slot_to_fd(struct kvm_vm *vm, uint32_t slot)
+{
+	return memslot2region(vm, slot)->fd;
+}
+
 #ifndef vcpu_arch_put_guest
 #define vcpu_arch_put_guest(mem, val) do { (mem) = (val); } while (0)
 #endif
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 6daeb4f945a0..ce2b0273b26c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -946,12 +946,13 @@ void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags
 /* FIXME: This thing needs to be ripped apart and rewritten. */
 void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 		uint64_t gpa, uint32_t slot, uint64_t npages, uint32_t flags,
-		int gmem_fd, uint64_t gmem_offset)
+		int gmem_fd, uint64_t gmem_offset, uint64_t gmem_flags)
 {
 	int ret;
 	struct userspace_mem_region *region;
 	size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
 	size_t mem_size = npages * vm->page_size;
+	off_t mmap_offset = 0;
 	size_t alignment;
 
 	TEST_REQUIRE_SET_USER_MEMORY_REGION2();
@@ -1030,8 +1031,6 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 
 	if (flags & KVM_MEM_GUEST_MEMFD) {
 		if (gmem_fd < 0) {
-			uint32_t gmem_flags = 0;
-
 			TEST_ASSERT(!gmem_offset,
 				    "Offset must be zero when creating new guest_memfd");
 			gmem_fd = vm_create_guest_memfd(vm, mem_size, gmem_flags);
@@ -1052,13 +1051,17 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 	}
 
 	region->fd = -1;
-	if (backing_src_is_shared(src_type))
+	if (flags & KVM_MEM_GUEST_MEMFD && gmem_flags & GUEST_MEMFD_FLAG_MMAP) {
+		region->fd = kvm_dup(gmem_fd);
+		mmap_offset = gmem_offset;
+	} else if (backing_src_is_shared(src_type)) {
 		region->fd = kvm_memfd_alloc(region->mmap_size,
 					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
+	}
 
-	region->mmap_start = kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE,
-				      vm_mem_backing_src_alias(src_type)->flag,
-				      region->fd);
+	region->mmap_start = __kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE,
+					vm_mem_backing_src_alias(src_type)->flag,
+					region->fd, mmap_offset);
 
 	TEST_ASSERT(!is_backing_src_hugetlb(src_type) ||
 		    region->mmap_start == align_ptr_up(region->mmap_start, backing_src_pagesz),
@@ -1119,7 +1122,7 @@ void vm_userspace_mem_region_add(struct kvm_vm *vm,
 				 uint64_t gpa, uint32_t slot, uint64_t npages,
 				 uint32_t flags)
 {
-	vm_mem_add(vm, src_type, gpa, slot, npages, flags, -1, 0);
+	vm_mem_add(vm, src_type, gpa, slot, npages, flags, -1, 0, 0);
 }
 
 /*
diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
index 1969f4ab9b28..41f6b38f0407 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
@@ -399,7 +399,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 	for (i = 0; i < nr_memslots; i++)
 		vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
 			   BASE_DATA_SLOT + i, slot_size / vm->page_size,
-			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i);
+			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, 0);
 
 	for (i = 0; i < nr_vcpus; i++) {
 		uint64_t gpa =  BASE_DATA_GPA + i * per_cpu_size;
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 16/37] KVM: selftests: Add selftests global for guest memory attributes capability
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (14 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 15/37] KVM: selftests: Add support for mmap() on guest_memfd in core library Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 17/37] KVM: selftests: Update framework to use KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
                   ` (21 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

From: Sean Christopherson <seanjc@google.com>

Add a global variable, kvm_has_gmem_attributes, to make the result of
checking for KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES available to all tests.

kvm_has_gmem_attributes is true if guest_memfd tracks memory attributes, as
opposed to VM-level tracking.

This global variable is synced to the guest for testing convenience, to
avoid introducing subtle bugs when host/guest state is desynced.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/include/test_util.h | 2 ++
 tools/testing/selftests/kvm/lib/kvm_util.c      | 5 +++++
 2 files changed, 7 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index b4872ba8ed12..2871a4292847 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -113,6 +113,8 @@ struct guest_random_state {
 extern uint32_t guest_random_seed;
 extern struct guest_random_state guest_rng;
 
+extern bool kvm_has_gmem_attributes;
+
 struct guest_random_state new_guest_random_state(uint32_t seed);
 uint32_t guest_random_u32(struct guest_random_state *state);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index ce2b0273b26c..4f464ad8dffd 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -24,6 +24,8 @@ uint32_t guest_random_seed;
 struct guest_random_state guest_rng;
 static uint32_t last_guest_seed;
 
+bool kvm_has_gmem_attributes;
+
 static size_t vcpu_mmap_sz(void);
 
 int __open_path_or_exit(const char *path, int flags, const char *enoent_help)
@@ -488,6 +490,7 @@ struct kvm_vm *__vm_create(struct vm_shape shape, uint32_t nr_runnable_vcpus,
 	}
 	guest_rng = new_guest_random_state(guest_random_seed);
 	sync_global_to_guest(vm, guest_rng);
+	sync_global_to_guest(vm, kvm_has_gmem_attributes);
 
 	kvm_arch_vm_post_create(vm, nr_runnable_vcpus);
 
@@ -2332,6 +2335,8 @@ void __attribute((constructor)) kvm_selftest_init(void)
 	guest_random_seed = last_guest_seed = random();
 	pr_info("Random seed: 0x%x\n", guest_random_seed);
 
+	kvm_has_gmem_attributes = kvm_has_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES);
+
 	kvm_selftest_arch_init();
 }
 
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 17/37] KVM: selftests: Update framework to use KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (15 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 16/37] KVM: selftests: Add selftests global for guest memory attributes capability Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 18/37] KVM: selftests: Add helpers for calling ioctls on guest_memfd Ackerley Tng
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Update KVM selftest framework to use KVM_SET_MEMORY_ATTRIBUTES2 and the
accompanying struct kvm_memory_attributes2.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index a64aae271a6a..872988197a5c 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -397,7 +397,7 @@ static inline void vm_enable_cap(struct kvm_vm *vm, uint32_t cap, uint64_t arg0)
 static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
 					    uint64_t size, uint64_t attributes)
 {
-	struct kvm_memory_attributes attr = {
+	struct kvm_memory_attributes2 attr = {
 		.attributes = attributes,
 		.address = gpa,
 		.size = size,
@@ -405,13 +405,16 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
 	};
 
 	/*
-	 * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows
+	 * KVM_SET_MEMORY_ATTRIBUTES2 overwrites _all_ attributes.  These flows
 	 * need significant enhancements to support multiple attributes.
 	 */
 	TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
 		    "Update me to support multiple attributes!");
 
-	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
+	__TEST_REQUIRE(kvm_check_cap(KVM_CAP_MEMORY_ATTRIBUTES2) > 0,
+		       "No valid attributes for VM fd ioctl!");
+
+	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES2, &attr);
 }
 
 
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 18/37] KVM: selftests: Add helpers for calling ioctls on guest_memfd
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (16 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 17/37] KVM: selftests: Update framework to use KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 19/37] KVM: selftests: Test using guest_memfd for guest private memory Ackerley Tng
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

Add helper functions to kvm_util.h to support calling ioctls, specifically
KVM_SET_MEMORY_ATTRIBUTES2, on a guest_memfd file descriptor.

Introduce gmem_ioctl() and __gmem_ioctl() macros, modeled after the
existing vm_ioctl() helpers, to provide a standard way to call ioctls
on a guest_memfd.

Add gmem_set_memory_attributes() and its derivatives (gmem_set_private(),
gmem_set_shared()) to set memory attributes on a guest_memfd region.
Also provide "__" variants that return the ioctl error code instead of
aborting the test. These helpers will be used by upcoming guest_memfd
tests.

To avoid code duplication, factor out the check for supported memory
attributes into a new macro, TEST_ASSERT_SUPPORTED_ATTRIBUTES, and use
it in both the existing vm_set_memory_attributes() and the new
gmem_set_memory_attributes() helpers.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../testing/selftests/kvm/include/kvm_util.h  | 89 +++++++++++++++++--
 1 file changed, 82 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 872988197a5c..e767f9a99a7b 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -310,6 +310,16 @@ static inline bool kvm_has_cap(long cap)
 	TEST_ASSERT(!ret, __KVM_IOCTL_ERROR(#cmd, ret));	\
 })
 
+#define __gmem_ioctl(gmem_fd, cmd, arg)				\
+	kvm_do_ioctl(gmem_fd, cmd, arg)
+
+#define gmem_ioctl(gmem_fd, cmd, arg)				\
+({								\
+	int ret = __gmem_ioctl(gmem_fd, cmd, arg);		\
+								\
+	TEST_ASSERT(!ret, __KVM_IOCTL_ERROR(#cmd, ret));	\
+})
+
 static __always_inline void static_assert_is_vm(struct kvm_vm *vm) { }
 
 #define __vm_ioctl(vm, cmd, arg)				\
@@ -394,6 +404,14 @@ static inline void vm_enable_cap(struct kvm_vm *vm, uint32_t cap, uint64_t arg0)
 	vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
 }
 
+/*
+ * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows need
+ * significant enhancements to support multiple attributes.
+ */
+#define TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes)				\
+	TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,	\
+		    "Update me to support multiple attributes!")
+
 static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
 					    uint64_t size, uint64_t attributes)
 {
@@ -404,12 +422,7 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
 		.flags = 0,
 	};
 
-	/*
-	 * KVM_SET_MEMORY_ATTRIBUTES2 overwrites _all_ attributes.  These flows
-	 * need significant enhancements to support multiple attributes.
-	 */
-	TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
-		    "Update me to support multiple attributes!");
+	TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes);
 
 	__TEST_REQUIRE(kvm_check_cap(KVM_CAP_MEMORY_ATTRIBUTES2) > 0,
 		       "No valid attributes for VM fd ioctl!");
@@ -417,7 +430,6 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
 	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES2, &attr);
 }
 
-
 static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
 				      uint64_t size)
 {
@@ -430,6 +442,69 @@ static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
 	vm_set_memory_attributes(vm, gpa, size, 0);
 }
 
+static inline int __gmem_set_memory_attributes(int fd, loff_t offset,
+					       uint64_t size,
+					       uint64_t attributes,
+					       loff_t *error_offset)
+{
+	struct kvm_memory_attributes2 attr = {
+		.attributes = attributes,
+		.offset = offset,
+		.size = size,
+		.flags = 0,
+	};
+	int r;
+
+	TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes);
+
+	r = __gmem_ioctl(fd, KVM_SET_MEMORY_ATTRIBUTES2, &attr);
+	if (r)
+		*error_offset = attr.error_offset;
+	return r;
+}
+
+static inline int __gmem_set_private(int fd, loff_t offset, uint64_t size,
+				     loff_t *error_offset)
+{
+	return __gmem_set_memory_attributes(fd, offset, size,
+					    KVM_MEMORY_ATTRIBUTE_PRIVATE,
+					    error_offset);
+}
+
+static inline int __gmem_set_shared(int fd, loff_t offset, uint64_t size,
+				    loff_t *error_offset)
+{
+	return __gmem_set_memory_attributes(fd, offset, size, 0, error_offset);
+}
+
+static inline void gmem_set_memory_attributes(int fd, loff_t offset,
+					      uint64_t size, uint64_t attributes)
+{
+	struct kvm_memory_attributes2 attr = {
+		.attributes = attributes,
+		.offset = offset,
+		.size = size,
+		.flags = 0,
+	};
+
+	TEST_ASSERT_SUPPORTED_ATTRIBUTES(attributes);
+
+	__TEST_REQUIRE(kvm_check_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES) > 0,
+		       "No valid attributes for guest_memfd ioctl!");
+
+	gmem_ioctl(fd, KVM_SET_MEMORY_ATTRIBUTES2, &attr);
+}
+
+static inline void gmem_set_private(int fd, loff_t offset, uint64_t size)
+{
+	gmem_set_memory_attributes(fd, offset, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void gmem_set_shared(int fd, loff_t offset, uint64_t size)
+{
+	gmem_set_memory_attributes(fd, offset, size, 0);
+}
+
 void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
 			    bool punch_hole);
 
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 19/37] KVM: selftests: Test using guest_memfd for guest private memory
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (17 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 18/37] KVM: selftests: Add helpers for calling ioctls on guest_memfd Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 20/37] KVM: selftests: Test basic single-page conversion flow Ackerley Tng
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Add a selftest to verify that a memory region backed by a guest_memfd
can be used as private guest memory. This is a key use case for
confidential computing guests where the host should not have access to the
guest's memory contents.

The new test, test_guest_private_mem, creates a protected VM, maps a
guest_memfd into the guest's address space, and then marks the region as
private. The guest code then writes to and reads from this private memory
region to verify it is accessible.

To better distinguish between the test cases, rename the existing test
that verifies shared host/guest access from test_guest_memfd_guest to
test_guest_shared_mem.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/guest_memfd_test.c  | 57 +++++++++++++++++--
 1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 618c937f3c90..ecb0cbcacbec 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -406,7 +406,7 @@ static void test_guest_memfd(unsigned long vm_type)
 	kvm_vm_free(vm);
 }
 
-static void guest_code(uint8_t *mem, uint64_t size)
+static void guest_code_test_guest_shared_mem(uint8_t *mem, uint64_t size)
 {
 	size_t i;
 
@@ -418,7 +418,7 @@ static void guest_code(uint8_t *mem, uint64_t size)
 	GUEST_DONE();
 }
 
-static void test_guest_memfd_guest(void)
+static void test_guest_shared_mem(void)
 {
 	/*
 	 * Skip the first 4gb and slot0.  slot0 maps <1gb and is used to back
@@ -437,7 +437,8 @@ static void test_guest_memfd_guest(void)
 	if (!kvm_check_cap(KVM_CAP_GUEST_MEMFD_FLAGS))
 		return;
 
-	vm = __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, &vcpu, 1, guest_code);
+	vm = __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, &vcpu, 1,
+					     guest_code_test_guest_shared_mem);
 
 	TEST_ASSERT(vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_FLAGS) & GUEST_MEMFD_FLAG_MMAP,
 		    "Default VM type should support MMAP, supported flags = 0x%x",
@@ -469,6 +470,53 @@ static void test_guest_memfd_guest(void)
 	kvm_vm_free(vm);
 }
 
+static void guest_code_test_guest_private_mem(uint8_t *mem)
+{
+	WRITE_ONCE(mem[0], 0xff);
+	GUEST_ASSERT_EQ(READ_ONCE(mem[0]), 0xff);
+
+	GUEST_DONE();
+}
+
+static void test_guest_private_mem(void)
+{
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = KVM_X86_SW_PROTECTED_VM,
+	};
+	/*
+	 * Skip the first 4gb and slot0.  slot0 maps <1gb and is used to back
+	 * the guest's code, stack, and page tables, and low memory contains
+	 * the PCI hole and other MMIO regions that need to be avoided.
+	 */
+	const uint64_t gpa = SZ_4G;
+	const int slot = 1;
+
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	size_t npages;
+	int fd;
+
+	npages = page_size / getpagesize();
+	vm = __vm_create_shape_with_one_vcpu(shape, &vcpu, npages,
+					     guest_code_test_guest_private_mem);
+
+	fd = vm_create_guest_memfd(vm, page_size, 0);
+	vm_mem_add(vm, VM_MEM_SRC_SHMEM, gpa, slot, npages, KVM_MEM_GUEST_MEMFD,
+		   fd, 0, 0);
+
+	virt_map(vm, gpa, gpa, npages);
+	vm_mem_set_private(vm, gpa, page_size);
+
+	vcpu_args_set(vcpu, 1, gpa);
+	vcpu_run(vcpu);
+
+	TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_DONE);
+
+	close(fd);
+	kvm_vm_free(vm);
+}
+
 int main(int argc, char *argv[])
 {
 	unsigned long vm_types, vm_type;
@@ -488,5 +536,6 @@ int main(int argc, char *argv[])
 	for_each_set_bit(vm_type, &vm_types, BITS_PER_TYPE(vm_types))
 		test_guest_memfd(vm_type);
 
-	test_guest_memfd_guest();
+	test_guest_shared_mem();
+	test_guest_private_mem();
 }
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 20/37] KVM: selftests: Test basic single-page conversion flow
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (18 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 19/37] KVM: selftests: Test using guest_memfd for guest private memory Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 21/37] KVM: selftests: Test conversion flow when INIT_SHARED Ackerley Tng
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Add a selftest for the guest_memfd memory attribute conversion ioctls.
The test starts the guest_memfd as all-private (the default state), and
verifies the basic flow of converting a single page to shared and then back
to private.

Add infrastructure that supports extensions to other conversion flow
tests. This infrastructure will be used in upcoming patches for other
conversion tests.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../kvm/guest_memfd_conversions_test.c        | 199 ++++++++++++++++++
 2 files changed, 200 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c

diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm
index ba5c2b643efa..1cbb20d51b01 100644
--- a/tools/testing/selftests/kvm/Makefile.kvm
+++ b/tools/testing/selftests/kvm/Makefile.kvm
@@ -143,6 +143,7 @@ TEST_GEN_PROGS_x86 += access_tracking_perf_test
 TEST_GEN_PROGS_x86 += coalesced_io_test
 TEST_GEN_PROGS_x86 += dirty_log_perf_test
 TEST_GEN_PROGS_x86 += guest_memfd_test
+TEST_GEN_PROGS_x86 += guest_memfd_conversions_test
 TEST_GEN_PROGS_x86 += hardware_disable_test
 TEST_GEN_PROGS_x86 += memslot_modification_stress_test
 TEST_GEN_PROGS_x86 += memslot_perf_test
diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
new file mode 100644
index 000000000000..48265215f218
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, Google LLC.
+ */
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include <linux/align.h>
+#include <linux/kvm.h>
+#include <linux/sizes.h>
+
+#include "kvm_util.h"
+#include "kselftest_harness.h"
+#include "test_util.h"
+#include "ucall_common.h"
+
+FIXTURE(gmem_conversions) {
+	struct kvm_vcpu *vcpu;
+	int gmem_fd;
+	/* HVA of the first byte of the memory mmap()-ed from gmem_fd. */
+	char *mem;
+};
+
+typedef FIXTURE_DATA(gmem_conversions) test_data_t;
+
+FIXTURE_SETUP(gmem_conversions) { }
+
+static uint64_t page_size;
+
+static void guest_do_rmw(void);
+#define GUEST_MEMFD_SHARING_TEST_GVA 0x90000000ULL
+
+/*
+ * Defer setup until the individual test is invoked so that tests can specify
+ * the number of pages and flags for the guest_memfd instance.
+ */
+static void gmem_conversions_do_setup(test_data_t *t, int nr_pages,
+				      int gmem_flags)
+{
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = KVM_X86_SW_PROTECTED_VM,
+	};
+	/*
+	 * Use high GPA above APIC_DEFAULT_PHYS_BASE to avoid clashing with
+	 * APIC_DEFAULT_PHYS_BASE.
+	 */
+	const uint64_t gpa = SZ_4G;
+	const uint32_t slot = 1;
+	struct kvm_vm *vm;
+
+	vm = __vm_create_shape_with_one_vcpu(shape, &t->vcpu, nr_pages, guest_do_rmw);
+
+	vm_mem_add(vm, VM_MEM_SRC_SHMEM, gpa, slot, nr_pages,
+		   KVM_MEM_GUEST_MEMFD, -1, 0, gmem_flags);
+
+	t->gmem_fd = kvm_slot_to_fd(vm, slot);
+	t->mem = addr_gpa2hva(vm, gpa);
+	virt_map(vm, GUEST_MEMFD_SHARING_TEST_GVA, gpa, nr_pages);
+}
+
+static void gmem_conversions_do_teardown(test_data_t *t)
+{
+	/* No need to close gmem_fd, it's owned by the VM structure. */
+	kvm_vm_free(t->vcpu->vm);
+}
+
+FIXTURE_TEARDOWN(gmem_conversions)
+{
+	gmem_conversions_do_teardown(self);
+}
+
+/*
+ * In these test definition macros, __nr_pages and nr_pages is used to set up
+ * the total number of pages in the guest_memfd under test. This will be
+ * available in the test definitions as nr_pages.
+ */
+
+#define __GMEM_CONVERSION_TEST(test, __nr_pages, flags)				\
+static void __gmem_conversions_##test(test_data_t *t, int nr_pages);		\
+										\
+TEST_F(gmem_conversions, test)							\
+{										\
+	gmem_conversions_do_setup(self, __nr_pages, flags);			\
+	__gmem_conversions_##test(self, __nr_pages);				\
+}										\
+static void __gmem_conversions_##test(test_data_t *t, int nr_pages)		\
+
+#define GMEM_CONVERSION_TEST(test, __nr_pages, flags)				\
+	__GMEM_CONVERSION_TEST(test, __nr_pages, (flags) | GUEST_MEMFD_FLAG_MMAP)
+
+#define __GMEM_CONVERSION_TEST_INIT_PRIVATE(test, __nr_pages)			\
+	GMEM_CONVERSION_TEST(test, __nr_pages, 0)
+
+#define GMEM_CONVERSION_TEST_INIT_PRIVATE(test)					\
+	__GMEM_CONVERSION_TEST_INIT_PRIVATE(test, 1)
+
+struct guest_check_data {
+	void *mem;
+	char expected_val;
+	char write_val;
+};
+static struct guest_check_data guest_data;
+
+static void guest_do_rmw(void)
+{
+	for (;;) {
+		char *mem = READ_ONCE(guest_data.mem);
+
+		GUEST_ASSERT_EQ(READ_ONCE(*mem), READ_ONCE(guest_data.expected_val));
+		WRITE_ONCE(*mem, READ_ONCE(guest_data.write_val));
+
+		GUEST_SYNC(0);
+	}
+}
+
+static void run_guest_do_rmw(struct kvm_vcpu *vcpu, loff_t pgoff,
+			     char expected_val, char write_val)
+{
+	struct ucall uc;
+	int r;
+
+	guest_data.mem = (void *)GUEST_MEMFD_SHARING_TEST_GVA + pgoff * page_size;
+	guest_data.expected_val = expected_val;
+	guest_data.write_val = write_val;
+	sync_global_to_guest(vcpu->vm, guest_data);
+
+	do {
+		r = __vcpu_run(vcpu);
+	} while (r == -1 && errno == EINTR);
+
+	TEST_ASSERT_EQ(r, 0);
+
+	switch (get_ucall(vcpu, &uc)) {
+	case UCALL_ABORT:
+		REPORT_GUEST_ASSERT(uc);
+	case UCALL_SYNC:
+		break;
+	default:
+		TEST_FAIL("Unexpected ucall %lu", uc.cmd);
+	}
+}
+
+static void host_do_rmw(char *mem, loff_t pgoff, char expected_val,
+			char write_val)
+{
+	TEST_ASSERT_EQ(READ_ONCE(mem[pgoff * page_size]), expected_val);
+	WRITE_ONCE(mem[pgoff * page_size], write_val);
+}
+
+static void test_private(test_data_t *t, loff_t pgoff, char starting_val,
+			 char write_val)
+{
+	TEST_EXPECT_SIGBUS(WRITE_ONCE(t->mem[pgoff * page_size], write_val));
+	run_guest_do_rmw(t->vcpu, pgoff, starting_val, write_val);
+	TEST_EXPECT_SIGBUS(READ_ONCE(t->mem[pgoff * page_size]));
+}
+
+static void test_convert_to_private(test_data_t *t, loff_t pgoff,
+				    char starting_val, char write_val)
+{
+	gmem_set_private(t->gmem_fd, pgoff * page_size, page_size);
+	test_private(t, pgoff, starting_val, write_val);
+}
+
+static void test_shared(test_data_t *t, loff_t pgoff, char starting_val,
+			char host_write_val, char write_val)
+{
+	host_do_rmw(t->mem, pgoff, starting_val, host_write_val);
+	run_guest_do_rmw(t->vcpu, pgoff, host_write_val, write_val);
+	TEST_ASSERT_EQ(READ_ONCE(t->mem[pgoff * page_size]), write_val);
+}
+
+static void test_convert_to_shared(test_data_t *t, loff_t pgoff,
+				   char starting_val, char host_write_val,
+				   char write_val)
+{
+	gmem_set_shared(t->gmem_fd, pgoff * page_size, page_size);
+	test_shared(t, pgoff, starting_val, host_write_val, write_val);
+}
+
+GMEM_CONVERSION_TEST_INIT_PRIVATE(init_private)
+{
+	test_private(t, 0, 0, 'A');
+	test_convert_to_shared(t, 0, 'A', 'B', 'C');
+	test_convert_to_private(t, 0, 'C', 'E');
+}
+
+
+int main(int argc, char *argv[])
+{
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES) &
+		     KVM_MEMORY_ATTRIBUTE_PRIVATE);
+
+	page_size = getpagesize();
+
+	return test_harness_run(argc, argv);
+}
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 21/37] KVM: selftests: Test conversion flow when INIT_SHARED
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (19 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 20/37] KVM: selftests: Test basic single-page conversion flow Ackerley Tng
@ 2026-02-02 22:29 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 22/37] KVM: selftests: Test indexing in guest_memfd Ackerley Tng
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Add a test case to verify that conversions between private and shared
memory work correctly when the memory is initially created as shared.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/guest_memfd_conversions_test.c     | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index 48265215f218..438937980f04 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -95,6 +95,12 @@ static void __gmem_conversions_##test(test_data_t *t, int nr_pages)		\
 #define GMEM_CONVERSION_TEST_INIT_PRIVATE(test)					\
 	__GMEM_CONVERSION_TEST_INIT_PRIVATE(test, 1)
 
+#define __GMEM_CONVERSION_TEST_INIT_SHARED(test, __nr_pages)			\
+	GMEM_CONVERSION_TEST(test, __nr_pages, GUEST_MEMFD_FLAG_INIT_SHARED)
+
+#define GMEM_CONVERSION_TEST_INIT_SHARED(test)					\
+	__GMEM_CONVERSION_TEST_INIT_SHARED(test, 1)
+
 struct guest_check_data {
 	void *mem;
 	char expected_val;
@@ -186,6 +192,12 @@ GMEM_CONVERSION_TEST_INIT_PRIVATE(init_private)
 	test_convert_to_private(t, 0, 'C', 'E');
 }
 
+GMEM_CONVERSION_TEST_INIT_SHARED(init_shared)
+{
+	test_shared(t, 0, 0, 'A', 'B');
+	test_convert_to_private(t, 0, 'B', 'C');
+	test_convert_to_shared(t, 0, 'C', 'D', 'E');
+}
 
 int main(int argc, char *argv[])
 {
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 22/37] KVM: selftests: Test indexing in guest_memfd
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (20 preceding siblings ...)
  2026-02-02 22:29 ` [RFC PATCH v2 21/37] KVM: selftests: Test conversion flow when INIT_SHARED Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 23/37] KVM: selftests: Test conversion before allocation Ackerley Tng
                   ` (15 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

The existing guest_memfd conversion tests only use single-page memory
regions. This provides no coverage for multi-page guest_memfd objects,
specifically whether KVM correctly handles the page index for conversion
operations. An incorrect implementation could, for example, always operate
on the first page regardless of the index provided.

Add a new test case to verify that conversions between private and shared
memory correctly target the specified page within a multi-page guest_memfd.

To support this test, add a new GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED
macro that handles setting up and tearing down the VM for each page
iteration. The teardown logic is adjusted to prevent a double-free in this
new scenario.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/guest_memfd_conversions_test.c        | 56 +++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index 438937980f04..8044581d5e5e 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -63,6 +63,9 @@ static void gmem_conversions_do_teardown(test_data_t *t)
 {
 	/* No need to close gmem_fd, it's owned by the VM structure. */
 	kvm_vm_free(t->vcpu->vm);
+
+	/* NULL this out to avoid second free on full teardown in multipage tests. */
+	t->vcpu->vm = NULL;
 }
 
 FIXTURE_TEARDOWN(gmem_conversions)
@@ -101,6 +104,29 @@ static void __gmem_conversions_##test(test_data_t *t, int nr_pages)		\
 #define GMEM_CONVERSION_TEST_INIT_SHARED(test)					\
 	__GMEM_CONVERSION_TEST_INIT_SHARED(test, 1)
 
+/*
+ * Repeats test over nr_pages in a guest_memfd of size nr_pages, providing each
+ * test iteration with test_page, the index of the page under test in
+ * guest_memfd. test_page takes values 0..(nr_pages - 1) inclusive.
+ */
+#define GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(test, __nr_pages)		\
+static void __gmem_conversions_multipage_##test(test_data_t *t, int nr_pages,	\
+						const int test_page);		\
+										\
+TEST_F(gmem_conversions, test)							\
+{										\
+	const uint64_t flags = GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED; \
+	int i;									\
+										\
+	for (i = 0; i < __nr_pages; ++i) {					\
+		gmem_conversions_do_setup(self, __nr_pages, flags);		\
+		__gmem_conversions_multipage_##test(self, __nr_pages, i);	\
+		gmem_conversions_do_teardown(self);				\
+	}									\
+}										\
+static void __gmem_conversions_multipage_##test(test_data_t *t, int nr_pages,	\
+						const int test_page)
+
 struct guest_check_data {
 	void *mem;
 	char expected_val;
@@ -199,6 +225,36 @@ GMEM_CONVERSION_TEST_INIT_SHARED(init_shared)
 	test_convert_to_shared(t, 0, 'C', 'D', 'E');
 }
 
+/*
+ * Test indexing of pages within guest_memfd, using test data that is a multiple
+ * of page index.
+ */
+GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4)
+{
+	int i;
+
+	/*
+	 * Start with the highest index, to catch any errors when, perhaps, the
+	 * first page is returned even for the last index.
+	 */
+	for (i = nr_pages - 1; i >= 0; --i)
+		test_shared(t, i, 0, i, i * 2);
+
+	for (i = 0; i < nr_pages; ++i) {
+		if (i == test_page)
+			test_convert_to_private(t, i, i * 2, i * 4);
+		else
+			test_shared(t, i, i * 2, i * 3, i * 4);
+	}
+
+	for (i = 0; i < nr_pages; ++i) {
+		if (i == test_page)
+			test_convert_to_shared(t, i, i * 4, i * 5, i * 6);
+		else
+			test_shared(t, i, i * 4, i * 5, i * 6);
+	}
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 23/37] KVM: selftests: Test conversion before allocation
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (21 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 22/37] KVM: selftests: Test indexing in guest_memfd Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 24/37] KVM: selftests: Convert with allocated folios in different layouts Ackerley Tng
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Add two test cases to the guest_memfd conversions selftest to cover
the scenario where a conversion is requested before any memory has been
allocated in the guest_memfd region.

The KVM_MEMORY_CONVERT_GUEST ioctl can be called on a memory region at any
time. If the guest has not yet faulted in any pages for that region, the
kernel must record the conversion request and apply the requested state
when the pages are eventually allocated.

The new tests cover both conversion directions.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/guest_memfd_conversions_test.c   | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index 8044581d5e5e..b48aa5d9f8cd 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -255,6 +255,20 @@ GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4)
 	}
 }
 
+/*
+ * Test that even if there are no folios yet, conversion requests are recorded
+ * in guest_memfd.
+ */
+GMEM_CONVERSION_TEST_INIT_SHARED(before_allocation_shared)
+{
+	test_convert_to_private(t, 0, 0, 'A');
+}
+
+GMEM_CONVERSION_TEST_INIT_PRIVATE(before_allocation_private)
+{
+	test_convert_to_shared(t, 0, 0, 'A', 'B');
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 24/37] KVM: selftests: Convert with allocated folios in different layouts
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (22 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 23/37] KVM: selftests: Test conversion before allocation Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 25/37] KVM: selftests: Test precision of conversion Ackerley Tng
                   ` (13 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Add a guest_memfd selftest to verify that memory conversions work
correctly with allocated folios in different layouts.

By iterating through which pages are initially faulted, the test covers
various layouts of contiguous allocated and unallocated regions, exercising
conversion with different range layouts.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/guest_memfd_conversions_test.c        | 30 +++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index b48aa5d9f8cd..9dc47316112f 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -269,6 +269,36 @@ GMEM_CONVERSION_TEST_INIT_PRIVATE(before_allocation_private)
 	test_convert_to_shared(t, 0, 0, 'A', 'B');
 }
 
+/*
+ * Test that when some of the folios in the conversion range are allocated,
+ * conversion requests are handled correctly in guest_memfd.  Vary the ranges
+ * allocated before conversion, using test_page, to cover various layouts of
+ * contiguous allocated and unallocated regions.
+ */
+GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(unallocated_folios, 8)
+{
+	const int second_page_to_fault = 4;
+	int i;
+
+	/*
+	 * Fault 2 of the pages to test filemap range operations except when
+	 * test_page == second_page_to_fault.
+	 */
+	host_do_rmw(t->mem, test_page, 0, 'A');
+	if (test_page != second_page_to_fault)
+		host_do_rmw(t->mem, second_page_to_fault, 0, 'A');
+
+	gmem_set_private(t->gmem_fd, 0, nr_pages * page_size);
+	for (i = 0; i < nr_pages; ++i) {
+		char expected = (i == test_page || i == second_page_to_fault) ? 'A' : 0;
+
+		test_private(t, i, expected, 'B');
+	}
+
+	for (i = 0; i < nr_pages; ++i)
+		test_convert_to_shared(t, i, 'B', 'C', 'D');
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 25/37] KVM: selftests: Test precision of conversion
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (23 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 24/37] KVM: selftests: Convert with allocated folios in different layouts Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 26/37] KVM: selftests: Test that truncation does not change shared/private status Ackerley Tng
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Enhance the guest_memfd indexing selftest to also verify the precision of
memory conversions between private and shared.

The existing test converted a single page within a multi-page mapping but
did not explicitly check the state of the surrounding pages after the
conversion loop.

Add checks to confirm that converting a single page from shared to private
only affects the target page. Iterate through all other pages in the
guest_memfd region to ensure they remain in their original shared state,
thus verifying that the conversion operation is precise and does not have
unintended side effects.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../selftests/kvm/guest_memfd_conversions_test.c    | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index 9dc47316112f..b109f078bc6b 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -227,7 +227,8 @@ GMEM_CONVERSION_TEST_INIT_SHARED(init_shared)
 
 /*
  * Test indexing of pages within guest_memfd, using test data that is a multiple
- * of page index.
+ * of page index.  Also test the precision of conversion, that it does not
+ * affect surrounding pages.
  */
 GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4)
 {
@@ -247,12 +248,20 @@ GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(indexing, 4)
 			test_shared(t, i, i * 2, i * 3, i * 4);
 	}
 
+	/* Confirm that only one page was converted */
 	for (i = 0; i < nr_pages; ++i) {
 		if (i == test_page)
-			test_convert_to_shared(t, i, i * 4, i * 5, i * 6);
+			test_private(t, i, i * 4, i * 6);
 		else
 			test_shared(t, i, i * 4, i * 5, i * 6);
 	}
+
+	for (i = 0; i < nr_pages; ++i) {
+		if (i == test_page)
+			test_convert_to_shared(t, i, i * 6, i * 7, i * 8);
+		else
+			test_shared(t, i, i * 6, i * 7, i * 8);
+	}
 }
 
 /*
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 26/37] KVM: selftests: Test that truncation does not change shared/private status
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (24 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 25/37] KVM: selftests: Test precision of conversion Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 27/37] KVM: selftests: Test that shared/private status is consistent across processes Ackerley Tng
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Add a test to verify that deallocating a page in a guest memfd region via
fallocate() with FALLOC_FL_PUNCH_HOLE does not alter the shared or private
status of the corresponding memory range.

When a page backing a guest memfd mapping is deallocated, e.g., by punching
a hole or truncating the file, and then subsequently faulted back in, the
new page must inherit the correct shared/private status tracked by
guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/guest_memfd_conversions_test.c   | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index b109f078bc6b..89881a71902e 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -10,6 +10,7 @@
 #include <linux/sizes.h>
 
 #include "kvm_util.h"
+#include "kvm_syscalls.h"
 #include "kselftest_harness.h"
 #include "test_util.h"
 #include "ucall_common.h"
@@ -308,6 +309,19 @@ GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(unallocated_folios, 8)
 		test_convert_to_shared(t, i, 'B', 'C', 'D');
 }
 
+/* Truncation should not affect shared/private status. */
+GMEM_CONVERSION_TEST_INIT_SHARED(truncate)
+{
+	host_do_rmw(t->mem, 0, 0, 'A');
+	kvm_fallocate(t->gmem_fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, page_size);
+	host_do_rmw(t->mem, 0, 0, 'A');
+
+	test_convert_to_private(t, 0, 'A', 'B');
+
+	kvm_fallocate(t->gmem_fd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0, page_size);
+	test_private(t, 0, 0, 'A');
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 27/37] KVM: selftests: Test that shared/private status is consistent across processes
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (25 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 26/37] KVM: selftests: Test that truncation does not change shared/private status Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 28/37] KVM: selftests: Test conversion with elevated page refcount Ackerley Tng
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

From: Sean Christopherson <seanjc@google.com>

Add a test to verify that a guest_memfd's shared/private status is
consistent across processes, and that any shared pages previously mapped in
any process are unmapped from all processes.

The test forks a child process after creating the shared guest_memfd
region so that the second process exists alongside the main process for the
entire test.

The processes then take turns to access memory to check that the
shared/private status is consistent across processes.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../kvm/guest_memfd_conversions_test.c        | 74 +++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index 89881a71902e..c1a9cc7c9fae 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -322,6 +322,80 @@ GMEM_CONVERSION_TEST_INIT_SHARED(truncate)
 	test_private(t, 0, 0, 'A');
 }
 
+/* Test that shared/private memory protections work and are seen from any process. */
+GMEM_CONVERSION_TEST_INIT_SHARED(forked_accesses)
+{
+	/*
+	 * No races are intended in this test, shared memory is only used to
+	 * coordinate between processes.
+	 */
+	static enum {
+		STATE_INIT,
+		STATE_CHECK_SHARED,
+		STATE_DONE_CHECKING_SHARED,
+		STATE_CHECK_PRIVATE,
+		STATE_DONE_CHECKING_PRIVATE,
+	} *test_state;
+	pid_t child_pid;
+
+	test_state = kvm_mmap(sizeof(*test_state), PROT_READ | PROT_WRITE,
+			      MAP_SHARED | MAP_ANONYMOUS, -1);
+
+#define TEST_STATE_AWAIT(__state)						\
+	while (READ_ONCE(*test_state) != __state) {				\
+		if (child_pid != 0) {						\
+			int status;						\
+			pid_t pid;						\
+			do {							\
+				pid = waitpid(child_pid, &status, WNOHANG);	\
+			} while (pid == -1 && errno == EINTR);			\
+			if (pid == -1)						\
+				TEST_FAIL("Couldn't check child status.");	\
+			else if (pid != 0)					\
+				TEST_FAIL("Child exited prematurely.");		\
+		}								\
+	}
+
+#define TEST_STATE_SET(__state) WRITE_ONCE(*test_state, __state)
+
+	child_pid = fork();
+	TEST_ASSERT(child_pid != -1, "fork failed");
+
+	if (child_pid == 0) {
+		const char inconsequential = 0xdd;
+
+		TEST_STATE_AWAIT(STATE_CHECK_SHARED);
+
+		/*
+		 * This maps the pages into the child process as well, and tests
+		 * that the conversion process will unmap the guest_memfd memory
+		 * from all processes.
+		 */
+		host_do_rmw(t->mem, 0, 0xB, 0xC);
+
+		TEST_STATE_SET(STATE_DONE_CHECKING_SHARED);
+		TEST_STATE_AWAIT(STATE_CHECK_PRIVATE);
+
+		TEST_EXPECT_SIGBUS(READ_ONCE(t->mem[0]));
+		TEST_EXPECT_SIGBUS(WRITE_ONCE(t->mem[0], inconsequential));
+
+		TEST_STATE_SET(STATE_DONE_CHECKING_PRIVATE);
+		exit(0);
+	}
+
+	test_shared(t, 0, 0, 0xA, 0xB);
+
+	TEST_STATE_SET(STATE_CHECK_SHARED);
+	TEST_STATE_AWAIT(STATE_DONE_CHECKING_SHARED);
+
+	test_convert_to_private(t, 0, 0xC, 0xD);
+
+	TEST_STATE_SET(STATE_CHECK_PRIVATE);
+	TEST_STATE_AWAIT(STATE_DONE_CHECKING_PRIVATE);
+
+	kvm_munmap(test_state, sizeof(*test_state));
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 28/37] KVM: selftests: Test conversion with elevated page refcount
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (26 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 27/37] KVM: selftests: Test that shared/private status is consistent across processes Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 29/37] KVM: selftests: Reset shared memory after hole-punching Ackerley Tng
                   ` (9 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Add a selftest to verify that converting a shared guest_memfd page to a
private page fails if the page has an elevated reference count.

When KVM converts a shared page to a private one, it expects the page to
have a reference count equal to the reference counts taken by the
filemap. If another kernel subsystem holds a reference to the page, for
example via pin_user_pages(), the conversion must be aborted.

This test uses vmsplice to increment the refcount of a specific page. The
reference is kept on the page by not reading data out from vmsplice's
destination pipe. It then attempts to convert a range of pages, including
the page with elevated refcount, from shared to private.

The test asserts that both bulk and single-page conversion attempts
correctly fail with EAGAIN for the pinned page. After the page is unpinned,
the test verifies that subsequent conversions succeed.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/guest_memfd_conversions_test.c        | 78 +++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
index c1a9cc7c9fae..872747432545 100644
--- a/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_conversions_test.c
@@ -396,6 +396,84 @@ GMEM_CONVERSION_TEST_INIT_SHARED(forked_accesses)
 	kvm_munmap(test_state, sizeof(*test_state));
 }
 
+static int pin_pipe[2] = { -1, -1 };
+
+static void pin_pages(void *vaddr, uint64_t size)
+{
+	struct iovec iov = {
+		.iov_base = vaddr,
+		.iov_len = size,
+	};
+
+	if (pin_pipe[1] < 0)
+		TEST_ASSERT_EQ(pipe(pin_pipe), 0);
+
+	TEST_ASSERT_EQ(vmsplice(pin_pipe[1], &iov, 1, 0), size);
+}
+
+static void unpin_pages(void)
+{
+	close(pin_pipe[1]);
+	pin_pipe[1] = -1;
+	close(pin_pipe[0]);
+	pin_pipe[0] = -1;
+}
+
+static void test_convert_to_private_fails(test_data_t *t, loff_t pgoff,
+					  size_t nr_pages,
+					  loff_t expected_error_offset)
+{
+	loff_t offset = pgoff * page_size;
+	loff_t error_offset = -1ul;
+	int ret;
+
+	do {
+		ret = __gmem_set_private(t->gmem_fd, offset,
+					 nr_pages * page_size, &error_offset);
+	} while (ret == -1 && errno == EINTR);
+	TEST_ASSERT(ret == -1 && errno == EAGAIN,
+		    "Wanted EAGAIN on page %lu, got %d (ret = %d)", pgoff,
+		    errno, ret);
+	TEST_ASSERT_EQ(error_offset, expected_error_offset);
+}
+
+GMEM_CONVERSION_MULTIPAGE_TEST_INIT_SHARED(elevated_refcount, 4)
+{
+	int i;
+
+	pin_pages(t->mem + test_page * page_size, page_size);
+
+	for (i = 0; i < nr_pages; i++)
+		test_shared(t, i, 0, 'A', 'B');
+
+	/*
+	 * Converting in bulk should fail as long any page in the range has
+	 * unexpected refcounts.
+	 */
+	test_convert_to_private_fails(t, 0, nr_pages, test_page * page_size);
+
+	for (i = 0; i < nr_pages; i++) {
+		/*
+		 * Converting page-wise should also fail as long any page in the
+		 * range has unexpected refcounts.
+		 */
+		if (i == test_page)
+			test_convert_to_private_fails(t, i, 1, test_page * page_size);
+		else
+			test_convert_to_private(t, i, 'B', 'C');
+	}
+
+	unpin_pages();
+
+	gmem_set_private(t->gmem_fd, 0, nr_pages * page_size);
+
+	for (i = 0; i < nr_pages; i++) {
+		char expected = i == test_page ? 'B' : 'C';
+
+		test_private(t, i, expected, 'D');
+	}
+}
+
 int main(int argc, char *argv[])
 {
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 29/37] KVM: selftests: Reset shared memory after hole-punching
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (27 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 28/37] KVM: selftests: Test conversion with elevated page refcount Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 30/37] KVM: selftests: Provide function to look up guest_memfd details from gpa Ackerley Tng
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

private_mem_conversions_test used to reset the shared memory that was used
for the test to an initial pattern at the end of each test iteration. Then,
it would punch out the pages, which would zero memory.

Without in-place conversion, the resetting would write shared memory, and
hole-punching will zero private memory, hence resetting the test to the
state at the beginning of the for loop.

With in-place conversion, resetting writes memory as shared, and
hole-punching zeroes the same physical memory, hence undoing the reset
done before the hole punch.

Move the resetting after the hole-punching, and reset the entire
PER_CPU_DATA_SIZE instead of just the tested range.

With in-place conversion, this zeroes and then resets the same physical
memory. Without in-place conversion, the private memory is zeroed, and the
shared memory is reset to init_p.

This is sufficient since at each test stage, the memory is assumed to start
as shared, and private memory is always assumed to start zeroed. Conversion
zeroes memory, so the future test stages will work as expected.

Fixes: 43f623f350ce1 ("KVM: selftests: Add x86-only selftest for private memory conversions")
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../selftests/kvm/x86/private_mem_conversions_test.c     | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
index 41f6b38f0407..47f1eb921259 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
@@ -202,15 +202,18 @@ static void guest_test_explicit_conversion(uint64_t base_gpa, bool do_fallocate)
 		guest_sync_shared(gpa, size, p3, p4);
 		memcmp_g(gpa, p4, size);
 
-		/* Reset the shared memory back to the initial pattern. */
-		memset((void *)gpa, init_p, size);
-
 		/*
 		 * Free (via PUNCH_HOLE) *all* private memory so that the next
 		 * iteration starts from a clean slate, e.g. with respect to
 		 * whether or not there are pages/folios in guest_mem.
 		 */
 		guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true);
+
+		/*
+		 * Hole-punching above zeroed private memory. Reset shared
+		 * memory in preparation for the next GUEST_STAGE.
+		 */
+		memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE);
 	}
 }
 
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 30/37] KVM: selftests: Provide function to look up guest_memfd details from gpa
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (28 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 29/37] KVM: selftests: Reset shared memory after hole-punching Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 31/37] KVM: selftests: Provide common function to set memory attributes Ackerley Tng
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Introduce a new helper, kvm_gpa_to_guest_memfd(), to find the
guest_memfd-related details of a memory region that contains a given guest
physical address (GPA).

The function returns the file descriptor for the memfd, the offset into
the file that corresponds to the GPA, and the number of bytes remaining
in the region from that GPA.

kvm_gpa_to_guest_memfd() was factored out from vm_guest_mem_fallocate();
refactor vm_guest_mem_fallocate() to use the new helper.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../testing/selftests/kvm/include/kvm_util.h  |  3 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 34 ++++++++++++-------
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index e767f9a99a7b..b370b70442e8 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -404,6 +404,9 @@ static inline void vm_enable_cap(struct kvm_vm *vm, uint32_t cap, uint64_t arg0)
 	vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
 }
 
+int kvm_gpa_to_guest_memfd(struct kvm_vm *vm, vm_paddr_t gpa, off_t *fd_offset,
+			   uint64_t *nr_bytes);
+
 /*
  * KVM_SET_MEMORY_ATTRIBUTES overwrites _all_ attributes.  These flows need
  * significant enhancements to support multiple attributes.
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 4f464ad8dffd..61adfd7e623c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1258,27 +1258,19 @@ void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size,
 			    bool punch_hole)
 {
 	const int mode = FALLOC_FL_KEEP_SIZE | (punch_hole ? FALLOC_FL_PUNCH_HOLE : 0);
-	struct userspace_mem_region *region;
 	uint64_t end = base + size;
 	uint64_t gpa, len;
 	off_t fd_offset;
-	int ret;
+	int fd, ret;
 
 	for (gpa = base; gpa < end; gpa += len) {
-		uint64_t offset;
-
-		region = userspace_mem_region_find(vm, gpa, gpa);
-		TEST_ASSERT(region && region->region.flags & KVM_MEM_GUEST_MEMFD,
-			    "Private memory region not found for GPA 0x%lx", gpa);
+		fd = kvm_gpa_to_guest_memfd(vm, gpa, &fd_offset, &len);
+		len = min(end - gpa, len);
 
-		offset = gpa - region->region.guest_phys_addr;
-		fd_offset = region->region.guest_memfd_offset + offset;
-		len = min_t(uint64_t, end - gpa, region->region.memory_size - offset);
-
-		ret = fallocate(region->region.guest_memfd, mode, fd_offset, len);
+		ret = fallocate(fd, mode, fd_offset, len);
 		TEST_ASSERT(!ret, "fallocate() failed to %s at %lx (len = %lu), fd = %d, mode = %x, offset = %lx",
 			    punch_hole ? "punch hole" : "allocate", gpa, len,
-			    region->region.guest_memfd, mode, fd_offset);
+			    fd, mode, fd_offset);
 	}
 }
 
@@ -1684,6 +1676,22 @@ void *addr_gpa2alias(struct kvm_vm *vm, vm_paddr_t gpa)
 	return (void *) ((uintptr_t) region->host_alias + offset);
 }
 
+int kvm_gpa_to_guest_memfd(struct kvm_vm *vm, vm_paddr_t gpa, off_t *fd_offset,
+			   uint64_t *nr_bytes)
+{
+	struct userspace_mem_region *region;
+	vm_paddr_t gpa_offset;
+
+	region = userspace_mem_region_find(vm, gpa, gpa);
+	TEST_ASSERT(region && region->region.flags & KVM_MEM_GUEST_MEMFD,
+		    "guest_memfd memory region not found for GPA 0x%lx", gpa);
+
+	gpa_offset = gpa - region->region.guest_phys_addr;
+	*fd_offset = region->region.guest_memfd_offset + gpa_offset;
+	*nr_bytes = region->region.memory_size - gpa_offset;
+	return region->region.guest_memfd;
+}
+
 /* Create an interrupt controller chip for the specified VM. */
 void vm_create_irqchip(struct kvm_vm *vm)
 {
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 31/37] KVM: selftests: Provide common function to set memory attributes
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (29 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 30/37] KVM: selftests: Provide function to look up guest_memfd details from gpa Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 32/37] KVM: selftests: Check fd/flags provided to mmap() when setting up memslot Ackerley Tng
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

From: Sean Christopherson <seanjc@google.com>

Introduce vm_mem_set_memory_attributes(), which handles setting of memory
attributes for a range of guest physical addresses, regardless of whether
the attributes should be set via guest_memfd or via the memory attributes
at the VM level.

Refactor existing vm_mem_set_{shared,private} functions to use the new
function.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/include/kvm_util.h  | 44 ++++++++++++++-----
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index b370b70442e8..33905f2ed5c9 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -433,18 +433,6 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
 	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES2, &attr);
 }
 
-static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
-				      uint64_t size)
-{
-	vm_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
-}
-
-static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
-				     uint64_t size)
-{
-	vm_set_memory_attributes(vm, gpa, size, 0);
-}
-
 static inline int __gmem_set_memory_attributes(int fd, loff_t offset,
 					       uint64_t size,
 					       uint64_t attributes,
@@ -508,6 +496,38 @@ static inline void gmem_set_shared(int fd, loff_t offset, uint64_t size)
 	gmem_set_memory_attributes(fd, offset, size, 0);
 }
 
+static inline void vm_mem_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
+						uint64_t size, uint64_t attrs)
+{
+	if (kvm_has_gmem_attributes) {
+		uint64_t end = gpa + size;
+		uint64_t addr, len;
+		off_t fd_offset;
+		int fd;
+
+		for (addr = gpa; addr < end; addr += len) {
+			fd = kvm_gpa_to_guest_memfd(vm, addr, &fd_offset, &len);
+			len = min(end - addr, len);
+
+			gmem_set_memory_attributes(fd, fd_offset, len, attrs);
+		}
+	} else {
+		vm_set_memory_attributes(vm, gpa, size, attrs);
+	}
+}
+
+static inline void vm_mem_set_private(struct kvm_vm *vm, uint64_t gpa,
+				      uint64_t size)
+{
+	vm_mem_set_memory_attributes(vm, gpa, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
+				     uint64_t size)
+{
+	vm_mem_set_memory_attributes(vm, gpa, size, 0);
+}
+
 void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
 			    bool punch_hole);
 
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 32/37] KVM: selftests: Check fd/flags provided to mmap() when setting up memslot
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (30 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 31/37] KVM: selftests: Provide common function to set memory attributes Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 33/37] KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe Ackerley Tng
                   ` (5 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

From: Sean Christopherson <seanjc@google.com>

Check that a valid fd provided to mmap() must be accompanied by MAP_SHARED.

With an invalid fd (usually used for anonymous mappings), there are no
constraints on mmap() flags.

Add this check to make sure that when a guest_memfd is used as region->fd,
the flag provided to mmap() will include MAP_SHARED.

Signed-off-by: Sean Christopherson <seanjc@google.com>
[Rephrase assertion message.]
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/lib/kvm_util.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 61adfd7e623c..aec7b24418ab 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1062,6 +1062,9 @@ void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
 					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
 	}
 
+	TEST_ASSERT(region->fd == -1 || backing_src_is_shared(src_type),
+		    "A valid fd provided to mmap() must be accompanied by MAP_SHARED.");
+
 	region->mmap_start = __kvm_mmap(region->mmap_size, PROT_READ | PROT_WRITE,
 					vm_mem_backing_src_alias(src_type)->flag,
 					region->fd, mmap_offset);
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 33/37] KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (31 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 32/37] KVM: selftests: Check fd/flags provided to mmap() when setting up memslot Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-14 19:49   ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 34/37] KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd Ackerley Tng
                   ` (4 subsequent siblings)
  37 siblings, 1 reply; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

The TEST_EXPECT_SIGBUS macro is not thread-safe as it uses a global
sigjmp_buf and installs a global SIGBUS signal handler. If multiple threads
execute the macro concurrently, they will race on installing the signal
handler and stomp on other threads' jump buffers, leading to incorrect test
behavior.

Make TEST_EXPECT_SIGBUS thread-safe with the following changes:

Share the KVM tests' global signal handler. sigaction() applies to all
threads; without sharing a global signal handler, one thread may have
removed the signal handler that another thread added, hence leading to
unexpected signals.

The alternative of layering signal handlers was considered, but calling
sigaction() within TEST_EXPECT_SIGBUS() necessarily creates a race. To
avoid adding new setup and teardown routines to do sigaction() and keep
usage of TEST_EXPECT_SIGBUS() simple, share the KVM tests' global signal
handler.

Opportunistically rename report_unexpected_signal to
catchall_signal_handler.

To continue to only expect SIGBUS within specific regions of code, use a
thread-specific variable, expecting_sigbus, to replace installing and
removing signal handlers.

Make the execution environment for the thread, sigjmp_buf, a
thread-specific variable.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../testing/selftests/kvm/include/test_util.h | 29 +++++++++----------
 tools/testing/selftests/kvm/lib/kvm_util.c    | 18 ++++++++----
 tools/testing/selftests/kvm/lib/test_util.c   |  7 -----
 3 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 2871a4292847..0e4e6f7dab8f 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -80,22 +80,19 @@ do {									\
 	__builtin_unreachable(); \
 } while (0)
 
-extern sigjmp_buf expect_sigbus_jmpbuf;
-void expect_sigbus_handler(int signum);
-
-#define TEST_EXPECT_SIGBUS(action)						\
-do {										\
-	struct sigaction sa_old, sa_new = {					\
-		.sa_handler = expect_sigbus_handler,				\
-	};									\
-										\
-	sigaction(SIGBUS, &sa_new, &sa_old);					\
-	if (sigsetjmp(expect_sigbus_jmpbuf, 1) == 0) {				\
-		action;								\
-		TEST_FAIL("'%s' should have triggered SIGBUS", #action);	\
-	}									\
-	sigaction(SIGBUS, &sa_old, NULL);					\
-} while (0)
+extern __thread sigjmp_buf expect_sigbus_jmpbuf;
+extern __thread bool expecting_sigbus;
+
+#define TEST_EXPECT_SIGBUS(action)                                     \
+	do {                                                           \
+		expecting_sigbus = true;			       \
+		if (sigsetjmp(expect_sigbus_jmpbuf, 1) == 0) {         \
+			action;                                        \
+			TEST_FAIL("'%s' should have triggered SIGBUS", \
+				  #action);                            \
+		}                                                      \
+		expecting_sigbus = false;			       \
+	} while (0)
 
 size_t parse_size(const char *size);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index aec7b24418ab..18ced8bdde36 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2314,13 +2314,20 @@ __weak void kvm_selftest_arch_init(void)
 {
 }
 
-static void report_unexpected_signal(int signum)
+__thread sigjmp_buf expect_sigbus_jmpbuf;
+__thread bool expecting_sigbus;
+
+static void catchall_signal_handler(int signum)
 {
+	switch (signum) {
+	case SIGBUS: {
+		if (expecting_sigbus)
+			siglongjmp(expect_sigbus_jmpbuf, 1);
+
+		TEST_FAIL("Unexpected SIGBUS (%d)\n", signum);
+	}
 #define KVM_CASE_SIGNUM(sig)					\
 	case sig: TEST_FAIL("Unexpected " #sig " (%d)\n", signum)
-
-	switch (signum) {
-	KVM_CASE_SIGNUM(SIGBUS);
 	KVM_CASE_SIGNUM(SIGSEGV);
 	KVM_CASE_SIGNUM(SIGILL);
 	KVM_CASE_SIGNUM(SIGFPE);
@@ -2332,12 +2339,13 @@ static void report_unexpected_signal(int signum)
 void __attribute((constructor)) kvm_selftest_init(void)
 {
 	struct sigaction sig_sa = {
-		.sa_handler = report_unexpected_signal,
+		.sa_handler = catchall_signal_handler,
 	};
 
 	/* Tell stdout not to buffer its content. */
 	setbuf(stdout, NULL);
 
+	expecting_sigbus = false;
 	sigaction(SIGBUS, &sig_sa, NULL);
 	sigaction(SIGSEGV, &sig_sa, NULL);
 	sigaction(SIGILL, &sig_sa, NULL);
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 8a1848586a85..03eb99af9b8d 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -18,13 +18,6 @@
 
 #include "test_util.h"
 
-sigjmp_buf expect_sigbus_jmpbuf;
-
-void __attribute__((used)) expect_sigbus_handler(int signum)
-{
-	siglongjmp(expect_sigbus_jmpbuf, 1);
-}
-
 /*
  * Random number generator that is usable from guest code. This is the
  * Park-Miller LCG using standard constants.
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 34/37] KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (32 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 33/37] KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 35/37] KVM: selftests: Add script to exercise private_mem_conversions_test Ackerley Tng
                   ` (3 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Update the private memory conversions selftest to also test conversions
that are done "in-place" via per-guest_memfd memory attributes. In-place
conversions require the host to be able to mmap() the guest_memfd so that
the host and guest can share the same backing physical memory.

This includes several updates, that are conditioned on the system
supporting per-guest_memfd attributes (kvm_has_gmem_attributes):

1. Set up guest_memfd requesting MMAP and INIT_SHARED.

2. With in-place conversions, the host's mapping points directly to the
   guest's memory. When the guest converts a region to private, host access
   to that region is blocked. Update the test to expect a SIGBUS when
   attempting to access the host virtual address (HVA) of private memory.

3. Use vm_mem_set_memory_attributes(), which chooses how to set memory
   attributes based on whether kvm_has_gmem_attributes.

Restrict the test to using VM_MEM_SRC_SHMEM because guest_memfd's required
mmap() flags and page sizes happens to align with those of
VM_MEM_SRC_SHMEM. As long as VM_MEM_SRC_SHMEM is used for src_type,
vm_mem_add() works as intended.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/x86/private_mem_conversions_test.c    | 39 +++++++++++++++----
 1 file changed, 32 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
index 47f1eb921259..f85717662a73 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.c
@@ -307,8 +307,8 @@ static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
 		vm_guest_mem_fallocate(vm, gpa, size, map_shared);
 
 	if (set_attributes)
-		vm_set_memory_attributes(vm, gpa, size,
-					 map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
+		vm_mem_set_memory_attributes(vm, gpa, size,
+					     map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
 	run->hypercall.ret = 0;
 }
 
@@ -352,8 +352,20 @@ static void *__test_mem_conversions(void *__vcpu)
 				size_t nr_bytes = min_t(size_t, vm->page_size, size - i);
 				uint8_t *hva = addr_gpa2hva(vm, gpa + i);
 
-				/* In all cases, the host should observe the shared data. */
-				memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
+				/*
+				 * When using per-guest_memfd memory attributes,
+				 * i.e. in-place conversion, host accesses will
+				 * point at guest memory and should SIGBUS when
+				 * guest memory is private.  When using per-VM
+				 * attributes, i.e. separate backing for shared
+				 * vs. private, the host should always observe
+				 * the shared data.
+				 */
+				if (kvm_has_gmem_attributes &&
+				    uc.args[0] == SYNC_PRIVATE)
+					TEST_EXPECT_SIGBUS(READ_ONCE(*hva));
+				else
+					memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
 
 				/* For shared, write the new pattern to guest memory. */
 				if (uc.args[0] == SYNC_SHARED)
@@ -382,6 +394,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 	const size_t slot_size = memfd_size / nr_memslots;
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
 	pthread_t threads[KVM_MAX_VCPUS];
+	uint64_t gmem_flags;
 	struct kvm_vm *vm;
 	int memfd, i;
 
@@ -397,12 +410,17 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 
 	vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
 
-	memfd = vm_create_guest_memfd(vm, memfd_size, 0);
+	if (kvm_has_gmem_attributes)
+		gmem_flags = GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED;
+	else
+		gmem_flags = 0;
+
+	memfd = vm_create_guest_memfd(vm, memfd_size, gmem_flags);
 
 	for (i = 0; i < nr_memslots; i++)
 		vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
 			   BASE_DATA_SLOT + i, slot_size / vm->page_size,
-			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, 0);
+			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i, gmem_flags);
 
 	for (i = 0; i < nr_vcpus; i++) {
 		uint64_t gpa =  BASE_DATA_GPA + i * per_cpu_size;
@@ -452,17 +470,24 @@ static void usage(const char *cmd)
 
 int main(int argc, char *argv[])
 {
-	enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
+	enum vm_mem_backing_src_type src_type;
 	uint32_t nr_memslots = 1;
 	uint32_t nr_vcpus = 1;
 	int opt;
 
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
 
+	src_type = kvm_has_gmem_attributes ? VM_MEM_SRC_SHMEM :
+					     DEFAULT_VM_MEM_SRC;
+
 	while ((opt = getopt(argc, argv, "hm:s:n:")) != -1) {
 		switch (opt) {
 		case 's':
 			src_type = parse_backing_src_type(optarg);
+			TEST_ASSERT(!kvm_has_gmem_attributes ||
+				    src_type == VM_MEM_SRC_SHMEM,
+				    "Testing in-place conversions, only %s mem_type supported\n",
+				    vm_mem_backing_src_alias(VM_MEM_SRC_SHMEM)->name);
 			break;
 		case 'n':
 			nr_vcpus = atoi_positive("nr_vcpus", optarg);
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 35/37] KVM: selftests: Add script to exercise private_mem_conversions_test
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (33 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 34/37] KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 36/37] KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes Ackerley Tng
                   ` (2 subsequent siblings)
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

Add a wrapper script to simplify running the private_mem_conversions_test
with a variety of configurations. Manually invoking the test for all
supported memory backing source types is tedious.

The script automatically detects the availability of 2MB and 1GB hugepages
and builds a list of source types to test. It then iterates through the
list, running the test for each type with both a single memslot and
multiple memslots.

This makes it easier to get comprehensive test coverage across different
memory configurations.

Use python to be able to issue an ioctl to /dev/kvm.

Update .gitignore to allowlist python scripts.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 .../kvm/x86/private_mem_conversions_test.py   | 152 ++++++++++++++++++
 2 files changed, 153 insertions(+)
 create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.py

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 1d41a046a7bf..d7e9c1d97e37 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -4,6 +4,7 @@
 !*.c
 !*.h
 !*.S
+!*.py
 !*.sh
 !.gitignore
 !config
diff --git a/tools/testing/selftests/kvm/x86/private_mem_conversions_test.py b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.py
new file mode 100755
index 000000000000..17f46c21e85e
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86/private_mem_conversions_test.py
@@ -0,0 +1,152 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Wrapper script which runs different test setups of
+# private_mem_conversions_test.
+#
+# Copyright (C) 2025, Google LLC.
+
+import os
+import fcntl
+import sys
+import subprocess
+
+
+NUM_VCPUS_TO_TEST = 4
+NUM_MEMSLOTS_TO_TEST = NUM_VCPUS_TO_TEST
+
+# Required pages are based on the test setup in the C code.
+# These static requirements are set to the maximum required for
+# NUM_VCPUS_TO_TEST, over all the hugetlb-related tests
+REQUIRED_NUM_2M_HUGEPAGES = 1024 * NUM_VCPUS_TO_TEST
+REQUIRED_NUM_1G_HUGEPAGES = 2 * NUM_VCPUS_TO_TEST
+
+
+def get_hugepage_count(page_size_kb: int) -> int:
+    """Reads the current number of hugepages available for a given size."""
+    try:
+        path = f"/sys/kernel/mm/hugepages/hugepages-{page_size_kb}kB/nr_hugepages"
+        with open(path, 'r') as f:
+            return int(f.read().strip())
+    except (FileNotFoundError, ValueError):
+        return 0
+
+
+def get_default_hugepage_size_in_kb():
+    """Reads the default hugepage size from /proc/meminfo."""
+    try:
+        with open("/proc/meminfo", 'r') as f:
+            for line in f:
+                if line.startswith("Hugepagesize:"):
+                    parts = line.split()
+                    if len(parts) >= 2 and parts[1].isdigit():
+                        return int(parts[1])
+    except FileNotFoundError:
+        return None
+
+
+def run_tests(executable_path: str, src_type: str, num_memslots: int, num_vcpus: int) -> None:
+    """Runs the test executable with different arguments."""
+    command = [executable_path, "-s", src_type, "-m", str(num_memslots), "-n", str(num_vcpus)]
+    print(" ".join(command))
+    _ = subprocess.run(command, check=True)
+
+
+def kvm_check_cap(capability: int) -> int:
+    KVM_CHECK_EXTENSION = 0xAE03
+    KVM_DEVICE = '/dev/kvm'
+
+    if not os.path.exists(KVM_DEVICE):
+        print(f"Error: KVM device not found at {KVM_DEVICE}. Is the 'kvm' module loaded?")
+        return -1
+
+    try:
+        fd = os.open(KVM_DEVICE, os.O_RDONLY)
+
+        result = fcntl.ioctl(fd, KVM_CHECK_EXTENSION, capability)
+
+        os.close(fd)
+        return result
+    except OSError as e:
+        print(f"Error issuing KVM ioctl on {KVM_DEVICE}: {e}", file=sys.stderr)
+        if fd > 0:
+            os.close(fd)
+        return -1
+
+
+def kvm_has_gmem_attributes() -> bool:
+    KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES = 246
+
+    return kvm_check_cap(KVM_CAP_GUEST_MEMFD_MEMORY_ATTRIBUTES) > 0
+
+
+def get_backing_source_types() -> list[str]:
+    hugepage_2mb_count = get_hugepage_count(2048)
+    hugepage_2mb_enabled = hugepage_2mb_count >= REQUIRED_NUM_2M_HUGEPAGES
+    hugepage_1gb_count = get_hugepage_count(1048576)
+    hugepage_1gb_enabled = hugepage_1gb_count >= REQUIRED_NUM_1G_HUGEPAGES
+
+    default_hugepage_size_kb = get_default_hugepage_size_in_kb()
+    hugepage_default_enabled = False
+    if default_hugepage_size_kb == 2048:
+        hugepage_default_enabled = hugepage_2mb_enabled
+    elif default_hugepage_size_kb == 1048576:
+        hugepage_default_enabled = hugepage_1gb_enabled
+
+    backing_src_types: list[str] = ["anonymous", "anonymous_thp"]
+
+    if hugepage_default_enabled:
+        backing_src_types.append("anonymous_hugetlb")
+    else:
+        print("skipping anonymous_hugetlb backing source type")
+
+    if hugepage_2mb_enabled:
+        backing_src_types.append("anonymous_hugetlb_2mb")
+    else:
+        print("skipping anonymous_hugetlb_2mb backing source type")
+
+    if hugepage_1gb_enabled:
+        backing_src_types.append("anonymous_hugetlb_1gb")
+    else:
+        print("skipping anonymous_hugetlb_1gb backing source type")
+
+    backing_src_types.append("shmem")
+
+    if hugepage_default_enabled:
+        backing_src_types.append("shared_hugetlb")
+    else:
+        print("skipping shared_hugetlb backing source type")
+
+    return backing_src_types
+
+
+def main():
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    test_executable = os.path.join(script_dir, "private_mem_conversions_test")
+
+    if not os.path.exists(test_executable):
+        print(f"Error: Test executable not found at '{test_executable}'", file=sys.stderr)
+        sys.exit(1)
+
+    return_code = 0
+
+    backing_src_types = ["shmem"] if kvm_has_gmem_attributes() else get_backing_source_types()
+    try:
+        for i, src_type in enumerate(backing_src_types):
+            if i > 0:
+                print()
+            run_tests(test_executable, src_type, num_memslots=1, num_vcpus=1)
+            run_tests(test_executable, src_type, num_memslots=1, num_vcpus=NUM_VCPUS_TO_TEST)
+            run_tests(test_executable, src_type, num_memslots=NUM_MEMSLOTS_TO_TEST, num_vcpus=NUM_VCPUS_TO_TEST)
+    except subprocess.CalledProcessError as e:
+        print(f"Test failed for source type '{src_type}'. Command: {' '.join(e.cmd)}", file=sys.stderr)
+        return_code = e.returncode
+    except Exception as e:
+        print(f"An unexpected error occurred: {e}", file=sys.stderr)
+        return_code = 1
+
+    sys.exit(return_code)
+
+
+if __name__ == "__main__":
+    main()
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 36/37] KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (34 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 35/37] KVM: selftests: Add script to exercise private_mem_conversions_test Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-02 22:30 ` [RFC PATCH v2 37/37] KVM: selftests: Update private memory exits test work with per-gmem attributes Ackerley Tng
  2026-02-20  9:09 ` [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Lisa Wang
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

Skip setting memory to private in the pre-fault memory test when using
per-gmem memory attributes, as memory is initialized to private by default
for guest_memfd, and using vm_mem_set_private() on a guest_memfd instance
requires creating guest_memfd with GUEST_MEMFD_FLAG_MMAP (which is totally
doable, but would need to be conditional and is ultimately unnecessary).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/pre_fault_memory_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/kvm/pre_fault_memory_test.c b/tools/testing/selftests/kvm/pre_fault_memory_test.c
index 93e603d91311..831b612449ec 100644
--- a/tools/testing/selftests/kvm/pre_fault_memory_test.c
+++ b/tools/testing/selftests/kvm/pre_fault_memory_test.c
@@ -187,7 +187,7 @@ static void __test_pre_fault_memory(unsigned long vm_type, bool private)
 				    TEST_NPAGES, private ? KVM_MEM_GUEST_MEMFD : 0);
 	virt_map(vm, gva, gpa, TEST_NPAGES);
 
-	if (private)
+	if (!kvm_has_gmem_attributes && private)
 		vm_mem_set_private(vm, gpa, TEST_SIZE);
 
 	pre_fault_memory(vcpu, gpa, 0, SZ_2M, 0, private);
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 37/37] KVM: selftests: Update private memory exits test work with per-gmem attributes
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (35 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 36/37] KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes Ackerley Tng
@ 2026-02-02 22:30 ` Ackerley Tng
  2026-02-20  9:09 ` [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Lisa Wang
  37 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

From: Sean Christopherson <seanjc@google.com>

Skip setting memory to private in the private memory exits test when using
per-gmem memory attributes, as memory is initialized to private by default
for guest_memfd, and using vm_mem_set_private() on a guest_memfd instance
requires creating guest_memfd with GUEST_MEMFD_FLAG_MMAP (which is totally
doable, but would need to be conditional and is ultimately unnecessary).

Expect an emulated MMIO instead of a memory fault exit when attributes are
per-gmem, as deleting the memslot effectively drops the private status,
i.e. the GPA becomes shared and thus supports emulated MMIO.

Skip the "memslot not private" test entirely, as private vs. shared state
for x86 software-protected VMs comes from the memory attributes themselves,
and so when doing in-place conversions there can never be a disconnect
between the expected and actual states.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../kvm/x86/private_mem_kvm_exits_test.c      | 36 +++++++++++++++----
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
index 13e72fcec8dd..10be67441d45 100644
--- a/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
+++ b/tools/testing/selftests/kvm/x86/private_mem_kvm_exits_test.c
@@ -62,8 +62,9 @@ static void test_private_access_memslot_deleted(void)
 
 	virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
 
-	/* Request to access page privately */
-	vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
+	/* Request to access page privately. */
+	if (!kvm_has_gmem_attributes)
+		vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
 
 	pthread_create(&vm_thread, NULL,
 		       (void *(*)(void *))run_vcpu_get_exit_reason,
@@ -74,10 +75,26 @@ static void test_private_access_memslot_deleted(void)
 	pthread_join(vm_thread, &thread_return);
 	exit_reason = (uint32_t)(uint64_t)thread_return;
 
-	TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
-	TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
-	TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
-	TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+	/*
+	 * If attributes are tracked per-gmem, deleting the memslot that points
+	 * at the gmem instance effectively makes the memory shared, and so the
+	 * read should trigger emulated MMIO.
+	 *
+	 * If attributes are tracked per-VM, deleting the memslot shouldn't
+	 * affect the private attribute, and so KVM should generate a memory
+	 * fault exit (emulated MMIO on private GPAs is disallowed).
+	 */
+	if (kvm_has_gmem_attributes) {
+		TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MMIO);
+		TEST_ASSERT_EQ(vcpu->run->mmio.phys_addr, EXITS_TEST_GPA);
+		TEST_ASSERT_EQ(vcpu->run->mmio.len, sizeof(uint64_t));
+		TEST_ASSERT_EQ(vcpu->run->mmio.is_write, false);
+	} else {
+		TEST_ASSERT_EQ(exit_reason, KVM_EXIT_MEMORY_FAULT);
+		TEST_ASSERT_EQ(vcpu->run->memory_fault.flags, KVM_MEMORY_EXIT_FLAG_PRIVATE);
+		TEST_ASSERT_EQ(vcpu->run->memory_fault.gpa, EXITS_TEST_GPA);
+		TEST_ASSERT_EQ(vcpu->run->memory_fault.size, EXITS_TEST_SIZE);
+	}
 
 	kvm_vm_free(vm);
 }
@@ -88,6 +105,13 @@ static void test_private_access_memslot_not_private(void)
 	struct kvm_vcpu *vcpu;
 	uint32_t exit_reason;
 
+	/*
+	 * Accessing non-private memory as private with a software-protected VM
+	 * isn't possible when doing in-place conversions.
+	 */
+	if (kvm_has_gmem_attributes)
+		return;
+
 	vm = vm_create_shape_with_one_vcpu(protected_vm_shape, &vcpu,
 					   guest_repeatedly_read);
 
-- 
2.53.0.rc1.225.gd81095ad13-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [RFC PATCH v2 00/37] guest_memfd: In-place conversion support
@ 2026-02-02 22:36 Ackerley Tng
  2026-02-02 22:29 ` [RFC PATCH v2 01/37] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings Ackerley Tng
                   ` (37 more replies)
  0 siblings, 38 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-02 22:36 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao, Ackerley Tng

(resending to fix Message-ID)

Here's a second revision of guest_memfd In-place conversion support.

In this version, other than addressing comments from RFCv1 [1], the largest
change is that guest_memfd now does not avoid participation in LRU; it
participates in LRU by joining the unevictable list (no change from before this
series).

While checking for elevated refcounts during shared to private conversions,
guest_memfd will now do an lru_add_drain_all() if elevated refcounts were found,
before concluding that there are true users of the shared folio and erroring
out.

I'd still like feedback on these points, if any:

1. Having private/shared status stored in a maple tree (Thanks Michael for your
   support of using maple trees over xarrays for performance! [5]).
2. Having a new guest_memfd ioctl (not a vm ioctl) that performs conversions.
3. Using ioctls/structs/input attribute similar to the existing vm ioctl
   KVM_SET_MEMORY_ATTRIBUTES to perform conversions.
4. Storing requested attributes directly in the maple tree.
5. Using a KVM module-wide param to toggle between setting memory attributes via
   vm and guest_memfd ioctls (making them mututally exclusive - a single loaded
   KVM module can only do one of the two.).

This series is based on kvm/next as at 2026-01-21, and here's the tree for your
convenience:

https://github.com/googleprodkernel/linux-cc/commits/guest_memfd-inplace-conversion-v2

The "Don't set FGP_ACCESSED when getting folios" patch from RFCv1 is still
useful but no longer related to conversion, and was posted separately [6].

Older series:

+ RFCv1 is at [1]
+ Previous versions of this feature, part of other series, are available at
  [2][3][4].

[1] https://lore.kernel.org/all/cover.1760731772.git.ackerleytng@google.com/T/
[2] https://lore.kernel.org/all/bd163de3118b626d1005aa88e71ef2fb72f0be0f.1726009989.git.ackerleytng@google.com/
[3] https://lore.kernel.org/all/20250117163001.2326672-6-tabba@google.com/
[4] https://lore.kernel.org/all/b784326e9ccae6a08388f1bf39db70a2204bdc51.1747264138.git.ackerleytng@google.com/
[5] https://lore.kernel.org/all/20250529054227.hh2f4jmyqf6igd3i@amd.com/
[6] https://lore.kernel.org/all/20260129172646.2361462-1-ackerleytng@google.com/

Ackerley Tng (19):
  KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes
  KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2
  KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion
    safety check
  KVM: selftests: Update framework to use KVM_SET_MEMORY_ATTRIBUTES2
  KVM: selftests: Test using guest_memfd for guest private memory
  KVM: selftests: Test basic single-page conversion flow
  KVM: selftests: Test conversion flow when INIT_SHARED
  KVM: selftests: Test indexing in guest_memfd
  KVM: selftests: Test conversion before allocation
  KVM: selftests: Convert with allocated folios in different layouts
  KVM: selftests: Test precision of conversion
  KVM: selftests: Test that truncation does not change shared/private
    status
  KVM: selftests: Test conversion with elevated page refcount
  KVM: selftests: Reset shared memory after hole-punching
  KVM: selftests: Provide function to look up guest_memfd details from
    gpa
  KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
  KVM: selftests: Update private_mem_conversions_test to mmap()
    guest_memfd
  KVM: selftests: Add script to exercise private_mem_conversions_test

Sean Christopherson (18):
  KVM: guest_memfd: Introduce per-gmem attributes, use to guard user
    mappings
  KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES
  KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem
    is defined
  KVM: Stub in ability to disable per-VM memory attribute tracking
  KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem
    attributes
  KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
  KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86
  KVM: Let userspace disable per-VM mem attributes, enable per-gmem
    attributes
  KVM: selftests: Create gmem fd before "regular" fd when adding memslot
  KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset}
  KVM: selftests: Add support for mmap() on guest_memfd in core library
  KVM: selftests: Add selftests global for guest memory attributes
    capability
  KVM: selftests: Add helpers for calling ioctls on guest_memfd
  KVM: selftests: Test that shared/private status is consistent across
    processes
  KVM: selftests: Provide common function to set memory attributes
  KVM: selftests: Check fd/flags provided to mmap() when setting up
    memslot
  KVM: selftests: Update pre-fault test to work with per-guest_memfd
    attributes
  KVM: selftests: Update private memory exits test work with per-gmem
    attributes

 Documentation/virt/kvm/api.rst                |  72 ++-
 arch/x86/include/asm/kvm_host.h               |   2 +-
 arch/x86/kvm/Kconfig                          |  15 +-
 arch/x86/kvm/mmu/mmu.c                        |   4 +-
 arch/x86/kvm/x86.c                            |  13 +-
 include/linux/kvm_host.h                      |  53 +-
 include/trace/events/kvm.h                    |   4 +-
 include/uapi/linux/kvm.h                      |  17 +
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile.kvm      |   1 +
 .../kvm/guest_memfd_conversions_test.c        | 486 ++++++++++++++++++
 .../testing/selftests/kvm/guest_memfd_test.c  |  57 +-
 .../testing/selftests/kvm/include/kvm_util.h  | 128 ++++-
 .../testing/selftests/kvm/include/test_util.h |  31 +-
 tools/testing/selftests/kvm/lib/kvm_util.c    | 130 +++--
 tools/testing/selftests/kvm/lib/test_util.c   |   7 -
 .../selftests/kvm/pre_fault_memory_test.c     |   2 +-
 .../kvm/x86/private_mem_conversions_test.c    |  48 +-
 .../kvm/x86/private_mem_conversions_test.py   | 152 ++++++
 .../kvm/x86/private_mem_kvm_exits_test.c      |  36 +-
 virt/kvm/Kconfig                              |   4 +-
 virt/kvm/guest_memfd.c                        | 399 +++++++++++++-
 virt/kvm/kvm_main.c                           | 104 +++-
 23 files changed, 1590 insertions(+), 176 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_conversions_test.c
 create mode 100755 tools/testing/selftests/kvm/x86/private_mem_conversions_test.py

--
2.53.0.rc1.225.gd81095ad13-goog

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 33/37] KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe
  2026-02-02 22:30 ` [RFC PATCH v2 33/37] KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe Ackerley Tng
@ 2026-02-14 19:49   ` Ackerley Tng
  0 siblings, 0 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-14 19:49 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

Ackerley Tng <ackerleytng@google.com> writes:

> The TEST_EXPECT_SIGBUS macro is not thread-safe as it uses a global
> sigjmp_buf and installs a global SIGBUS signal handler. If multiple threads
> execute the macro concurrently, they will race on installing the signal
> handler and stomp on other threads' jump buffers, leading to incorrect test
> behavior.
>
> Make TEST_EXPECT_SIGBUS thread-safe with the following changes:
>
> Share the KVM tests' global signal handler. sigaction() applies to all
> threads; without sharing a global signal handler, one thread may have
> removed the signal handler that another thread added, hence leading to
> unexpected signals.
>
> The alternative of layering signal handlers was considered, but calling
> sigaction() within TEST_EXPECT_SIGBUS() necessarily creates a race. To
> avoid adding new setup and teardown routines to do sigaction() and keep
> usage of TEST_EXPECT_SIGBUS() simple, share the KVM tests' global signal
> handler.
>
> Opportunistically rename report_unexpected_signal to
> catchall_signal_handler.
>
> To continue to only expect SIGBUS within specific regions of code, use a
> thread-specific variable, expecting_sigbus, to replace installing and
> removing signal handlers.
>
> Make the execution environment for the thread, sigjmp_buf, a
> thread-specific variable.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  .../testing/selftests/kvm/include/test_util.h | 29 +++++++++----------
>  tools/testing/selftests/kvm/lib/kvm_util.c    | 18 ++++++++----
>  tools/testing/selftests/kvm/lib/test_util.c   |  7 -----
>  3 files changed, 26 insertions(+), 28 deletions(-)
>
> diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
> index 2871a4292847..0e4e6f7dab8f 100644
> --- a/tools/testing/selftests/kvm/include/test_util.h
> +++ b/tools/testing/selftests/kvm/include/test_util.h
> @@ -80,22 +80,19 @@ do {									\
>  	__builtin_unreachable(); \
>  } while (0)
>
> -extern sigjmp_buf expect_sigbus_jmpbuf;
> -void expect_sigbus_handler(int signum);
> -
> -#define TEST_EXPECT_SIGBUS(action)						\
> -do {										\
> -	struct sigaction sa_old, sa_new = {					\
> -		.sa_handler = expect_sigbus_handler,				\
> -	};									\
> -										\
> -	sigaction(SIGBUS, &sa_new, &sa_old);					\
> -	if (sigsetjmp(expect_sigbus_jmpbuf, 1) == 0) {				\
> -		action;								\
> -		TEST_FAIL("'%s' should have triggered SIGBUS", #action);	\
> -	}									\
> -	sigaction(SIGBUS, &sa_old, NULL);					\
> -} while (0)
> +extern __thread sigjmp_buf expect_sigbus_jmpbuf;
> +extern __thread bool expecting_sigbus;
> +
> +#define TEST_EXPECT_SIGBUS(action)                                     \
> +	do {                                                           \
> +		expecting_sigbus = true;			       \
> +		if (sigsetjmp(expect_sigbus_jmpbuf, 1) == 0) {         \
> +			action;                                        \
> +			TEST_FAIL("'%s' should have triggered SIGBUS", \
> +				  #action);                            \
> +		}                                                      \
> +		expecting_sigbus = false;			       \
> +	} while (0)
>
>  size_t parse_size(const char *size);
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index aec7b24418ab..18ced8bdde36 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -2314,13 +2314,20 @@ __weak void kvm_selftest_arch_init(void)
>  {
>  }
>
> -static void report_unexpected_signal(int signum)
> +__thread sigjmp_buf expect_sigbus_jmpbuf;
> +__thread bool expecting_sigbus;
> +
> +static void catchall_signal_handler(int signum)
>  {
> +	switch (signum) {
> +	case SIGBUS: {
> +		if (expecting_sigbus)


Transferring/summarizing an internal comment from Sean upstream:

This assumes that tests are indeed using the catchall_signal_handler as
the global signal handler.

In the next revision, I will make TEST_EXPECT_SIGBUS() assert that the
default signal handler is installed, so that developers get a clear,
explicit failure if/when something goes wrong.

> +			siglongjmp(expect_sigbus_jmpbuf, 1);
> +
> +		TEST_FAIL("Unexpected SIGBUS (%d)\n", signum);
> +	}
>  #define KVM_CASE_SIGNUM(sig)					\
>  	case sig: TEST_FAIL("Unexpected " #sig " (%d)\n", signum)
> -
> -	switch (signum) {
> -	KVM_CASE_SIGNUM(SIGBUS);
>  	KVM_CASE_SIGNUM(SIGSEGV);
>  	KVM_CASE_SIGNUM(SIGILL);
>  	KVM_CASE_SIGNUM(SIGFPE);
> @@ -2332,12 +2339,13 @@ static void report_unexpected_signal(int signum)
>  void __attribute((constructor)) kvm_selftest_init(void)
>  {
>  	struct sigaction sig_sa = {
> -		.sa_handler = report_unexpected_signal,
> +		.sa_handler = catchall_signal_handler,
>  	};
>
>  	/* Tell stdout not to buffer its content. */
>  	setbuf(stdout, NULL);
>
> +	expecting_sigbus = false;
>  	sigaction(SIGBUS, &sig_sa, NULL);
>  	sigaction(SIGSEGV, &sig_sa, NULL);
>  	sigaction(SIGILL, &sig_sa, NULL);
> diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
> index 8a1848586a85..03eb99af9b8d 100644
> --- a/tools/testing/selftests/kvm/lib/test_util.c
> +++ b/tools/testing/selftests/kvm/lib/test_util.c
> @@ -18,13 +18,6 @@
>
>  #include "test_util.h"
>
> -sigjmp_buf expect_sigbus_jmpbuf;
> -
> -void __attribute__((used)) expect_sigbus_handler(int signum)
> -{
> -	siglongjmp(expect_sigbus_jmpbuf, 1);
> -}
> -
>  /*
>   * Random number generator that is usable from guest code. This is the
>   * Park-Miller LCG using standard constants.
> --
> 2.53.0.rc1.225.gd81095ad13-goog

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-02 22:29 ` [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
@ 2026-02-14 20:09   ` Ackerley Tng
  2026-02-17 23:04     ` Sean Christopherson
                       ` (3 more replies)
  0 siblings, 4 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-02-14 20:09 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

Ackerley Tng <ackerleytng@google.com> writes:

>
> [...snip...]
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 23ec0b0c3e22..26e80745c8b4 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -117,7 +117,7 @@ description:
>        x86 includes both i386 and x86_64.
>
>    Type:
> -      system, vm, or vcpu.
> +      system, vm, vcpu or guest_memfd.
>
>    Parameters:
>        what parameters are accepted by the ioctl.
> @@ -6523,11 +6523,22 @@ the capability to be present.
>  ---------------------------------
>
>  :Capability: KVM_CAP_MEMORY_ATTRIBUTES2
> -:Architectures: x86
> -:Type: vm ioctl
> +:Architectures: all
> +:Type: vm, guest_memfd ioctl
>  :Parameters: struct kvm_memory_attributes2 (in/out)
>  :Returns: 0 on success, <0 on error
>
> +Errors:
> +
> +  ========== ===============================================================
> +  EINVAL     The specified `offset` or `size` were invalid (e.g. not
> +             page aligned, causes an overflow, or size is zero).
> +  EFAULT     The parameter address was invalid.
> +  EAGAIN     Some page within requested range had unexpected refcounts. The
> +             offset of the page will be returned in `error_offset`.
> +  ENOMEM     Ran out of memory trying to track private/shared state
> +  ========== ===============================================================
> +
>  KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
>  KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
>  userspace.  The original (pre-extension) fields are shared with
> @@ -6538,15 +6549,42 @@ Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
>  ::
>
>    struct kvm_memory_attributes2 {
> -	__u64 address;
> +	/* in */
> +	union {
> +		__u64 address;
> +		__u64 offset;
> +	};
>  	__u64 size;
>  	__u64 attributes;
>  	__u64 flags;
> -	__u64 reserved[12];
> +	/* out */
> +	__u64 error_offset;
> +	__u64 reserved[11];
>    };
>
>    #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>
> +Set attributes for a range of offsets within a guest_memfd to
> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
> +supported, after a successful call to set
> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
> +into host userspace and will only be mappable by the guest.
> +
> +To allow the range to be mappable into host userspace again, call
> +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
> +KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
> +
> +If this ioctl returns -EAGAIN, the offset of the page with unexpected
> +refcounts will be returned in `error_offset`. This can occur if there
> +are transient refcounts on the pages, taken by other parts of the
> +kernel.
> +
> +Userspace is expected to figure out how to remove all known refcounts
> +on the shared pages, such as refcounts taken by get_user_pages(), and
> +try the ioctl again. A possible source of these long term refcounts is
> +if the guest_memfd memory was pinned in IOMMU page tables.
> +
>  See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
>

Transferring/re-summarizing an internal comment from Sean upstream here!
We can also follow up on this topic at the next guest_memfd biweekly.

Before this lands, Sean wants, at the very minimum, an in-principle
agreement on guest_memfd behavior with respect to whether or not memory
should be preserved on conversion.

Sean is against deferring whether to preserve memory to the underlying
hardware because that is letting (effectively) micro-architectural
behavior to define KVM's ABI. KVM's uAPI cannot let behavior be
undefined, or be based on vendor, and maybe even on firmware version.

Sean says that all decisions that affect guest data must be made by
userspace. The architecture can restrict what is possible, e.g. neither
SNP nor TDX currently support "generic" in-place conversion, but whether
or not data is to be preserved must be an explicit request from
userspace. If preserving data is impossible, then KVM needs to reject
the request.

(Vendor specific ioctls are out-of-scope, SNP and TDX cases were brought
up purely to highlight that there's nothing that fundamentally prevents
preserving data on conversion.)

I suggested a few uAPI options for configuring content preservation on
conversion:

1. guest_memfd creation time flag like
   GUEST_MEMFD_FLAG_PRESERVE_CONTENTS. This can be valid only if the
   kernel and vendor support content preservation

This was rejected because we should not assume all current and future
use cases will want the same content preservation config for a given
guest_memfd.

2. KConfig: automatically select to preserve contents if the
   architecture supports content preservation

This was rejected because it's not a decision explicitly made by
userspace.

3. KVM module param to configure content preservation.

This was rejected because the configuration may not generalize across
all VMs on the same host.

4. guest_memfd ioctl flag
   SET_MEMORY_ATTRIBUTES2_FLAG_PRESERVE_CONTENTS. -EINVAL if kernel and
   vendor don't support content preservation

Specifying a flag to choose whether content should be preserved at
conversion-time is the current best suggestion.

What does the rest of the community think of a conversion ioctl flag to
choose whether to preserve memory contents on conversion?

Fuad, I think you also made a related comment on an earlier internal
version we were working on. What do you/pKVM think?

>
> [...snip...]
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-14 20:09   ` Ackerley Tng
@ 2026-02-17 23:04     ` Sean Christopherson
  2026-02-19 12:43     ` Fuad Tabba
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 54+ messages in thread
From: Sean Christopherson @ 2026-02-17 23:04 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86, aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

On Sat, Feb 14, 2026, Ackerley Tng wrote:
> Sean is against deferring whether to preserve memory to the underlying
> hardware because that is letting (effectively) micro-architectural
> behavior to define KVM's ABI. KVM's uAPI cannot let behavior be
> undefined, or be based on vendor, and maybe even on firmware version.

Concretely, we'll end up with a giant mess if TDX and/or SNP adds support for
preserving contents on conversion.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-14 20:09   ` Ackerley Tng
  2026-02-17 23:04     ` Sean Christopherson
@ 2026-02-19 12:43     ` Fuad Tabba
  2026-02-24 10:14     ` Ackerley Tng
  2026-03-12  5:44     ` Ackerley Tng
  3 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2026-02-19 12:43 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86, aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tglx, vannapurve, vbabka, willy, wyihan, yan.y.zhao

Hi Ackerley,

Sorry, but I completely missed this!

[...snip...]

> 4. guest_memfd ioctl flag
>    SET_MEMORY_ATTRIBUTES2_FLAG_PRESERVE_CONTENTS. -EINVAL if kernel and
>    vendor don't support content preservation
>
> Specifying a flag to choose whether content should be preserved at
> conversion-time is the current best suggestion.
>
> What does the rest of the community think of a conversion ioctl flag to
> choose whether to preserve memory contents on conversion?
>
> Fuad, I think you also made a related comment on an earlier internal
> version we were working on. What do you/pKVM think?

Introducing PRESERVE_CONTENTS seems like the correct architectural
approach to me. pKVM fully supports requiring the PRESERVE_CONTENTS
flag on every conversion where data retention is expected. As you
know, pKVM operates at EL2 with control over Stage-2 page tables, so
we intrinsically support in-place state transitions. Requiring the VMM
to explicitly pass this flag when injecting payloads ensures we have
explicit userspace intent without artificially restricting pKVM's
capabilities.

If userspace omits the PRESERVE_CONTENTS flag, KVM must zero the page
contents in software and proceed. Returning -EINVAL would be incorrect
because it would make the flag mandatory, eliminating userspace's
ability to request clean-slate, destructive conversions when needed.
Additionally, software zeroing ensures deterministic behavior across
pKVM, TDX, and SNP. The guest is guaranteed a clean page, isolating
the KVM uAPI from micro-architectural data destruction nuances.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 00/37] guest_memfd: In-place conversion support
  2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
                   ` (36 preceding siblings ...)
  2026-02-02 22:30 ` [RFC PATCH v2 37/37] KVM: selftests: Update private memory exits test work with per-gmem attributes Ackerley Tng
@ 2026-02-20  9:09 ` Lisa Wang
  37 siblings, 0 replies; 54+ messages in thread
From: Lisa Wang @ 2026-02-20  9:09 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86, aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, yan.y.zhao

On Mon, Feb 02, 2026 at 02:36:37PM -0800, Ackerley Tng wrote:
> (resending to fix Message-ID)
> 
> Here's a second revision of guest_memfd In-place conversion support.
> 
> In this version, other than addressing comments from RFCv1 [1], the largest
> change is that guest_memfd now does not avoid participation in LRU; it
> participates in LRU by joining the unevictable list (no change from before this
> series).
> 
> While checking for elevated refcounts during shared to private conversions,
> guest_memfd will now do an lru_add_drain_all() if elevated refcounts were found,
> before concluding that there are true users of the shared folio and erroring
> out.
> 
> I'd still like feedback on these points, if any:
> 
> 1. Having private/shared status stored in a maple tree (Thanks Michael for your
>    support of using maple trees over xarrays for performance! [5]).
> 2. Having a new guest_memfd ioctl (not a vm ioctl) that performs conversions.
> 3. Using ioctls/structs/input attribute similar to the existing vm ioctl
>    KVM_SET_MEMORY_ATTRIBUTES to perform conversions.
> 4. Storing requested attributes directly in the maple tree.
> 5. Using a KVM module-wide param to toggle between setting memory attributes via
>    vm and guest_memfd ioctls (making them mututally exclusive - a single loaded
>    KVM module can only do one of the two.).
> 
> [...snip...]
>
> 
> --
> 2.53.0.rc1.225.gd81095ad13-goog

I’ve tested memory failure handling after applying this series and here’s what
memory_failure() does:

Shared memory: In line with other in-memory filesystems, the memory_failure()
handler unmaps the page if it is currently mapped, and issues a SIGBUS
  - if memory failure was injected with MF_ACTION_REQUIRED or
  - if the test process’s memory corruption kill policy is PR_MCE_KILL_EARLY

Here’s the above, in table form:

| MF_ACTION_REQUIRED | Kill Policy         | Mapped | Dirty | Result: SIGBUS |
|--------------------|---------------------|--------|-------|----------------|
| false              | PR_MCE_KILL_EARLY   | true   | true  | true           |
| false              | PR_MCE_KILL_EARLY   | true   | false | false          |
| false              | PR_MCE_KILL_EARLY   | false  | true  | false          |
| false              | PR_MCE_KILL_EARLY   | false  | false | false          |
| false              | PR_MCE_KILL_LATE    | true   | true  | false          |
| false              | PR_MCE_KILL_LATE    | true   | false | false          |
| false              | PR_MCE_KILL_LATE    | false  | true  | false          |
| false              | PR_MCE_KILL_LATE    | false  | false | false          |
| true               | Any Policy          | true   | true  | true           |
| true               | Any Policy          | true   | false | false          |

(I used MADV_HWPOISON to inject memory failures with MF_ACTION_REQUIRED set, and
there was no way to use MADV_HWPOISON without first mapping the page in. To
inject memory failures without MF_ACTION_REQUIRED set, I used debugfs’
hwpoison/corrupt-pfn.)

Private memory: The handler unmaps the page for the stage 2 page table and does
not issue a SIGBUS - the page is never mapped to the host, since it is private
to the guest.

| MF_ACTION_REQUIRED | Kill Policy         | Mapped | Dirty | Result: SIGBUS |
|--------------------|---------------------|--------|-------|----------------|
| false              | PR_MCE_KILL_EARLY   | false  | true  | false          |
| false              | PR_MCE_KILL_EARLY   | false  | false | false          |
| false              | PR_MCE_KILL_LATE    | false  | true  | false          |
| false              | PR_MCE_KILL_LATE    | false  | false | false          |

(I couldn’t use MADV_HWPOISON since private memory could not be mapped and hence
will not have a userspace address)

I’ll post updated memory failure tests together with the next revision of this
series [1] to fix MF_DELAYED handling on memory failure.

[1] https://lore.kernel.org/all/cover.1760551864.git.wyihan@google.com/T/


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-14 20:09   ` Ackerley Tng
  2026-02-17 23:04     ` Sean Christopherson
  2026-02-19 12:43     ` Fuad Tabba
@ 2026-02-24 10:14     ` Ackerley Tng
  2026-02-25 11:00       ` Fuad Tabba
  2026-03-12  5:44     ` Ackerley Tng
  3 siblings, 1 reply; 54+ messages in thread
From: Ackerley Tng @ 2026-02-24 10:14 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

Ackerley Tng <ackerleytng@google.com> writes:

> Ackerley Tng <ackerleytng@google.com> writes:
>
>>
>> [...snip...]
>>
> Before this lands, Sean wants, at the very minimum, an in-principle
> agreement on guest_memfd behavior with respect to whether or not memory
> should be preserved on conversion.
>>
>> [...snip...]
>>

Here's what I've come up with, following up from last guest_memfd
biweekly.

Every KVM_SET_MEMORY_ATTRIBUTES2 request will be accompanied by an
enum set_memory_attributes_content_policy:

    enum set_memory_attributes_content_policy {
    	SET_MEMORY_ATTRIBUTES_CONTENT_ZERO,
    	SET_MEMORY_ATTRIBUTES_CONTENT_READABLE,
    	SET_MEMORY_ATTRIBUTES_CONTENT_ENCRYPTED,
    }

Within guest_memfd's KVM_SET_MEMORY_ATTRIBUTES2 handler, guest_memfd
will make an arch call

    kvm_gmem_arch_content_policy_supported(kvm, policy, gfn, nr_pages)

where every arch will get to return some error if the requested policy
is not supported for the given range.

ZERO is the simplest of the above, it means that after the conversion
the memory will be zeroed for the next reader.

+ TDX and SNP today will support ZERO since the firmware handles
  zeroing.
+ pKVM and SW_PROTECTED_VM will apply software zeroing.
+ Purpose: having this policy in the API allows userspace to be sure
  that the memory is zeroed after the conversion - there is no need to
  zero again in userspace (addresses concern that Sean pointed out)

READABLE means that after the conversion, the memory is readable by
userspace (if converting to shared) or readable by the guest (if
converting to private).

+ TDX and SNP (today) can't support this, so return -EOPNOTSUPP
+ SW_PROTECTED_VM will support this and do nothing extra on
  conversion, since there is no encryption anyway and all content
  remains readable.
+ pKVM will make use of the arch function above.

Here's where I need input: (David's questions during the call about
the full flow beginning with the guest prompted this).

Since pKVM doesn't encrypt the memory contents, there must be some way
that pKVM can say no when userspace requests to convert and retain
READABLE contents? I think pKVM's arch function can be used to check
if the guest previously made a conversion request. Fuad, to check that
the guest made a conversion request, what's other parameters are
needed other than gfn and nr_pages?

ENCRYPTED means that after the conversion, the memory contents are
retained as-is, with no decryption.

+ TDX and SNP (today) can't support this, so return -EOPNOTSUPP
+ pKVM and SW_PROTECTED_VM can do nothing, but doing nothing retains
  READABLE content, not ENCRYPTED content, hence SW_PROTECTED_VM
  should return -EOPNOTSUPP.
+ Michael, you mentioned during the call that SNP is planning to
  introduce a policy that retains the ENCRYPTED version for a special
  GHCB call. ENCRYPTED is meant for that use case. Does it work? I'm
  assuming that SNP should only support this policy given some
  conditions, so would the arch call as described above work?
+ If this policy is specified on conversion from shared to private,
  always return -EOPNOTSUPP.
+ When this first lands, ENCRYPTED will not be a valid option, but I'm
  listing it here so we have line of sight to having this support.

READABLE and ENCRYPTED defines the state after conversion clearly
(instead of DONT_CARE or similar).

DESTROY could be another policy, which means that after the
conversion, the memory is unreadable. This is the option to address
what David brought up during the call, for cases where userspace knows
it is going to free the memory already and doesn't care about the
state as long as nobody gets to read it. This will not implemented
when feature first lands, but is presented here just to show how this
can be extended in future.

Right now, I'm thinking that one of the above policies MUST be
specified (not specifying a policy will result in -EINVAL).

How does this sound?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-24 10:14     ` Ackerley Tng
@ 2026-02-25 11:00       ` Fuad Tabba
  2026-02-26  4:16         ` Ackerley Tng
  0 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2026-02-25 11:00 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86, aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tglx, vannapurve, vbabka, willy, wyihan, yan.y.zhao

Hi Ackerley,

Here are my thoughts, at least when it comes to pKVM.

On Tue, 24 Feb 2026 at 10:14, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Ackerley Tng <ackerleytng@google.com> writes:
>
> > Ackerley Tng <ackerleytng@google.com> writes:
> >
> >>
> >> [...snip...]
> >>
> > Before this lands, Sean wants, at the very minimum, an in-principle
> > agreement on guest_memfd behavior with respect to whether or not memory
> > should be preserved on conversion.
> >>
> >> [...snip...]
> >>
>
> Here's what I've come up with, following up from last guest_memfd
> biweekly.
>
> Every KVM_SET_MEMORY_ATTRIBUTES2 request will be accompanied by an
> enum set_memory_attributes_content_policy:
>
>     enum set_memory_attributes_content_policy {
>         SET_MEMORY_ATTRIBUTES_CONTENT_ZERO,
>         SET_MEMORY_ATTRIBUTES_CONTENT_READABLE,
>         SET_MEMORY_ATTRIBUTES_CONTENT_ENCRYPTED,
>     }
>
> Within guest_memfd's KVM_SET_MEMORY_ATTRIBUTES2 handler, guest_memfd
> will make an arch call
>
>     kvm_gmem_arch_content_policy_supported(kvm, policy, gfn, nr_pages)
>
> where every arch will get to return some error if the requested policy
> is not supported for the given range.

This hook provides the validation mechanism pKVM requires.

> ZERO is the simplest of the above, it means that after the conversion
> the memory will be zeroed for the next reader.
>
> + TDX and SNP today will support ZERO since the firmware handles
>   zeroing.
> + pKVM and SW_PROTECTED_VM will apply software zeroing.
> + Purpose: having this policy in the API allows userspace to be sure
>   that the memory is zeroed after the conversion - there is no need to
>   zero again in userspace (addresses concern that Sean pointed out)
>
> READABLE means that after the conversion, the memory is readable by
> userspace (if converting to shared) or readable by the guest (if
> converting to private).
>
> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP
> + SW_PROTECTED_VM will support this and do nothing extra on
>   conversion, since there is no encryption anyway and all content
>   remains readable.
> + pKVM will make use of the arch function above.
>
> Here's where I need input: (David's questions during the call about
> the full flow beginning with the guest prompted this).
>
> Since pKVM doesn't encrypt the memory contents, there must be some way
> that pKVM can say no when userspace requests to convert and retain
> READABLE contents? I think pKVM's arch function can be used to check
> if the guest previously made a conversion request. Fuad, to check that
> the guest made a conversion request, what's other parameters are
> needed other than gfn and nr_pages?

The gfn and nr_pages parameters are enough I think.

To clarify how pKVM would use this hook: all memory sharing and
unsharing must be initiated by the guest via a hypercall. When the
guest issues this hypercall, the pKVM hypervisor (EL2) exits to the
host kernel (EL1). The host kernel records the exit reason (share or
unshare) along with the specific memory address in the kvm_run
structure before exiting to userspace.

We do not track this pending conversion state in the hypervisor. If a
compromised host kernel wants to lie and corrupt the state, it can
crash the system or the guest (which is an accepted DOS risk), but it
cannot compromise guest confidentiality because EL2 still strictly
enforces Stage-2 permissions. Our primary goal here is to prevent a
malicious or buggy userspace VMM from crashing the system.

When the VMM subsequently issues the KVM_SET_MEMORY_ATTRIBUTES2 ioctl
with the READABLE policy, we will use the
kvm_gmem_arch_content_policy_supported() hook in EL1 to validate the
ioctl. We will cross-reference the requested gfn and nr_pages against
the pending exit reason stored in kvm_run.

If the VMM attempts an unsolicited conversion (i.e., there is no
matching exit request in kvm_run, or the addresses do not match), our
current plan is to reject the request and return an error. In the
future, rather than outright rejecting an unsolicited conversion, we
might evolve this to treat it as a host-initiated destructive reclaim,
forcing an unshare and zeroing the memory. For the time being,
explicit rejection is the simplest and safest path.

> ENCRYPTED means that after the conversion, the memory contents are
> retained as-is, with no decryption.
>
> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP
> + pKVM and SW_PROTECTED_VM can do nothing, but doing nothing retains
>   READABLE content, not ENCRYPTED content, hence SW_PROTECTED_VM
>   should return -EOPNOTSUPP.
> + Michael, you mentioned during the call that SNP is planning to
>   introduce a policy that retains the ENCRYPTED version for a special
>   GHCB call. ENCRYPTED is meant for that use case. Does it work? I'm
>   assuming that SNP should only support this policy given some
>   conditions, so would the arch call as described above work?
> + If this policy is specified on conversion from shared to private,
>   always return -EOPNOTSUPP.
> + When this first lands, ENCRYPTED will not be a valid option, but I'm
>   listing it here so we have line of sight to having this support.
>
> READABLE and ENCRYPTED defines the state after conversion clearly
> (instead of DONT_CARE or similar).
>
> DESTROY could be another policy, which means that after the
> conversion, the memory is unreadable. This is the option to address
> what David brought up during the call, for cases where userspace knows
> it is going to free the memory already and doesn't care about the
> state as long as nobody gets to read it. This will not implemented
> when feature first lands, but is presented here just to show how this
> can be extended in future.
>
> Right now, I'm thinking that one of the above policies MUST be
> specified (not specifying a policy will result in -EINVAL).
>
> How does this sound?

I don't think that returning -EINVAL is the right thing to do here. If
userspace omits the policy, the API should default to
SET_MEMORY_ATTRIBUTES_CONTENT_ZERO and proceed with the conversion. I
believe that, in Linux APIs in general, omitting an optional behavior
flag results in the safest, most standard default action.

Also, returning -EINVAL when no policy is specified makes the policy
parameter strictly mandatory. This makes it difficult for userspace's
to seamlessly request clean-slate, destructive conversions. Software
zeroing ensures deterministic behavior across pKVM, TDX, and SNP,
isolating the KVM uAPI from micro-architectural data destruction
nuances.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-25 11:00       ` Fuad Tabba
@ 2026-02-26  4:16         ` Ackerley Tng
  2026-02-26  8:11           ` Fuad Tabba
  0 siblings, 1 reply; 54+ messages in thread
From: Ackerley Tng @ 2026-02-26  4:16 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86, aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tglx, vannapurve, vbabka, willy, wyihan, yan.y.zhao

Fuad Tabba <tabba@google.com> writes:

> Hi Ackerley,
>
> Here are my thoughts, at least when it comes to pKVM.
>
>
> On Tue, 24 Feb 2026 at 10:14, Ackerley Tng <ackerleytng@google.com> wrote:
>>
>> Ackerley Tng <ackerleytng@google.com> writes:
>>
>> > Ackerley Tng <ackerleytng@google.com> writes:
>> >
>> >>
>> >> [...snip...]
>> >>
>> > Before this lands, Sean wants, at the very minimum, an in-principle
>> > agreement on guest_memfd behavior with respect to whether or not memory
>> > should be preserved on conversion.
>> >>
>> >> [...snip...]
>> >>
>>
>> Here's what I've come up with, following up from last guest_memfd
>> biweekly.
>>
>> Every KVM_SET_MEMORY_ATTRIBUTES2 request will be accompanied by an
>> enum set_memory_attributes_content_policy:
>>
>>     enum set_memory_attributes_content_policy {
>>         SET_MEMORY_ATTRIBUTES_CONTENT_ZERO,
>>         SET_MEMORY_ATTRIBUTES_CONTENT_READABLE,
>>         SET_MEMORY_ATTRIBUTES_CONTENT_ENCRYPTED,
>>     }
>>
>> Within guest_memfd's KVM_SET_MEMORY_ATTRIBUTES2 handler, guest_memfd
>> will make an arch call
>>
>>     kvm_gmem_arch_content_policy_supported(kvm, policy, gfn, nr_pages)
>>
>> where every arch will get to return some error if the requested policy
>> is not supported for the given range.
>
> This hook provides the validation mechanism pKVM requires.
>
>> ZERO is the simplest of the above, it means that after the conversion
>> the memory will be zeroed for the next reader.
>>
>> + TDX and SNP today will support ZERO since the firmware handles
>>   zeroing.
>> + pKVM and SW_PROTECTED_VM will apply software zeroing.
>> + Purpose: having this policy in the API allows userspace to be sure
>>   that the memory is zeroed after the conversion - there is no need to
>>   zero again in userspace (addresses concern that Sean pointed out)
>>
>> READABLE means that after the conversion, the memory is readable by
>> userspace (if converting to shared) or readable by the guest (if
>> converting to private).
>>
>> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP
>> + SW_PROTECTED_VM will support this and do nothing extra on
>>   conversion, since there is no encryption anyway and all content
>>   remains readable.
>> + pKVM will make use of the arch function above.
>>
>> Here's where I need input: (David's questions during the call about
>> the full flow beginning with the guest prompted this).
>>
>> Since pKVM doesn't encrypt the memory contents, there must be some way
>> that pKVM can say no when userspace requests to convert and retain
>> READABLE contents? I think pKVM's arch function can be used to check
>> if the guest previously made a conversion request. Fuad, to check that
>> the guest made a conversion request, what's other parameters are
>> needed other than gfn and nr_pages?
>
> The gfn and nr_pages parameters are enough I think.
>
> To clarify how pKVM would use this hook: all memory sharing and
> unsharing must be initiated by the guest via a hypercall. When the
> guest issues this hypercall, the pKVM hypervisor (EL2) exits to the
> host kernel (EL1). The host kernel records the exit reason (share or
> unshare) along with the specific memory address in the kvm_run
> structure before exiting to userspace.
>
> We do not track this pending conversion state in the hypervisor. If a
> compromised host kernel wants to lie and corrupt the state, it can
> crash the system or the guest (which is an accepted DOS risk), but it
> cannot compromise guest confidentiality because EL2 still strictly
> enforces Stage-2 permissions. Our primary goal here is to prevent a
> malicious or buggy userspace VMM from crashing the system.
>

Thinking through it again, there's actually no security (in terms of
CoCo confidentiality) risk here, since the conversion ioctl doesn't
actually tell the CoCo vendor/platform to encrypt/decrypt or flip
permissions, it just unmaps the pages as requested.

On TDX, if a rogue private to shared conversion request comes in, the
private pages would get unmapped from the guest, and on the next guest
access, the guest would access the page as private, so kvm's fault
handler would think there's a shared/private mismatch and exit with
KVM_EXIT_MEMORY_FAULT. Userspace now has a zeroed shared page, and the
guest needs to re-accept the page to continue using it (if it knows what
to do with a zeroed page). This would be userspace DOS-ing the guest,
which userspace can do anyway.

On pKVM, rephrasing what you said, even if there is a rogue private to
shared conversion, EL2 still thinks of the page as private. After the
conversion, the page can be faulted in by the host, but any access will
be stopped by EL2.

David, there's no missing piece in the flow!

> When the VMM subsequently issues the KVM_SET_MEMORY_ATTRIBUTES2 ioctl
> with the READABLE policy, we will use the
> kvm_gmem_arch_content_policy_supported() hook in EL1 to validate the
> ioctl. We will cross-reference the requested gfn and nr_pages against
> the pending exit reason stored in kvm_run.
>
> If the VMM attempts an unsolicited conversion (i.e., there is no
> matching exit request in kvm_run, or the addresses do not match), our

Ah I see, so struct kvm_run is not considered "in the hypervisor" since
it is modifiable by host userspace. Would you be using struct
memory_fault in struct kvm_run?

Which vcpu's kvm_run struct would you look up from
kvm_gmem_arch_content_policy_supported()?

For this to land, do you still want the gfn and nr_pages parameters?

Can pKVM just always accept the request, whether the guest requested it
or not? Thinking about it again,
kvm_gmem_arch_content_policy_supported() probably shouldn't be used to
guard solicited vs unsolicited requests anyway (unless you think the
function's name should be changed?)

> current plan is to reject the request and return an error. In the
> future, rather than outright rejecting an unsolicited conversion, we
> might evolve this to treat it as a host-initiated destructive reclaim,
> forcing an unshare and zeroing the memory. For the time being,
> explicit rejection is the simplest and safest path.
>

>> ENCRYPTED means that after the conversion, the memory contents are
>> retained as-is, with no decryption.
>>
>> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP
>> + pKVM and SW_PROTECTED_VM can do nothing, but doing nothing retains
>>   READABLE content, not ENCRYPTED content, hence SW_PROTECTED_VM
>>   should return -EOPNOTSUPP.
>> + Michael, you mentioned during the call that SNP is planning to
>>   introduce a policy that retains the ENCRYPTED version for a special
>>   GHCB call. ENCRYPTED is meant for that use case. Does it work? I'm
>>   assuming that SNP should only support this policy given some
>>   conditions, so would the arch call as described above work?
>> + If this policy is specified on conversion from shared to private,
>>   always return -EOPNOTSUPP.
>> + When this first lands, ENCRYPTED will not be a valid option, but I'm
>>   listing it here so we have line of sight to having this support.
>>
>> READABLE and ENCRYPTED defines the state after conversion clearly
>> (instead of DONT_CARE or similar).
>>
>> DESTROY could be another policy, which means that after the
>> conversion, the memory is unreadable. This is the option to address
>> what David brought up during the call, for cases where userspace knows
>> it is going to free the memory already and doesn't care about the
>> state as long as nobody gets to read it. This will not implemented
>> when feature first lands, but is presented here just to show how this
>> can be extended in future.
>>
>> Right now, I'm thinking that one of the above policies MUST be
>> specified (not specifying a policy will result in -EINVAL).
>>
>> How does this sound?
>
> I don't think that returning -EINVAL is the right thing to do here. If
> userspace omits the policy, the API should default to
> SET_MEMORY_ATTRIBUTES_CONTENT_ZERO and proceed with the conversion. I
> believe that, in Linux APIs in general, omitting an optional behavior
> flag results in the safest, most standard default action.
>

Makes sense, I'll default to zeroing then. Thanks!

> Also, returning -EINVAL when no policy is specified makes the policy
> parameter strictly mandatory. This makes it difficult for userspace's
> to seamlessly request clean-slate, destructive conversions. Software
> zeroing ensures deterministic behavior across pKVM, TDX, and SNP,
> isolating the KVM uAPI from micro-architectural data destruction
> nuances.
>
> Cheers,
> /fuad

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-26  4:16         ` Ackerley Tng
@ 2026-02-26  8:11           ` Fuad Tabba
  0 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2026-02-26  8:11 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86, aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tglx, vannapurve, vbabka, willy, wyihan, yan.y.zhao

Hi Ackerley,

On Thu, 26 Feb 2026 at 04:16, Ackerley Tng <ackerleytng@google.com> wrote:
>
> Fuad Tabba <tabba@google.com> writes:
>
> > Hi Ackerley,
> >
> > Here are my thoughts, at least when it comes to pKVM.
> >
> >
> > On Tue, 24 Feb 2026 at 10:14, Ackerley Tng <ackerleytng@google.com> wrote:
> >>
> >> Ackerley Tng <ackerleytng@google.com> writes:
> >>
> >> > Ackerley Tng <ackerleytng@google.com> writes:
> >> >
> >> >>
> >> >> [...snip...]
> >> >>
> >> > Before this lands, Sean wants, at the very minimum, an in-principle
> >> > agreement on guest_memfd behavior with respect to whether or not memory
> >> > should be preserved on conversion.
> >> >>
> >> >> [...snip...]
> >> >>
> >>
> >> Here's what I've come up with, following up from last guest_memfd
> >> biweekly.
> >>
> >> Every KVM_SET_MEMORY_ATTRIBUTES2 request will be accompanied by an
> >> enum set_memory_attributes_content_policy:
> >>
> >>     enum set_memory_attributes_content_policy {
> >>         SET_MEMORY_ATTRIBUTES_CONTENT_ZERO,
> >>         SET_MEMORY_ATTRIBUTES_CONTENT_READABLE,
> >>         SET_MEMORY_ATTRIBUTES_CONTENT_ENCRYPTED,
> >>     }
> >>
> >> Within guest_memfd's KVM_SET_MEMORY_ATTRIBUTES2 handler, guest_memfd
> >> will make an arch call
> >>
> >>     kvm_gmem_arch_content_policy_supported(kvm, policy, gfn, nr_pages)
> >>
> >> where every arch will get to return some error if the requested policy
> >> is not supported for the given range.
> >
> > This hook provides the validation mechanism pKVM requires.
> >
> >> ZERO is the simplest of the above, it means that after the conversion
> >> the memory will be zeroed for the next reader.
> >>
> >> + TDX and SNP today will support ZERO since the firmware handles
> >>   zeroing.
> >> + pKVM and SW_PROTECTED_VM will apply software zeroing.
> >> + Purpose: having this policy in the API allows userspace to be sure
> >>   that the memory is zeroed after the conversion - there is no need to
> >>   zero again in userspace (addresses concern that Sean pointed out)
> >>
> >> READABLE means that after the conversion, the memory is readable by
> >> userspace (if converting to shared) or readable by the guest (if
> >> converting to private).
> >>
> >> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP
> >> + SW_PROTECTED_VM will support this and do nothing extra on
> >>   conversion, since there is no encryption anyway and all content
> >>   remains readable.
> >> + pKVM will make use of the arch function above.
> >>
> >> Here's where I need input: (David's questions during the call about
> >> the full flow beginning with the guest prompted this).
> >>
> >> Since pKVM doesn't encrypt the memory contents, there must be some way
> >> that pKVM can say no when userspace requests to convert and retain
> >> READABLE contents? I think pKVM's arch function can be used to check
> >> if the guest previously made a conversion request. Fuad, to check that
> >> the guest made a conversion request, what's other parameters are
> >> needed other than gfn and nr_pages?
> >
> > The gfn and nr_pages parameters are enough I think.
> >
> > To clarify how pKVM would use this hook: all memory sharing and
> > unsharing must be initiated by the guest via a hypercall. When the
> > guest issues this hypercall, the pKVM hypervisor (EL2) exits to the
> > host kernel (EL1). The host kernel records the exit reason (share or
> > unshare) along with the specific memory address in the kvm_run
> > structure before exiting to userspace.
> >
> > We do not track this pending conversion state in the hypervisor. If a
> > compromised host kernel wants to lie and corrupt the state, it can
> > crash the system or the guest (which is an accepted DOS risk), but it
> > cannot compromise guest confidentiality because EL2 still strictly
> > enforces Stage-2 permissions. Our primary goal here is to prevent a
> > malicious or buggy userspace VMM from crashing the system.
> >
>
> Thinking through it again, there's actually no security (in terms of
> CoCo confidentiality) risk here, since the conversion ioctl doesn't
> actually tell the CoCo vendor/platform to encrypt/decrypt or flip
> permissions, it just unmaps the pages as requested.
>
> On TDX, if a rogue private to shared conversion request comes in, the
> private pages would get unmapped from the guest, and on the next guest
> access, the guest would access the page as private, so kvm's fault
> handler would think there's a shared/private mismatch and exit with
> KVM_EXIT_MEMORY_FAULT. Userspace now has a zeroed shared page, and the
> guest needs to re-accept the page to continue using it (if it knows what
> to do with a zeroed page). This would be userspace DOS-ing the guest,
> which userspace can do anyway.
>
> On pKVM, rephrasing what you said, even if there is a rogue private to
> shared conversion, EL2 still thinks of the page as private. After the
> conversion, the page can be faulted in by the host, but any access will
> be stopped by EL2.
>
> David, there's no missing piece in the flow!
>
> > When the VMM subsequently issues the KVM_SET_MEMORY_ATTRIBUTES2 ioctl
> > with the READABLE policy, we will use the
> > kvm_gmem_arch_content_policy_supported() hook in EL1 to validate the
> > ioctl. We will cross-reference the requested gfn and nr_pages against
> > the pending exit reason stored in kvm_run.
> >
> > If the VMM attempts an unsolicited conversion (i.e., there is no
> > matching exit request in kvm_run, or the addresses do not match), our
>
> Ah I see, so struct kvm_run is not considered "in the hypervisor" since
> it is modifiable by host userspace. Would you be using struct
> memory_fault in struct kvm_run?
>
> Which vcpu's kvm_run struct would you look up from
> kvm_gmem_arch_content_policy_supported()?
>
> For this to land, do you still want the gfn and nr_pages parameters?
>
> Can pKVM just always accept the request, whether the guest requested it
> or not? Thinking about it again,
> kvm_gmem_arch_content_policy_supported() probably shouldn't be used to
> guard solicited vs unsolicited requests anyway (unless you think the
> function's name should be changed?)

You spotted a flaw in my proposed validation mechanic. As you said,
KVM_SET_MEMORY_ATTRIBUTES2 is a VM-scoped ioctl, meaning we lack the
vCPU context required to safely inspect a specific kvm_run structure.
Attempting to track this state VM-wide in struct kvm_arch would
require introducing a lock on a hot memory-transition path, which we
want to avoid.

Your assessment of the security implications and the TDX fallback
mechanism is correct and applies to pKVM. Because EL2 maintains the
ultimate source of truth regarding Stage-2 permissions, a rogue
conversion by the VMM cannot breach guest confidentiality.

Given this, it makes sense for pKVM will adopt the TDX approach. We
will drop the requirement to synchronously validate the guest's intent
during the ioctl. We will always accept the request in EL1. If the VMM
is lying and the guest never actually authorized the share, the
resulting attribute mismatch will trigger a KVM_EXIT_MEMORY_FAULT upon
the next guest access, or a SIGBUS/SIGSEGV if the host userspace
attempts to map and read the memory. This will DOS/crash the VMM
without endangering the host kernel.

Therefore, kvm_gmem_arch_content_policy_supported() should not be used
for dynamic state-machine validation. We will use it strictly as a
static capability check (e.g., verifying that the requested policy is
architecturally possible).

I think that we would still require gfn and nr_pages in the hook to
allow for potential range-based capability checks in the future, but
not use them to cross-reference pending conversion requests.

Cheers,
/fuad

> > current plan is to reject the request and return an error. In the
> > future, rather than outright rejecting an unsolicited conversion, we
> > might evolve this to treat it as a host-initiated destructive reclaim,
> > forcing an unshare and zeroing the memory. For the time being,
> > explicit rejection is the simplest and safest path.
> >
>
> >> ENCRYPTED means that after the conversion, the memory contents are
> >> retained as-is, with no decryption.
> >>
> >> + TDX and SNP (today) can't support this, so return -EOPNOTSUPP
> >> + pKVM and SW_PROTECTED_VM can do nothing, but doing nothing retains
> >>   READABLE content, not ENCRYPTED content, hence SW_PROTECTED_VM
> >>   should return -EOPNOTSUPP.
> >> + Michael, you mentioned during the call that SNP is planning to
> >>   introduce a policy that retains the ENCRYPTED version for a special
> >>   GHCB call. ENCRYPTED is meant for that use case. Does it work? I'm
> >>   assuming that SNP should only support this policy given some
> >>   conditions, so would the arch call as described above work?
> >> + If this policy is specified on conversion from shared to private,
> >>   always return -EOPNOTSUPP.
> >> + When this first lands, ENCRYPTED will not be a valid option, but I'm
> >>   listing it here so we have line of sight to having this support.
> >>
> >> READABLE and ENCRYPTED defines the state after conversion clearly
> >> (instead of DONT_CARE or similar).
> >>
> >> DESTROY could be another policy, which means that after the
> >> conversion, the memory is unreadable. This is the option to address
> >> what David brought up during the call, for cases where userspace knows
> >> it is going to free the memory already and doesn't care about the
> >> state as long as nobody gets to read it. This will not implemented
> >> when feature first lands, but is presented here just to show how this
> >> can be extended in future.
> >>
> >> Right now, I'm thinking that one of the above policies MUST be
> >> specified (not specifying a policy will result in -EINVAL).
> >>
> >> How does this sound?
> >
> > I don't think that returning -EINVAL is the right thing to do here. If
> > userspace omits the policy, the API should default to
> > SET_MEMORY_ATTRIBUTES_CONTENT_ZERO and proceed with the conversion. I
> > believe that, in Linux APIs in general, omitting an optional behavior
> > flag results in the safest, most standard default action.
> >
>
> Makes sense, I'll default to zeroing then. Thanks!
>
> > Also, returning -EINVAL when no policy is specified makes the policy
> > parameter strictly mandatory. This makes it difficult for userspace's
> > to seamlessly request clean-slate, destructive conversions. Software
> > zeroing ensures deterministic behavior across pKVM, TDX, and SNP,
> > isolating the KVM uAPI from micro-architectural data destruction
> > nuances.
> >
> > Cheers,
> > /fuad

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-02-14 20:09   ` Ackerley Tng
                       ` (2 preceding siblings ...)
  2026-02-24 10:14     ` Ackerley Tng
@ 2026-03-12  5:44     ` Ackerley Tng
  2026-03-12 15:12       ` Fuad Tabba
  3 siblings, 1 reply; 54+ messages in thread
From: Ackerley Tng @ 2026-03-12  5:44 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86
  Cc: aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tabba, tglx, vannapurve, vbabka, willy, wyihan,
	yan.y.zhao

Ackerley Tng <ackerleytng@google.com> writes:

Here's iteration 2 of the attributes, after getting a much clearer idea
of the use cases across platforms at the last guest_memfd biweekly.

Please comment in this context! I'm planning for this text to make it to
Documentation/virt/kvm/api.rst.

> Ackerley Tng <ackerleytng@google.com> writes:
>
>>
>> [...snip...]
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 23ec0b0c3e22..26e80745c8b4 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -117,7 +117,7 @@ description:
>>        x86 includes both i386 and x86_64.
>>
>>    Type:
>> -      system, vm, or vcpu.
>> +      system, vm, vcpu or guest_memfd.
>>
>>    Parameters:
>>        what parameters are accepted by the ioctl.
>> @@ -6523,11 +6523,22 @@ the capability to be present.
>>  ---------------------------------
>>
>>  :Capability: KVM_CAP_MEMORY_ATTRIBUTES2
>> -:Architectures: x86
>> -:Type: vm ioctl
>> +:Architectures: all
>> +:Type: vm, guest_memfd ioctl
>>  :Parameters: struct kvm_memory_attributes2 (in/out)
>>  :Returns: 0 on success, <0 on error
>>
>> +Errors:
>> +
>> +  ========== ===============================================================
>> +  EINVAL     The specified `offset` or `size` were invalid (e.g. not
>> +             page aligned, causes an overflow, or size is zero).
>> +  EFAULT     The parameter address was invalid.
>> +  EAGAIN     Some page within requested range had unexpected refcounts. The
>> +             offset of the page will be returned in `error_offset`.
>> +  ENOMEM     Ran out of memory trying to track private/shared state
      EOPNOTSUPP The specified content policy is not supported while
                 setting the requested attribute
>> +  ========== ===============================================================
>> +
>>  KVM_SET_MEMORY_ATTRIBUTES2 is an extension to
>>  KVM_SET_MEMORY_ATTRIBUTES that supports returning (writing) values to
>>  userspace.  The original (pre-extension) fields are shared with
>> @@ -6538,15 +6549,42 @@ Attribute values are shared with KVM_SET_MEMORY_ATTRIBUTES.
>>  ::
>>
>>    struct kvm_memory_attributes2 {
>> -	__u64 address;
>> +	/* in */
>> +	union {
>> +		__u64 address;
>> +		__u64 offset;
>> +	};
>>  	__u64 size;
>>  	__u64 attributes;
>>  	__u64 flags;
>> -	__u64 reserved[12];
>> +	/* out */
>> +	__u64 error_offset;
>> +	__u64 reserved[11];
>>    };
>>
>>    #define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
>>
>> +Set attributes for a range of offsets within a guest_memfd to
>> +KVM_MEMORY_ATTRIBUTE_PRIVATE to limit the specified guest_memfd backed
>> +memory range for guest_use. Even if KVM_CAP_GUEST_MEMFD_MMAP is
>> +supported, after a successful call to set
>> +KVM_MEMORY_ATTRIBUTE_PRIVATE, the requested range will not be mappable
>> +into host userspace and will only be mappable by the guest.
>> +
>> +To allow the range to be mappable into host userspace again, call
>> +KVM_SET_MEMORY_ATTRIBUTES2 on the guest_memfd again with
>> +KVM_MEMORY_ATTRIBUTE_PRIVATE unset.
>> +
>> +If this ioctl returns -EAGAIN, the offset of the page with unexpected
>> +refcounts will be returned in `error_offset`. This can occur if there
>> +are transient refcounts on the pages, taken by other parts of the
>> +kernel.
>> +
>> +Userspace is expected to figure out how to remove all known refcounts
>> +on the shared pages, such as refcounts taken by get_user_pages(), and
>> +try the ioctl again. A possible source of these long term refcounts is
>> +if the guest_memfd memory was pinned in IOMMU page tables.
>> +

Memory *content* policies can be requested while setting memory
attributes. This defines:

  - What the host reads after a private to shared conversion
  - What the guest reads after a shared to private conversion (if
    applicable)

The policy definitions below provide more details:

``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_ZERO`` (default)

  On a private to shared conversion, the host will read zeros from the
  converted memory on the next fault after successful return of the
  KVM_SET_MEMORY_ATTRIBUTES2 ioctl.

  This is not supported (-EOPNOTSUPP) for a shared to private
  conversion. While some CoCo implementations do zero memory contents
  such that the guest reads zeros after conversion, the guest is not
  expected to trust host-provided zeroing, hence as a UAPI policy, KVM
  does not make any such guarantees.

  For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle
  will support this policy and ensure zeroing for conversions in both
  directions.

``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_PRESERVE``

  On private/shared conversions in both directions, memory contents
  will be preserved and readable. As a concrete example, if the host
  writes ``0xbeef`` to memory and converts the memory to shared, the
  guest will also read ``0xbeef``, after any necessary hardware or
  software provided decryption. After a reverse shared to private
  conversion, the host will also read ``0xbeef``.

  pKVM (ARM) is the first user of this policy. Since pKVM does not
  protect memory with encryption, a content policy to preserve memory
  will not will not involve any decryption. The guest will be able to
  read what the host wrote with full content preservation.

  For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle
  will support this policy and the contents of converted memory will
  be preserved.

``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_NONE``

  This is an explicit request that KVM provide no guarantees on memory
  contents after conversion. Neither host nor guest should expect any
  guarantees about the memory contents after conversion.

  For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle will
  support this policy and every byte of converted memory will read
  ``0xab``.

>>  See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
>>
>
> [...snip...]
>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-03-12  5:44     ` Ackerley Tng
@ 2026-03-12 15:12       ` Fuad Tabba
  2026-03-12 15:44         ` Sean Christopherson
  0 siblings, 1 reply; 54+ messages in thread
From: Fuad Tabba @ 2026-03-12 15:12 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86, aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, seanjc, shivankg, shuah,
	steven.price, tglx, vannapurve, vbabka, willy, wyihan, yan.y.zhao

Hi Ackerley,

Before getting into the UAPI semantics, thank you for all the heavy
lifting you've done here. Figuring out how to make it all work across
the different platforms is not easy :)

<snip>

> The policy definitions below provide more details:
>
> ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_ZERO`` (default)
>
>   On a private to shared conversion, the host will read zeros from the
>   converted memory on the next fault after successful return of the
>   KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
>
>   This is not supported (-EOPNOTSUPP) for a shared to private
>   conversion. While some CoCo implementations do zero memory contents
>   such that the guest reads zeros after conversion, the guest is not
>   expected to trust host-provided zeroing, hence as a UAPI policy, KVM
>   does not make any such guarantees.

The rationale for not supporting this in the UAPI isn't quite right
and I think that the prohibition should be removed. It's true that the
guest is not expected to trust host-provided zeroing. However, if the
VMM invokes this ioctl with the ZERO policy, the zeroing is performed
by the hypervisor, not by the (untrusted) host.

Although pKVM handles fresh, zeroed memory provisioning via donation
rather than attribute conversion, stating that the UAPI cannot make
guarantees due to trust boundaries is incorrect. The hypervisor is
precisely the entity the guest trusts to enforce this.

The UAPI should define the semantics for a shared-to-private ZERO
conversion, even if current architectures return -EOPNOTSUPP because
they handle fresh memory provisioning via other mechanisms (like
pKVM's donation path).

How about something like the following:

On a shared to private conversion, the hypervisor will zero the memory
contents before mapping it into the guest's private address space,
preventing the untrusted host from injecting arbitrary data into the
guest. If an architecture handles zeroed-provisioning via mechanisms
other than attribute conversion, it may return -EOPNOTSUPP.

>   For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle
>   will support this policy and ensure zeroing for conversions in both
>   directions.
>
> ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_PRESERVE``
>
>   On private/shared conversions in both directions, memory contents
>   will be preserved and readable. As a concrete example, if the host
>   writes ``0xbeef`` to memory and converts the memory to shared, the
>   guest will also read ``0xbeef``, after any necessary hardware or
>   software provided decryption. After a reverse shared to private
>   conversion, the host will also read ``0xbeef``.

I think that this example is backwards. If the host writes to memory,
that memory is already shared, isn't it? Converting it to shared is
redundant. More importantly, if memory undergoes a shared-to-private
conversion, the host must lose access entirely.

Maybe a clearer example would reflect actual payload injection and
bounce buffer sharing:
- Shared-to-Private (Payload Injection): The host writes a payload
(e.g., 0xbeef) to shared memory and converts it to private. The guest
reads 0xbeef in its private address space. The host loses access.
- Private-to-Shared (Bounce Buffer): The guest writes 0xbeef to
private memory and converts it to shared. The host reads 0xbeef.

>   pKVM (ARM) is the first user of this policy. Since pKVM does not
>   protect memory with encryption, a content policy to preserve memory
>   will not will not involve any decryption. The guest will be able to
>   read what the host wrote with full content preservation.

This is correct, but to be precise, I think it should explicitly
mention Stage-2 page tables as the protection mechanism, maybe:

Because pKVM protects private memory via Stage-2 page table isolation
rather than hardware encryption, content preservation does not require
cryptographic operations. The hypervisor (at EL2) simply transfers
page ownership while retaining the data.

Cheers,
/fuad

>   For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle
>   will support this policy and the contents of converted memory will
>   be preserved.
>
> ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_NONE``
>
>   This is an explicit request that KVM provide no guarantees on memory
>   contents after conversion. Neither host nor guest should expect any
>   guarantees about the memory contents after conversion.
>
>   For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle will
>   support this policy and every byte of converted memory will read
>   ``0xab``.
>
> >>  See also: :ref: `KVM_SET_MEMORY_ATTRIBUTES`.
> >>
> >
> > [...snip...]
> >

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-03-12 15:12       ` Fuad Tabba
@ 2026-03-12 15:44         ` Sean Christopherson
  2026-03-12 21:59           ` Ackerley Tng
  0 siblings, 1 reply; 54+ messages in thread
From: Sean Christopherson @ 2026-03-12 15:44 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: Ackerley Tng, kvm, linux-doc, linux-kernel, linux-kselftest,
	linux-trace-kernel, x86, aik, andrew.jones, binbin.wu, bp,
	brauner, chao.p.peng, chao.p.peng, chenhuacai, corbet,
	dave.hansen, david, hpa, ira.weiny, jgg, jmattson, jroedel,
	jthoughton, maobibo, mathieu.desnoyers, maz, mhiramat,
	michael.roth, mingo, mlevitsk, oupton, pankaj.gupta, pbonzini,
	prsampat, qperret, ricarkol, rick.p.edgecombe, rientjes, rostedt,
	shivankg, shuah, steven.price, tglx, vannapurve, vbabka, willy,
	wyihan, yan.y.zhao

On Thu, Mar 12, 2026, Fuad Tabba wrote:
> Hi Ackerley,
> 
> Before getting into the UAPI semantics, thank you for all the heavy
> lifting you've done here. Figuring out how to make it all work across
> the different platforms is not easy :)
> 
> <snip>
> 
> > The policy definitions below provide more details:

Please drop "CONTENT_POLICY" from the KVM documentation.  From KVM's perspective,
these are not "policy", they are purely properties of the underlying memory.
Userspace will likely use the attributes to implement policy of some kind, but
KVM straight up doesn't care.

> > ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_ZERO`` (default)

The default behavior absolutely cannot be something that's not supported on
every conversion type.

> >
> >   On a private to shared conversion, the host will read zeros from the
> >   converted memory on the next fault after successful return of the
> >   KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
> >
> >   This is not supported (-EOPNOTSUPP) for a shared to private
> >   conversion. While some CoCo implementations do zero memory contents
> >   such that the guest reads zeros after conversion, the guest is not
> >   expected to trust host-provided zeroing, hence as a UAPI policy, KVM
> >   does not make any such guarantees.
> 
> The rationale for not supporting this in the UAPI isn't quite right
> and I think that the prohibition should be removed. It's true that the
> guest is not expected to trust host-provided zeroing. However, if the
> VMM invokes this ioctl with the ZERO policy, the zeroing is performed
> by the hypervisor, not by the (untrusted) host.

What entity zeros the data doesn't matter as far as KVM's ABI is concerned.  That's
a motivating favor to providing ZERO, e.g. it allow userspace to elide additional
zeroing when it _knows_ the memory holds zeros, but that's orthogonal to KVM's
contract with userspace.

> Although pKVM handles fresh, zeroed memory provisioning via donation
> rather than attribute conversion, stating that the UAPI cannot make
> guarantees due to trust boundaries is incorrect. The hypervisor is

We should avoid using "hypervisor", because (a) it means different things to
different people and (b) even when there's consensus on what "hypervisor" means,
whether or not the hypervisor is trusted varies per implementation.

> need to be careful witho precisely the entity the guest trusts to enforce
> this.
> 
> The UAPI should define the semantics for a shared-to-private ZERO
> conversion, even if current architectures return -EOPNOTSUPP because
> they handle fresh memory provisioning via other mechanisms (like
> pKVM's donation path).
> 
> How about something like the following:
> 
> On a shared to private conversion, the hypervisor will zero the memory

Again, say _nothing_ about "the hypervisor".  _How_ or when anything happens is
completely irrelevant, the only thing that matters here is _what_ happens.

> contents before mapping it into the guest's private address space,
> preventing the untrusted host from injecting arbitrary data into the
> guest. If an architecture handles zeroed-provisioning via mechanisms
> other than attribute conversion, it may return -EOPNOTSUPP.

No.  I am 100% against bleeding vendor specific information into KVM's ABI for
this.  What the vendor code does is irrelevant, the _only_ thing that matters
here is KVM's contract with userspace.

That doesn't mean pKVM guests can't rely on memory being zeroed, but that is a
contract between pKVM and its guests, not between KVM and host userspace.

> >   For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle
> >   will support this policy and ensure zeroing for conversions in both
> >   directions.
> >
> > ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_PRESERVE``
> >
> >   On private/shared conversions in both directions, memory contents
> >   will be preserved and readable. As a concrete example, if the host
> >   writes ``0xbeef`` to memory and converts the memory to shared, the
> >   guest will also read ``0xbeef``, after any necessary hardware or
> >   software provided decryption. After a reverse shared to private
> >   conversion, the host will also read ``0xbeef``.
> 
> I think that this example is backwards. If the host writes to memory,
> that memory is already shared, isn't it? Converting it to shared is
> redundant. More importantly, if memory undergoes a shared-to-private
> conversion, the host must lose access entirely.

Ya, it's messed up.

> Maybe a clearer example would reflect actual payload injection and
> bounce buffer sharing:
> - Shared-to-Private (Payload Injection): The host writes a payload
> (e.g., 0xbeef) to shared memory and converts it to private. The guest
> reads 0xbeef in its private address space. The host loses access.
> - Private-to-Shared (Bounce Buffer): The guest writes 0xbeef to
> private memory and converts it to shared. The host reads 0xbeef.
> 
> >   pKVM (ARM) is the first user of this policy. Since pKVM does not
> >   protect memory with encryption, a content policy to preserve memory
> >   will not will not involve any decryption. The guest will be able to
> >   read what the host wrote with full content preservation.
> 
> This is correct, but to be precise, I think it should explicitly
> mention Stage-2 page tables as the protection mechanism, maybe:

pKVM shouldn't be mentioned in here at all.

---
By default, KVM makes no guarantees about the in-memory values after memory is
convert to/from shared/private.  Optionally, userspace may instruct KVM to
ensure the contents of memory are zeroed or preserved, e.g. to enable in-place
sharing of data, or as an optimization to avoid having to re-zero memory when
the trusted entity guarantees the memory will be zeroed after conversion.

The behaviors supported by a given KVM instance can be queried via <cap>.  If
the requested behavior is an unsupported, KVM will return -EOPNOTSUPP and
reject the conversion request.  Note!  The "ZERO" request is only support for
private to shared conversion!

``KVM_SET_MEMORY_ATTRIBUTES2_ZERO``

  On conversion, KVM guarantees all entities that have "allowed" access to the
  memory will read zeros.  E.g. on private to shared conversion, both trusted
  and untrusted code will read zeros.

  Zeroing is currently only supported for private-to-shared conversions, as KVM
  in general is untrusted and thus cannot guarantee the guest (or any trusted
  entity) will read zeros after conversion.  Note, some CoCo implementations do
  zero memory contents such that the guest reads zeros after conversion, and
  the guest may choose to rely on that behavior.  But that's a contract between
  the trusted CoCo entity and the guest, not between KVM and the guest.

``KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE``

  On conversion, KVM guarantees memory contents will be preserved with respect
  to the last written unencrypted value.  As a concrete example, if the host
  writes ``0xbeef`` to shared memory and converts the memory to private, the
  guest will also read ``0xbeef``, even if the in-memory data is encrypted as
  part of the conversion.  And vice versa, if the guest writes ``0xbeef`` to
  private memory and then converts the memory to shared, the host (and guest)
  will read ``0xbeef`` (if the memory is accessible).

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-03-12 15:44         ` Sean Christopherson
@ 2026-03-12 21:59           ` Ackerley Tng
  2026-03-13  0:36             ` Sean Christopherson
  2026-03-13  8:31             ` Fuad Tabba
  0 siblings, 2 replies; 54+ messages in thread
From: Ackerley Tng @ 2026-03-12 21:59 UTC (permalink / raw)
  To: Sean Christopherson, Fuad Tabba
  Cc: kvm, linux-doc, linux-kernel, linux-kselftest, linux-trace-kernel,
	x86, aik, andrew.jones, binbin.wu, bp, brauner, chao.p.peng,
	chao.p.peng, chenhuacai, corbet, dave.hansen, david, hpa,
	ira.weiny, jgg, jmattson, jroedel, jthoughton, maobibo,
	mathieu.desnoyers, maz, mhiramat, michael.roth, mingo, mlevitsk,
	oupton, pankaj.gupta, pbonzini, prsampat, qperret, ricarkol,
	rick.p.edgecombe, rientjes, rostedt, shivankg, shuah,
	steven.price, tglx, vannapurve, vbabka, willy, wyihan, yan.y.zhao

Sean Christopherson <seanjc@google.com> writes:

> On Thu, Mar 12, 2026, Fuad Tabba wrote:
>> Hi Ackerley,
>>
>> Before getting into the UAPI semantics, thank you for all the heavy
>> lifting you've done here. Figuring out how to make it all work across
>> the different platforms is not easy :)
>>
>> <snip>
>>
>> > The policy definitions below provide more details:
>
> Please drop "CONTENT_POLICY" from the KVM documentation.  From KVM's perspective,
> these are not "policy", they are purely properties of the underlying memory.
> Userspace will likely use the attributes to implement policy of some kind, but
> KVM straight up doesn't care.

Policy might have been the wrong word. I think this is a property of the
conversion process/request, not a property of the memory like how
shared/private is a property of the memory?

I'll have to find another word to describe this enum of

* KVM_SET_MEMORY_ATTRIBUTES2_ZERO
* KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE

>
>> > ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_ZERO`` (default)
>
> The default behavior absolutely cannot be something that's not supported on
> every conversion type.
>
>> >
>> >   On a private to shared conversion, the host will read zeros from the
>> >   converted memory on the next fault after successful return of the
>> >   KVM_SET_MEMORY_ATTRIBUTES2 ioctl.
>> >
>> >   This is not supported (-EOPNOTSUPP) for a shared to private
>> >   conversion. While some CoCo implementations do zero memory contents
>> >   such that the guest reads zeros after conversion, the guest is not
>> >   expected to trust host-provided zeroing, hence as a UAPI policy, KVM
>> >   does not make any such guarantees.
>>
>> The rationale for not supporting this in the UAPI isn't quite right
>> and I think that the prohibition should be removed. It's true that the
>> guest is not expected to trust host-provided zeroing. However, if the
>> VMM invokes this ioctl with the ZERO policy, the zeroing is performed
>> by the hypervisor, not by the (untrusted) host.
>
> What entity zeros the data doesn't matter as far as KVM's ABI is concerned.  That's
> a motivating favor to providing ZERO, e.g. it allow userspace to elide additional
> zeroing when it _knows_ the memory holds zeros, but that's orthogonal to KVM's
> contract with userspace.
>
>> Although pKVM handles fresh, zeroed memory provisioning via donation
>> rather than attribute conversion, stating that the UAPI cannot make
>> guarantees due to trust boundaries is incorrect. The hypervisor is
>
> We should avoid using "hypervisor", because (a) it means different things to
> different people and (b) even when there's consensus on what "hypervisor" means,
> whether or not the hypervisor is trusted varies per implementation.
>
>> need to be careful witho precisely the entity the guest trusts to enforce
>> this.
>>
>> The UAPI should define the semantics for a shared-to-private ZERO
>> conversion, even if current architectures return -EOPNOTSUPP because
>> they handle fresh memory provisioning via other mechanisms (like
>> pKVM's donation path).
>>
>> How about something like the following:
>>
>> On a shared to private conversion, the hypervisor will zero the memory
>
> Again, say _nothing_ about "the hypervisor".  _How_ or when anything happens is
> completely irrelevant, the only thing that matters here is _what_ happens.
>
>> contents before mapping it into the guest's private address space,
>> preventing the untrusted host from injecting arbitrary data into the
>> guest. If an architecture handles zeroed-provisioning via mechanisms
>> other than attribute conversion, it may return -EOPNOTSUPP.
>
> No.  I am 100% against bleeding vendor specific information into KVM's ABI for
> this.  What the vendor code does is irrelevant, the _only_ thing that matters
> here is KVM's contract with userspace.
>
> That doesn't mean pKVM guests can't rely on memory being zeroed, but that is a
> contract between pKVM and its guests, not between KVM and host userspace.
>

If pKVM's (kernel, or elsewhere) documentation says something like

  Shared to private (in addition to private to shared already specified
  in the userspace/KVM contract) conversions through guest_memfd
  specifying ZERO will have memory contents zeroed.

Would that then cover both perspectives? I see Fuad's point that pKVM
would like to provide guarantees in the shared to private direction too,
and I see Sean's point that the shared to private direction isn't a
userspace/KVM thing.

The awkward part is that we guarantee both directions for PRESERVE but
not for ZERO.

>> >   For testing purposes, the KVM_X86_SW_PROTECTED_VM testing vehicle
>> >   will support this policy and ensure zeroing for conversions in both
>> >   directions.
>> >
>> > ``KVM_SET_MEMORY_ATTRIBUTES2_CONTENT_POLICY_PRESERVE``
>> >
>> >   On private/shared conversions in both directions, memory contents
>> >   will be preserved and readable. As a concrete example, if the host
>> >   writes ``0xbeef`` to memory and converts the memory to shared, the
>> >   guest will also read ``0xbeef``, after any necessary hardware or
>> >   software provided decryption. After a reverse shared to private
>> >   conversion, the host will also read ``0xbeef``.
>>
>> I think that this example is backwards. If the host writes to memory,
>> that memory is already shared, isn't it? Converting it to shared is
>> redundant. More importantly, if memory undergoes a shared-to-private
>> conversion, the host must lose access entirely.
>
> Ya, it's messed up.
>

Omg, it is backwards!! Might have been copypasta...

>> Maybe a clearer example would reflect actual payload injection and
>> bounce buffer sharing:
>> - Shared-to-Private (Payload Injection): The host writes a payload
>> (e.g., 0xbeef) to shared memory and converts it to private. The guest
>> reads 0xbeef in its private address space. The host loses access.
>> - Private-to-Shared (Bounce Buffer): The guest writes 0xbeef to
>> private memory and converts it to shared. The host reads 0xbeef.
>>
>> >   pKVM (ARM) is the first user of this policy. Since pKVM does not
>> >   protect memory with encryption, a content policy to preserve memory
>> >   will not will not involve any decryption. The guest will be able to
>> >   read what the host wrote with full content preservation.
>>
>> This is correct, but to be precise, I think it should explicitly
>> mention Stage-2 page tables as the protection mechanism, maybe:
>
> pKVM shouldn't be mentioned in here at all.
>
> ---
> By default, KVM makes no guarantees about the in-memory values after memory is
> convert to/from shared/private.  Optionally, userspace may instruct KVM to
> ensure the contents of memory are zeroed or preserved, e.g. to enable in-place
> sharing of data, or as an optimization to avoid having to re-zero memory when
> the trusted entity guarantees the memory will be zeroed after conversion.
>

How about:

or as an optimization to avoid having to re-zero memory when userspace
could have relied on the trusted entity to guarantee the memory will be
zeroed as part of the entire conversion process.

> The behaviors supported by a given KVM instance can be queried via <cap>.  If

I started with some implementation and was questioning the value of a
CAP. It seems like there won't be anything dynamic about this?

The userspace code can check what platform it is running on, and then
decide ZERO or PRESERVE based on the platform:

If the VM is running on TDX, it would want to specify ZERO all the
time. If the VM were running on pKVM it would want to specify PRESERVE
if it wants to enable in-place sharing, and ZERO if it wants to zero the
memory.

If someday TDX supports PRESERVE, then there's room for discovery of
which algorithm to choose when running the guest. Perhaps that's when
the CAP should be introduced?

> the requested behavior is an unsupported, KVM will return -EOPNOTSUPP and
> reject the conversion request.  Note!  The "ZERO" request is only support for
> private to shared conversion!
>

Do you mean ZERO is only guaranteed for private to shared? If we say
"ZERO is only guaranteed for private to shared", then pKVM could
additionally guarantee zeroing for shared to private. If we say it is
only supported for private to shared, then should I return -EOPNOTSUPP
and therefore not allow platforms to provide other guarantees?

I think we should stick to guarantees for this

* not specified (default) = no guarantees whatsoever
* ZERO = guaranteed zero for shared to private, no guarantees for
         private to shared. Platforms can add on more guarantees.
* PRESERVE = guaranteed preseved in both directions

-EOPNOTSUPP should probably be understood as "There is no way to
guarantee this" like how TDX would return -EOPNOTSUPP for PRESERVE
(now).

> ``KVM_SET_MEMORY_ATTRIBUTES2_ZERO``
>
>   On conversion, KVM guarantees all entities that have "allowed" access to the
>   memory will read zeros.  E.g. on private to shared conversion, both trusted
>   and untrusted code will read zeros.
>
>   Zeroing is currently only supported for private-to-shared conversions, as KVM
>   in general is untrusted and thus cannot guarantee the guest (or any trusted
>   entity) will read zeros after conversion.  Note, some CoCo implementations do
>   zero memory contents such that the guest reads zeros after conversion, and
>   the guest may choose to rely on that behavior.  But that's a contract between
>   the trusted CoCo entity and the guest, not between KVM and the guest.
>
> ``KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE``
>
>   On conversion, KVM guarantees memory contents will be preserved with respect
>   to the last written unencrypted value.  As a concrete example, if the host
>   writes ``0xbeef`` to shared memory and converts the memory to private, the
>   guest will also read ``0xbeef``, even if the in-memory data is encrypted as
>   part of the conversion.  And vice versa, if the guest writes ``0xbeef`` to
>   private memory and then converts the memory to shared, the host (and guest)
>   will read ``0xbeef`` (if the memory is accessible).

Thank you for this summary :)

I see you dropped any documentation to do with testing. I meant to
document it (at least something about the unspecified case) so it can be
relied on in selftests, with the understanding (already specified
elsewhere in Documentation/virt/kvm/api.rst) that nothing about
KVM_X86_SW_PROTECTED_VM is to be relied on in production, and can be
changed anytime. What do you think?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-03-12 21:59           ` Ackerley Tng
@ 2026-03-13  0:36             ` Sean Christopherson
  2026-03-13  8:32               ` Fuad Tabba
  2026-03-13  8:31             ` Fuad Tabba
  1 sibling, 1 reply; 54+ messages in thread
From: Sean Christopherson @ 2026-03-13  0:36 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Fuad Tabba, kvm, linux-doc, linux-kernel, linux-kselftest,
	linux-trace-kernel, x86, aik, andrew.jones, binbin.wu, bp,
	brauner, chao.p.peng, chao.p.peng, chenhuacai, corbet,
	dave.hansen, david, hpa, ira.weiny, jgg, jmattson, jroedel,
	jthoughton, maobibo, mathieu.desnoyers, maz, mhiramat,
	michael.roth, mingo, mlevitsk, oupton, pankaj.gupta, pbonzini,
	prsampat, qperret, ricarkol, rick.p.edgecombe, rientjes, rostedt,
	shivankg, shuah, steven.price, tglx, vannapurve, vbabka, willy,
	wyihan, yan.y.zhao

On Thu, Mar 12, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Thu, Mar 12, 2026, Fuad Tabba wrote:
> >> Hi Ackerley,
> >>
> >> Before getting into the UAPI semantics, thank you for all the heavy
> >> lifting you've done here. Figuring out how to make it all work across
> >> the different platforms is not easy :)
> >>
> >> <snip>
> >>
> >> > The policy definitions below provide more details:
> >
> > Please drop "CONTENT_POLICY" from the KVM documentation.  From KVM's perspective,
> > these are not "policy", they are purely properties of the underlying memory.
> > Userspace will likely use the attributes to implement policy of some kind, but
> > KVM straight up doesn't care.
> 
> Policy might have been the wrong word. I think this is a property of the
> conversion process/request, not a property of the memory like how
> shared/private is a property of the memory?
> 
> I'll have to find another word to describe this enum of

Or just don't?  I'm 100% serious, because unless we carve out a field _just_ for
these two flags, they're eventually going to get mixed with other stuff.  At that
point, having a precisely named enum container just gets in the way.

> I see you dropped any documentation to do with testing.

Yes.

> I meant to document it (at least something about the unspecified case) so it
> can be relied on in selftests, with the understanding (already specified
> elsewhere in Documentation/virt/kvm/api.rst) that nothing about
> KVM_X86_SW_PROTECTED_VM is to be relied on in production, and can be changed
> anytime. What do you think?

KVM_X86_SW_PROTECTED_VM should self-report like all other VM types, and shouldn't
support anything that isn't documented as possible.  I.e. we shouldn't allow
ZERO on shared=>private "for testing".

What I do think we should do is scribble memory on conversions without ZERO or
PRIVATE, probably guarded by a Kconfig or maybe a module param, to do a best
effort enforcement of the ABI, i.e. to try and prevent userspace from depending
on uarch/vendor specific behavior.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-03-12 21:59           ` Ackerley Tng
  2026-03-13  0:36             ` Sean Christopherson
@ 2026-03-13  8:31             ` Fuad Tabba
  1 sibling, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2026-03-13  8:31 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Sean Christopherson, kvm, linux-doc, linux-kernel,
	linux-kselftest, linux-trace-kernel, x86, aik, andrew.jones,
	binbin.wu, bp, brauner, chao.p.peng, chao.p.peng, chenhuacai,
	corbet, dave.hansen, david, hpa, ira.weiny, jgg, jmattson,
	jroedel, jthoughton, maobibo, mathieu.desnoyers, maz, mhiramat,
	michael.roth, mingo, mlevitsk, oupton, pankaj.gupta, pbonzini,
	prsampat, qperret, ricarkol, rick.p.edgecombe, rientjes, rostedt,
	shivankg, shuah, steven.price, tglx, vannapurve, vbabka, willy,
	wyihan, yan.y.zhao

Hi Ackerley,

<snip>

> > By default, KVM makes no guarantees about the in-memory values after memory is
> > convert to/from shared/private.  Optionally, userspace may instruct KVM to
> > ensure the contents of memory are zeroed or preserved, e.g. to enable in-place
> > sharing of data, or as an optimization to avoid having to re-zero memory when
> > the trusted entity guarantees the memory will be zeroed after conversion.
> >
>
> How about:
>
> or as an optimization to avoid having to re-zero memory when userspace
> could have relied on the trusted entity to guarantee the memory will be
> zeroed as part of the entire conversion process.
>
> > The behaviors supported by a given KVM instance can be queried via <cap>.  If
>
> I started with some implementation and was questioning the value of a
> CAP. It seems like there won't be anything dynamic about this?

We can drop the CAP for now. Probing via the ioctl and handling
-EOPNOTSUPP is entirely sufficient for the VMM to discover whether
ZERO or PRESERVE are supported for a given architecture and conversion
direction.

> The userspace code can check what platform it is running on, and then
> decide ZERO or PRESERVE based on the platform:
>
> If the VM is running on TDX, it would want to specify ZERO all the
> time. If the VM were running on pKVM it would want to specify PRESERVE
> if it wants to enable in-place sharing, and ZERO if it wants to zero the
> memory.
>
> If someday TDX supports PRESERVE, then there's room for discovery of
> which algorithm to choose when running the guest. Perhaps that's when
> the CAP should be introduced?
>
> > the requested behavior is an unsupported, KVM will return -EOPNOTSUPP and
> > reject the conversion request.  Note!  The "ZERO" request is only support for
> > private to shared conversion!

I think that this makes sensefor the UAPI. Returning -EOPNOTSUPP for
shared-to-private ZERO conversions.

For pKVM's specific use cases where the VMM requires a zeroed page to
be injected into the guest's private space via attribute conversion,
the VMM can simply `memset()` the shared memory to zero in userspace,
and then invoke the ioctl with the
`KVM_SET_MEMORY_ATTRIBUTES2_PRESERVE` flag. This completely offloads
the UAPI from making guarantees on behalf of the trusted entity, while
still satisfying pKVM's functional requirements.

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2
  2026-03-13  0:36             ` Sean Christopherson
@ 2026-03-13  8:32               ` Fuad Tabba
  0 siblings, 0 replies; 54+ messages in thread
From: Fuad Tabba @ 2026-03-13  8:32 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Ackerley Tng, kvm, linux-doc, linux-kernel, linux-kselftest,
	linux-trace-kernel, x86, aik, andrew.jones, binbin.wu, bp,
	brauner, chao.p.peng, chao.p.peng, chenhuacai, corbet,
	dave.hansen, david, hpa, ira.weiny, jgg, jmattson, jroedel,
	jthoughton, maobibo, mathieu.desnoyers, maz, mhiramat,
	michael.roth, mingo, mlevitsk, oupton, pankaj.gupta, pbonzini,
	prsampat, qperret, ricarkol, rick.p.edgecombe, rientjes, rostedt,
	shivankg, shuah, steven.price, tglx, vannapurve, vbabka, willy,
	wyihan, yan.y.zhao

Hi,

On Fri, 13 Mar 2026 at 00:36, Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Mar 12, 2026, Ackerley Tng wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> >
> > > On Thu, Mar 12, 2026, Fuad Tabba wrote:
> > >> Hi Ackerley,
> > >>
> > >> Before getting into the UAPI semantics, thank you for all the heavy
> > >> lifting you've done here. Figuring out how to make it all work across
> > >> the different platforms is not easy :)
> > >>
> > >> <snip>
> > >>
> > >> > The policy definitions below provide more details:
> > >
> > > Please drop "CONTENT_POLICY" from the KVM documentation.  From KVM's perspective,
> > > these are not "policy", they are purely properties of the underlying memory.
> > > Userspace will likely use the attributes to implement policy of some kind, but
> > > KVM straight up doesn't care.
> >
> > Policy might have been the wrong word. I think this is a property of the
> > conversion process/request, not a property of the memory like how
> > shared/private is a property of the memory?
> >
> > I'll have to find another word to describe this enum of
>
> Or just don't?  I'm 100% serious, because unless we carve out a field _just_ for
> these two flags, they're eventually going to get mixed with other stuff.  At that
> point, having a precisely named enum container just gets in the way.

I agree. It makes sense to drop the enum wrapper and the "policy"
terminology entirely. Let's go with direct flags passed to the ioctl
representing the requested memory properties upon conversion.

> > I see you dropped any documentation to do with testing.
>
> Yes.
>
> > I meant to document it (at least something about the unspecified case) so it
> > can be relied on in selftests, with the understanding (already specified
> > elsewhere in Documentation/virt/kvm/api.rst) that nothing about
> > KVM_X86_SW_PROTECTED_VM is to be relied on in production, and can be changed
> > anytime. What do you think?
>
> KVM_X86_SW_PROTECTED_VM should self-report like all other VM types, and shouldn't
> support anything that isn't documented as possible.  I.e. we shouldn't allow
> ZERO on shared=>private "for testing".
>
> What I do think we should do is scribble memory on conversions without ZERO or
> PRIVATE, probably guarded by a Kconfig or maybe a module param, to do a best
> effort enforcement of the ABI, i.e. to try and prevent userspace from depending
> on uarch/vendor specific behavior.

I strongly agree with scribbling/poisoning the memory on default
conversions. If userspace specifies neither flag, actively destroying
the data in software is the only way to strictly enforce that the ABI
makes no guarantees, preventing the VMM from implicitly relying on
underlying hardware behavior (like TDX automatically zeroing).

Cheers,
/fuad

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2026-03-13  8:33 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-02 22:36 [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 01/37] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 02/37] KVM: Rename KVM_GENERIC_MEMORY_ATTRIBUTES to KVM_VM_MEMORY_ATTRIBUTES Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 03/37] KVM: Enumerate support for PRIVATE memory iff kvm_arch_has_private_mem is defined Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 04/37] KVM: Stub in ability to disable per-VM memory attribute tracking Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 05/37] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 06/37] KVM: guest_memfd: Update kvm_gmem_populate() to use gmem attributes Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 07/37] KVM: Introduce KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 08/37] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 09/37] KVM: guest_memfd: Add support for KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
2026-02-14 20:09   ` Ackerley Tng
2026-02-17 23:04     ` Sean Christopherson
2026-02-19 12:43     ` Fuad Tabba
2026-02-24 10:14     ` Ackerley Tng
2026-02-25 11:00       ` Fuad Tabba
2026-02-26  4:16         ` Ackerley Tng
2026-02-26  8:11           ` Fuad Tabba
2026-03-12  5:44     ` Ackerley Tng
2026-03-12 15:12       ` Fuad Tabba
2026-03-12 15:44         ` Sean Christopherson
2026-03-12 21:59           ` Ackerley Tng
2026-03-13  0:36             ` Sean Christopherson
2026-03-13  8:32               ` Fuad Tabba
2026-03-13  8:31             ` Fuad Tabba
2026-02-02 22:29 ` [RFC PATCH v2 10/37] KVM: guest_memfd: Handle lru_add fbatch refcounts during conversion safety check Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 11/37] KVM: Move KVM_VM_MEMORY_ATTRIBUTES config definition to x86 Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 12/37] KVM: Let userspace disable per-VM mem attributes, enable per-gmem attributes Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 13/37] KVM: selftests: Create gmem fd before "regular" fd when adding memslot Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 14/37] KVM: selftests: Rename guest_memfd{,_offset} to gmem_{fd,offset} Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 15/37] KVM: selftests: Add support for mmap() on guest_memfd in core library Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 16/37] KVM: selftests: Add selftests global for guest memory attributes capability Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 17/37] KVM: selftests: Update framework to use KVM_SET_MEMORY_ATTRIBUTES2 Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 18/37] KVM: selftests: Add helpers for calling ioctls on guest_memfd Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 19/37] KVM: selftests: Test using guest_memfd for guest private memory Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 20/37] KVM: selftests: Test basic single-page conversion flow Ackerley Tng
2026-02-02 22:29 ` [RFC PATCH v2 21/37] KVM: selftests: Test conversion flow when INIT_SHARED Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 22/37] KVM: selftests: Test indexing in guest_memfd Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 23/37] KVM: selftests: Test conversion before allocation Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 24/37] KVM: selftests: Convert with allocated folios in different layouts Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 25/37] KVM: selftests: Test precision of conversion Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 26/37] KVM: selftests: Test that truncation does not change shared/private status Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 27/37] KVM: selftests: Test that shared/private status is consistent across processes Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 28/37] KVM: selftests: Test conversion with elevated page refcount Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 29/37] KVM: selftests: Reset shared memory after hole-punching Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 30/37] KVM: selftests: Provide function to look up guest_memfd details from gpa Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 31/37] KVM: selftests: Provide common function to set memory attributes Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 32/37] KVM: selftests: Check fd/flags provided to mmap() when setting up memslot Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 33/37] KVM: selftests: Make TEST_EXPECT_SIGBUS thread-safe Ackerley Tng
2026-02-14 19:49   ` Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 34/37] KVM: selftests: Update private_mem_conversions_test to mmap() guest_memfd Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 35/37] KVM: selftests: Add script to exercise private_mem_conversions_test Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 36/37] KVM: selftests: Update pre-fault test to work with per-guest_memfd attributes Ackerley Tng
2026-02-02 22:30 ` [RFC PATCH v2 37/37] KVM: selftests: Update private memory exits test work with per-gmem attributes Ackerley Tng
2026-02-20  9:09 ` [RFC PATCH v2 00/37] guest_memfd: In-place conversion support Lisa Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox